Select a part of the database in R

Asked

Viewed 2,281 times

3

I’m doing an evaluation of the database of the transparency portal that can be obtained in this link, The problem is that I would like to select only one part of the database, my assessment is only about teacher data. I could do a data cleaning using Excel, but I would like to learn how to do in R. For reading the data I am using the following code:

library(readr)

df <- read_delim("~/GitHub/Servidores/Setembro/20160930_Cadastro.csv", 
";", escape_double = FALSE, locale = locale(encoding = "ASCII"),
trim_ws = TRUE)

# As únicas colunas que importam são a 3ª (ID do servidor) 
# e a 6ª (remuneração bruta) na planilha de remuneração      

# Renomeando a coluna ID e de Remuneração básica bruta e 
# fazendo um merge no data frame para acrescentar os salários 
# de cada servidor

salarios <-        
read_delim("~/GitHub/Servidores/Setembro/20160930_Remuneracao.csv", ";",
escape_double = FALSE, locale = locale(encoding = "ASCII"),
trim_ws = TRUE) %>% select(3, 6) 
head(salarios)

names(salarios) <- c("ID_SERVIDOR_PORTAL", "SALARIO")

names(df) <- str_to_upper(names(df))
df <- merge(df, salarios, by="ID_SERVIDOR_PORTAL")
df$x <- 1

Having done this, I would like to know how to select a part of the database, only the part related to teachers, in order to study the database only for these.

  • I visited the link and could not find the files 20160930_Cadastro.csv or 20160930_Remuneracao.csv. I just found a file called 201609_GastosDiretos.csv. Also, if your database has only two columns, one call ID_SERVIDOR_PORTAL and another SALARIO, where would be the information about the server’s position? It’s on ID_SERVIDOR_PORTAL even?

  • Hi @Marcusnunes, I don’t know how, but I got the link wrong! The same has been fixed! The registration database has 42 two columns of interest and I added two more that are in the remuneration database. Thank you and sorry for the mistake!

1 answer

4


I could not read the data with your original commands. I changed them so that my computer could work. If you can read these files with your original commands, ignore this part of my code.

setwd("~/GitHub/Servidores/Setembro/")

library(readr)
library(stringr)

cadastro <- read.table(file="20160930_Cadastro.csv", header=TRUE, sep="\t")

df <- read_delim("20160930_Cadastro.csv", "\t", escape_double=FALSE,
locale = locale(encoding = "Latin1"), trim_ws = TRUE)

# As únicas colunas que importam são a 3ª (ID do servidor) 
# e a 6ª (remuneração bruta) na planilha de remuneração      

# Renomeando a coluna ID e de Remuneração básica bruta e 
# fazendo um merge no data frame para acrescentar os salários 
# de cada servidor

salarios <- read_delim("20160930_Remuneracao.csv", "\t", escape_double = FALSE,
locale = locale(encoding = "Latin1"), trim_ws = TRUE) %>% select(3, 6) 

names(salarios) <- c("ID_SERVIDOR_PORTAL", "SALARIO")

names(df) <- str_to_upper(names(df))
df <- merge(df, salarios, by="ID_SERVIDOR_PORTAL")
df$x <- 1

# selecionar as posicoes no banco de dados df
# que possuem a string 'PROFESSOR' em algum lugar
# (talvez precise refinar isto dependendo
# do objetivo deste trabalho)

professores <- grep("PROFESSOR", df$DESCRICAO_CARGO)

# novo banco de dados apenas com as linhas dos 
# professores (ou melhor, dos servidores cuja
# descricao do cargo possui 'PROFESSOR' em algum 
# momento)

df.professores <- df[professores, ]
  • Hi @Marcusnunes, it was great! I already knew the string package, but I didn’t even think about it. Just to complement, as there are some teachers whose description begins with PROF and has the word SUBSTITUTE together, in the database lacked some names, so to complement your code I added to your command the word SUBSTITUTE, as below: teachers <- grep("TEACHER|SUBSTITUTE", df$DESCRICAO_CARGO)

  • Perfect. The important thing is to get the desired result :)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.