How to read the School Census data in R (Enrollment)

Asked

Viewed 219 times

2

I’m having trouble analyzing the 2019 school census data in R.

I read some things on other forums and chose to use the ffbase package that deals with very large databases.

Using the command below I can download the base in R:

matriculas<- read.csv.ffdf(file="MATRICULA_NORDESTE.csv",header=TRUE)

The problem is that I can’t work with her after, for example, trying to run:

Analise<-subset.ffdf(matriculas,CO_UF==27)

Is returned to me:

"Error in gsub(varre, varsub, es) : 
  expressão regular inválida "

Could someone help me? The idea is to create a smaller base (selecting only UF 27 so I can work on it like a normal csv).

Thank you in advance.

  • Difficult to answer without a sample of the data, but try to use CO_UF == "27"

1 answer

4

To read the data and filter them at the same time I will use the package sqldf which I think is ideal for this since it allows you to filter the data with instructions SELECT of language SQL. This avoids overloading the .GlobalEnv and memory with useless data.

1. See which files to read

csv_files <- list.files(pattern = "MATRICULA_.*\\.CSV")
csv_files
#[1] "MATRICULA_CO.CSV"       "MATRICULA_NORDESTE.CSV"
#[3] "MATRICULA_NORTE.CSV"    "MATRICULA_SUDESTE.CSV" 
#[5] "MATRICULA_SUL.CSV" 

2. How large the question file is
To know how many lines the file has I use the Unix/Linux command wc. The call to the operating system is made with system2 which is currently the recommended R command for such.

system2("wc", csv_files[2])
#  15304590   15304590 3927158117 MATRICULA_NORDESTE.CSV

The file has more than 15 million lines and 3.6 GB.

3. Reading the data

library(sqldf)

SQL <- "select * from file where CO_UF = '27'"
t0 <- system.time(
  dados <- read.csv2.sql(csv_files[2], sql = SQL, sep = "|")
)
t0
#  usuário   sistema decorrido 
#  324.799    20.892   514.291 

dim(dados)
#[1] 933873    104

On my system it took 8.5 minutes to read and filter the data. There was no error.
The final basis has about 930 thousand observations of 104 variables, which represents

100*nrow(dados)/15304590
#[1] 6.101915

only 6.10% of file data.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.