How to read microdata from ENEM in R?

Asked

Viewed 5,814 times

14

The INEP (Instituto Nacional de Estudos e Pesquisas Educacionais Anísio Teixeira) makes available for download the microdata of ENEM (and others) from this link. These microdata are the ENEM’s raw data, with information on student granularity.

When downloading microdata from ENEM 2014, for example, you download a file with zip extension of approximately 1.2Gb. When extracting the contents of this file there is in the folder structure all the documentation of this data set and also the microdata. In the DATA folder there is the file MICRODADOS_ENEM_2014.csv. This file is 6Gb in size such that commands like:

dados <- read.csv('MICRODADOS_ENEM_2014.csv', header=T)

do not work on most common Pcs due to RAM memory limitation.

What is the best alternative to reading this data set in R on a common PC? I would especially like concrete methods to perform this reading with the help of R and ALSO packages using database management systems.

NOTE: I am aware of the question Strategies to analyze very large databases in R (that do not fit in RAM), having even participated as one of the respondents. HOWEVER, the question had a more general character, producing in the answers no code in R that could solve the specific problem of this question.

  • 4
  • 1

    @Molx I was one of the people who answered that question. That was a more conceptual question, with a more open character. This particular one is much more specific and related to a specific data set. As a result of this issue will also be documented the R-code needed to accomplish this task. A lot of people go on google specifically to look for a solution to this problem.

  • 1

    I don’t think it’s duplicate either.

  • @Carloscinelli I think that when people gain access to the analysis queue they don’t even read the question properly and they already want to "act". Just read calmly to realize that they are different issues.

  • Using Hadoop and Mapreduce, in order to filter and reduce the amount of data, would be an option that helps you ?

4 answers

13

The functions of the type read.csv that the R has carry in RAM the contents of the files being read. Thus, there are three main ways to work with large data that does not fit in memory R:

  1. Increase your RAM memory

  2. Take a sample of your data

  3. Read the data from the disk without loading it into the RAM

Solution 1 would be ideal. After all, there would be no need to learn anything new, as we could work with the data uploaded directly to RAM, as if it were a normal size file. However, this costs money.

Solution 2 is not bad. As long as you take a representative sample of the data, there is no reason to imagine that the analysis done in this way is bad. But we statisticians have worked with sampling for so long. Why don’t we take advantage of this plethora of data that we have today and not analyze all of it? After all, better than making small mistakes when analyzing a sample is to make no mistake by analyzing the entire population.

Solution 3 is the one I like best. It implies not fully loading the data into the RAM memory, but rather reading it incrementally from the disk. Fortunately, the R has several packages that make it easier for us to do this. I will use one of them here, called ff.

In the example below I load the package ff, I read the microdata of the ENEM of the year 2014 and create a table with the number of candidates enrolled by Brazilian state:

library(ff)
enem <- read.csv.ffdf(file="MICRODADOS_ENEM_2014.csv", header=TRUE)
table(enem[, 22])
             AC      AL      AM      AP      BA      CE      DF 
  70096   65269  156393  172709   45553  720330  572238  143196 
     ES      GO      MA      MG      MS      MT      PA      PB 
 149299  218662  354018  987173  152758  155066  459417  233768 
     PE      PI      PR      RJ      RN      RO      RR      RS 
 441963  214878  409248  589757  196980   97263   19064  478792 
     SC      SE      SP      TO 
 143680  117885 1289458   67335  

Note that I used the same command table that we normally use to create a R. The only difference is in charge read.csv.ffdf, which has identical syntax of read.csv in this case. Moreover, even in other cases, the syntax of read.csv.ffdf is very similar to the syntax of read.csv.

Of course this solution has problems. The main one is speed. I evaluated the execution time of the above two commands on my PC and got the following:

system.time(enem <- read.csv.ffdf(file="MICRODADOS_ENEM_2014.csv",
  header=TRUE))
    user   system  elapsed 
3302.942  170.418 3596.523 

system.time(table(enem[, 22]))
 user  system elapsed 
0.550   0.185   0.945 

It took me about an hour to read the data for a year of ENEM. Just for comparison, I did this reading on an Intel i5 2.8GHz laptop with 8GB RAM and 512GB SSD. The table, however, was done very quickly, taking less than a second to get ready.

So this is a way to read large data sets in the R. Beyond the ff, the package bigmemory is also able to. I imagine there are even more packages, but my experience is limited to these two.


Coming out of the R, What I would suggest to you, in order to decrease the time spent analyzing, is to define exactly what you’re interested in in this data set. It has 166 columns. It’s a lot of column. Also, it also has 8,722,249 rows. It’s a lot of row. Of course, they’re all important at the national level, but maybe they’re not important to you or the analysis you intend to do. I strongly recommend learning some commands from bash, as grep, cut and awk, to pre-filter this data.

For example, there is no reason to read the results of all Brazilian states in the memory of R if only Rio Grande do Sul interests you. Maybe import the answer key of all students, with columns TX_RESPOSTAS_CN, TX_RESPOSTAS_CH, TX_RESPOSTAS_LC andTX_RESPOSTAS_MT, not interesting to you. It may be that all 76 columns of the socioeconomic questionnaire do not matter for your analysis. I do not know.

But, in my experience, I suggest cutting everything that does not interest you before importing this data set into the R to facilitate your analysis, especially if your interest is to analyze the results of more than one year of ENEM. Imagine the time that will be spent if you need to match the results of nearly 20 years of the exam. Only reading the data will take 20 hours, assuming that the reading time increases in a linear way, which is an assumption that doesn’t make much sense.

Anyway, if I was doing an analysis of this type, I would delete the columns and rows that I’m not interested in, because then I would optimize my code, making it run faster, because I would be having to manage only the data that I really care about.

  • 2

    Nice answer! I find it essential to also check the issue of columns that are required, the file size can be reduced to the point that reading with read.csv become trivial.

7

An elegant solution that can be practically independent of computational resources is to use a pre-processing software line by line, filtering only the desired records. The most known and popular tool for this type of task is the AWK - a text manipulation language -, whose main implementation is the gawk. gawk has implosions for Linux, Windows and macos.

I don’t exactly have the ENEM data, but suppose you want to process the School Census 2015 files (which, I imagine, are quite similar). The CLASSES.CSV table contains data from all classes of regular basic education in Brazil. There is a field in this table, whose header indicates the name "CO_UF", which is the code of the federation unit. To filter only the records of the Holy Spirit, for example, whose federation unit code is 32 (according to IBGE), I would call the gawk as follows:

gawk 'BEGIN {FS="|"} { if($73 == 32) print $0 }' TURMAS.CSV

First (BEGIN) I declare the field separator (FS) as "|", which is the separator used in these INEP files. Then I check if field 73 (this part is boring, I need to count the fields in the CSV description, or the first row that is the header) is equal to 32 (ES code). In the gawk each field, divided by the separator, is ordered and identified after the dollar sign ($).

If the test is positive, I print the entire line (the $0 code is special and prints the entire "reset" line, i.e., the fields and separators). To make use of this filter, I redirect the pipe to a file:

gawk 'BEGIN {FS="|"} { if($73 == 32) print $0 }' TURMAS.CSV > filtrado_es.csv

The filter_es.csv file contains, then, only the records that satisfied the condition of UF code equal to 32, and therefore will be smaller, making suffer less R!

If you have gawk in your PATH, you can call it inside the R itself via the pipe() command. For example:

read.csv(pipe("gawk 'BEGIN {FS=\"|\"} {if ($73 == 32) print $0}'
/caminho/para/TURMAS.CSV"))

In this case, read.csv will read the already pre-processed gawk file.

5


Concepts

They’ve been great so far, but I’m gonna add my two cents here. I would say that when we are working with data sets that are well over the size of RAM, for me, the ideal is to always work with database management systems such as Postgresql, Mysql or Monetdb. Especially with public microdata, which in general you only need to read once and from there only consult, I believe that Dbms are the best approach. In addition to allowing consistent data storage, it is also possible to use the dplyr package in R, with the backend of databases, such that you can use the database tables using virtually the same syntax that you would use with a data.frame. It is a solution that is fast, doesn’t have the limitations that R has with RAM and that you can use later for a long time.

Of the available Dbms, two that do not give any trouble to use with R, not needing to install any other software than R itself are:

The first can be used easily from the R through the package Rsqlite and the second by means of the package Monetdblite. My favorite, and what I’m going to use here as an example, is Monetdb. I’m going to use it because of the two is the one that does the storage by columns. So the writing operation is a little more expensive, but any query operation is much cheaper. If you work with data that does not need to be written frequently but needs to be read frequently, columnar database management systems are superior. On the other hand, if in addition to reading you also need to write in the bank frequently, a bank like Sqlite (and others) should be superior. Specifically in the case of microdata, such as Census, ENEM or School Census, which you should charge the bank once and only read from there, I believe that Monetdb is the best option available today.

Applying

To use the solution I will propose here you must have the Monetdblite package installed on your computer. In the R console type:

install.packages('MonetDBLite', dependencies = TRUE)

From there, the next step will be to load the data into the database. Fortunately Monetdblite is a package that allows you to do this automatically using the monetdb.read.csv() function. Assuming you have the ENEM dataset, in my case the microdata_enem2014.csv, in the same working directory where you will run the script, run the following commands:

## Carregando os pacotes necessários
library(MonetDBLite)
library(DBI)

## Definindo um diretório
dbdir <- 'database/'

## Criando a conexão com um banco, criado na pasta database
con <- dbConnect( MonetDBLite::MonetDBLite() , dbdir )

## Fazendo a ingestao do csv no banco
monetdb.read.csv(conn = con, files = 'microdados_enem2014.csv', tablename = 'enem2014', header = TRUE, na.strings = '', delim = ',')

## Listando as tabelas no banco
dbListTables(con)

## Contanto o número de registros no banco
dbGetQuery(con, 'SELECT count(*) FROM enem2014')

## Consultando as 100 primeiras linhas
teste <- dbGetQuery(con, "SELECT * FROM enem2014 LIMIT 100")

The shipment didn’t take a minute here on my machine, core i5 and 16gb of RAM. But I believe that in a more modest machine it should take a little longer. Another fundamental point is that the ideal is that the file is in UTF-8 for loading, and the csv’s of ENEM are in ISO-8859-1. I converted the file easily via the iconv command on the Linux terminal:

iconv -f ISO-8859-1 -t UTF-8 MICRODADOS_ENEM_2014.csv > microdados_enem2014.csv

but in Windows you should follow the iconv installation procedure shown here. In fact there are several ways to change file encoding in Windows and this is only a suggestion.

Another point to note is that as the database and table in the database have already been created, you can make queries directly in SQL, as if it were in the database terminal. If you know SQL your problem is solved and just generate the necessary data from the query and then process them as data.frame in R.

Using the dplyr

Of course the best option to use databases with R is to use dplyr directly, which allows the R user to work with the database without writing a single SQL line. The coolest of this strategy is that dplyr converts the commands into R for querys in SQL and even the intermediate results of the queries stay within the database, such that there are no performance problems related to the limitations of R with RAM.

As an example, let’s look at the average grade of students per state and administrative dependency, using dplyr.

## Carregando o pacote
library(dplyr)

## Ligando o dplyr na tabela
my_db <- MonetDBLite::src_monetdb(embedded=dbdir)
my_tbl <- tbl(my_db, "enem2014")

## Obtendo média de matemática por estado e dependência administrativa
consulta <- my_tbl %>% group_by(COD_UF_ESC, ID_DEPENDENCIA_ADM_ESC) %>% summarise(mean(NOTA_MT))

## Salvando a consulta como um data.frame
consulta <- collect(consulta)

which results in:

  COD_UF_ESC ID_DEPENDENCIA_ADM_ESC       L1
        <int>                  <int>    <dbl>
1          NA                     NA 472.1223
2          26                      2 441.0514
3          32                      2 454.2741
4          35                      2 468.1310
5          11                      2 440.4378
6          33                      4 554.1954
7          33                      2 456.5003
8          23                      2 436.1984
9          29                      2 433.5961
10         53                      4 447.5004

Note that it will be necessary to convert the codes of the Federation Units and the administrative dependency to the respective names, which are available in the dictionary that comes with the ENEM data. Another point is that at the end of the query, to "save" the query result as a data.frame in R you should use the Collect function().

Finally, when you finish using the bank, you can disconnect and disconnect the Monetdb instance that was created and is running on your machine:

## Desconectando do banco
dbDisconnect(con, shutdown=TRUE)

rest assured that your data is intact. If you need to use the database then just reconnect as we have done before:

## Criando a conexão com um banco, criado na pasta database
con <- dbConnect( MonetDBLite::MonetDBLite() , dbdir )

4

A very good solution is to use the read_csv_chunked() function of the readr package. This function allows the reading and application of a function to each Chunk (number of lines) iteratively. Suppose you just want the data from the state of São Paulo. You can limit the number of lines to import at a time and filter them by UF=="SP", before proceeding to the next Chunk. To do so, just follow the following procedure:

f<-function(x,pos) subset(x,UF=="SP")

df<-readr::read_csv_chunked("MICRODADOS_ENEM_2014.csv",DataFrameCallback$new(f),
chunk_size = 1000)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.