Concepts
They’ve been great so far, but I’m gonna add my two cents here. I would say that when we are working with data sets that are well over the size of RAM, for me, the ideal is to always work with database management systems such as Postgresql, Mysql or Monetdb. Especially with public microdata, which in general you only need to read once and from there only consult, I believe that Dbms are the best approach. In addition to allowing consistent data storage, it is also possible to use the dplyr package in R, with the backend of databases, such that you can use the database tables using virtually the same syntax that you would use with a data.frame. It is a solution that is fast, doesn’t have the limitations that R has with RAM and that you can use later for a long time.
Of the available Dbms, two that do not give any trouble to use with R, not needing to install any other software than R itself are:
The first can be used easily from the R through the package Rsqlite and the second by means of the package Monetdblite. My favorite, and what I’m going to use here as an example, is Monetdb. I’m going to use it because of the two is the one that does the storage by columns. So the writing operation is a little more expensive, but any query operation is much cheaper. If you work with data that does not need to be written frequently but needs to be read frequently, columnar database management systems are superior. On the other hand, if in addition to reading you also need to write in the bank frequently, a bank like Sqlite (and others) should be superior. Specifically in the case of microdata, such as Census, ENEM or School Census, which you should charge the bank once and only read from there, I believe that Monetdb is the best option available today.
Applying
To use the solution I will propose here you must have the Monetdblite package installed on your computer. In the R console type:
install.packages('MonetDBLite', dependencies = TRUE)
From there, the next step will be to load the data into the database. Fortunately Monetdblite is a package that allows you to do this automatically using the monetdb.read.csv() function. Assuming you have the ENEM dataset, in my case the microdata_enem2014.csv, in the same working directory where you will run the script, run the following commands:
## Carregando os pacotes necessários
library(MonetDBLite)
library(DBI)
## Definindo um diretório
dbdir <- 'database/'
## Criando a conexão com um banco, criado na pasta database
con <- dbConnect( MonetDBLite::MonetDBLite() , dbdir )
## Fazendo a ingestao do csv no banco
monetdb.read.csv(conn = con, files = 'microdados_enem2014.csv', tablename = 'enem2014', header = TRUE, na.strings = '', delim = ',')
## Listando as tabelas no banco
dbListTables(con)
## Contanto o número de registros no banco
dbGetQuery(con, 'SELECT count(*) FROM enem2014')
## Consultando as 100 primeiras linhas
teste <- dbGetQuery(con, "SELECT * FROM enem2014 LIMIT 100")
The shipment didn’t take a minute here on my machine, core i5 and 16gb of RAM. But I believe that in a more modest machine it should take a little longer. Another fundamental point is that the ideal is that the file is in UTF-8 for loading, and the csv’s of ENEM are in ISO-8859-1. I converted the file easily via the iconv command on the Linux terminal:
iconv -f ISO-8859-1 -t UTF-8 MICRODADOS_ENEM_2014.csv > microdados_enem2014.csv
but in Windows you should follow the iconv installation procedure shown here. In fact there are several ways to change file encoding in Windows and this is only a suggestion.
Another point to note is that as the database and table in the database have already been created, you can make queries directly in SQL, as if it were in the database terminal. If you know SQL your problem is solved and just generate the necessary data from the query and then process them as data.frame in R.
Using the dplyr
Of course the best option to use databases with R is to use dplyr directly, which allows the R user to work with the database without writing a single SQL line. The coolest of this strategy is that dplyr converts the commands into R for querys in SQL and even the intermediate results of the queries stay within the database, such that there are no performance problems related to the limitations of R with RAM.
As an example, let’s look at the average grade of students per state and administrative dependency, using dplyr.
## Carregando o pacote
library(dplyr)
## Ligando o dplyr na tabela
my_db <- MonetDBLite::src_monetdb(embedded=dbdir)
my_tbl <- tbl(my_db, "enem2014")
## Obtendo média de matemática por estado e dependência administrativa
consulta <- my_tbl %>% group_by(COD_UF_ESC, ID_DEPENDENCIA_ADM_ESC) %>% summarise(mean(NOTA_MT))
## Salvando a consulta como um data.frame
consulta <- collect(consulta)
which results in:
COD_UF_ESC ID_DEPENDENCIA_ADM_ESC L1
<int> <int> <dbl>
1 NA NA 472.1223
2 26 2 441.0514
3 32 2 454.2741
4 35 2 468.1310
5 11 2 440.4378
6 33 4 554.1954
7 33 2 456.5003
8 23 2 436.1984
9 29 2 433.5961
10 53 4 447.5004
Note that it will be necessary to convert the codes of the Federation Units and the administrative dependency to the respective names, which are available in the dictionary that comes with the ENEM data. Another point is that at the end of the query, to "save" the query result as a data.frame in R you should use the Collect function().
Finally, when you finish using the bank, you can disconnect and disconnect the Monetdb instance that was created and is running on your machine:
## Desconectando do banco
dbDisconnect(con, shutdown=TRUE)
rest assured that your data is intact. If you need to use the database then just reconnect as we have done before:
## Criando a conexão com um banco, criado na pasta database
con <- dbConnect( MonetDBLite::MonetDBLite() , dbdir )
Possible duplicate of Strategies to analyze very large databases in R (that do not fit in RAM)
– Molx
@Molx I was one of the people who answered that question. That was a more conceptual question, with a more open character. This particular one is much more specific and related to a specific data set. As a result of this issue will also be documented the R-code needed to accomplish this task. A lot of people go on google specifically to look for a solution to this problem.
– Flavio Barros
I don’t think it’s duplicate either.
– Carlos Cinelli
@Carloscinelli I think that when people gain access to the analysis queue they don’t even read the question properly and they already want to "act". Just read calmly to realize that they are different issues.
– Flavio Barros
Using Hadoop and Mapreduce, in order to filter and reduce the amount of data, would be an option that helps you ?
– Gilberto Santos