Best SQL DBMS for with large bases (RAIS > Mem RAM)

Asked

Viewed 321 times

3

I am trying to write a routine to automate the import of RAIS public micro-data and to work with it on a normal computer (RAIS size > RAM)

What is the best SQL software (DBMS/protocol) to use? Preferably I would like to use Rsqllite, as it can be created from within the R, without relying on the installation of other software by the user.

But I wanted to know what the trade-offs in relation to other DBMS, in terms of performance, ability to handle large bases and etc.

In practice, in addition to the SQL software, I would still like to automate the download of the data, because the data are made available by MTE via Google Drive (I asked in the OS in English here). In addition they come compressed in format .7z and from what I saw, it is not possible to unpack only with the R commands (but you can install 7zip and run by command line in the shell, inside the R).

1 answer

6

The choice of DBMS depends on what you will do with the data and the type of data that will be used. I will mention what are, in my opinion, the advantages and disadvantages of using the following DBMS’s: Postgres, Monetdb and Sqlite3.

Sqlite3 is the simplest DBMS of the three and also the one that has less resources and less capacity to store data. The great advantage of Sqlite3 is simplicity, as it is not necessary to install it on the computer. Just use an R package and the data can be transferred easily by copying a file. If the bank is relatively small and the performance of Sqlite3 is satisfactory in terms of speed and features, this DBMS is an excellent option.

Postgres is certainly the one with the most resources. Virtually any resource available on other DBMS’s is also available on Postgres. There is also the library Madlib that extends features such that it is possible to perform complex analyses, all directly from R, whether using the package dplyr or the Pivotalr. Through these packages it is not necessary to write SQL queries directly and the user can work using only the knowledge he already has of the R language.

Monetdb is a DBMS columnar, that is, it stores the data internally in the form of columns. This is a great advantage of Montedb because the reading speed is significantly faster than that of Postgres and Sqlite3. However, the writing cost is higher, such that in some applications the positive reading effect may be nullified by the negative writing effect.

So my suggestion is this: if the dataset is reasonably stable, that is, there is no need for regular writing, Monetdb may be the best choice. If it is necessary to write and read regularly, and the data set is large, but not much larger than RAM, Sqlite3 may be a better option. Finally, if it is necessary to perform complex data queries and modeling, in very large volumes of data, perhaps the best option is Postgres + dplyr + Pivotalr.

Especially in the case of microdata, where after the first reading it is practically not necessary to write in the database, the reading time can be critical. In these cases my predilection is for Monetdb. Another advantage of Monetdb in the case of microdata, is that if these data are results of complex samples, it is necessary to include the sampling plan in the analysis; with the package sqlsurvey it is possible to include the sampling plan in the analysis easily using Monetdb.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.