Strategies to analyze very large databases in R (that do not fit in RAM)

Asked

Viewed 5,405 times

32

Suppose I have a huge database that doesn’t fit in RAM. What strategies to analyze this database in R, since I cannot load it entirely into memory?

PS: The question isn’t just about how to get R to talk to a relational/non-relational database. If, for example, your data is in a relational database, you still can’t load it all at once in the R to run a Forest Random or regression, for example.

  • The question was half open, generic... After editing it was still a bit of the impression that it lacks database support, which gave room to suggest Postresql as "solution". Making a linear regression requires only having the data in two vectors and delivering it to a function, which is done with PL/R... Perhaps asking "how to solve a linear regression in R with PL/R and data from an XY table?" is less open.

  • @Peterkrauss have a look here: http://stackoverflow.com/questions/16612320/when-running-pl-r-on-postgresql-can-handle-data-bigger-then-ram

  • Um... by the looks of it you now found exactly what you needed, within the scope of response I wished to give... But we had one "... but Unfortunately..." from J. Conway himself who is "the guy" from PL/R. A pity.

  • Clearly something to data.table. I suggest reading of this presentation. The data.table allows you to handle and analyze tables with hundreds of millions of lines. fread reads files with 20Gb in a few minutes.

  • 1

    @Paulocardoso the data.table works within the R, and is also limited to what your RAM memory allows.

4 answers

20

This question depends on some factors such as the task of analysis that one wishes to perform and the size of the data set, that is, how big it is in relation to the RAM (and sometimes the hard drive) of the computer where one intends to perform the analysis. There are some cases to consider:

Size of the data set:

  • Data sets larger than RAM but smaller than common HD on personal computers, something like 20Gb for example.

  • Data sets larger than RAM and HD of personal computers.

As to the type of analysis:

  • Descriptive analyses, simple queries and calculations.

  • More complex analyses, including adjustment of models such as Randomforest, Linear Regressions and etc.

When the dataset is moderately sized, larger than RAM, but not so large that it is impossible to treat it on a single PC, R packets like ff, Bigmemory or even the package Scaler of Revolution Analytics are able to perform simple and more complex analyses. A caveat in these cases are the situations where, even with these packages, the procedure is very slow in relation to the user’s need. Another less known solution is to use the library Madlib, which extends Postgres and allows complex analyses to be performed on large data sets, such as Linear/Logistic Regressions, Randomforest and so on, directly from R via the package Pivotalr.

If the analysis involves only simple queries and descriptive statistics, an interesting solution may be to simply load the dataset into a Database Management System (DBMS) such as the Postgres, the Mysql, the Sqlite3 and the Monetdb, and turn the calculations into SQL queries. Alternatively, use the package dplyr, with which the user defines the origin of the data as one of these DBMS’s and the package automatically converts dplyr operations into SQL code. In addition to these alternatives, dplyr allows the use of Big Data services in the cloud, such as Bigquery, where the user can perform query operations directly from the terminal with dplyr, in the same way he would if he were using a data frame..

In situations where the data set is much larger than RAM memory, sometimes intractable on a single computer, there is a need to use frameworks which allow distributed processing of large data sets such as Apache Hadoop or the Apache Spark. In these cases, depending on the type of analysis you want to perform, such as simple queries and calculations, Hadoop + R with the package Rhadoop or Spark+R with the package Sparkr may be sufficient.

Both Hadoop and Spark have associated projects that implement machine learning methods such as Apache Mahout and the Mlib, which are not available for use with R. However there is the engine H2O of 0xadata that has an API for the R such that the user can implement modeling methods in large data sets. Madlib, cited above, can also be used in distributed database management systems such as the Greenplum, such that together with the Pivotalr package, it allows complex analyses to be performed. Revolution’s Scaler package can also be used in these cases, where it uses a Big Data infrastructure as a backend.

  • 1

    +1 very good answer, Flavius!

14


R is a specialized language whose Sweet spot are data analysis problems in memory (an extremely significant set of problems).

That said, the R ecosystem is large and several solutions are emerging to address problems with huge amounts of data. Keep in mind that problems of Big Data use specific techniques (and often, Software solutions / Hardware / File System and specific protocols) such as Mapreduce. Don’t assume you can do everything you do in one data.frame with gigantic data volumes, and even if a certain technique can be applied, do not assume that the algorithms are similar.

Keep in mind that issues like Mapreduce regression are still open research problems, new algorithms and new implementations are emerging within and outside the R ecosystem (you can find more information on papers as Robust Regression on Mapreduce).

To give you a taste of where to start, there are already packages that implement:

3

There is always the option to work with external databases on R and load only the variables needed for the analysis (since it is rare that the analyses will use all variables at the same time). There are several packages that allow working with other types of banks, of which I highlight Rsqlite. It has functions that create Sqlite databases of files with delimiters and without (fixed width).

3

I suggest as a solution to handle large volumes of data a database management system, "rice with beans" (relational model)... In that context, the best foss software (satisfies demands for robustness, stability, scalability, etc.) is the Postgresql.

Some users of R has been using Postgresql since ~2006, so it is already well stable and documented: the module of "R Embedded" (PL/R - R Procedural Language for Postgresql), gives you the freedom to create procedures database with R - for example triggers UPDATE/INSERT/DELETE written in R jargon instead of PL or other foreign language - and perform virtually all operations of R (about basic data types) in the database scripts themselves.

In the Soen has some tips on how to install.

However, apparently the PL/R would still have the same memory restrictions RAM as the R, as can be seen in this reply from Soen.

  • It would be very interesting to quote the sources of the statements in the first paragraph.

  • Peter, to complement, how would you run a Forest Random model or a regression, without being able to access all the data at the same time?

  • Dear friends, the intention of the first paragraph was to "pull sardines" for the use of "database with R" and for the use of Postgresql... It was a marketing text, I agree, even so, no time to cut, I answered opening for Wiki... I do not understand why 'downvote'. SUGGEST DELETE THE ANSWER?

  • @Carloscinelli, I understand that the relational model has limitations, both that complex data as in Postgis (geometries) use the relational database only as an alternative to filesystem, not to "do the math". You can store complex structures in their binary representation, solving at least the problem of data management with this. The advantage of using the R-bank becomes clearer in the automatic generation of reports, and when there is great diversity of simple data related.

  • @gmsantos, links are the sources. Tamnbém made edits to get less marketeer, look forward to your new position.

  • @Peterkrauss the answer itself is totally marketable. The only valuable suggestion is the citation of the Postgre PL/R module, which in the question is not exemplified its use, suggested an implementation or something like.

  • Peter, it would be lacking to complement the part about analyzing data, running predictive models for example, with the restriction of not being able to load everything at once into the R.

  • Peter, it appears that PL/R does not solve the RAM limitation problem: http://stackoverflow.com/questions/16612320/when-running-pl-r-on-postgresql-can-handle-data-bigger-then-ram .

  • @Carloscinelli, yes, as I commented above (in question) J. Conway himself has already given his verdict. PS: If mathematically the problem is "refactorial" (usually not without loss of accuracy), SQL helps fragment and then join/summarize the partial results.

Show 4 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.