This question depends on some factors such as the task of analysis that one wishes to perform and the size of the data set, that is, how big it is in relation to the RAM (and sometimes the hard drive) of the computer where one intends to perform the analysis. There are some cases to consider:
Size of the data set:
Data sets larger than RAM but smaller than common HD on
personal computers, something like 20Gb for example.
Data sets larger than RAM and HD of personal computers.
As to the type of analysis:
Descriptive analyses, simple queries and calculations.
More complex analyses, including adjustment of models such as Randomforest, Linear Regressions and etc.
When the dataset is moderately sized, larger than RAM, but not so large that it is impossible to treat it on a single PC, R packets like ff, Bigmemory or even the package Scaler of Revolution Analytics are able to perform simple and more complex analyses. A caveat in these cases are the situations where, even with these packages, the procedure is very slow in relation to the user’s need. Another less known solution is to use the library Madlib, which extends Postgres and allows complex analyses to be performed on large data sets, such as Linear/Logistic Regressions, Randomforest and so on, directly from R via the package Pivotalr.
If the analysis involves only simple queries and descriptive statistics, an interesting solution may be to simply load the dataset into a Database Management System (DBMS) such as the Postgres, the Mysql, the Sqlite3 and the Monetdb, and turn the calculations into SQL queries. Alternatively, use the package dplyr, with which the user defines the origin of the data as one of these DBMS’s and the package automatically converts dplyr operations into SQL code. In addition to these alternatives, dplyr allows the use of Big Data services in the cloud, such as Bigquery, where the user can perform query operations directly from the terminal with dplyr, in the same way he would if he were using a data frame..
In situations where the data set is much larger than RAM memory, sometimes intractable on a single computer, there is a need to use frameworks which allow distributed processing of large data sets such as Apache Hadoop or the Apache Spark. In these cases, depending on the type of analysis you want to perform, such as simple queries and calculations, Hadoop + R with the package Rhadoop or Spark+R with the package Sparkr may be sufficient.
Both Hadoop and Spark have associated projects that implement machine learning methods such as Apache Mahout and the Mlib, which are not available for use with R. However there is the engine H2O of 0xadata that has an API for the R such that the user can implement modeling methods in large data sets. Madlib, cited above, can also be used in distributed database management systems such as the Greenplum, such that together with the Pivotalr package, it allows complex analyses to be performed. Revolution’s Scaler package can also be used in these cases, where it uses a Big Data infrastructure as a backend.
The question was half open, generic... After editing it was still a bit of the impression that it lacks database support, which gave room to suggest Postresql as "solution". Making a linear regression requires only having the data in two vectors and delivering it to a function, which is done with PL/R... Perhaps asking "how to solve a linear regression in R with PL/R and data from an XY table?" is less open.
– Peter Krauss
@Peterkrauss have a look here: http://stackoverflow.com/questions/16612320/when-running-pl-r-on-postgresql-can-handle-data-bigger-then-ram
– Carlos Cinelli
Um... by the looks of it you now found exactly what you needed, within the scope of response I wished to give... But we had one "... but Unfortunately..." from J. Conway himself who is "the guy" from PL/R. A pity.
– Peter Krauss
Clearly something to
data.table
. I suggest reading of this presentation. Thedata.table
allows you to handle and analyze tables with hundreds of millions of lines.fread
reads files with 20Gb in a few minutes.– Paulo Cardoso
@Paulocardoso the
data.table
works within theR
, and is also limited to what your RAM memory allows.– Carlos Cinelli