Here are some tips p/ solve your problem:
1 - Read the file using another function:
> microbenchmark(
+ base = read.csv(file = "df-write-csv.csv", header = T),
+ readr = readr::read_csv("df-write-csv.csv"),
+ data.table = data.table::fread("df-write-csv.csv"),
+ rio = rio::import("df-write-csv.csv", format = "csv")
+ )
Unit: microseconds
expr min lq mean median uq max neval
base 1836.230 1912.1815 2253.6071 1980.3995 2282.1675 4148.787 100
readr 823.960 881.3625 1072.4790 921.6605 1120.2365 3538.359 100
data.table 327.759 364.4810 442.5933 402.3295 458.7895 920.436 100
rio 312.317 351.2260 444.1087 382.9325 439.7960 2938.490 100
See that read the file with the function fread
of data.table
or with the function import
of rio
is 4x faster than with native R function.
Make sure you really read.
2 - Check that you have actually been able to filter your database. Save the subset result to an auxiliary object.
If this is the problem, try filtering using other package functions like dplyr
or data.table
.
When it is long operations, the data.table
can be much faster.
> df <- data.frame(x = 1:100000, y = 1:100000, l = sample(letters, size = 100, replace = T))
> microbenchmark(
+ base = subset(df, l == "a"),
+ dplyr = dplyr::filter(df, l == "a"),
+ data.table = data.table(df)[l == "a",]
+ )
Unit: milliseconds
expr min lq mean median uq max neval
base 10.329514 12.467143 14.962479 13.976907 17.171858 24.3761 100
dplyr 7.331626 8.624356 10.063947 8.853807 11.140871 16.8939 100
data.table 2.986519 4.580536 6.774548 4.824227 5.957255 119.9709 100
3 - Use the function write_csv
package readr
she is more or less 2x faster than the function write.csv
native of the R.
microbenchmark(
base = write.csv(df, file = "df-write-csv.csv", row.names = F),
rio = rio::export(df, file = "df-rio.csv", format = "csv"),
readr = readr::write_csv(df, path = "df-readr.csv")
)
Unit: microseconds
expr min lq mean median uq max neval
base 713.564 1097.534 2025.377 1467.4980 2996.136 4168.352 100
rio 718.141 1156.998 2243.143 2011.5310 3106.479 7368.046 100
readr 366.306 594.629 1265.297 734.0445 1793.405 5852.142 100
Anyway, if you were able to read the 5GB file, it is very likely that you can also write it, since it is already in the RAM of your computer.
It didn’t work because it didn’t finish running? It’s probably because the file is heavy. Have you run this analysis on data of this magnitude? How long was running? How much of RAM does the computer have?
– Molx
Andre, put the computer information you’re using into the question. Also worth checking out are these answers and see if they help you: http://answall.com/questions/30631/strategy-parsingbases-databases-very big
– Carlos Cinelli
Carlos, the configuration of my machine is as follows:: Meméria 2.9GB, Intel CPU 585 2.16GHz and 32 bit. It is running Ubuntu 14.04 LTS. Your link can help a lot, for now, he gave me several ideas that I will see if I put in practice.
– André Oliveira
Do you have the link to csv? If possible, pass the link here so we can test the solutions. Your question is very similar to this one: http://answall.com/questions/35469/pre-processr-grande-arquivo-de-texto-txt-substr-por probably the package
ff
solves your problem.– Carlos Cinelli