4
I am trying to make the following code to read a giant file that does not fit the memory.
library(dplyr)
arq_grande <- file("dados2014.csv", "r")
tam_chunk <- 1e2
df1 <- read.csv(arq_grande, nrows = 10, header = T, sep = ",", dec = ".")
df_filtrado <- df1 %>% filter(TP_SEXO == 'M')
write.table(df_filtrado, "sexoM.csv", row.names = F, sep = ",", dec = ".")
nrow <- 1
repeat {
df <- read.csv(arq_grande, header=FALSE, col.names = names(df1), nrows = tam_chunk)
cat("Read", nrow(df), "rows\n")
if (nrow(df) == 0)
break
df_filtrado <- df1 %>% filter(TP_SEXO == 'M')
write.table(df_filtrado, "sexoM.csv", append = T, col.names = F, row.names = F, sep = ",", dec = ".")
}
close(arq_grande)
The problem with it is that it doesn’t advance in reading. It’s just repeating the first 10 lines non-stop.
The writing seems to be working.
Notice you’re nrows = 10, that’s not it?
– SCOFIELD
You have nothing that will make the lines read from 10 to 10, the first 10 lines are always read. You can try using the argument
skip
and an accountant to record how many lines you’ve read. In reality, this is not the best way to deal with this problem, there are much more elegant and probably efficient approaches. See the answers to this question: Strategies to analyze very large databases in R (that do not fit in RAM)– Molx
I think the answers to this question may also help: http://answall.com/questions/108729/filtrar-aquivo-csv-de-5gb-no-r/109028#109028
– Daniel Falbel
@Molx When you read files from a connection, as in the code that André is using (with
file(...)
), lines are read sequentially even without indicating the argumentskip
. @Andréoliveira your code is correct. Are you sure you have no problem with your file?– Daniel Falbel
Actually, I just found the bug! Inside the loop of
repeat
, you have to usedf_filtrado <- df %>% filter(TP_SEXO == 'M')
instead ofdf_filtrado <- df1 %>% filter(TP_SEXO == 'M')
, the problem is the df1 which is the database using only to read the first lines.– Daniel Falbel
@Danielfalbel I honestly did not know this... but still suspect that other approaches should be better, so half of the comment still worth. :)
– Molx
I would take a look at two options that seem simpler to me... One is to use the read.csv.sql function of the sqldf package to already bring the filtered data. And another, that if you just want to separate the records by sex, you can do it via the command line, with awk, grep and the like. Ex: http://superuser.com/questions/827539/grep-to-filter-gigantic-csv-file
– Icaro Bombonato
@Danielfalbel think you could put as answer the error found
– Carlos Cinelli
@Carloscinelli believes the question should be closed, actually. Correcting a typo is unlikely to help anyone else, don’t you think?
– Daniel Falbel
Possible duplicate of Pre-process large text files in R
– Flavio Barros