Read in blocks do not advance with read.csv. How to fix?

Asked

Viewed 115 times

4

I am trying to make the following code to read a giant file that does not fit the memory.

library(dplyr)

arq_grande <- file("dados2014.csv", "r")
tam_chunk <- 1e2
df1 <- read.csv(arq_grande, nrows = 10, header = T, sep = ",", dec = ".")
df_filtrado <- df1 %>% filter(TP_SEXO == 'M')
write.table(df_filtrado, "sexoM.csv", row.names = F, sep = ",", dec = ".")

nrow <- 1

repeat {
   df <- read.csv(arq_grande, header=FALSE, col.names = names(df1), nrows = tam_chunk)
   cat("Read", nrow(df), "rows\n")
if (nrow(df) == 0)
   break
df_filtrado <- df1 %>% filter(TP_SEXO == 'M')
write.table(df_filtrado, "sexoM.csv", append = T, col.names = F,       row.names = F, sep = ",", dec = ".")
}
close(arq_grande)

The problem with it is that it doesn’t advance in reading. It’s just repeating the first 10 lines non-stop.

The writing seems to be working.

  • Notice you’re nrows = 10, that’s not it?

  • You have nothing that will make the lines read from 10 to 10, the first 10 lines are always read. You can try using the argument skip and an accountant to record how many lines you’ve read. In reality, this is not the best way to deal with this problem, there are much more elegant and probably efficient approaches. See the answers to this question: Strategies to analyze very large databases in R (that do not fit in RAM)

  • I think the answers to this question may also help: http://answall.com/questions/108729/filtrar-aquivo-csv-de-5gb-no-r/109028#109028

  • 1

    @Molx When you read files from a connection, as in the code that André is using (with file(...)), lines are read sequentially even without indicating the argument skip. @Andréoliveira your code is correct. Are you sure you have no problem with your file?

  • 2

    Actually, I just found the bug! Inside the loop of repeat, you have to use df_filtrado <- df %>% filter(TP_SEXO == 'M') instead of df_filtrado <- df1 %>% filter(TP_SEXO == 'M'), the problem is the df1 which is the database using only to read the first lines.

  • @Danielfalbel I honestly did not know this... but still suspect that other approaches should be better, so half of the comment still worth. :)

  • 2

    I would take a look at two options that seem simpler to me... One is to use the read.csv.sql function of the sqldf package to already bring the filtered data. And another, that if you just want to separate the records by sex, you can do it via the command line, with awk, grep and the like. Ex: http://superuser.com/questions/827539/grep-to-filter-gigantic-csv-file

  • @Danielfalbel think you could put as answer the error found

  • @Carloscinelli believes the question should be closed, actually. Correcting a typo is unlikely to help anyone else, don’t you think?

  • 1

    Possible duplicate of Pre-process large text files in R

Show 5 more comments
No answers

Browser other questions tagged

You are not signed in. Login or sign up in order to post.