Read in blocks do not advance with read.csv. How to fix?

Question

Read in blocks do not advance with read.csv. How to fix?

Asked 9 years, 8 months ago

Viewed 115 times

4

I am trying to make the following code to read a giant file that does not fit the memory.

library(dplyr)

arq_grande <- file("dados2014.csv", "r")
tam_chunk <- 1e2
df1 <- read.csv(arq_grande, nrows = 10, header = T, sep = ",", dec = ".")
df_filtrado <- df1 %>% filter(TP_SEXO == 'M')
write.table(df_filtrado, "sexoM.csv", row.names = F, sep = ",", dec = ".")

nrow <- 1

repeat {
   df <- read.csv(arq_grande, header=FALSE, col.names = names(df1), nrows = tam_chunk)
   cat("Read", nrow(df), "rows\n")
if (nrow(df) == 0)
   break
df_filtrado <- df1 %>% filter(TP_SEXO == 'M')
write.table(df_filtrado, "sexoM.csv", append = T, col.names = F,       row.names = F, sep = ",", dec = ".")
}
close(arq_grande)

The problem with it is that it doesn’t advance in reading. It’s just repeating the first 10 lines non-stop.

The writing seems to be working.

Notice you’re nrows = 10, that’s not it?

– SCOFIELD

2016/02/16 at 23:09
You have nothing that will make the lines read from 10 to 10, the first 10 lines are always read. You can try using the argument skip and an accountant to record how many lines you’ve read. In reality, this is not the best way to deal with this problem, there are much more elegant and probably efficient approaches. See the answers to this question: Strategies to analyze very large databases in R (that do not fit in RAM)

– Molx

2016/02/17 at 02:12
I think the answers to this question may also help: http://answall.com/questions/108729/filtrar-aquivo-csv-de-5gb-no-r/109028#109028

– Daniel Falbel

2016/02/17 at 11:34
1

@Molx When you read files from a connection, as in the code that André is using (with file(...)), lines are read sequentially even without indicating the argument skip. @Andréoliveira your code is correct. Are you sure you have no problem with your file?

– Daniel Falbel

2016/02/17 at 12:09
2

Actually, I just found the bug! Inside the loop of repeat, you have to use df_filtrado <- df %>% filter(TP_SEXO == 'M') instead of df_filtrado <- df1 %>% filter(TP_SEXO == 'M'), the problem is the df1 which is the database using only to read the first lines.

– Daniel Falbel

2016/02/17 at 12:23
@Danielfalbel I honestly did not know this... but still suspect that other approaches should be better, so half of the comment still worth. :)

– Molx

2016/02/17 at 18:46
2

I would take a look at two options that seem simpler to me... One is to use the read.csv.sql function of the sqldf package to already bring the filtered data. And another, that if you just want to separate the records by sex, you can do it via the command line, with awk, grep and the like. Ex: http://superuser.com/questions/827539/grep-to-filter-gigantic-csv-file

– Icaro Bombonato

2016/02/23 at 01:19
@Danielfalbel think you could put as answer the error found

– Carlos Cinelli

2016/02/23 at 12:22
@Carloscinelli believes the question should be closed, actually. Correcting a typo is unlikely to help anyone else, don’t you think?

– Daniel Falbel

2016/02/23 at 12:24
1

Possible duplicate of Pre-process large text files in R

– Flavio Barros

2016/07/29 at 20:40

Show 5 more comments

No answers

Browser other questions tagged r

You are not signed in. Login or sign up in order to post.