Filter 5GB CSV file on R

Asked

Viewed 940 times

6

I am trying in every way to tamper with a 5GB file for my monograph.

The code I’m trying to get into is as follows::

> write.csv(subset(read.csv("enem.csv", header=TRUE), UF_ESC=="SP"), "filtro.csv", row.nomes=FALSE)

I have tested with a template file and it went all right, but in the original database not. I imagine it is the size, because, keep thinking and nothing.

If anyone has another idea to help will be of huge help.

  • 1

    It didn’t work because it didn’t finish running? It’s probably because the file is heavy. Have you run this analysis on data of this magnitude? How long was running? How much of RAM does the computer have?

  • Andre, put the computer information you’re using into the question. Also worth checking out are these answers and see if they help you: http://answall.com/questions/30631/strategy-parsingbases-databases-very big

  • Carlos, the configuration of my machine is as follows:: Meméria 2.9GB, Intel CPU 585 2.16GHz and 32 bit. It is running Ubuntu 14.04 LTS. Your link can help a lot, for now, he gave me several ideas that I will see if I put in practice.

  • Do you have the link to csv? If possible, pass the link here so we can test the solutions. Your question is very similar to this one: http://answall.com/questions/35469/pre-processr-grande-arquivo-de-texto-txt-substr-por probably the package ff solves your problem.

2 answers

7

Here are some tips p/ solve your problem:

1 - Read the file using another function:

> microbenchmark(
+   base = read.csv(file = "df-write-csv.csv", header = T),
+   readr = readr::read_csv("df-write-csv.csv"),
+   data.table = data.table::fread("df-write-csv.csv"),
+   rio = rio::import("df-write-csv.csv", format = "csv")
+ )
Unit: microseconds
       expr      min        lq      mean    median        uq      max neval
       base 1836.230 1912.1815 2253.6071 1980.3995 2282.1675 4148.787   100
      readr  823.960  881.3625 1072.4790  921.6605 1120.2365 3538.359   100
 data.table  327.759  364.4810  442.5933  402.3295  458.7895  920.436   100
        rio  312.317  351.2260  444.1087  382.9325  439.7960 2938.490   100

See that read the file with the function fread of data.table or with the function importof rio is 4x faster than with native R function. Make sure you really read.

2 - Check that you have actually been able to filter your database. Save the subset result to an auxiliary object. If this is the problem, try filtering using other package functions like dplyr or data.table.

When it is long operations, the data.table can be much faster.

> df <- data.frame(x = 1:100000, y = 1:100000, l = sample(letters, size = 100, replace = T))
> microbenchmark(
+   base = subset(df, l == "a"),
+   dplyr = dplyr::filter(df, l == "a"),
+   data.table = data.table(df)[l == "a",]
+ )
Unit: milliseconds
       expr       min        lq      mean    median        uq      max neval
       base 10.329514 12.467143 14.962479 13.976907 17.171858  24.3761   100
      dplyr  7.331626  8.624356 10.063947  8.853807 11.140871  16.8939   100
 data.table  2.986519  4.580536  6.774548  4.824227  5.957255 119.9709   100

3 - Use the function write_csv package readr she is more or less 2x faster than the function write.csv native of the R.

microbenchmark(
  base = write.csv(df, file = "df-write-csv.csv", row.names = F),
  rio = rio::export(df, file = "df-rio.csv", format = "csv"),
  readr = readr::write_csv(df, path = "df-readr.csv")
)

Unit: microseconds
  expr     min       lq     mean    median       uq      max neval
  base 713.564 1097.534 2025.377 1467.4980 2996.136 4168.352   100
   rio 718.141 1156.998 2243.143 2011.5310 3106.479 7368.046   100
 readr 366.306  594.629 1265.297  734.0445 1793.405 5852.142   100

Anyway, if you were able to read the 5GB file, it is very likely that you can also write it, since it is already in the RAM of your computer.

  • Daniel, there’s no way I can load the entire file into memory, unfortunately. After I extract the data I will use in my work will get much smaller.

1

André, as you will only filter your database and after that it will get much smaller, you can read it in Chunks. For this you can do as follows:

Just p/ test created the following file "large":

library(readr)
library(dplyr)
x <- data.frame(x = runif(3e6), y = 1:3e6)
write_csv(x, path = "test.csv")

The following code snippet reads the database in small parts (tam_chunk), filter those parts and then save to a file called filtrado.csv.

See if it works that way. It should be time consuming, but at least you can get past the memory problem:

# criando a conexão com o arquivo grande
arq_grande <- file("test.csv", "r")
tam_chunk <- 1e5 # tamanho do chunk
# lendo as 100 primeiras linhas do banco de dados e criando
# o arquivo filtrado
df1 <- read.csv(arq_grande, nrows = 100, header = T)
df_filtrado <- df1 %>% filter(x <= 0.5)
write.table(df_filtrado, "filtrado.csv", row.names = F, sep = ",", dec = ".")
# iniciando o loop de leitura em chunks
n_row <- 1
repeat {
  ## process data...
  ## ...then read the next chunk
  df <- read.csv(arq_grande, header=FALSE, col.names = names(df1), nrows = tam_chunk)
  cat("Read", nrow(df), "rows\n")
  if (nrow(df) == 0)                # done yet?
    break
  df_filtrado <- df %>% filter(x <= 0.5) # filtrar o data.frame
  # salvar o data.frame
  write.table(df_filtrado, "filtrado.csv", append = T, 
              col.names = F, row.names = F, sep = ",", dec = ".")
}
close(arq_grande)

That answer was very inspired in this Stack Overflow response.

  • Daniel, thanks for the help. More when I tried to use your answer gives the following error: "Could not find function %>%". I tried to search some information about it more unsuccessfully. The error occurs in the following line of code: df_filtrado <- df1 %>% filter(x <= 0.5)

  • @Andréoliveira probably you were not with the package dplyr loaded. Install it using install.packages("dplyr") and then use the command library(dplyr)to carry it.

  • vc was right I had to update the R and everything else. Now the code will, but only writes the first lines, the looping ones does not write.

  • some error appears?

  • not only says that he read more than 70 thousand lines but only writes the first 1000 lines. I tried everything else I could.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.