Pre-process large text files in R

Question

Pre-process large text files in R

Asked 11 years ago

Viewed 1,955 times

13

I am writing a script, which I will make public, to open the micro-data of RAIS (dehydrated, available here) in R using Monetdb. However the bank does not accept vígula (,) as decimal separator. Each RAIS Ufano.txt file is quite large (up to 7GB) so the solution cannot require modifications that fit into the RAM. Two alternatives:

a) import into the database, as if everything were string, and then make an UPDATE within SQL by creating new columns for the numeric variables and replacing "," by ".".

b) pre-process the file by replacing . txt with a semicolon.

The question is about the alternative "b".

Is there any efficient way to do this substitution? Ajdamico indicates a slow way, replacing line by line here.

As an example we can from the file of Acre 2012 (AC2012.txt), which I make available at this link

As it is to be packaged as R command, the solution cannot depend on the OS nor require the installation of things outside the R.

3 answers

Browser other questions tagged sql r csv txt

You are not signed in. Login or sign up in order to post.

by Flavio Barros • **1,717** points · Answer 1 · 2014-10-11T16:08:02+00:00

Very good solution @Carloscinelli. But an alternative solution is using the package Iterators. The change_dot() function basically reads a line, swaps the ',' for the '.' and writes the line in a text file.

library(iterators)

change_dot <- function(file, saida='teste.txt', chunk=1) {
  con1 <- file(file, 'r')
  con2 <- file(saida, open = 'w')
  linha <- 0
  it <- ireadLines(con1, n=chunk)
  out <- tryCatch(expr=write(x = gsub(pattern = ',', replacement = '.', x = nextElem(it)), con2), 
                   error=function(e) e)

  while(!any(class(out) == "error")) {
    linha = linha + 1
    print(paste('Escrita linha ', linha))
    out <- tryCatch(expr=write(x = gsub(pattern = ',', replacement = '.', x = nextElem(it)), con2, append = T), 
                  error=function(e) e)
  }
}

system.time(change_dot(file = 'AC2012.txt', saida = 'saida.csv'))



 user  system elapsed 
  48.65    4.70   53.04

In this file, the AC2012.txt, the procedure took 48 seconds on my machine.

Here it should be noted that it is possible to increase the size of Chunk to values greater than 1. For example, increasing to 40000 obtained the following times using this solution, and Damico’s solution:

change_ponto <- function() {
            file.create("acre.txt")
            outcon <- file( "acre.txt" , "w" )
            incon <- file("AC2012.txt" , "r" )
            while( length( one.line <- readLines( incon , 40000 , encoding="latin1") ) > 0 ){
              one.line <- gsub( ',' , '.' , one.line )
              writeLines( one.line , outcon )
            }
}

system.time(change_ponto())

 user  system elapsed 
   6.53    0.82    7.36

system.time(change_dot(file = 'AC2012.txt', saida = 'teste4.csv', chunk = 40000))

 user  system elapsed 
   6.71    3.12    9.92

And now testing if the files are the same:

    teste2 <- read.csv("acre.txt", header=F, sep=";", stringsAsFactors=FALSE, row.names=NULL)
    teste4 <- read.csv("teste4.csv", header=F, sep=";", stringsAsFactors=FALSE, row.names=NULL)
    all.equal(teste2, teste4)

[1] TRUE

I wrote a post about Iterators on my blog a while back: http://www.rmining.com.br/2015/09/07/preparacao-de-dados-parte-2/

by Carlos Cinelli • **16,826** points · Answer 2 · 2014-10-10T02:42:18+00:00

I made a partial solution (which should not be the most efficient) that reads the file in Chunks and goes saving in a csv, already with the decimal as a point instead of a comma. Then you just set the size of Chunk that fits in your RAM. It worked with the ACRE test file, in theory it would work with larger files.

chunks <- 40000
acre <- read.csv2("AC2012.txt", fileEncoding="latin1", nrows=chunks, stringsAsFactors=FALSE)
write.table(acre, "teste.csv",sep=",")
oldrows <- chunks
while(nrow(acre)>0){
  acre <- read.csv2("AC2012.txt", fileEncoding="latin1", nrows=chunks, skip=oldrows, 
                               stringsAsFactors=FALSE)
  cat("Lido: linhas", oldrows, "a", oldrows + nrow(acre), "\n")
  oldrows <- oldrows+nrow(acre)
  write.table(acre, "teste.csv", col.names=FALSE, append=TRUE, sep=",")
}

# testando
original <- read.csv2("AC2012.txt", fileEncoding="latin1",stringsAsFactors=FALSE)
novo <- read.csv("teste.csv", header=TRUE,sep=",", stringsAsFactors=FALSE, row.names=NULL)
all.equal(novo[-1], original)
TRUE

I also tested the solution proposed in Damico’s question, with Chunks, and this is faster, at least in this example:

file.create("acre.txt")
outcon <- file( "acre.txt" , "w" )
incon <- file("AC2012.txt" , "r" )
while( length( one.line <- readLines( incon , 40000 , encoding="latin1") ) > 0 ){
  one.line <- gsub( ',' , '.' , one.line )
  writeLines( one.line , outcon )
}

by LucasMation • **411** points · Answer 3 · 2014-10-13T13:10:45+00:00

Thanks to @Carloscinelli and @Flaviobarros for the great answers. I decided to put a 3rd alternative to my question above, using the "ff" package to import the data into R for a "ffdf" and then export to a traditional CSV:

dat <- read.csv2.ffdf(file="AC2012.txt", strip.white=TRUE,
                      na.strings = c('{ñclass}','{ñ','{ñ class}','{ñc','000-1')))
write.csv.ffdf(dat, "AC2012.csv")

I measured the total time elapsed. For Acre are 4,65s to read and 5,17s to save, in a hard-disk SSD. For all 2002 RAIS state files, whose . txt occupy 11.8GB, were in total 29min to import and 35min to export.

Using the ff package has some advantages:

The ff package decides how to optimize between disk and RAM when importing the data. The user does not think about it and uses a function with usual syntax (read.csv2).
As in read.csv2, the read.csv2.ffdf function assumes that the . CSV is European/Latin, using "," as decimal separator and ";" as field separator
The function also handles other problems of the current version of RAIS files such as some fields with database spaces before the data (e.g., "" 999") and the strings used to identify missings ( "{ñclass}", "{ñ", "{ñ class}", "{ñc", "000-1"). After fixed these problems the file size of 2002 decreases to 4.7GB, less than half.

Perhaps a debauchery is that the ff package does not hold up externally large bases, while the other solutions of this post are scalable without major problems.