In R, a function that reads only a few columns of a dataframe in Rda format

Asked

Viewed 230 times

6

I have 27 txt files. Each file represents a state of brazil. In fact, each file is dataframe on labor market. The 27 files can add up to more than 20 gigs. The first thing I did to reduce this was to save each file in Rda format. With this, for example, 20 gigs of memory can be reduced to approximately 3 gigs. This is very good, but the big problem is that often I don’t need to read all the variables of the dataframe (approximately 40 variables in total) . For example, in the case of txt, I can use the fread function to read only 3 variables:

fread("data.txt", select = c("var1","var2","var3") )

Unfortunately, I couldn’t find a version for the rda case. I then decided to create a function that allows me to read just a few columns. Let’s take as an example a file of the 27: RJ.txt. The idea is to break this dataframe by columns, save each column in Rda format and save everything in a folder. I then created a function to do this:

df <- fread ( "RJ.txt") # Leio o arquivo original
arquivo_pasta<- "C:/Meu diretorio/pastaRJ" # Esta é a minha pasta onde vou guardar todas as variáveis.

# Esta é a minha função para salvar
save2<- function(df , arquivo_pasta )
{
dfl <- as.list(df) # nossa matrix agora é uma lista
remove(df)
setwd(arquivo_pasta)
for( i in 1:length(dfl))
{
  v <- dfl[[i]]      
  save(  v , file = paste0( names(dfl)[i], ".Rda" )   )  #salvamos
}
}

This way, I have a folder with the 40 columns of RJ.txt, each in rda format. Now I create a function to read only a few columns

read2 <- function(arquivo_pasta , colunas)
{
  setwd(arquivo_pasta)

  # Vamos criar uma matriz, com uma variável auxiliar para poder selecionar apenas as variáveis que queremos
  (path<- list.files(path = arquivo_pasta, all.files = T ,  full.names = T ))
  path<- as.data.frame(path)

  # Criamos a variável auxiliar com apenas o nome da variável
  path$aux<- gsub(arquivo_pasta, "" , path$path)
  path$aux<- gsub("/", "" , path$aux)
  path$aux<- gsub(".Rda", "" , path$aux)

  # Finalmente, selecionamos as colunas
  path <- subset(path , aux %in% colunas )

  # Criamos uma variável auxiliar para poder iniciar o empilhamento
  df_ret <- 1

  for(i in 1:nrow(path))
  {
    load(as.character(path$path[i]))
    dfaux<- data.table(v)
    names(dfaux) <- as.character(path$aux[i])
    df_ret<- cbind(df_ret, dfaux)
  }

  # Excluímos a variável auxiliar 
  df_ret<- df_ret[,df_ret:=NULL]
  return(df_ret)

}

As you can imagine, I am doing this because I want to get rid of all txt files. The problem is that I want to do this a little more efficiently and faster. I wonder if you have any idea how to best this, especially in a matter of time.

  • 1

    Improving code execution time in R is not a trivial task. Each case is a case. In general, it is recommended not to use loops. In your case, in addition to the loops, it has disk access, which slows the code down. See if the tips on this link help you in any way. Another thing you can do is try to work with packages that allow you to read all the data in memory, as suggested in this post.

1 answer

6


A good solution is to use the package fst. Note that it is not ideal for early storage since it is still in intense development.

According to README it compresses as well as the saveRDS, is faster to read and write and breaks allows reading only a few columns.

Example:

# Generate a random data frame with 10 million rows and various column types
nrOfRows <- 1e7

x <- data.frame(
  Integers = 1:nrOfRows,  # integer
  Logicals = sample(c(TRUE, FALSE, NA), nrOfRows, replace = TRUE),  # logical
  Text = factor(sample(state.name, nrOfRows, replace = TRUE)),  # text
  Numericals = runif(nrOfRows, 0.0, 100),  # numericals
  stringsAsFactors = FALSE)

# Store it
  write.fst(x, "dataset.fst")

# Retrieve it
  y <- read.fst("dataset.fst")

# Ler só algumas linhas e colunas
 read.fst("dataset.fst", c("Logicals", "Text"), 2000, 4990) # subset rows and columns
  • What do you mean by long-term storage? Do you mean it is not safe to store my data in this format? What risks do I take?

  • 1

    They say in the package like this: Note to users: The binary format used for data storage by the package (the 'fst file format') is expected to evolve in the coming months. Therefore, fst should not be used for long-term data storage. The risk is that you will not be able to access the data with newer versions of the package. (You can always use the older version to read and save again with the new).

  • I tested here saving in fst and rda format. Saving, in fact, the write.fst function was much faster than save. , although the space occupied by the rda is somewhat smaller than the fst. Already reading, the load is a little faster than read.fst (when I read all variables). But I think it’s worth it. I’ll keep testing.

  • 1

    It seems he has a parameter compression that will affect the speed and the space that the file will occupy on the disk.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.