In R, a function that reads only a few columns of a dataframe in Rda format

Question

In R, a function that reads only a few columns of a dataframe in Rda format

Asked 7 years, 9 months ago

Viewed 230 times

6

I have 27 txt files. Each file represents a state of brazil. In fact, each file is dataframe on labor market. The 27 files can add up to more than 20 gigs. The first thing I did to reduce this was to save each file in Rda format. With this, for example, 20 gigs of memory can be reduced to approximately 3 gigs. This is very good, but the big problem is that often I don’t need to read all the variables of the dataframe (approximately 40 variables in total) . For example, in the case of txt, I can use the fread function to read only 3 variables:

fread("data.txt", select = c("var1","var2","var3") )

Unfortunately, I couldn’t find a version for the rda case. I then decided to create a function that allows me to read just a few columns. Let’s take as an example a file of the 27: RJ.txt. The idea is to break this dataframe by columns, save each column in Rda format and save everything in a folder. I then created a function to do this:

df <- fread ( "RJ.txt") # Leio o arquivo original
arquivo_pasta<- "C:/Meu diretorio/pastaRJ" # Esta é a minha pasta onde vou guardar todas as variáveis.

# Esta é a minha função para salvar
save2<- function(df , arquivo_pasta )
{
dfl <- as.list(df) # nossa matrix agora é uma lista
remove(df)
setwd(arquivo_pasta)
for( i in 1:length(dfl))
{
  v <- dfl[[i]]      
  save(  v , file = paste0( names(dfl)[i], ".Rda" )   )  #salvamos
}
}

This way, I have a folder with the 40 columns of RJ.txt, each in rda format. Now I create a function to read only a few columns

read2 <- function(arquivo_pasta , colunas)
{
  setwd(arquivo_pasta)

  # Vamos criar uma matriz, com uma variável auxiliar para poder selecionar apenas as variáveis que queremos
  (path<- list.files(path = arquivo_pasta, all.files = T ,  full.names = T ))
  path<- as.data.frame(path)

  # Criamos a variável auxiliar com apenas o nome da variável
  path$aux<- gsub(arquivo_pasta, "" , path$path)
  path$aux<- gsub("/", "" , path$aux)
  path$aux<- gsub(".Rda", "" , path$aux)

  # Finalmente, selecionamos as colunas
  path <- subset(path , aux %in% colunas )

  # Criamos uma variável auxiliar para poder iniciar o empilhamento
  df_ret <- 1

  for(i in 1:nrow(path))
  {
    load(as.character(path$path[i]))
    dfaux<- data.table(v)
    names(dfaux) <- as.character(path$aux[i])
    df_ret<- cbind(df_ret, dfaux)
  }

  # Excluímos a variável auxiliar 
  df_ret<- df_ret[,df_ret:=NULL]
  return(df_ret)

}

As you can imagine, I am doing this because I want to get rid of all txt files. The problem is that I want to do this a little more efficiently and faster. I wonder if you have any idea how to best this, especially in a matter of time.

1

Improving code execution time in R is not a trivial task. Each case is a case. In general, it is recommended not to use loops. In your case, in addition to the loops, it has disk access, which slows the code down. See if the tips on this link help you in any way. Another thing you can do is try to work with packages that allow you to read all the data in memory, as suggested in this post.

– Marcus Nunes

2017/11/16 at 15:09

1 answer

Browser other questions tagged r memory

You are not signed in. Login or sign up in order to post.

by Daniel Falbel • **12,504** points · Answer 1 · 2017-11-16T16:32:43+00:00

A good solution is to use the package fst. Note that it is not ideal for early storage since it is still in intense development.

According to README it compresses as well as the saveRDS, is faster to read and write and breaks allows reading only a few columns.

Example:

# Generate a random data frame with 10 million rows and various column types
nrOfRows <- 1e7

x <- data.frame(
  Integers = 1:nrOfRows,  # integer
  Logicals = sample(c(TRUE, FALSE, NA), nrOfRows, replace = TRUE),  # logical
  Text = factor(sample(state.name, nrOfRows, replace = TRUE)),  # text
  Numericals = runif(nrOfRows, 0.0, 100),  # numericals
  stringsAsFactors = FALSE)

# Store it
  write.fst(x, "dataset.fst")

# Retrieve it
  y <- read.fst("dataset.fst")

# Ler só algumas linhas e colunas
 read.fst("dataset.fst", c("Logicals", "Text"), 2000, 4990) # subset rows and columns