6
I have 27 txt files. Each file represents a state of brazil. In fact, each file is dataframe on labor market. The 27 files can add up to more than 20 gigs. The first thing I did to reduce this was to save each file in Rda format. With this, for example, 20 gigs of memory can be reduced to approximately 3 gigs. This is very good, but the big problem is that often I don’t need to read all the variables of the dataframe (approximately 40 variables in total) . For example, in the case of txt, I can use the fread function to read only 3 variables:
fread("data.txt", select = c("var1","var2","var3") )
Unfortunately, I couldn’t find a version for the rda case. I then decided to create a function that allows me to read just a few columns. Let’s take as an example a file of the 27: RJ.txt. The idea is to break this dataframe by columns, save each column in Rda format and save everything in a folder. I then created a function to do this:
df <- fread ( "RJ.txt") # Leio o arquivo original
arquivo_pasta<- "C:/Meu diretorio/pastaRJ" # Esta é a minha pasta onde vou guardar todas as variáveis.
# Esta é a minha função para salvar
save2<- function(df , arquivo_pasta )
{
dfl <- as.list(df) # nossa matrix agora é uma lista
remove(df)
setwd(arquivo_pasta)
for( i in 1:length(dfl))
{
v <- dfl[[i]]
save( v , file = paste0( names(dfl)[i], ".Rda" ) ) #salvamos
}
}
This way, I have a folder with the 40 columns of RJ.txt, each in rda format. Now I create a function to read only a few columns
read2 <- function(arquivo_pasta , colunas)
{
setwd(arquivo_pasta)
# Vamos criar uma matriz, com uma variável auxiliar para poder selecionar apenas as variáveis que queremos
(path<- list.files(path = arquivo_pasta, all.files = T , full.names = T ))
path<- as.data.frame(path)
# Criamos a variável auxiliar com apenas o nome da variável
path$aux<- gsub(arquivo_pasta, "" , path$path)
path$aux<- gsub("/", "" , path$aux)
path$aux<- gsub(".Rda", "" , path$aux)
# Finalmente, selecionamos as colunas
path <- subset(path , aux %in% colunas )
# Criamos uma variável auxiliar para poder iniciar o empilhamento
df_ret <- 1
for(i in 1:nrow(path))
{
load(as.character(path$path[i]))
dfaux<- data.table(v)
names(dfaux) <- as.character(path$aux[i])
df_ret<- cbind(df_ret, dfaux)
}
# Excluímos a variável auxiliar
df_ret<- df_ret[,df_ret:=NULL]
return(df_ret)
}
As you can imagine, I am doing this because I want to get rid of all txt files. The problem is that I want to do this a little more efficiently and faster. I wonder if you have any idea how to best this, especially in a matter of time.
Improving code execution time in R is not a trivial task. Each case is a case. In general, it is recommended not to use loops. In your case, in addition to the loops, it has disk access, which slows the code down. See if the tips on this link help you in any way. Another thing you can do is try to work with packages that allow you to read all the data in memory, as suggested in this post.
– Marcus Nunes