How to decode many columns of a R data frame

Asked

Viewed 871 times

6

I have a data frame with more than 300 columns that are categorical but are encoded as numerical. Each of these columns has its own "type", that is, it has its own encoding table. My problem is to create a new data frame with the decoded variables.

I have loaded the following data frames:

  • the main data frame called "data", which has 347 columns I want to decode.
  • an auxiliary data frame called "vars data" with: name (variable.name) and "type" (data.type) of all variables of the main df
  • an auxiliary data frame called "codes": with "type" (data.type), possible codes (value) for the respective "type" and meaning (content) of each code

I’m trying to use the LAPD to make it easier. What I’ve managed to do so far is:

# pego uma das variáveis do df principal 
variavel <- "abc"
# busco no df "dados_vars" qual é o tipo desta variável
tipo.variavel <- as.character(dados_vars[dados_vars$variable.name == variavel, "data.type"])
# filtro no df "codes" os códigos específicos que esta variável pode ter
codigos <- codes %>% filter(data.type==tipo.variável) %>% select(value,content)
# crio um novo data frame com esta variável decodificada
novos.dados <- mutate(dados, var1=factor(var1,label=codigos$content,levels=codigos$value))

Now, how do I apply this procedure to all main df columns?

  • If it’s just to turn your code into something that works for all variables, you can put it inside a for (i in 1:347) and define var1 <- paste0("var", i). You can change the for by a lapply and check if performance improves, too.

  • To facilitate the explanation of the problem I described the variables with the name "varX" but, in fact, each variable has a different name.

  • In this case, another possibility is you exchange var for colnames(dados)[i]. If I understood correctly, this would solve.

  • Yes, this goes in the direction of the solution given by @Rcoster and which in principle solves my problem.

2 answers

2

A solution using the base package:

dados <- data.frame(replicate(10, sample(1:3, 10, rep = T)))
dados_vars <- data.frame(variable.name = paste0('X', 1:10), data.type = sample(1:4, 10, rep = T))
codes <- data.frame(tipo = rep(1:4, each = 3), value = rep(1:3, 4), code = letters[1:12])

for (i in colnames(dados)) {
    tipo.atual <- dados_vars[dados_vars$variable.name == i, 'data.type']
    dados[, i] <- factor(dados[, i], levels = subset(codes, tipo == tipo.atual)$value, labels = subset(codes, tipo == tipo.atual)$code)
}
  • Unfortunately this solution did not work in my data frames. For example, if I do: i <- "hxadd" tipo.atual <- dados_vars[dados_vars$variable.name == i, 'data.type'] dados[, i] <- factor(dados[, i], levels = subset(codes, data.type == tipo.atual)$value, labels = subset(codes, data.type == tipo.atual)$code), gives the following error message: Error in Ops.factor(data.type, tipo.atual) : level sets of factors are different

  • I was able to solve the problem of the error "level sets factors are differents" using my initial code only instead of referring by variable name, I now reference by index i. So the solution of @Rcoster works, having the advantage of me manipulating some columns differently (for example, some columns, instead of searching for a code, should be discretized and I use the cut for this). However, how do I keep the original columns in df (getting this with the encoded and decoded columns)?

  • In R tu cannot apply Labels without modifying values (similar to what happens in SPSS). When applying the factor, your variable will always be transformed into an integer vector with a 'mask'. The only solution to this, I think, is to have 2 different data.frames: one with the same values and the other with the coded values.

  • I think I got what I was looking for. To create the variables encoded in new columns of the data frame (and thus keep the original columns) I changed the last assignment of for for dados[,paste0(i,".new")] <- .... I still have to test if this impacts the loop since the stop condition is based on the data columns. If it’s a problem, maybe I’ll actually create a new data frame with just the decoded variables.

0


I ended up adopting the solution below, following the tips given.

for (i in colnames(dados)){
    tipo.variavel <- as.character(dados_vars[dados_vars$variable.name == i, "data.type"])
    fatores.variavel <- subset(codes,toupper(data.type)==toupper(tipo.variavel), c("value","content"))
    dados[,paste0(i,".new")] <- factor(dados[,i],labels=fatores.variavel$content,levels=fatores.variavel$value)
}

Thanks for your help.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.