How to eliminate variables that have some NA values in R

Question

How to eliminate variables that have some NA values in R

Asked 5 years, 3 months ago

Viewed 1,104 times

5

Currently I have a data frame with textual and numerical variables. However, some variables have NA in just a few observations, not all. For example, I have the following data frame

Cidade      Estado populacao idh   area
Salvador    BA     21212     3     NA   
Salvador    BA     21212     NA    23323 
Salvador    BA     21212     3     23323
Salvador    BA     21212     3     23323
Salvador    BA     21212     NA    23323

In case I needed to eliminate variables once and for all IDH and AREA. But in my example I have more than 2,000 variables, so you can’t analyze them one by one. How to solve this? Note that I want to exclude the variable (column), not the observation (line).

Just to be clear: the goal is to eliminate any column that has at least one observation equal to NA?

– Marcus Nunes

2020/04/22 at 17:34

2 answers

Browser other questions tagged r

You are not signed in. Login or sign up in order to post.

by Rui Barradas • **15,422** points · Answer 1 · 2020-04-22T18:37:47+00:00

A way in R base is as follows. To each base column apply (sapply) the function anyNA.

dados[, !sapply(dados, anyNA)]
#    Cidade Estado populacao
#1 Salvador     BA     21212
#2 Salvador     BA     21212
#3 Salvador     BA     21212
#4 Salvador     BA     21212
#5 Salvador     BA     21212

Testing of both solutions, this and the user’s snows, with the package microbenchmark.

But with a table with about 2000 columns. As the example table of the question has 5 columns, enough

log2(2000/5)
#[1] 8.643856

nine iterations of cbind() to have more than 2000 columns. I will take advantage to also have more rows.

d2 <- dados
for(i in 1:10) d2 <- rbind(d2, d2)
for(i in 1:9) d2 <- cbind(d2, d2)
dim(d2)
#[1] 5120 2560

mb <- microbenchmark(
  colSums = d2[, colSums(is.na(d2)) == 0],
  anyNA = d2[, !sapply(d2, anyNA)]
)
print(mb, unit = "relative", order = "median")
#Unit: relative
#    expr      min       lq     mean   median       uq      max neval cld
#   anyNA 1.000000  1.00000  1.00000  1.00000  1.00000 1.000000   100  a 
# colSums 3.429795 11.07481 10.40991 10.90102 10.81671 6.761014   100   b


ggplot2::autoplot(mb)

by neves • **5,644** points · Answer 2 · 2020-04-22T18:09:17+00:00

Consider the following data set:

dados <- data.frame(x = c(1, 2, 3, 4, 5, 8), y = c(NA, 0, 0, 0, 2, NA), 
               w = c(88, 2, 3, 4, 5, 8), z = c(5, 2, 2, 9, NA, 18))

To eliminate the varnishes containing NAs you can use the following code:

dados[, colSums(is.na(dados)) == 0]