How to partially disregard NA in R operations with a historical data series?

Question

How to partially disregard NA in R operations with a historical data series?

Asked 5 years, 6 months ago

Viewed 167 times

3

I have a set of rain data measured every hour and I need to add this data throughout the day. For this, I am using the commands below:

library(dplyr)
df %>% group_by(data) %>%
summarise_all(funs(Media=sum(., na.rm=TRUE)))

However, my df has three characteristics: i) days without NA, ii) days with some Nas, and iii) days with only NA, as an example below:

With that, if I consider na.rm=TRUE, days that have only NA, as day 03/01/10, returns with a value of 0 after the sum. Which is wrong, because I don’t know if it really rained, since I don’t have the die. On the contrary, na.rm=FALSE, disregard both the days that have only NA and the days that have some Nas and some measures, which is also bad.

In the output shown below, the day 02/01/2010 should present a value different from 0.0 and the day 03/01/2010 should present NA. When I calculate the average Media=mean the result is correct, but to Media=sum I can’t fix that.

data    "rain_mm_ToT"
01/01/2010  19.7
02/01/2010  0.0
03/01/2010  0.0
04/01/2010  0.5
05/01/2010  0.0
06/01/2010  0.0
07/01/2010  6.3
08/01/2010  1.9
09/01/2010  1.4
10/01/2010  0.0
11/01/2010  0.0

Since the series is very long, I can’t make a thorough assessment of where the mistakes will be. Thus, I would like to know if there is an alternative to consider "partially" the Nas values, that is, return with NA only on days that have no measurement and perform the sum on those that have both NA and measurement?

Thank you!

Can you please, edit the question with the departure of dput(df) or, if the base is too large, dput(head(df, 20))?

– Rui Barradas

2020/01/21 at 17:20

1 answer

Browser other questions tagged r dplyr

You are not signed in. Login or sign up in order to post.

by Rui Barradas • **15,422** points · Answer 1 · 2020-01-21T17:34:59+00:00

This response deals with cases where at least one vector element whose mean is to be calculated is NA with na.rm = TRUE. But contrary to what is written in the question, when all the elements are NA the value of the average is NaN, is not 0. This makes sense since if we remove all the data we have the sum of zero elements divided by zero, the length of that zero element vector. Now 0/0 gives NaN.

library(dplyr)

df %>%
  group_by(data) %>%
  summarise_all(list(Media = ~mean(., na.rm = anyNA(.))))
## A tibble: 3 x 2
#   data  Media
#  <int>  <dbl>
#1     1   2.5 
#2     2   2.33
#3     3 NaN

This second way of calculating averages by groups taking into account missing values, NA, considers the case that all values are NA apart and has as a result NA. But for the reason explained above I believe that the first code is the most correct.

df %>%
  group_by(data) %>%
  summarise_all(list(Media = ~mean(., na.rm = !all(is.na(.)))))
## A tibble: 3 x 2
#   data Media
#  <int> <dbl>
#1     1  2.5 
#2     2  2.33
#3     3 NA

Test data.

df <- data.frame(data = rep(1:3, each = 4),
                 `chuva(mm)` = c(1:4, c(1,2,NA,4), rep(NA, 4)))