How to group information into a data frame from missing data?

Question

How to group information into a data frame from missing data?

Asked 7 years, 4 months ago

Viewed 587 times

3

I need to exclude empty lines from the df of a 30-year time series, with three daily measurements for each variable. I have already used the function subset(x, ...) that solves part of the problem. However, in some cases there is no recorded measurement, as noted in the "Prec" column for the date "1961-08-21". In this case, I need to keep a line indicating that no measurement was performed that day, i.e., that it remains with NA. How can I do this?

date        id      prec    tair    tw      tmax    tmin
1961-08-21  83377   NA      22.6    14.1    27.9    NA
1961-08-21  83377   NA      23.8    15.2    NA      13.8
1961-08-21  83377   NA      24.2    15.4    NA      NA
1961-08-22  83377   NA      22.6    14.1    29.7    NA
1961-08-22  83377   0       24.8    14.6    NA      13.9
1961-08-22  83377   NA      27      16      NA      NA
1961-08-23  83377   NA      24.6    14      28.8    NA
1961-08-23  83377   1       19.8    14.6    NA      13.8
1961-08-23  83377   2       18.8    14.7    NA      13.6

I don’t understand the problem. You need to keep only one line for the day "1961-08-21" instead of three rows? If yes, what do we do with the other columns? You can give an example of the output corresponding to this data?

– Rui Barradas

2018/03/27 at 14:38
I need a continuous time series, no duplicate or missing dates. If I apply subset(x, ...) in the Prec column, for example, I will lose information from that day for this variable, while for the other columns I will have an average value between the three measurements (which I also could not automate). So my dataset will be different for each column.

– Andreia Almeida

2018/03/27 at 15:21

1 answer

Browser other questions tagged r dplyr

You are not signed in. Login or sign up in order to post.

by Marcus Nunes • **17,915** points · Answer 1 · 2018-03-27T15:28:26+00:00

You can solve this problem with the package dplyr:

dados <- structure(list(date = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
 .Label = c("1961-08-21", "1961-08-22", "1961-08-23"), class = "factor"), 
 id = c(83377L, 83377L, 83377L, 83377L, 83377L, 83377L, 83377L, 83377L, 83377L), 
 prec = c(NA, NA, NA, NA, 0L, NA, NA, 1L, 2L), 
 tair = c(22.6, 23.8, 24.2, 22.6, 24.8, 27, 24.6, 19.8, 18.8), 
 tw = c(14.1, 15.2, 15.4, 14.1, 14.6, 16, 14, 14.6, 14.7), 
 tmax = c(27.9, NA, NA, 29.7, NA, NA, 28.8, NA, NA), 
 tmin = c(NA, 13.8, NA, NA, 13.9, NA, NA, 13.8, 13.6)), 
 .Names = c("date", "id", "prec", "tair", "tw", "tmax", "tmin"), 
 class = "data.frame", 
 row.names = c(NA, -9L))

library(dplyr)

dados %>%
  group_by(date) %>%
  summarise_all(funs(Media=mean(., na.rm=TRUE)))
# A tibble: 3 x 7
  date       id_Media prec_Media tair_Media tw_Media tmax_Media tmin_Media
  <fct>         <dbl>      <dbl>      <dbl>    <dbl>      <dbl>      <dbl>
1 1961-08-21   83377.     NaN          23.5     14.9       27.9       13.8
2 1961-08-22   83377.       0.         24.8     14.9       29.7       13.9
3 1961-08-23   83377.       1.50       21.1     14.4       28.8       13.7

Basically, I grouped the data according to the date and calculated the average of each of the other columns. Note that I also calculated the average of id, but how do I imagine the id are equal for each date, whether or not it calculates this average.