How to group information into a data frame from missing data?

Asked

Viewed 587 times

3

I need to exclude empty lines from the df of a 30-year time series, with three daily measurements for each variable. I have already used the function subset(x, ...) that solves part of the problem. However, in some cases there is no recorded measurement, as noted in the "Prec" column for the date "1961-08-21". In this case, I need to keep a line indicating that no measurement was performed that day, i.e., that it remains with NA. How can I do this?

date        id      prec    tair    tw      tmax    tmin
1961-08-21  83377   NA      22.6    14.1    27.9    NA
1961-08-21  83377   NA      23.8    15.2    NA      13.8
1961-08-21  83377   NA      24.2    15.4    NA      NA
1961-08-22  83377   NA      22.6    14.1    29.7    NA
1961-08-22  83377   0       24.8    14.6    NA      13.9
1961-08-22  83377   NA      27      16      NA      NA
1961-08-23  83377   NA      24.6    14      28.8    NA
1961-08-23  83377   1       19.8    14.6    NA      13.8
1961-08-23  83377   2       18.8    14.7    NA      13.6
  • I don’t understand the problem. You need to keep only one line for the day "1961-08-21" instead of three rows? If yes, what do we do with the other columns? You can give an example of the output corresponding to this data?

  • I need a continuous time series, no duplicate or missing dates. If I apply subset(x, ...) in the Prec column, for example, I will lose information from that day for this variable, while for the other columns I will have an average value between the three measurements (which I also could not automate). So my dataset will be different for each column.

1 answer

3


You can solve this problem with the package dplyr:

dados <- structure(list(date = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
 .Label = c("1961-08-21", "1961-08-22", "1961-08-23"), class = "factor"), 
 id = c(83377L, 83377L, 83377L, 83377L, 83377L, 83377L, 83377L, 83377L, 83377L), 
 prec = c(NA, NA, NA, NA, 0L, NA, NA, 1L, 2L), 
 tair = c(22.6, 23.8, 24.2, 22.6, 24.8, 27, 24.6, 19.8, 18.8), 
 tw = c(14.1, 15.2, 15.4, 14.1, 14.6, 16, 14, 14.6, 14.7), 
 tmax = c(27.9, NA, NA, 29.7, NA, NA, 28.8, NA, NA), 
 tmin = c(NA, 13.8, NA, NA, 13.9, NA, NA, 13.8, 13.6)), 
 .Names = c("date", "id", "prec", "tair", "tw", "tmax", "tmin"), 
 class = "data.frame", 
 row.names = c(NA, -9L))

library(dplyr)

dados %>%
  group_by(date) %>%
  summarise_all(funs(Media=mean(., na.rm=TRUE)))
# A tibble: 3 x 7
  date       id_Media prec_Media tair_Media tw_Media tmax_Media tmin_Media
  <fct>         <dbl>      <dbl>      <dbl>    <dbl>      <dbl>      <dbl>
1 1961-08-21   83377.     NaN          23.5     14.9       27.9       13.8
2 1961-08-22   83377.       0.         24.8     14.9       29.7       13.9
3 1961-08-23   83377.       1.50       21.1     14.4       28.8       13.7      

Basically, I grouped the data according to the date and calculated the average of each of the other columns. Note that I also calculated the average of id, but how do I imagine the id are equal for each date, whether or not it calculates this average.

  • Thank you so much! It worked. If instead of the media I need the sum or a specific value between the three, for example the second value assigned to the date 1961-08-21, I can also apply the package [dplyr] then?

  • Yes. Just replace it mean by the appropriate function to each case. It’s very good to know that my response has helped you in some way. So consider vote and accept the answer, so that in the future other people who experience the same problem have a reference to solve it.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.