The "date" format disappears after using the "group_by" function followed by "summarise (Mean())" in R

Asked

Viewed 36 times

1

I am working with the following global temperature database:

https://drive.google.com/open?id=1nSwP3Y0V7gncbnG_DccNhrTRxmUNqMqa

I import the data with the function import() package rio and engrave on the object df.

df<-rio::import("TemperaturasGlobais.csv")

head(df)

 dt AverageTemperature AverageTemperatureUncertainty   City Country Latitude Longitude
1 1743-11-01              6.068                         1.737 Ã…rhus Denmark   57.05N    10.33E
2 1743-12-01                 NA                            NA Ã…rhus Denmark   57.05N    10.33E
3 1744-01-01                 NA                            NA Ã…rhus Denmark   57.05N    10.33E
4 1744-02-01                 NA                            NA Ã…rhus Denmark   57.05N    10.33E
5 1744-03-01                 NA                            NA Ã…rhus Denmark   57.05N    10.33E
6 1744-04-01              5.788                         3.624 Ã…rhus Denmark   57.05N    10.33E 

However, the column dt (date) comes in format character.

str(df)

'data.frame':   8599212 obs. of  7 variables:
 $ dt                           : chr  "1743-11-01" "1743-12-01" "1744-01-01" "1744-02-01" ...
 $ AverageTemperature           : num  6.07 NA NA NA NA ...
 $ AverageTemperatureUncertainty: num  1.74 NA NA NA NA ...
 $ City                         : chr  "Ã…rhus" "Ã…rhus" "Ã…rhus" "Ã…rhus" ...
 $ Country                      : chr  "Denmark" "Denmark" "Denmark" "Denmark" ...
 $ Latitude                     : chr  "57.05N" "57.05N" "57.05N" "57.05N" ...
 $ Longitude                    : chr  "10.33E" "10.33E" "10.33E" "10.33E" ...

So I apply the function ymd() of the lubridate to convert it to the format date and engrave on the object df2.

df2<-df %>% 
  mutate(dt=ymd(dt))

head(df2)

dt AverageTemperature AverageTemperatureUncertainty   City Country Latitude Longitude
1 1743-11-01              6.068                         1.737 Ã…rhus Denmark   57.05N    10.33E
2 1743-12-01                 NA                            NA Ã…rhus Denmark   57.05N    10.33E
3 1744-01-01                 NA                            NA Ã…rhus Denmark   57.05N    10.33E
4 1744-02-01                 NA                            NA Ã…rhus Denmark   57.05N    10.33E
5 1744-03-01                 NA                            NA Ã…rhus Denmark   57.05N    10.33E
6 1744-04-01              5.788                         3.624 Ã…rhus Denmark   57.05N    10.33E

Confiro, and I see that it worked. The column "dt" is now in the format "date"

str(df2)

'data.frame':   8599212 obs. of  7 variables:
 $ dt                           : Date, format: "1743-11-01" "1743-12-01" "1744-01-01" "1744-02-01" ...
 $ AverageTemperature           : num  6.07 NA NA NA NA ...
 $ AverageTemperatureUncertainty: num  1.74 NA NA NA NA ...
 $ City                         : chr  "Ã…rhus" "Ã…rhus" "Ã…rhus" "Ã…rhus" ...
 $ Country                      : chr  "Denmark" "Denmark" "Denmark" "Denmark" ...
 $ Latitude                     : chr  "57.05N" "57.05N" "57.05N" "57.05N" ...
 $ Longitude                    : chr  "10.33E" "10.33E" "10.33E" "10.33E" ...

The problem comes now: I do the grouping (group_by) PER YEAR, filter for country only Brazil, and request the annual average summarise (mean()) with the removal of missing values (na.rm = T).

df3<-df2 %>% 
  group_by(ano=year(dt)) %>% 
  filter(Country=="Brazil") %>% 
  summarise(media.anual=mean(AverageTemperature, na.rm = T))

The output is a Tibble whose column dt is no longer in format date.

# A tibble: 190 x 2
     ano media.anual
   <dbl>       <dbl>
 1  1824        26.5
 2  1825        26.5
 3  1826        26.4
 4  1827        26.7
 5  1828        26.1
 6  1829        26.0
 7  1830       NaN  
 8  1831       NaN  
 9  1832        20.5
10  1833        21.4
# ... with 180 more rows


str(df3)

tibble [190 x 2] (S3: tbl_df/tbl/data.frame)
 $ ano        : num [1:190] 1824 1825 1826 1827 1828 ...
 $ media.anual: num [1:190] 26.5 26.5 26.4 26.7 26.1 ...

Hence, there are 3 my doubts:

  1. Why after using the function of group_by+summarise(mean()) the result undoes the formatting of date that I had previously achieved?
  2. How do I make this Tibble stay in format date?
  3. A curiosity: why the missing values appear in Tibble df3 as NaN and not as NA? What does that mean NaN?

1 answer

1


  1. By definition, the function year package lubridate returns a numeric value. This is the reason for the column dt was double and not a date.
x <- ymd("2012-03-26")
    
is.numeric(year(x))
## [1] TRUE    

is.Date(year(x))
## [1] FALSE
  1. In the R, a date is, in the vast majority of times, composed by year, month, day, hour, minute and second. In other words, the output here is to create a fictional day and month in order to have a date in the program. My favorite way is to reference the value of the year for January 1st:
ano <- c(2003, 2010, 2015)

data <- ymd(paste(ano, 1, 1, sep = "-"))

is.Date(data)
## [1] TRUE

In your specific case, the code

df3<-df2 %>% 
  group_by(ano = year(dt)) %>% 
  filter(Country == "Brazil") %>% 
  summarise(media.anual = mean(AverageTemperature, na.rm = T))
  mutate(ano = ymd(paste(anos, 1, 1, sep = "-")))

should work.

  1. NaN is the acronym for Not a Number. This is because the R could not represent computationally the result of the requested operation. The most classic case of this is trying to divide 0 by 0:
0/0
## [1] NaN

The best way to try to understand why this is happening with your data is to look directly at it, without doing any operation. Below I list all values of AverageTemperature for Brazil in 1830:

df2 %>% 
    group_by(ano=year(dt)) %>% 
    filter(Country=="Brazil") %>% 
    filter(ano == 1830) %>%
    print(n = Inf)

## # A tibble: 48 x 8
## # Groups:   ano [1]
##    dt         AverageTemperat… AverageTemperat… City  Country Latitude Longitude
##    <date>                <dbl>            <dbl> <chr> <chr>   <chr>    <chr>    
##  1 1830-01-01               NA               NA Boa … Brazil  2.41N    60.27W   
##  2 1830-02-01               NA               NA Boa … Brazil  2.41N    60.27W   
##  3 1830-03-01               NA               NA Boa … Brazil  2.41N    60.27W   
##  4 1830-04-01               NA               NA Boa … Brazil  2.41N    60.27W   
##  5 1830-05-01               NA               NA Boa … Brazil  2.41N    60.27W   
##  6 1830-06-01               NA               NA Boa … Brazil  2.41N    60.27W   
##  7 1830-07-01               NA               NA Boa … Brazil  2.41N    60.27W   
##  8 1830-08-01               NA               NA Boa … Brazil  2.41N    60.27W   
##  9 1830-09-01               NA               NA Boa … Brazil  2.41N    60.27W   
## 10 1830-10-01               NA               NA Boa … Brazil  2.41N    60.27W   
## 11 1830-11-01               NA               NA Boa … Brazil  2.41N    60.27W   
## 12 1830-12-01               NA               NA Boa … Brazil  2.41N    60.27W   
## 13 1830-01-01               NA               NA Maca… Brazil  0.80N    50.63W   
## 14 1830-02-01               NA               NA Maca… Brazil  0.80N    50.63W   
## 15 1830-03-01               NA               NA Maca… Brazil  0.80N    50.63W   
## 16 1830-04-01               NA               NA Maca… Brazil  0.80N    50.63W   
## 17 1830-05-01               NA               NA Maca… Brazil  0.80N    50.63W   
## 18 1830-06-01               NA               NA Maca… Brazil  0.80N    50.63W   
## 19 1830-07-01               NA               NA Maca… Brazil  0.80N    50.63W   
## 20 1830-08-01               NA               NA Maca… Brazil  0.80N    50.63W   
## 21 1830-09-01               NA               NA Maca… Brazil  0.80N    50.63W   
## 22 1830-10-01               NA               NA Maca… Brazil  0.80N    50.63W   
## 23 1830-11-01               NA               NA Maca… Brazil  0.80N    50.63W   
## 24 1830-12-01               NA               NA Maca… Brazil  0.80N    50.63W   
## 25 1830-01-01               NA               NA Mana… Brazil  2.41S    60.27W   
## 26 1830-02-01               NA               NA Mana… Brazil  2.41S    60.27W   
## 27 1830-03-01               NA               NA Mana… Brazil  2.41S    60.27W   
## 28 1830-04-01               NA               NA Mana… Brazil  2.41S    60.27W   
## 29 1830-05-01               NA               NA Mana… Brazil  2.41S    60.27W   
## 30 1830-06-01               NA               NA Mana… Brazil  2.41S    60.27W   
## 31 1830-07-01               NA               NA Mana… Brazil  2.41S    60.27W   
## 32 1830-08-01               NA               NA Mana… Brazil  2.41S    60.27W   
## 33 1830-09-01               NA               NA Mana… Brazil  2.41S    60.27W   
## 34 1830-10-01               NA               NA Mana… Brazil  2.41S    60.27W   
## 35 1830-11-01               NA               NA Mana… Brazil  2.41S    60.27W   
## 36 1830-12-01               NA               NA Mana… Brazil  2.41S    60.27W   
## 37 1830-01-01               NA               NA Sant… Brazil  2.41S    55.45W   
## 38 1830-02-01               NA               NA Sant… Brazil  2.41S    55.45W   
## 39 1830-03-01               NA               NA Sant… Brazil  2.41S    55.45W   
## 40 1830-04-01               NA               NA Sant… Brazil  2.41S    55.45W   
## 41 1830-05-01               NA               NA Sant… Brazil  2.41S    55.45W   
## 42 1830-06-01               NA               NA Sant… Brazil  2.41S    55.45W   
## 43 1830-07-01               NA               NA Sant… Brazil  2.41S    55.45W   
## 44 1830-08-01               NA               NA Sant… Brazil  2.41S    55.45W   
## 45 1830-09-01               NA               NA Sant… Brazil  2.41S    55.45W   
## 46 1830-10-01               NA               NA Sant… Brazil  2.41S    55.45W   
## 47 1830-11-01               NA               NA Sant… Brazil  2.41S    55.45W   
## 48 1830-12-01               NA               NA Sant… Brazil  2.41S    55.45W   
## # … with 1 more variable: ano <dbl>

Note that there is no recorded temperature value for Brazil in 1830. If you are going to calculate the average of these values, removing anything NA, no observation will be left. Therefore, an indetermination is generated, making it impossible to calculate the average Brazil in this specific year. You can check that the same goes for 1831.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.