1
I am working with the following global temperature database:
https://drive.google.com/open?id=1nSwP3Y0V7gncbnG_DccNhrTRxmUNqMqa
I import the data with the function import() package rio and engrave on the object df.
df<-rio::import("TemperaturasGlobais.csv")
head(df)
 dt AverageTemperature AverageTemperatureUncertainty   City Country Latitude Longitude
1 1743-11-01              6.068                         1.737 Ã…rhus Denmark   57.05N    10.33E
2 1743-12-01                 NA                            NA Ã…rhus Denmark   57.05N    10.33E
3 1744-01-01                 NA                            NA Ã…rhus Denmark   57.05N    10.33E
4 1744-02-01                 NA                            NA Ã…rhus Denmark   57.05N    10.33E
5 1744-03-01                 NA                            NA Ã…rhus Denmark   57.05N    10.33E
6 1744-04-01              5.788                         3.624 Ã…rhus Denmark   57.05N    10.33E 
However, the column dt (date) comes in format character.
str(df)
'data.frame':   8599212 obs. of  7 variables:
 $ dt                           : chr  "1743-11-01" "1743-12-01" "1744-01-01" "1744-02-01" ...
 $ AverageTemperature           : num  6.07 NA NA NA NA ...
 $ AverageTemperatureUncertainty: num  1.74 NA NA NA NA ...
 $ City                         : chr  "Ã…rhus" "Ã…rhus" "Ã…rhus" "Ã…rhus" ...
 $ Country                      : chr  "Denmark" "Denmark" "Denmark" "Denmark" ...
 $ Latitude                     : chr  "57.05N" "57.05N" "57.05N" "57.05N" ...
 $ Longitude                    : chr  "10.33E" "10.33E" "10.33E" "10.33E" ...
So I apply the function ymd() of the lubridate to convert it to the format date and engrave on the object df2.
df2<-df %>% 
  mutate(dt=ymd(dt))
head(df2)
dt AverageTemperature AverageTemperatureUncertainty   City Country Latitude Longitude
1 1743-11-01              6.068                         1.737 Ã…rhus Denmark   57.05N    10.33E
2 1743-12-01                 NA                            NA Ã…rhus Denmark   57.05N    10.33E
3 1744-01-01                 NA                            NA Ã…rhus Denmark   57.05N    10.33E
4 1744-02-01                 NA                            NA Ã…rhus Denmark   57.05N    10.33E
5 1744-03-01                 NA                            NA Ã…rhus Denmark   57.05N    10.33E
6 1744-04-01              5.788                         3.624 Ã…rhus Denmark   57.05N    10.33E
Confiro, and I see that it worked. The column "dt" is now in the format "date"
str(df2)
'data.frame':   8599212 obs. of  7 variables:
 $ dt                           : Date, format: "1743-11-01" "1743-12-01" "1744-01-01" "1744-02-01" ...
 $ AverageTemperature           : num  6.07 NA NA NA NA ...
 $ AverageTemperatureUncertainty: num  1.74 NA NA NA NA ...
 $ City                         : chr  "Ã…rhus" "Ã…rhus" "Ã…rhus" "Ã…rhus" ...
 $ Country                      : chr  "Denmark" "Denmark" "Denmark" "Denmark" ...
 $ Latitude                     : chr  "57.05N" "57.05N" "57.05N" "57.05N" ...
 $ Longitude                    : chr  "10.33E" "10.33E" "10.33E" "10.33E" ...
The problem comes now: I do the grouping (group_by) PER YEAR, filter for country only Brazil, and request the annual average summarise (mean()) with the removal of missing values (na.rm = T).
df3<-df2 %>% 
  group_by(ano=year(dt)) %>% 
  filter(Country=="Brazil") %>% 
  summarise(media.anual=mean(AverageTemperature, na.rm = T))
The output is a Tibble whose column dt is no longer in format date.
# A tibble: 190 x 2
     ano media.anual
   <dbl>       <dbl>
 1  1824        26.5
 2  1825        26.5
 3  1826        26.4
 4  1827        26.7
 5  1828        26.1
 6  1829        26.0
 7  1830       NaN  
 8  1831       NaN  
 9  1832        20.5
10  1833        21.4
# ... with 180 more rows
str(df3)
tibble [190 x 2] (S3: tbl_df/tbl/data.frame)
 $ ano        : num [1:190] 1824 1825 1826 1827 1828 ...
 $ media.anual: num [1:190] 26.5 26.5 26.4 26.7 26.1 ...
Hence, there are 3 my doubts:
- Why after using the function of group_by+summarise(mean())the result undoes the formatting ofdatethat I had previously achieved?
- How do I make this Tibble stay in format date?
- A curiosity: why the missing values appear in Tibble df3asNaNand not asNA? What does that meanNaN?