1
I am working with the following global temperature database:
https://drive.google.com/open?id=1nSwP3Y0V7gncbnG_DccNhrTRxmUNqMqa
I import the data with the function import()
package rio
and engrave on the object df
.
df<-rio::import("TemperaturasGlobais.csv")
head(df)
dt AverageTemperature AverageTemperatureUncertainty City Country Latitude Longitude
1 1743-11-01 6.068 1.737 Ã…rhus Denmark 57.05N 10.33E
2 1743-12-01 NA NA Ã…rhus Denmark 57.05N 10.33E
3 1744-01-01 NA NA Ã…rhus Denmark 57.05N 10.33E
4 1744-02-01 NA NA Ã…rhus Denmark 57.05N 10.33E
5 1744-03-01 NA NA Ã…rhus Denmark 57.05N 10.33E
6 1744-04-01 5.788 3.624 Ã…rhus Denmark 57.05N 10.33E
However, the column dt
(date) comes in format character
.
str(df)
'data.frame': 8599212 obs. of 7 variables:
$ dt : chr "1743-11-01" "1743-12-01" "1744-01-01" "1744-02-01" ...
$ AverageTemperature : num 6.07 NA NA NA NA ...
$ AverageTemperatureUncertainty: num 1.74 NA NA NA NA ...
$ City : chr "Ã…rhus" "Ã…rhus" "Ã…rhus" "Ã…rhus" ...
$ Country : chr "Denmark" "Denmark" "Denmark" "Denmark" ...
$ Latitude : chr "57.05N" "57.05N" "57.05N" "57.05N" ...
$ Longitude : chr "10.33E" "10.33E" "10.33E" "10.33E" ...
So I apply the function ymd()
of the lubridate to convert it to the format date
and engrave on the object df2
.
df2<-df %>%
mutate(dt=ymd(dt))
head(df2)
dt AverageTemperature AverageTemperatureUncertainty City Country Latitude Longitude
1 1743-11-01 6.068 1.737 Ã…rhus Denmark 57.05N 10.33E
2 1743-12-01 NA NA Ã…rhus Denmark 57.05N 10.33E
3 1744-01-01 NA NA Ã…rhus Denmark 57.05N 10.33E
4 1744-02-01 NA NA Ã…rhus Denmark 57.05N 10.33E
5 1744-03-01 NA NA Ã…rhus Denmark 57.05N 10.33E
6 1744-04-01 5.788 3.624 Ã…rhus Denmark 57.05N 10.33E
Confiro, and I see that it worked. The column "dt" is now in the format "date"
str(df2)
'data.frame': 8599212 obs. of 7 variables:
$ dt : Date, format: "1743-11-01" "1743-12-01" "1744-01-01" "1744-02-01" ...
$ AverageTemperature : num 6.07 NA NA NA NA ...
$ AverageTemperatureUncertainty: num 1.74 NA NA NA NA ...
$ City : chr "Ã…rhus" "Ã…rhus" "Ã…rhus" "Ã…rhus" ...
$ Country : chr "Denmark" "Denmark" "Denmark" "Denmark" ...
$ Latitude : chr "57.05N" "57.05N" "57.05N" "57.05N" ...
$ Longitude : chr "10.33E" "10.33E" "10.33E" "10.33E" ...
The problem comes now: I do the grouping (group_by
) PER YEAR, filter for country only Brazil
, and request the annual average summarise (mean())
with the removal of missing values (na.rm = T
).
df3<-df2 %>%
group_by(ano=year(dt)) %>%
filter(Country=="Brazil") %>%
summarise(media.anual=mean(AverageTemperature, na.rm = T))
The output is a Tibble whose column dt
is no longer in format date
.
# A tibble: 190 x 2
ano media.anual
<dbl> <dbl>
1 1824 26.5
2 1825 26.5
3 1826 26.4
4 1827 26.7
5 1828 26.1
6 1829 26.0
7 1830 NaN
8 1831 NaN
9 1832 20.5
10 1833 21.4
# ... with 180 more rows
str(df3)
tibble [190 x 2] (S3: tbl_df/tbl/data.frame)
$ ano : num [1:190] 1824 1825 1826 1827 1828 ...
$ media.anual: num [1:190] 26.5 26.5 26.4 26.7 26.1 ...
Hence, there are 3 my doubts:
- Why after using the function of
group_by
+summarise(mean())
the result undoes the formatting ofdate
that I had previously achieved? - How do I make this Tibble stay in format
date
? - A curiosity: why the missing values appear in Tibble
df3
asNaN
and not asNA
? What does that meanNaN
?