How to make a double filter on a long-format dataframe

Asked

Viewed 48 times

1

I have a DF in long format with data from several countries. I would like to make a filter to fetch the last available value of the variable total_tests of each country, those without data are represented by a NA. This data is available on different dates for each country, which made me have problems using a filter(date == max(date)) or filter(!is.na(total_tests)).

Code:

library(tidycovid19) # Pacote do GitHub - https://github.com/joachim-gassen/tidycovid19
library(tidyverse)

updates <- download_merged_data(cached = TRUE)

updates %>%
    filter(date == max(date), !is.na(total_tests))

1 answer

2


I’d do it this way:

  1. Would eliminate all lines with NA in total_tests

  2. Convert date for date, as it is possible to establish an order relation in that column

  3. Sort the data frame by country and for date, so as to make sure that all the observations of each country are together and in ascending order by date

  4. Group by country

  5. Would apply the function tail with argument 1, to keep only the last line of each block of observations of each country

The final code went like this:

library(tidycovid19)
library(tidyverse)
library(lubridate)

updates <- download_merged_data(cached = TRUE)

updates %>%
  filter(!is.na(total_tests)) %>%
  mutate(date = ymd(date)) %>%
  arrange(country, date) %>%
  group_by(country) %>%
  do(tail(., 1))
# A tibble: 83 x 35
# Groups:   country [83]
   iso3c country date       confirmed deaths recovered ecdc_cases ecdc_deaths
   <chr> <chr>   <date>         <dbl>  <dbl>     <dbl>      <dbl>       <dbl>
 1 ARG   Argent… 2020-06-01     17415    556      5521      16838         539
 2 AUS   Austra… 2020-05-31      7202    103      6618       7185         103
 3 AUT   Austria 2020-06-01     16733    668     15596      16642         668
 4 BHR   Bahrain 2020-06-01     11871     19      7070      11398          19
 5 BGD   Bangla… 2020-05-31     47153    650      9781      44608         610
 6 BLR   Belarus 2020-06-01     43403    240     18776      42556         235
 7 BEL   Belgium 2020-05-30     58186   9453     15769      58061        9443
 8 BOL   Bolivia 2020-05-31      9982    313       968       9592         310
 9 BRA   Brazil  2020-05-29    465166  27878    189476     438238       26754
10 BGR   Bulgar… 2020-06-01      2519    140      1090       2513         140
# … with 73 more rows, and 27 more variables: total_tests <dbl>,
#   tests_units <chr>, soc_dist <dbl>, mov_rest <dbl>, pub_health <dbl>,
#   gov_soc_econ <dbl>, lockdown <dbl>, apple_mtr_driving <dbl>,
#   apple_mtr_walking <dbl>, apple_mtr_transit <dbl>,
#   gcmr_retail_recreation <dbl>, gcmr_grocery_pharmacy <dbl>, gcmr_parks <dbl>,
#   gcmr_transit_stations <dbl>, gcmr_workplaces <dbl>, gcmr_residential <dbl>,
#   gtrends_score <dbl>, gtrends_country_score <int>, region <chr>,
#   income <chr>, population <dbl>, land_area_skm <dbl>, pop_density <dbl>,
#   pop_largest_city <dbl>, life_expectancy <dbl>, gdp_capita <dbl>,
#   timestamp <dttm>

Browser other questions tagged

You are not signed in. Login or sign up in order to post.