Minor Date in a Dataset

Asked

Viewed 158 times

6

I have a data set and would like to select only the smallest date among each primary key (column MATRICULA). Follow the example of my DF:

MATRICULA <- c(1,1,3,3,3,4,5,5,5,5,6)
DATA <- c('15/01/2018', '10/12/2017', '20/11/2017', '01/01/2015', 
  '25/10/2018', '02/07/2016', '03/12/2016','17/08/2017', '22/03/2018', 
  '12/06/2018', '13/04/2014')
DADOS <- data.frame(MATRICULA, DATA)

I already use the function abv_data = c(as.Date(DADOS$DATA,"%d/%m/%Y")) to transform the date format.

Therefore, I would like the result to appear only at the earliest date from the column MATRICULA. The result I expect should be:

MATRICULA <- c(1,3,4,5,6)
DATA <- c('10/12/2017', '01/01/2015', '02/07/2016', '03/12/2016', '13/04/2014')
DADOS <- data.frame(MATRICULA,DATA)

3 answers

6


This problem is very easy to solve using the package dplyr. The first thing to do is turn the column DATA in date, so that the R can establish an order relation for it. I’ll just copy your original code, which was correct, and overwrite the column DATA inside DADOS:

DADOS$DATA = c(as.Date(DADOS$DATA,"%d/%m/%Y"))

With this done, the package dplyr will serve to group your data by MATRICULA and then look for the lowest value of DATA for each of them:

DADOS %>% 
  group_by(MATRICULA) %>% 
  slice(which.min(DATA))
# A tibble: 5 x 2
# Groups:   MATRICULA [5]
  MATRICULA DATA      
      <dbl> <date>    
1         1 2017-12-10
2         3 2015-01-01
3         4 2016-07-02
4         5 2016-12-03
5         6 2014-04-13

That’s it. That’s the result. It will serve for any number of registrations you have in your dataset, and it won’t matter how many dates there are within each of these registrations.

6

Two ways with R base.

With aggregate.

aggregate(abv_data ~ MATRICULA, DADOS, min)
#  MATRICULA   abv_data
#1         1 2017-12-10
#2         3 2015-01-01
#3         4 2016-07-02
#4         5 2016-12-03
#5         6 2014-04-13

With tapply.

as.Date(tapply(abv_data, DADOS$MATRICULA, FUN = min), origin = "1970-01-01")
#           1            3            4            5            6 
#"2017-12-10" "2015-01-01" "2016-07-02" "2016-12-03" "2014-04-13"

5

Another solution with dplyr is using filter instead of slice.

library(tidyverse)
library(lubridate)

DADOS %>% 
  mutate(DATA = dmy(as.character(DATA))) %>% 
  group_by(MATRICULA) %>% 
  filter(DATA == min(DATA))

# A tibble: 5 x 2
# Groups:   MATRICULA [5]
  MATRICULA DATA      
      <dbl> <date>    
1         1 2017-12-10
2         3 2015-01-01
3         4 2016-07-02
4         5 2016-12-03
5         6 2014-04-13
  • It worked better that way. As I have a DF of millions of lines, by Aggregate I did not succeed.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.