Remove duplicate cases and keep specific values from another variable

Asked

Viewed 1,261 times

2

Consider the following situation:

I have a database with two variables. The first is a variable with duplicate values (e.g. Cpfxxx.xxx.xxx-xx appears 14 times, Cpfxxx.xxx.xxx-xx appears 18 times and so on). The second variable is the dates of occurrence of the event (e.g. 2017-01-18, 2017-01-19...) associated with each CPF.

I use the following function to remove duplicate cases:

new<-dataset[!duplicated(dataset[c("CPFs")]),]

And I can remove duplicate lines.

My goal: to remove duplicates in CPFs, but in the other variable (data), cause the newer (or older) ones to remain attached to the CPF. That is, it is necessary to establish a sort when executing the function.

So if I have the dates (2018-01-20; 2017-02-22) coupled to a CPF, the date attached to it would be: 2017-02-22.

dput fictitious to aid the answer:

dataset=structure(list(CPFs = c(1234, 2345, 1234, 2345, 1234, 2345, 1234, 
2345), date = c(1998, 1997, 1993, 1992, 1998, 1998, 1992, 1989
)), class = "data.frame", row.names = c(NA, -8L))

Desired result:

CPF  date
1234 1992
2345 1989

1 answer

4


One simple way to solve is by using the package dplyr, of tidyverse:

  new_dataset <- dataset %>% 
    arrange(date) %>% 
    distinct(CPFs, .keep_all = TRUE)

Please note that dates need to be formatted as Date, and not as a string, otherwise the classification may not work properly.

If you want to select the most recent view, just use arrange(desc(date)), that is, classifying in a descending manner.

  • In fact, the function arrange of dplyr resolve. Thankful, @David.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.