Eliminating Double Lines from a Data.Frame

Question

Eliminating Double Lines from a Data.Frame

Asked 9 years, 9 months ago

Viewed 2,993 times

5

I have a data.Frame with the following behaviour:

     values        ind
1  10.82000 2011-01-03
2  11.75000 2011-01-03
3  10.82000 2011-01-03
4  11.75000 2011-01-03
5  10.82000 2011-01-03
6  11.75000 2011-01-03
7  10.84048 2011-01-04
8  11.79000 2011-01-04
9  10.87095 2011-01-05
10 11.84000 2011-01-05
11 10.88928 2011-01-06
12 11.88000 2011-01-06
13 10.92000 2011-01-07
14 12.03000 2011-01-07
15 10.93984 2011-01-10
...
121 11.67614 2011-03-03
122 12.47000 2011-03-03
123 11.67481 2011-03-04
124 12.44000 2011-03-04
125 11.68514 2011-03-09
126 12.44000 2011-03-09
127 11.68514 2011-03-09
128 12.44000 2011-03-09
129 11.68514 2011-03-09
130 12.44000 2011-03-09
131 11.68514 2011-03-09
132 12.44000 2011-03-09
133 11.68514 2011-03-09
134 12.44000 2011-03-09
135 11.67746 2011-03-10

Here’s the thing, I have to delete lines 1 through 4, leaving lines 5 and 6. I need to delete lines 125 through 133 leaving lines 134 and 135.

Note that it would be interesting to delete in order. As soon as I see that there are repetitions two by two I would like to go erasing until I leave the last repetition.

Can you create something? As I am a beginner in R I am since yesterday trying to create something but it is very difficult.

The cool thing is to see your code, so you can see your logic and answer better. Can [Edit] the question to add details.

– brasofilo

2015/10/12 at 14:47
It is not really a code. It is only the command Matrix[-c(1:4,15:18, 29:32, 43:46, 55:58,),].

– cassius

2015/10/12 at 14:49
as you mentioned estou tentando criar um codigo/função in the question...

– brasofilo

2015/10/12 at 14:53
Okay. I’ll edit it.

– cassius

2015/10/12 at 14:54
Note the actual effect of the duplicated function. It counts the number of duplicates and not the number of repeated lines. Duplicated negation counts the number of nonduplicates and counts the 1st of a series of duplicate lines. Worth checking.

– C Lederman

2020/07/14 at 15:57

2 answers

3

Another way to do this is by using the function duplicated.

Using the base recreated by Carlos.

df<- read.table(text = "values        ind
1  10.82000 2011-01-03
2  11.75000 2011-01-03
3  10.82000 2011-01-03
4  11.75000 2011-01-03
5  10.82000 2011-01-03
6  11.75000 2011-01-03
7  10.84048 2011-01-04
8  11.79000 2011-01-04
9  10.87095 2011-01-05
10 11.84000 2011-01-05
11 10.88928 2011-01-06
12 11.88000 2011-01-06
13 10.92000 2011-01-07
14 12.03000 2011-01-07
15 10.93984 2011-01-10
121 11.67614 2011-03-03
122 12.47000 2011-03-03
123 11.67481 2011-03-04
124 12.44000 2011-03-04
125 11.68514 2011-03-09
126 12.44000 2011-03-09
127 11.68514 2011-03-09
128 12.44000 2011-03-09
129 11.68514 2011-03-09
130 12.44000 2011-03-09
131 11.68514 2011-03-09
132 12.44000 2011-03-09
133 11.68514 2011-03-09
134 12.44000 2011-03-09
135 11.67746 2011-03-10")

The function duplicated finds lines with duplicate value.

duplicados <- duplicated(df,fromLast = TRUE)

The argument fromLast=TRUE makes the values considered duplicated to be the first apparitions.

The command below shows you which lines contain duplicate values

which(duplicados)

To get the data frame without duplicated values just do a subset with the command below.

df[!duplicados,]

Browser other questions tagged r

You are not signed in. Login or sign up in order to post.

by Carlos Cinelli • **16,826** points · Answer 1 · 2015-10-12T20:20:09+00:00

You can use the command unique, he will leave only the unique observations of his data.frame. For example, recreating your database:

df<- read.table(text = "values        ind
1  10.82000 2011-01-03
2  11.75000 2011-01-03
3  10.82000 2011-01-03
4  11.75000 2011-01-03
5  10.82000 2011-01-03
6  11.75000 2011-01-03
7  10.84048 2011-01-04
8  11.79000 2011-01-04
9  10.87095 2011-01-05
10 11.84000 2011-01-05
11 10.88928 2011-01-06
12 11.88000 2011-01-06
13 10.92000 2011-01-07
14 12.03000 2011-01-07
15 10.93984 2011-01-10
121 11.67614 2011-03-03
122 12.47000 2011-03-03
123 11.67481 2011-03-04
124 12.44000 2011-03-04
125 11.68514 2011-03-09
126 12.44000 2011-03-09
127 11.68514 2011-03-09
128 12.44000 2011-03-09
129 11.68514 2011-03-09
130 12.44000 2011-03-09
131 11.68514 2011-03-09
132 12.44000 2011-03-09
133 11.68514 2011-03-09
134 12.44000 2011-03-09
135 11.67746 2011-03-10")

And applying the unique.

unique(df)
      values        ind
1   10.82000 2011-01-03
2   11.75000 2011-01-03
7   10.84048 2011-01-04
8   11.79000 2011-01-04
9   10.87095 2011-01-05
10  11.84000 2011-01-05
11  10.88928 2011-01-06
12  11.88000 2011-01-06
13  10.92000 2011-01-07
14  12.03000 2011-01-07
15  10.93984 2011-01-10
121 11.67614 2011-03-03
122 12.47000 2011-03-03
123 11.67481 2011-03-04
124 12.44000 2011-03-04
125 11.68514 2011-03-09
126 12.44000 2011-03-09
135 11.67746 2011-03-10

Note that in this case he left the first observations. If you want to leave the last, as you described in your question, just put fromLast = TRUE.

unique(df, fromLast = TRUE)
      values        ind
5   10.82000 2011-01-03
6   11.75000 2011-01-03
7   10.84048 2011-01-04
8   11.79000 2011-01-04
9   10.87095 2011-01-05
10  11.84000 2011-01-05
11  10.88928 2011-01-06
12  11.88000 2011-01-06
13  10.92000 2011-01-07
14  12.03000 2011-01-07
15  10.93984 2011-01-10
121 11.67614 2011-03-03
122 12.47000 2011-03-03
123 11.67481 2011-03-04
124 12.44000 2011-03-04
133 11.68514 2011-03-09
134 12.44000 2011-03-09
135 11.67746 2011-03-10