How to return only repeated values in R?

Question

How to return only repeated values in R?

Asked 5 years, 7 months ago

Viewed 428 times

0

suppose the following dataframe:

ref<-data.frame(autores=c("AZEVEDO, L. S.; NASCIMENTO, E. F.; CANDEIAS, A. L. B.",
       "BERGER, R.; SILVA, J. A. A.; FERREIRA, R. L. C.; CANDEIAS, A. L. B.; RUBILAR, R.",
       "AZEVEDO, L. S.; CANDEIAS, ANA LÚCIA BEZERRA",
       "SILVA, JADSON FREIRE; MIRANDA, RODRIGO QUEIROGA; CANDEIAS, ANA LÚCIA BEZERRA",
       "OLIVEIRA, CLAUDIANNE BRAINER DE SOUZA; CANDEIAS, ANA LÚCIA BEZERRA; TAVARES JUNIOR, J. R.",
       "SANTOS, AMANDA PEREIRA; SILVA, EDER BATISTA DA; CANDEIAS, ANA LÚCIA BEZERRA; COSTA, MARIA APARECIDA TENÓRIO DA",
       "SILVA, JADSON FREIRE; MIRANDA, RODRIGO QUEIROGA; CANDEIAS, ANA LÚCIA BEZERRA",
       "SILVA, JADSON FREIRE; PAZ, YENÊ MEDEIROS; LIMA-SILVA, PEDRO PAULO; PEREIRA, JOÃO ANTÔNIO DOS SANTOS; CANDEIAS, ANA LÚCIA BEZERRA",
       "ALEXANDRE, FERNANDO DA SILVA; CANDEIAS, ANA LÚCIA BEZERRA; GOMES, DANIEL DANTAS MOREIRA"))

autores
1                                                                            AZEVEDO, L. S.; NASCIMENTO, E. F.; CANDEIAS, A. L. B.
2                                                 BERGER, R.; SILVA, J. A. A.; FERREIRA, R. L. C.; CANDEIAS, A. L. B.; RUBILAR, R.
3                                                                                      AZEVEDO, L. S.; CANDEIAS, ANA LÚCIA BEZERRA
4                                                     SILVA, JADSON FREIRE; MIRANDA, RODRIGO QUEIROGA; CANDEIAS, ANA LÚCIA BEZERRA
5                                        OLIVEIRA, CLAUDIANNE BRAINER DE SOUZA; CANDEIAS, ANA LÚCIA BEZERRA; TAVARES JUNIOR, J. R.
6                   SANTOS, AMANDA PEREIRA; SILVA, EDER BATISTA DA; CANDEIAS, ANA LÚCIA BEZERRA; COSTA, MARIA APARECIDA TENÓRIO DA
7                                                     SILVA, JADSON FREIRE; MIRANDA, RODRIGO QUEIROGA; CANDEIAS, ANA LÚCIA BEZERRA
8 SILVA, JADSON FREIRE; PAZ, YENÊ MEDEIROS; LIMA-SILVA, PEDRO PAULO; PEREIRA, JOÃO ANTÔNIO DOS SANTOS; CANDEIAS, ANA LÚCIA BEZERRA
9                                          ALEXANDRE, FERNANDO DA SILVA; CANDEIAS, ANA LÚCIA BEZERRA; GOMES, DANIEL DANTAS MOREIRA
>

There is a repeated value: "SILVA, JADSON FREIRE; MIRANDA, RODRIGO QUEIROGA; CANDEIAS, ANA LÚCIA BEZERRA"

I can identify through "duplicated()"

duplicated(ref)

[1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE

I can identify the position where the duplicate value is with "which()"

which(duplicated(ref))

[1] 7

But what I really wanted was to return a dataframe only with the repeated value.

The file in Excel: references

I import the file

df<-rio::import("coautoria.artigos.original.xlsx")

Being a data.frame with multiple columns, I use the option to keep all columns

df2<-df[duplicated(df$artigo), ]

I try to organize the repeated articles by arranging the data order from the "article" column. But the result doesn’t just bring back repeated articles.

df2 %>% 
  arrange(artigo)

Some repeated articles appear, but others do not.

Should return only the repeated, no?

An example right at the beginning of the frame date: the first article that appears ("THE PRODUCTION OF THE TOURIST AREA VIA ACCUMULATION...") is repeated. The same article is authored (column "teacher") of "Itamar" and "Edvania".

It should, then, appear one below the other, right? One referring to the teacher "Edvania" and another to the teacher "Itamar". Or I’m wrong?

3

repetido <- ref[duplicated(ref), , drop = FALSE]. It is necessary to use drop = FALSE to maintain the dataframe structure.

– Rui Barradas

2020/08/31 at 16:45
I’m sorry Rui, but I don’t get it. The square brackets open a "line" and "column" reference, don’t you think? In this case, "duplicated(ref)" would be the line of "ref [ ]" and ", ," would be referring to "all columns"? I don’t get it. What works, works! I tested it here, but I wanted to understand

– itamar

2020/08/31 at 19:26
Yes, the , , refers to all columns. When only row index is the same as saying "this row, regardless of column, that is, all". The same happens when you only have column index(s), you are referring to all rows.

– Rui Barradas

2020/08/31 at 22:23

2 answers

1

library(readxl)
library(dplyr)
df <- read_xlsx('./coautoria.artigos.original.xlsx')

df %>% add_count(artigo) %>%
  filter( n > 1)

Have you tried using add_count? I don’t know if this is what you want but this way I tested and it returned duplicate values.

1

Perfect lmonferrari, that’s exactly what I needed. Thank you very much

– itamar

2020/09/01 at 12:35

Browser other questions tagged r

You are not signed in. Login or sign up in order to post.

by Carlos Eduardo Lagosta • **5,497** points · Answer 1 · 2020-08-31T20:07:12+00:00

Rui Barradas has already responded in the comments, I will expand. To facilitate the visualization and make the answer more general, I will use simulated data:

set.seed(736)
letras <- sample(LETTERS[1:10], 10, replace = TRUE)

vector

letras[duplicated(letras)]

data.frame with a column

exdf <- as.data.frame(letras)

# Retornando como vetor:
exdf[duplicated(exdf), ]

# Mantendo a estrutura de data.frame
exdf[duplicated(exdf), , drop = FALSE]

data.frame with multiple columns

exdf$numero <- 1:nrow(exdf)

# Mantendo todas as colunas:
exdf[duplicated(exdf$letras), ]

# Apenas a coluna letras, como vetor:
exdf[duplicated(exdf$letras), "letras"]

# Apenas a coluna letras, mantendo a estrutura:
exdf[duplicated(exdf$letras), "letras", drop = FALSE]
#ou
exdf[duplicated(exdf$letras), ]["letras"]

# Mais de uma coluna como critério:
exdf[duplicated(exdf[c("letras", "numero")]), ]

# Todas as colunas (i.e. linhas iguais):
exdf[duplicated(exdf), ]

General advice for R: learn how to work with indexing and generic extraction. It is a basic and powerful resource, but often despised. See the help with ?"[". In Portuguese, this UFPR class gives a good overview.