How to return only repeated values in R?

Asked

Viewed 428 times

0

suppose the following dataframe:

ref<-data.frame(autores=c("AZEVEDO, L. S.; NASCIMENTO, E. F.; CANDEIAS, A. L. B.",
       "BERGER, R.; SILVA, J. A. A.; FERREIRA, R. L. C.; CANDEIAS, A. L. B.; RUBILAR, R.",
       "AZEVEDO, L. S.; CANDEIAS, ANA LÚCIA BEZERRA",
       "SILVA, JADSON FREIRE; MIRANDA, RODRIGO QUEIROGA; CANDEIAS, ANA LÚCIA BEZERRA",
       "OLIVEIRA, CLAUDIANNE BRAINER DE SOUZA; CANDEIAS, ANA LÚCIA BEZERRA; TAVARES JUNIOR, J. R.",
       "SANTOS, AMANDA PEREIRA; SILVA, EDER BATISTA DA; CANDEIAS, ANA LÚCIA BEZERRA; COSTA, MARIA APARECIDA TENÓRIO DA",
       "SILVA, JADSON FREIRE; MIRANDA, RODRIGO QUEIROGA; CANDEIAS, ANA LÚCIA BEZERRA",
       "SILVA, JADSON FREIRE; PAZ, YENÊ MEDEIROS; LIMA-SILVA, PEDRO PAULO; PEREIRA, JOÃO ANTÔNIO DOS SANTOS; CANDEIAS, ANA LÚCIA BEZERRA",
       "ALEXANDRE, FERNANDO DA SILVA; CANDEIAS, ANA LÚCIA BEZERRA; GOMES, DANIEL DANTAS MOREIRA"))
autores
1                                                                            AZEVEDO, L. S.; NASCIMENTO, E. F.; CANDEIAS, A. L. B.
2                                                 BERGER, R.; SILVA, J. A. A.; FERREIRA, R. L. C.; CANDEIAS, A. L. B.; RUBILAR, R.
3                                                                                      AZEVEDO, L. S.; CANDEIAS, ANA LÚCIA BEZERRA
4                                                     SILVA, JADSON FREIRE; MIRANDA, RODRIGO QUEIROGA; CANDEIAS, ANA LÚCIA BEZERRA
5                                        OLIVEIRA, CLAUDIANNE BRAINER DE SOUZA; CANDEIAS, ANA LÚCIA BEZERRA; TAVARES JUNIOR, J. R.
6                   SANTOS, AMANDA PEREIRA; SILVA, EDER BATISTA DA; CANDEIAS, ANA LÚCIA BEZERRA; COSTA, MARIA APARECIDA TENÓRIO DA
7                                                     SILVA, JADSON FREIRE; MIRANDA, RODRIGO QUEIROGA; CANDEIAS, ANA LÚCIA BEZERRA
8 SILVA, JADSON FREIRE; PAZ, YENÊ MEDEIROS; LIMA-SILVA, PEDRO PAULO; PEREIRA, JOÃO ANTÔNIO DOS SANTOS; CANDEIAS, ANA LÚCIA BEZERRA
9                                          ALEXANDRE, FERNANDO DA SILVA; CANDEIAS, ANA LÚCIA BEZERRA; GOMES, DANIEL DANTAS MOREIRA
> 

There is a repeated value: "SILVA, JADSON FREIRE; MIRANDA, RODRIGO QUEIROGA; CANDEIAS, ANA LÚCIA BEZERRA"

I can identify through "duplicated()"

duplicated(ref)

[1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE

I can identify the position where the duplicate value is with "which()"

which(duplicated(ref))

[1] 7

But what I really wanted was to return a dataframe only with the repeated value.

The file in Excel: references

I import the file

df<-rio::import("coautoria.artigos.original.xlsx")

Being a data.frame with multiple columns, I use the option to keep all columns

df2<-df[duplicated(df$artigo), ]

I try to organize the repeated articles by arranging the data order from the "article" column. But the result doesn’t just bring back repeated articles.

df2 %>% 
  arrange(artigo)

Some repeated articles appear, but others do not.

Should return only the repeated, no?

An example right at the beginning of the frame date: the first article that appears ("THE PRODUCTION OF THE TOURIST AREA VIA ACCUMULATION...") is repeated. The same article is authored (column "teacher") of "Itamar" and "Edvania".

It should, then, appear one below the other, right? One referring to the teacher "Edvania" and another to the teacher "Itamar". Or I’m wrong?

  • 3

    repetido <- ref[duplicated(ref), , drop = FALSE]. It is necessary to use drop = FALSE to maintain the dataframe structure.

  • I’m sorry Rui, but I don’t get it. The square brackets open a "line" and "column" reference, don’t you think? In this case, "duplicated(ref)" would be the line of "ref [ ]" and ", ," would be referring to "all columns"? I don’t get it. What works, works! I tested it here, but I wanted to understand

  • Yes, the , , refers to all columns. When only row index is the same as saying "this row, regardless of column, that is, all". The same happens when you only have column index(s), you are referring to all rows.

2 answers

2

Rui Barradas has already responded in the comments, I will expand. To facilitate the visualization and make the answer more general, I will use simulated data:

set.seed(736)
letras <- sample(LETTERS[1:10], 10, replace = TRUE)

vector

letras[duplicated(letras)]

data.frame with a column

exdf <- as.data.frame(letras)

# Retornando como vetor:
exdf[duplicated(exdf), ]

# Mantendo a estrutura de data.frame
exdf[duplicated(exdf), , drop = FALSE]

data.frame with multiple columns

exdf$numero <- 1:nrow(exdf)

# Mantendo todas as colunas:
exdf[duplicated(exdf$letras), ]

# Apenas a coluna letras, como vetor:
exdf[duplicated(exdf$letras), "letras"]

# Apenas a coluna letras, mantendo a estrutura:
exdf[duplicated(exdf$letras), "letras", drop = FALSE]
#ou
exdf[duplicated(exdf$letras), ]["letras"]

# Mais de uma coluna como critério:
exdf[duplicated(exdf[c("letras", "numero")]), ]

# Todas as colunas (i.e. linhas iguais):
exdf[duplicated(exdf), ]

General advice for R: learn how to work with indexing and generic extraction. It is a basic and powerful resource, but often despised. See the help with ?"[". In Portuguese, this UFPR class gives a good overview.

  • Hi Carlos Eduardo, your explanation is very good. Reproducing your examples, I can. But when I go to work with my df I remain unsuccessful. I can’t figure out what I’m doing wrong. I edited the question to put the example I’m working with. I put the file link.

  • 1

    Which columns or duplicate column combination is the criterion for you? Full reference? Use ref[duplicated(ref$referencia.completa),]. Title of the article? ref[duplicated(ref$artigo),]. Combination of title and authors: ref[duplicated(ref[c("autores","artigo")]), ]. So I made a generic example and gave the recommendation of the topic that your doubt involves: understand the principle and can use for any situation.

  • Now it’s Carlos. Thank you very much

1


library(readxl)
library(dplyr)
df <- read_xlsx('./coautoria.artigos.original.xlsx')

df %>% add_count(artigo) %>%
  filter( n > 1)

Have you tried using add_count? I don’t know if this is what you want but this way I tested and it returned duplicate values.

  • 1

    Perfect lmonferrari, that’s exactly what I needed. Thank you very much

Browser other questions tagged

You are not signed in. Login or sign up in order to post.