Remove duplicated names with regular expression

Asked

Viewed 1,065 times

9

Suppose I have the following vector, with the names of presidents of the republic:

presidentes <- c("da Fonseca, DeodoroDeodoro da Fonseca", 
"Peixoto, FlorianoFloriano Peixoto", "de Morais, PrudentePrudente de Morais", 
"Sales, CamposCampos Sales")

I would like to format this vector so that it is possible to directly read each president’s name:

"Deodoro da Fonseca" "Floriano Peixoto" "Prudente de Morais" "Campos Sales"      

I imagine there’s some regular expression that does this, but I can’t build it.

5 answers

7


It’s not pretty but it worked:

library(stringr)
rex <- ".*, [:alpha:]{1,}[A-Z]{1}"
nomes_invertidos <- str_extract_all(presidentes, rex) %>% unlist() %>% str_sub(end = -2)
str_replace_all(presidentes, nomes_invertidos, replacement = "")

[1] "Deodoro da Fonseca" "Floriano Peixoto"   "Prudente de Morais" "Campos Sales"   

The regex picks up:

  • anything up to the comma (.*,),
  • the comma,
  • a space,
  • any letter up to the first capital letter ([:alpha:]{1,}[A-Z]{1}) .
  • 1

    It worked perfectly, Daniel. Thank you.

4

The solution may vary. I used the following:

Regex -> ^.+?,\s*(\w+)\1(.+?)$

Substituição -> $1$2

I don’t know if the R language works, but the stretch of the back reference (\w+)\1 captures only the chunk that has repetition (of only one name) and concatenates with the remainder of the chunk of another capture.

I tested with Notepad++ and it worked.

In R, this expression can be used as follows::

gsub("^.+?,\\s*(\\w+)\\1(.+?)$", "\\1\\2", presidentes)
#[1] "Deodoro da Fonseca" "Floriano Peixoto"   "Prudente de Morais" "Campos Sales"
  • 4

    It worked and looked very elegant. Its regex in R, with the command gsub, would look like this: gsub("^.+?,\\s*(\\w+)\\1(.+?)$", "\\1\\2", presidentes). I took the liberty of adding to the reply.

  • +1 for increasing and improving the solution by language. Thank you.

1

gsub(".*[a-z]([A-Z])", "\\1", p)

that is to say:

de Morais, PrudentePrudente de Morais
..................eP
                   ↓
                   Prudente de Morais
  • Talk buddy, all right? Can you explain in detail what your code does?

1

I don’t understand the language r, but as I have I explained here, you can do a simple search that finds the same duplicate sequence, and replace it with a.

  • Pattern : ([a-z]+)\1
  • replace : $1
  • flag : i, and g, if the language needs to specify "replace all".

REGEX in JS : str.replace(/([a-z]+)\1/gi, '$1')

See working in REGEX101

Explanation

  • ([a-z]+) - Group 1
  • [a-z] - This limited to letters, and as has the flag i Uppercase and minuscule accepted.
  • \1 - Resumes the same capture of group 1. generating the search for duplicate parts.

1

I know that there are already some answers to this question and that the question was asked years ago; but I was practicing other ways of trying to do that.

I ended up meeting the following with str_replace:

str_replace(presidentes,
            pattern = "(.*, \\w+)([A-Z].*)",
            replacement = "\\2")

[1] "Deodoro da Fonseca" "Floriano Peixoto"   "Prudente de Morais" "Campos Sales"

The meaning:

  • (.*, \\w+) - take "anything, followed by a comma, followed by a space, followed by a word" and place in a first block;

  • ([A-Z].*) - after the first block, take "the first uppercase letter followed by anything" and place in a second block

In the replacement = " " I ask to return only the second block.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.