Word combination identification in R

Asked

Viewed 50 times

0

I have hundreds of sentences like these below. And I would like to identify the phrases that contain the words Phone and delete. But these two words must have only one term between them, in this case a telephone number.

I tried using the grepl function, but it returns all sentences with Phone and delete regardless of how many terms exist between them.

teles <- c("1 Telefone 44221201 apagar Bairro CENTRO apagar Notebook apagar",
           "2 Telefone 44221201 44221202 Bairro CENTRO2 apagar",
           "3 Telefone 44221203 44221202 EQUIPAMENTOS Blue-ray apagar",
           "4 Telefone 44220000 apagar EQUIPAMENTOS Televisão apagar",
           "5 EQUIPAMENTOS Televisão apagar Telefone 64221201 apagar",
           "6 EQUIPAMENTOS Antena apagar Telefone 54221201 apagar",
           "7 EQUIPAMENTOS DVD apagar EQUIPAMENTOS Antena apagar",
           "8 EQUIPAMENTOS DVD apagar EQUIPAMENTOS Antena apagar")

tel_apagar1 <- grepl("Telefone[^\\.,!?:;]*apagar", teles)
tel_apagar1

In this case, the function returns:

#[1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE

But, I would like to filter only the sentences:

1, 4, 5, 6 

That have the word Phone (a number) delete. So you would get the following sequence

#[1]  TRUE  FALSE  FALSE  TRUE  TRUE  TRUE FALSE FALSE

1 answer

3


The following regular expression does what the question asks.

grep("Telefone\\s*[[:alnum:]]+\\s*apagar", teles, ignore.case = TRUE)
#[1] 1 4 5 6

Browser other questions tagged

You are not signed in. Login or sign up in order to post.