Classify a text array using regular expressions using R

Asked

Viewed 330 times

4

Let’s say I have the following array of texts (character):

d <- data.frame(id=1:3, 
                txt=c('Um gato e um cachorro', 
                      'Cachorros jogam bola usando alpargatas', 
                      'gatinhos cospem bolas de pêlos'), stringsAsFactors=F)

I would like to add a Boolean column in d that it were TRUE if the text contains (cat or dog) and ball.

An alternative I would have would be to create a column for each of these expressions, and then do a logical operation. Using the packages dplyr and stringr (note that I don’t know much about regex and so they got big, ugly and inefficient, but that’s not important):

library(dplyr)
library(stringr)    
d %>%
  mutate(gato=str_detect(txt, ignore.case('^gat[aio]| gat[aio]')),
         cachorro=str_detect(txt, ignore.case('cachor')),
         bola=str_detect(txt, ignore.case('bola')),
         result=(gato | cachorro) & bola)

Upshot:

  id                                    txt  gato cachorro  bola result
1  1                  Um gato e um cachorro  TRUE     TRUE FALSE  FALSE
2  2 Cachorros jogam bola usando alpargatas FALSE     TRUE  TRUE   TRUE
3  3         gatinhos cospem bolas de pêlos  TRUE    FALSE  TRUE   TRUE

Now, generalizing the question: say I have a set of p regular expressions to be applied in the size text vector n, and I want to create a Boolean column that is the result of a logical operation from the detection of these expressions in the texts.

I ask you: there is a way to solve this without having to evaluate the text p times? That is, can you decrease the number of times I apply str_detect in my text?

The reason for the question is because i) both mine n how much mine p are very large and ii) did not want to explicitly write a lot of boolean variables.

A response compatible with the use of dplyr would be great but not necessary. I would appreciate any contribution!

1 answer

4


@Juliotrecenti, there is a way: to include the logical tests in Regex. Note that the pipe (|) is the equivalent of an OR operation, but to implement the AND operation we will need the Lookahead. Another point is that I’m going to use grepl() instead of str_detect(), because in this case I simply want to evaluate a regular expression with binary response, and so grepl() already does the job. Suppose the data.frame was created as you quoted the following code solves the problem in a row:

d %>% 
  mutate(result = grepl(d$txt, pattern = '(?=.*gat[aio]|[cC]achor)(?=.*bola)', perl = T))

  id                                    txt result
1  1                  Um gato e um cachorro  FALSE
2  2 Cachorros jogam bola usando alpargatas   TRUE
3  3         gatinhos cospem bolas de pêlos   TRUE

See what (?=something)(?=something else) says to find "something" And go back to finding "something else". This something is: something = cat OR cachro, where the OR is represented by pipe. Also note the "perl = T" option in grepl() that tells R to use regex according to perl. Without this feature Lookahead does not work.

  • Thanks, @Flaviobarros! One thing I didn’t understand: why is it necessary to use the first ?= in (?=alguma coisa), if he’ll look back and see nothing?

  • It is that the AND thus is improvised. In case we are saying to regex check the text twice in search of the pattern.

  • Wonderful. Thank you.

  • Something else, grepl('.*gat[aio]', 'Cachorros jogam bola usando alpargatas') returns TRUE (don’t want to). That’s why I used the ^gat[aio] or gat[aio]. Has better solution?

  • @Juliotrecenti you could use the s to represent a white space before the word. In R you have to put an extra bar ( s), so it would look like this: grepl('.*\\sgat[aio]', 'Cachorros jogam bola usando alpargatas')

Browser other questions tagged

You are not signed in. Login or sign up in order to post.