Classify a text array using regular expressions using R

Question

Classify a text array using regular expressions using R

Asked 11 years, 5 months ago

Viewed 330 times

4

Let’s say I have the following array of texts (character):

d <- data.frame(id=1:3, 
                txt=c('Um gato e um cachorro', 
                      'Cachorros jogam bola usando alpargatas', 
                      'gatinhos cospem bolas de pêlos'), stringsAsFactors=F)

I would like to add a Boolean column in d that it were TRUE if the text contains (cat or dog) and ball.

An alternative I would have would be to create a column for each of these expressions, and then do a logical operation. Using the packages dplyr and stringr (note that I don’t know much about regex and so they got big, ugly and inefficient, but that’s not important):

library(dplyr)
library(stringr)    
d %>%
  mutate(gato=str_detect(txt, ignore.case('^gat[aio]| gat[aio]')),
         cachorro=str_detect(txt, ignore.case('cachor')),
         bola=str_detect(txt, ignore.case('bola')),
         result=(gato | cachorro) & bola)

Upshot:

  id                                    txt  gato cachorro  bola result
1  1                  Um gato e um cachorro  TRUE     TRUE FALSE  FALSE
2  2 Cachorros jogam bola usando alpargatas FALSE     TRUE  TRUE   TRUE
3  3         gatinhos cospem bolas de pêlos  TRUE    FALSE  TRUE   TRUE

Now, generalizing the question: say I have a set of p regular expressions to be applied in the size text vector n, and I want to create a Boolean column that is the result of a logical operation from the detection of these expressions in the texts.

I ask you: there is a way to solve this without having to evaluate the text p times? That is, can you decrease the number of times I apply str_detect in my text?

The reason for the question is because i) both mine n how much mine p are very large and ii) did not want to explicitly write a lot of boolean variables.

A response compatible with the use of dplyr would be great but not necessary. I would appreciate any contribution!

1 answer

Browser other questions tagged r regex

You are not signed in. Login or sign up in order to post.

by Flavio Barros • **1,717** points · Answer 1 · 2014-10-03T20:43:59+00:00

@Juliotrecenti, there is a way: to include the logical tests in Regex. Note that the pipe (|) is the equivalent of an OR operation, but to implement the AND operation we will need the Lookahead. Another point is that I’m going to use grepl() instead of str_detect(), because in this case I simply want to evaluate a regular expression with binary response, and so grepl() already does the job. Suppose the data.frame was created as you quoted the following code solves the problem in a row:

d %>% 
  mutate(result = grepl(d$txt, pattern = '(?=.*gat[aio]|[cC]achor)(?=.*bola)', perl = T))

  id                                    txt result
1  1                  Um gato e um cachorro  FALSE
2  2 Cachorros jogam bola usando alpargatas   TRUE
3  3         gatinhos cospem bolas de pêlos   TRUE

See what (?=something)(?=something else) says to find "something" And go back to finding "something else". This something is: something = cat OR cachro, where the OR is represented by pipe. Also note the "perl = T" option in grepl() that tells R to use regex according to perl. Without this feature Lookahead does not work.