4
Let’s say I have the following array of texts (character
):
d <- data.frame(id=1:3,
txt=c('Um gato e um cachorro',
'Cachorros jogam bola usando alpargatas',
'gatinhos cospem bolas de pêlos'), stringsAsFactors=F)
I would like to add a Boolean column in d
that it were TRUE
if the text contains (cat or dog) and ball.
An alternative I would have would be to create a column for each of these expressions, and then do a logical operation. Using the packages dplyr
and stringr
(note that I don’t know much about regex and so they got big, ugly and inefficient, but that’s not important):
library(dplyr)
library(stringr)
d %>%
mutate(gato=str_detect(txt, ignore.case('^gat[aio]| gat[aio]')),
cachorro=str_detect(txt, ignore.case('cachor')),
bola=str_detect(txt, ignore.case('bola')),
result=(gato | cachorro) & bola)
Upshot:
id txt gato cachorro bola result
1 1 Um gato e um cachorro TRUE TRUE FALSE FALSE
2 2 Cachorros jogam bola usando alpargatas FALSE TRUE TRUE TRUE
3 3 gatinhos cospem bolas de pêlos TRUE FALSE TRUE TRUE
Now, generalizing the question: say I have a set of p
regular expressions to be applied in the size text vector n
, and I want to create a Boolean column that is the result of a logical operation from the detection of these expressions in the texts.
I ask you: there is a way to solve this without having to evaluate the text p
times? That is, can you decrease the number of times I apply str_detect
in my text?
The reason for the question is because i) both mine n
how much mine p
are very large and ii) did not want to explicitly write a lot of boolean variables.
A response compatible with the use of dplyr
would be great but not necessary. I would appreciate any contribution!
Thanks, @Flaviobarros! One thing I didn’t understand: why is it necessary to use the first ?= in
(?=alguma coisa)
, if he’ll look back and see nothing?– Julio Trecenti
It is that the AND thus is improvised. In case we are saying to regex check the text twice in search of the pattern.
– Flavio Barros
Wonderful. Thank you.
– Julio Trecenti
Something else,
grepl('.*gat[aio]', 'Cachorros jogam bola usando alpargatas')
returnsTRUE
(don’t want to). That’s why I used the^gat[aio]
orgat[aio]
. Has better solution?– Julio Trecenti
@Juliotrecenti you could use the s to represent a white space before the word. In R you have to put an extra bar ( s), so it would look like this:
grepl('.*\\sgat[aio]', 'Cachorros jogam bola usando alpargatas')
– Flavio Barros