Regular expression by string index

Asked

Viewed 93 times

1

Consider this variable:

x <- c('horccaeon', 'coleon', 'volues', 'mol', 'nao', 'tom', 'nada', 
'auio', 'aqoio')

I used the following regex to extract strings with the second letter o:

library(tidyverse)

str_detect(string = x, regex(pattern = '.o[^o]'))

[1] "horccaeon" "coleon"    "volues"    "mol"       "tom"  

It works, but I know that this is wrong.

I also tried to use the Lookahead:

str_subset(string = x, regex(pattern = '.(?=o)'))

[1] "horccaeon" "coleon"    "volues"    "mol"       "nao"       "tom"       "auio"     
[8] "aqoio" 

But it returns everything it contains o, not just the second letter.

Consider the reverse now: remove strings with the penultimate letter o. I couldn’t think of a regex for that reason.

Thus:

  • how to adjust a regex for her to give match with the positions/ indices of the string?

2 answers

3


If you want the lyrics o is the second letter of the string, you can use the bookmark ^, indicating the start of the string:

str_subset(string = x, regex(pattern = '^.o'))

So we have the beginning of the string (^), followed by any character (the dot, which means "any character, except line breaks"), followed by the letter o. The result in this case is:

[1] "horccaeon" "coleon"    "volues"    "mol"       "tom"   

Notice that the strings 'nao, 'auio' and 'aqoio' are left out because the lyrics o is not the second character (the string 'nada' is also not returned because it even has a o).


Already to check if the antepenultimate letter is a o, can use the bookmark $, indicating the end of the string:

str_subset(string = x, regex(pattern = 'o.{2}$'))

Now we have the words o, followed by two characters (.{2}), followed by the end of the string ($). The result is:

[1] "aqoio"

In general, you should use ^ if you want to check the letter o is the nth letter of the beginning, or $ if you want to check if she is X positions at the end. Ex:

  • ^o - begins with o
  • ^.{3}o - the fourth letter is o (because it has any 3 characters before)
  • o$ - ends with o
  • o.{3}$ - has the letter o, plus 3 characters, and the end of the string

Of course, if you want, you can exchange the point for something more specific (for example [a-z], so regex only considers the letter o if you have letters before or after - if you use the dot, you can have any character, including non-alphanumeric).


Just to explain their regex, none of them use markers ^ and $, which means that the pattern can be found in the middle of the string (thus, it does not guarantee that the letter o shall always be the second or antepenultimate, or any other specific position, see).

.o[^o] is any character, followed by o, followed by a character other than o. That means if you have something like 'zoo', she doesn’t take, after the o must have a character that is not o. Plus, it forces you to have something after the o, excluding two-letter words, such as 'do' (see).

.(?=o) is any character that has a o soon after, then in practice it is any string with at least two characters, in which the letter o has a character before (but not necessarily the letter o will be the second character, see).

  • 1

    regex only with a lot of training. I know those operators but I couldn’t apply them. Thanks again.

  • How would I use the regex(pattern = '.(?=o)') if "o" was actually a vector ?

0

Wow, are you into the grep? It has a flag called (value = false), if it is false it returns the indexes, if true it returns the selected elements of x; Don’t worry about setting the flag, by default it is already false.

Forehead:

x <- c('horccaeon', 'coleon', 'volues', 'mol', 'nao', 'tom', 'nada', 'auio', 'aqoio')
p <- c('[^o]')
sapply(p , function(y) grep(y,x))

Browser other questions tagged

You are not signed in. Login or sign up in order to post.