Regex able to ignore the prefixes of a word

Asked

Viewed 197 times

2

I’ve got those regex next:

regex_list = [ 
    '(?:arquiv|auto)(\w+) (?:auto|arquiv) (?:definitiv)',
    '(?:arquiv|auto)(\w+) (?:auto|arquiv)',
    '(?:arquiv)(\w+) (?:definitiv)'
]

My goal is to capture phrases such as: "filed the records", "filed records", "definitive filing", etc.

Only when I put disarming a regex also captures. Phrases can also come as: "arquive-se, portanto, os autos, de modo definitivo".

I would like to know how to ignore terms that come with prefixes (in this case, "unarchiving", I just want words with prefix arquiv.) and ensure that phrases like the latter are also captured.

  • 1

    Regular expression does not seem to be the solution to your problem - because apparently the expressions are not regular. Why don’t you use a natural language processing tool? Read about NLTK.

2 answers

3


You can use \b, which is a word Boundary (in the absence of a better translation, it is something like "boundary between words"). Basically, it serves to indicate positions of the string that have an alphanumeric character before and a non-alphinical character after (or vice versa).

For example, putting a \b before the arquiv, you already eliminate the case of "Disarmament".

For the other cases, it wasn’t very clear all the variations that the phrase can have, but from what I understood, the criteria are:

  • may have "arquiv" or "auto"
  • may or may not have "arquiv" or "auto" right after
  • "definitive" is optional
  • between these words can have zero or more words

You can do it in a single regex:

import re

frases = [
 "arquivados os autos",
 "autos arquivados",
 "arquivamento definitivo",
 "desarquivamento definitivo",
 "arquive-se, portanto, os autos, de modo definitivo"
]

r = re.compile(r'\b(?:arquiv|auto)(\w+).*?(?:auto|arquiv)?.*?(?:definitiv)?')
for frase in frases:
    if r.search(frase):
        print('encontrou a frase {}'.format(frase))

regex finds all sentences, except the one that has "shunning". .*? is "zero or more characters", ensuring that you can have anything between the desired words.

And the ? right after the parentheses make the whole section optional, thus both the second parentheses (with auto|arquiv) as regards the third (with definitiv) are optional.


Remembering that this regex is very "naive" and prone to false positives, depending on how complex the sentences you want to evaluate are. Depending on what you want to do, maybe it’s best to follow the recommendation that Anderson gave in the comments

  • Thanks, @hkotsubo. Actually, I’m experimenting with alternatives. Depending on how things will go, I will change the approach to better address the problem. But your instructions served for what I needed at the time.

0

With this regex you get something similar:

(autos?)|((?<!\w)arquiva(dos)?|(\-se)?)

Here you can run

The secret here is the look behind, here explains well.

(?<!\w) means: look back and see if you have a word (character set), if you have it, do not match.

This regex may not be 100%, but I think it will help you.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.