You can use \b
, which is a word Boundary (in the absence of a better translation, it is something like "boundary between words"). Basically, it serves to indicate positions of the string that have an alphanumeric character before and a non-alphinical character after (or vice versa).
For example, putting a \b
before the arquiv
, you already eliminate the case of "Disarmament".
For the other cases, it wasn’t very clear all the variations that the phrase can have, but from what I understood, the criteria are:
- may have "arquiv" or "auto"
- may or may not have "arquiv" or "auto" right after
- "definitive" is optional
- between these words can have zero or more words
You can do it in a single regex:
import re
frases = [
"arquivados os autos",
"autos arquivados",
"arquivamento definitivo",
"desarquivamento definitivo",
"arquive-se, portanto, os autos, de modo definitivo"
]
r = re.compile(r'\b(?:arquiv|auto)(\w+).*?(?:auto|arquiv)?.*?(?:definitiv)?')
for frase in frases:
if r.search(frase):
print('encontrou a frase {}'.format(frase))
regex finds all sentences, except the one that has "shunning". .*?
is "zero or more characters", ensuring that you can have anything between the desired words.
And the ?
right after the parentheses make the whole section optional, thus both the second parentheses (with auto|arquiv
) as regards the third (with definitiv
) are optional.
Remembering that this regex is very "naive" and prone to false positives, depending on how complex the sentences you want to evaluate are. Depending on what you want to do, maybe it’s best to follow the recommendation that Anderson gave in the comments
Regular expression does not seem to be the solution to your problem - because apparently the expressions are not regular. Why don’t you use a natural language processing tool? Read about NLTK.
– Woss