This happens because the quantifier *
is greedy (Greedy quantifier): he tries to grab as many characters as possible that satisfies the expression.
To cancel greed, just put one ?
shortly after the *
:
texto = re.search("teste.*?blocos", TXT)
With this, will be captured only the stretch until the first occurrence of blocos
.
Like *?
takes the minimum necessary to satisfy the expression, is called lazy quantifier (Lazy quantifier).
Only one detail, if your string is like the example below:
TXT = "teste com cablocos com blocos que tem mais blocos."
texto = re.search("teste.*?blocos", TXT)
The captured stretch will be teste com cablocos
. If you only want the word blocos
(and not cablocos
), use \b
to delimit the word:
TXT = "teste com cablocos com blocos que tem mais blocos."
texto = re.search(r"teste.*?\bblocos\b", TXT)
With this, the captured stretch will be teste com cablocos com blocos
.
Detail I’ve now used r"teste..."
to create a raw string, so the character \
does not need to be escaped. Without the r
, I would have to write it as \\
:
# sem o r"..." o caractere "\" deve ser escrito como "\\"
texto = re.search("teste.*?\\bblocos\\b", TXT)
Like \
is a character widely used in regular expressions, it is interesting to use raw strings to make the expression less confusing.
I know the correct word is "caboclos," but I couldn’t find a better example.