Return block-defined substring in Python on first occurrence

Asked

Viewed 145 times

3

In python, I am trying to capture a block-defined substring, however the word "blocks" repeats in the text and I would like to get the substring up to the first occurrence of this. In this example the return brings up to the last occurrence:

import re
TXT = "Este é um texto de teste para verificar a captura de blocos que estão dentro de uma String. E agora inserimos outros blocos para confundir."
texto = re.search("teste.*blocos", TXT)
print(texto[0])

1 answer

2


This happens because the quantifier * is greedy (Greedy quantifier): he tries to grab as many characters as possible that satisfies the expression.

To cancel greed, just put one ? shortly after the *:

texto = re.search("teste.*?blocos", TXT)

With this, will be captured only the stretch until the first occurrence of blocos.

Like *? takes the minimum necessary to satisfy the expression, is called lazy quantifier (Lazy quantifier).


Only one detail, if your string is like the example below:

TXT = "teste com cablocos com blocos que tem mais blocos."
texto = re.search("teste.*?blocos", TXT)

The captured stretch will be teste com cablocos. If you only want the word blocos (and not cablocos), use \b to delimit the word:

TXT = "teste com cablocos com blocos que tem mais blocos."
texto = re.search(r"teste.*?\bblocos\b", TXT)

With this, the captured stretch will be teste com cablocos com blocos.

Detail I’ve now used r"teste..." to create a raw string, so the character \ does not need to be escaped. Without the r, I would have to write it as \\:

# sem o r"..." o caractere "\" deve ser escrito como "\\"
texto = re.search("teste.*?\\bblocos\\b", TXT)

Like \ is a character widely used in regular expressions, it is interesting to use raw strings to make the expression less confusing.


I know the correct word is "caboclos," but I couldn’t find a better example.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.