For all tests below, I will consider this string:
texto = r"""
\questao{1}
\begin{enumerate}
esse não
\end{enumerate}
\begin{enumerate}
esse sim
\end{enumerate}
\questao{2}
\begin{enumerate}
esse também não
\end{enumerate}
\begin{enumerate}
esse também sim
\end{enumerate}
\questao{3}
"""
So you want to take only the \begin{enumerate}
who are just before any \questao
(in the example above, are the ones that contain the excerpts "this yes" and "this also yes").
For this you can use the module re
, used to work with regular expressions.
A first alternative would be to use this - long and complicated - regex:
import re
r = re.compile(r'\\begin\{enumerate\}(?:(?!\\begin\{enumerate\}).)+\\end\{enumerate\}(?=(?:(?!\\\w+\{\w+\}).)+\\questao\{\d+\})', re.DOTALL)
resultados = r.findall(texto)
for res in resultados:
print(res)
As you can see, the regex is quite complex. It starts simple, with \\begin\{enumerate\}
, that serves to catch the stretch \begin{enumerate}
text. Note that the characters \
, {
and }
must be escaped with \
, since they have special significance in regex, and for them to be interpreted as the characters themselves, we must use the escape.
Then we have (?:(?!\\begin\{enumerate\}).)+
. Explaining from the inside out:
- the point (
.
) corresponds to (almost) any character, because by default it does not match line breaks. But as I used the flag DOTALL
, he goes on to pick up the line breaks too.
- the stretch
(?!\\begin\{enumerate\})
is a Negative Lookahead, that something checks out nay exists ahead. In case, it checks if there is another \begin{enumerate}
ahead (thus preventing it to "invade" another \begin{enumerate}
).
The trick of Lookahead is that he only checks what’s in front of him, but then he goes back to where he was and continues to evaluate the regex. That is, first the regex checks that there is no other \begin{enumerate}
ahead, and then back to where it was and evaluate the point (which can be any character). And all this is in parentheses and with the quantifier +
(one or more occurrences). That is, it checks this several times, until it finds the \end{enumerate}
.
Then we have a Lookahead: the stretch between (?=....)
, that serves to check if something exists ahead. Inside this Lookahead, I do something similar to what was done before:
- there’s the Negative Lookahead
(?!\\\w+\{\w+\})
, which checks whether there is any other structure of the type \algumacoisa{alguma coisa}
(the shortcut \w
corresponds to letters, numbers or the character _
, then \w+
is one or more of these characters).
- this Negative Lookahead is used together with the point and the
+
, ensuring that I can have one or more characters as long as they are not \algumacoisa{alguma coisa}
- finally, I arrive at the part that corresponds to
\questao{x}
, whereas x
may be one or more digits (\d+
)
In short, the regex looks for \begin{enumerate}
, followed by one or more characters (checking before if there is another \begin{enumerate}
ahead), followed by \end{enumerate}
, as long as all this is followed by one or more characters (checking before if it is not algumacoisa{alguma coisa}
), followed by \questao{x}
.
The method findall
returns a list of all regex occurrences found in the text. The detail is that the parentheses used are in the form (?:
, which makes them a catch group. I did it because if I don’t use ?:
, the brackets form a capture group, and findall
returns the groups when they are present. To avoid this and return all the found stretch, I use the non-sampler groups.
The exit is:
\begin{enumerate}
esse sim
\end{enumerate}
\begin{enumerate}
esse também sim
\end{enumerate}
Another alternative is to use the method finditer
, which returns an iterator of pouch, which you can use to get more information from each found stretch:
import re
r = re.compile(r'\\begin\{enumerate\}(?:(?!\\begin\{enumerate\}).)+\\end\{enumerate\}(?=(?:(?!\\\w+\{\w+\}).)+\\questao\{\d+\})', re.DOTALL)
for m in r.finditer(texto):
print('Trecho "{}" encontrado entre as posições {} e {}'.format(m.group(), m.start(), m.end()))
The exit is:
Trecho "\begin{enumerate}
esse sim
\end{enumerate}" encontrado entre as posições 59 e 101
Trecho "\begin{enumerate}
esse também sim
\end{enumerate}" encontrado entre as posições 169 e 219
Another alternative is to separately obtain each "block" as a list, and analyze the elements one by one:
r = re.compile(r'\\begin\{(\w+)\}.*?\\end\{\1\}|\\(?!begin|end)\w+\{\w+\}', re.DOTALL)
questao_regex = re.compile(r'^\\questao{\d+}$')
partes = [m.group() for m in r.finditer(texto)]
qtd = len(partes) # quantidade de partes encontradas
for i, parte in enumerate(partes):
# é um begin{enumerate}, não é o último elemento e o próximo é uma questao
if parte.startswith(r'\begin{enumerate}') and i < qtd - 1 and questao_regex.match(partes[i + 1]):
print(parte)
Now the main regex begins with \\begin\{(\w+)\}
(that is to say, \begin{alguma coisa}
), being that "something" is \w+
(one or more letters, numbers or _
). Note that the \w+
is in parentheses, thus forming a capture group. This will be useful to check the \\end\{\1\}
, for \1
refers to what was captured in this group (i.e., it is the same "something" that was in the begin
). I used \1
as it refers to the first capture group (the groups are numbered according to the order in which they appear in the regex).
With that I guarantee I’m taking the end
corresponding to begin
(assuming that there are no similar nested structures, as in this case the regex will not function properly).
Then we have the character |
, which means or. Then we have a Negative Lookahead which ensures that what we have after is not begin
nor end
, and then we have \w+\{\w+\}
(so I pick up all the blocks that are not begin
nor end
, like the question
, for example).
I also created another regex to specifically check \questao{x}
. It’s similar to the previous alternative, the difference is I added the markers ^
and $
, which are respectively the beginning and end of the string. So I guarantee that the string cannot have any other character before or after.
At last, I use finditer
to obtain the pouch and create a list of found strings (using the syntax of comprehensilist on, much more succinct and pythonic). Then I go through this list and for each item I check if it is one \begin{enumerate}
and whether the next element is \questao{x}
(using the specific regex already mentioned). The output is:
\begin{enumerate}
esse sim
\end{enumerate}
\begin{enumerate}
esse também sim
\end{enumerate}
Just remembering that although it works for this case, regex is not always the best tool to do Parsing of structured texts such as HTML and Latex. As I said above, the codes above do not deal well with nested structures - and in this case, you could even use recursive regex, installing the module regex
(because the module re
does not support this feature), but I think it is not worth the complication.
In your specific case, you might want to try using specific libs to work with Latex.
That would be the regular expression?
/(\\begin\{enumerate\})(?=\s\\end\{enumerate\}\s\s\\questao\{\d+\}$)/
Functioning in regex101– Marconi
Only one update to @Marconi’s demo on Regex101, because if I understood correctly the question would be without the $ in the end:
(\\begin\{enumerate\})(?=\s\\end\{enumerate\}\s\s\\questao\{\d+\})
– danieltakeshi
@Marconi and @danieltakeshi, the detail is that the expressions you suggested do not consider the case where there is some text between the
begin
and theend
(see here and here).– hkotsubo