Do not use regex for Parsing html
Since the question has the tag beautifulsoup, why not use that library, which is made just to make Parsing and manipulation of HTML instead of regex?
If HTML is poorly formed/structured, you can install parsers alternative, cited in the Beautiful Soup documentation itself. In your case, I believe the option would be to use the html5lib, which is the most permissive of parsers (the documentation says that although it is slower, it is as permissive as the browsers, that are known to accept extremely poorly formed HTML’s). It was unclear how poorly formed your HTML is, but I think it’s worth the test.
Anyway, regex is not meant to work with HTML (may even "work" in many cases, but as you yourself have begun to realize, is not the most suitable tool for the task).
With Beautiful Soup, I’d look like this:
html = '''
<tr bgcolor="#CCCCCC">
<td colspan="2"><font color="#000000" face="Verdana, Arial, Helvetica, sans-serif" size="1"><span class="txt_arial_8pt_black">1º
Período Ideal</span></font></td>
<td><font color="#000000" face="Verdana, Arial, Helvetica, sans-serif" size="1"><span class="txt_arial_8pt_black">Créd.<br/>
Aula</span></font></td>
<td><font color="#000000" face="Verdana, Arial, Helvetica, sans-serif" size="1"><span class="txt_arial_8pt_black">Créd.<br/>
Trab.</span></font></td>
<td align="center"><font color="#000000" face="Verdana, Arial, Helvetica, sans-serif" size="1"><span class="txt_arial_8pt_black">CH</span></font></td>
<td align="center" width="6%"><font color="#000000" face="Verdana, Arial, Helvetica, sans-serif" size="1"><span class="txt_arial_8pt_black">CE</span></font></td>
<td align="center" width="6%"><font color="#000000" face="Verdana, Arial, Helvetica, sans-serif" size="1"><span class="txt_arial_8pt_black">CP</span></font></td>
<td align="center" width="6%"><font color="#000000" face="Verdana, Arial, Helvetica, sans-serif" size="1"><span class="txt_arial_8pt_black">ATPA</span></font></td>
</tr>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for span in soup.find_all('span', class_='txt_arial_8pt_black'):
print(span.text)
The code above searches all span
's that has the class "txt_arial_8pt_black", and prints the respective text of each one. Detail that line breaks present in HTML will also be printed:
1º
Período Ideal
Créd.
Aula
Créd.
Trab.
CH
CE
CP
ATPA
If you want to use the other parser already mentioned, which deals better with poorly formed HTML, just change to:
soup = BeautifulSoup(html, 'html5lib')
Not forgetting, of course, to install it, as stated in the documentation.
But if you really want to use regex...
Below I show you some solutions with regex, and you will understand why using Beautiful Soup is a much better solution.
Your regex does not work in all cases because the point corresponds to any character, except line breaks. And like some tags span
has line breaks in the text, .*
will not catch these snippets - as is the case of the tag you mentioned:
<span class="txt_arial_8pt_black">1º
Período Ideal</span>
An alternative is to use the flag DOTALL
, which causes the dot to also match line breaks:
import re
for texto in re.findall(r'<span class="txt_arial_8pt_black">(.*?)</span>', html, re.DOTALL):
print(texto)
I used findall
to search all existing occurrences in HTML. The detail is that the chunk that corresponds to the contents of the tag is in parentheses and this forms a catch group. And the documentation says that findall
returns only groups when they are present, so the variable texto
already has the tag text at each iteration.
This "works", but with many limitations (and I don’t even mean that regex also returns the tags <br/>
next to the text, which is unclear whether to return or not). If HTML changes a little, it will already require modifications to regex, which would not be necessary if we used Beautiful Soup.
For example, if the tag has other attributes or simply another class, such as <span id="abc" class="txt_arial_8pt_black">
or <span class="outra_classe txt_arial_8pt_black">
. Ready, already broke the regex (but the first code with Beautiful Soup will still find these tags without problems). An alternative would be:
r'<span[^>]+class="[^"]*\btxt_arial_8pt_black\b[^"]*"[^>]*>(.*?)</span>'
That is, the regex considers that it may have [^>]+
(one or more characters other than >
) before the class
, and within the class
can have zero or more non-quotation characters ([^"]*
) before and after "txt_arial_8pt_black" (and shortcut \b
is to ensure that there will not be classes like "txt_arial_8pt_black2", for example).
But what if you have one span
within another (what is perfectly possible)?
<span class="txt_arial_8pt_black">Antes <span>Durante</span> Depois</span>
Beautiful Soup correctly picks up all the text "Antes Durante Depois"
, but the regex only takes Antes <span>Durante
(Obs: if you want Beautiful Soup to return all content from span
, including the tags, exchange span.text
for span.decode_contents()
). This is because regex uses Lazy quantifier (the ?
shortly after the .*
- see more information here and here), i.e., it takes as few characters as possible that satisfies the expression. This causes the regex to stop at the first </span>
what to find (see).
And if I remove the Lazy quantifier, then she becomes greedy and takes as many characters as possible, that is, it takes everything down to the last </span>
(see).
In that case, you could use something like:
r'<span[^>]+class="[^"]*\btxt_arial_8pt_black\b[^"]*"[^>]*>(.*?)</span>(?![^<>]*</span>)'
The lookeahead negative (?![^<>]*</span>)
checks if after closing the tag there is no other closure, but now the regex fails if in the middle there is some other tag than span
, ex:
<span class="txt_arial_8pt_black">Antes <span>Durante</span><br> Depois</span>
The <br>
between the two </span>
already breaks the regex, and then you have to change it again to consider this case. An alternative would be:
r'<span[^>]+class="[^"]*\btxt_arial_8pt_black\b[^"]*"[^>]*>(.*?)</span>(?!(?:</?(?!\bspan\b)[^<>]*>|[^<>])*</span>)'
Now the Lookahead negative (?!(?:</?(?!\bspan\b)[^<>]*>|[^<>])
checks if it has any character that is not <
nor >
, or a tag that is not span
(using another Lookahead just for that, ie a Lookahead within another - which to me is already one of the signs that the thing is more complex than it should be).
Although it works for the case cited, is it worth using this regex? Think about the future maintenance of this, and compare it to the Beautiful Soup code above. Also, regex always returns the contents of span
with the tags, while Beautiful Soup gives you the option to return only the text without the tags. To delete the tags using regex, you would need to another just for that.
Note that for each different case more, there is a new complication in regex, and the expression gets bigger, more complex and difficult to understand and maintain. Already the code with Beautiful Soup would be the same quoted at the beginning, so is it worth insisting on using regex?
There are still other cases, for example if the tag is inside comments:
<!--
comentários HTML, etc
<a href="blabla">estou comentado</a>
<span class="txt_arial_8pt_black">também estou comentado</span>
<p>comentado, me ignorem</p>
-->
<span class="txt_arial_8pt_black">não estou comentado</span>
Beautiful Soup ignores the first span
, because it is commented, while the regex does not, because it cannot evaluate what is "around" the tag, and so does not detect that it is inside a comment. It is even possible to change the regex to detect these cases, but is it worth adding something like that in an expression that by itself is already quite complex, and with Beautiful Soup the code would be the same that is at the beginning of the answer?
And we’re only considering cases where the tag is closed. But since you said that the HTML is poorly formed, then we will have to consider that not always the tag will have the respective closure? In this case it will be even worse to have a regex that contemplates all possible situations: you may have to leave the optional closures (but how do you know the tag is over?), detect some variations of poorly formed tags - and the complexity will depend on how these tags are - etc. Anyway, depending on how the HTML is, making one or more expressions will be too much work and in my opinion it will not be worth it.
Don’t get me wrong, regex are legal - i like it enough - and often looks like be the best solution. But it’s not always (to manipulate HTML, for sure is not).
Actually the flag to be used is
DOTALL
, for theMULTILINE
only changes the behavior of^
and$
(they also consider the beginning and end of each line). Already theDOTALL
makes the point correspond to line breaks, and then "works": https://ideone.com/HbKC2C - personally I find confusing the nameMULTILINE
, And to make matters worse some Ngines call DOTALL a "single line", which makes everything even more confusing. But somehow, I think regex isn’t even the best solution, as I speak at my answer :-)– hkotsubo
I switched to the. DOTALL - but I will stick to the answer, since you should solve the A.P. problem now - I will emphasize that you should not use regexps, however.
– jsbueno