Regex does not take all span tag strings

Asked

Viewed 294 times

1

I know there are HTML parsers, but since my HTML is not well structured, I also need to use regular expressions.

The HTML is like this:

<tr bgcolor="#CCCCCC">
<td colspan="2"><font color="#000000" face="Verdana, Arial, Helvetica, sans-serif" size="1"><span class="txt_arial_8pt_black">1º 

                          Período Ideal</span></font></td>
<td><font color="#000000" face="Verdana, Arial, Helvetica, sans-serif" size="1"><span class="txt_arial_8pt_black">Créd.<br/>

                          Aula</span></font></td>
<td><font color="#000000" face="Verdana, Arial, Helvetica, sans-serif" size="1"><span class="txt_arial_8pt_black">Créd.<br/>

                          Trab.</span></font></td>
<td align="center"><font color="#000000" face="Verdana, Arial, Helvetica, sans-serif" size="1"><span class="txt_arial_8pt_black">CH</span></font></td>
<td align="center" width="6%"><font color="#000000" face="Verdana, Arial, Helvetica, sans-serif" size="1"><span class="txt_arial_8pt_black">CE</span></font></td>
<td align="center" width="6%"><font color="#000000" face="Verdana, Arial, Helvetica, sans-serif" size="1"><span class="txt_arial_8pt_black">CP</span></font></td>
<td align="center" width="6%"><font color="#000000" face="Verdana, Arial, Helvetica, sans-serif" size="1"><span class="txt_arial_8pt_black">ATPA</span></font></td>
</tr>

And the regex is:

r'''<span class="txt_arial_8pt_black">(.*?)</span>'''

I’m trying to get the text '1º Período Ideal', but regex does not recognize it, although other parts are recognized, such as 'CE', 'CP' and 'ATPA', and I don’t understand why.

2 answers

8

Do not use regex for Parsing html

Since the question has the tag , why not use that library, which is made just to make Parsing and manipulation of HTML instead of regex?

If HTML is poorly formed/structured, you can install parsers alternative, cited in the Beautiful Soup documentation itself. In your case, I believe the option would be to use the html5lib, which is the most permissive of parsers (the documentation says that although it is slower, it is as permissive as the browsers, that are known to accept extremely poorly formed HTML’s). It was unclear how poorly formed your HTML is, but I think it’s worth the test.

Anyway, regex is not meant to work with HTML (may even "work" in many cases, but as you yourself have begun to realize, is not the most suitable tool for the task).

With Beautiful Soup, I’d look like this:

html = '''
<tr bgcolor="#CCCCCC">
<td colspan="2"><font color="#000000" face="Verdana, Arial, Helvetica, sans-serif" size="1"><span class="txt_arial_8pt_black">1º 

                          Período Ideal</span></font></td>
<td><font color="#000000" face="Verdana, Arial, Helvetica, sans-serif" size="1"><span class="txt_arial_8pt_black">Créd.<br/>

                          Aula</span></font></td>
<td><font color="#000000" face="Verdana, Arial, Helvetica, sans-serif" size="1"><span class="txt_arial_8pt_black">Créd.<br/>

                          Trab.</span></font></td>
<td align="center"><font color="#000000" face="Verdana, Arial, Helvetica, sans-serif" size="1"><span class="txt_arial_8pt_black">CH</span></font></td>
<td align="center" width="6%"><font color="#000000" face="Verdana, Arial, Helvetica, sans-serif" size="1"><span class="txt_arial_8pt_black">CE</span></font></td>
<td align="center" width="6%"><font color="#000000" face="Verdana, Arial, Helvetica, sans-serif" size="1"><span class="txt_arial_8pt_black">CP</span></font></td>
<td align="center" width="6%"><font color="#000000" face="Verdana, Arial, Helvetica, sans-serif" size="1"><span class="txt_arial_8pt_black">ATPA</span></font></td>
</tr>'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for span in soup.find_all('span', class_='txt_arial_8pt_black'):
    print(span.text)

The code above searches all span's that has the class "txt_arial_8pt_black", and prints the respective text of each one. Detail that line breaks present in HTML will also be printed:

1º 

                          Período Ideal
Créd.

                          Aula
Créd.

                          Trab.
CH
CE
CP
ATPA

If you want to use the other parser already mentioned, which deals better with poorly formed HTML, just change to:

soup = BeautifulSoup(html, 'html5lib')

Not forgetting, of course, to install it, as stated in the documentation.


But if you really want to use regex...

Below I show you some solutions with regex, and you will understand why using Beautiful Soup is a much better solution.

Your regex does not work in all cases because the point corresponds to any character, except line breaks. And like some tags span has line breaks in the text, .* will not catch these snippets - as is the case of the tag you mentioned:

<span class="txt_arial_8pt_black">1º 

                          Período Ideal</span>

An alternative is to use the flag DOTALL, which causes the dot to also match line breaks:

import re
for texto in re.findall(r'<span class="txt_arial_8pt_black">(.*?)</span>', html, re.DOTALL):
    print(texto)

I used findall to search all existing occurrences in HTML. The detail is that the chunk that corresponds to the contents of the tag is in parentheses and this forms a catch group. And the documentation says that findall returns only groups when they are present, so the variable texto already has the tag text at each iteration.

This "works", but with many limitations (and I don’t even mean that regex also returns the tags <br/> next to the text, which is unclear whether to return or not). If HTML changes a little, it will already require modifications to regex, which would not be necessary if we used Beautiful Soup.

For example, if the tag has other attributes or simply another class, such as <span id="abc" class="txt_arial_8pt_black"> or <span class="outra_classe txt_arial_8pt_black">. Ready, already broke the regex (but the first code with Beautiful Soup will still find these tags without problems). An alternative would be:

r'<span[^>]+class="[^"]*\btxt_arial_8pt_black\b[^"]*"[^>]*>(.*?)</span>'

That is, the regex considers that it may have [^>]+ (one or more characters other than >) before the class, and within the class can have zero or more non-quotation characters ([^"]*) before and after "txt_arial_8pt_black" (and shortcut \b is to ensure that there will not be classes like "txt_arial_8pt_black2", for example).

But what if you have one span within another (what is perfectly possible)?

<span class="txt_arial_8pt_black">Antes <span>Durante</span> Depois</span>

Beautiful Soup correctly picks up all the text "Antes Durante Depois", but the regex only takes Antes <span>Durante (Obs: if you want Beautiful Soup to return all content from span, including the tags, exchange span.text for span.decode_contents()). This is because regex uses Lazy quantifier (the ? shortly after the .* - see more information here and here), i.e., it takes as few characters as possible that satisfies the expression. This causes the regex to stop at the first </span> what to find (see).

And if I remove the Lazy quantifier, then she becomes greedy and takes as many characters as possible, that is, it takes everything down to the last </span> (see).

In that case, you could use something like:

r'<span[^>]+class="[^"]*\btxt_arial_8pt_black\b[^"]*"[^>]*>(.*?)</span>(?![^<>]*</span>)'

The lookeahead negative (?![^<>]*</span>) checks if after closing the tag there is no other closure, but now the regex fails if in the middle there is some other tag than span, ex:

<span class="txt_arial_8pt_black">Antes <span>Durante</span><br> Depois</span>

The <br> between the two </span> already breaks the regex, and then you have to change it again to consider this case. An alternative would be:

r'<span[^>]+class="[^"]*\btxt_arial_8pt_black\b[^"]*"[^>]*>(.*?)</span>(?!(?:</?(?!\bspan\b)[^<>]*>|[^<>])*</span>)'

Now the Lookahead negative (?!(?:</?(?!\bspan\b)[^<>]*>|[^<>]) checks if it has any character that is not < nor >, or a tag that is not span (using another Lookahead just for that, ie a Lookahead within another - which to me is already one of the signs that the thing is more complex than it should be).

Although it works for the case cited, is it worth using this regex? Think about the future maintenance of this, and compare it to the Beautiful Soup code above. Also, regex always returns the contents of span with the tags, while Beautiful Soup gives you the option to return only the text without the tags. To delete the tags using regex, you would need to another just for that.


Note that for each different case more, there is a new complication in regex, and the expression gets bigger, more complex and difficult to understand and maintain. Already the code with Beautiful Soup would be the same quoted at the beginning, so is it worth insisting on using regex?

There are still other cases, for example if the tag is inside comments:

<!--
comentários HTML, etc
<a href="blabla">estou comentado</a>
<span class="txt_arial_8pt_black">também estou comentado</span>
<p>comentado, me ignorem</p>
-->
<span class="txt_arial_8pt_black">não estou comentado</span>

Beautiful Soup ignores the first span, because it is commented, while the regex does not, because it cannot evaluate what is "around" the tag, and so does not detect that it is inside a comment. It is even possible to change the regex to detect these cases, but is it worth adding something like that in an expression that by itself is already quite complex, and with Beautiful Soup the code would be the same that is at the beginning of the answer?

And we’re only considering cases where the tag is closed. But since you said that the HTML is poorly formed, then we will have to consider that not always the tag will have the respective closure? In this case it will be even worse to have a regex that contemplates all possible situations: you may have to leave the optional closures (but how do you know the tag is over?), detect some variations of poorly formed tags - and the complexity will depend on how these tags are - etc. Anyway, depending on how the HTML is, making one or more expressions will be too much work and in my opinion it will not be worth it.

Don’t get me wrong, regex are legal - i like it enough - and often looks like be the best solution. But it’s not always (to manipulate HTML, for sure is not).

3

the specific problem of this tag, in the example that is there, is that the cell content extends over more than one line. You didn’t put your Python code (it’s hard to answer the question like that), but you have to add the flag re.DOTALL to the call that executes the regular expression. (there is no way to give example, since we do not know which function you are using). This flag indicates to the engine that the whole text block should be treated together, not that each new line "starts again" a text block to apply the regular expression.

I think in all the calls from the module re, flags are the last parameter - just add the re.DOTALL on your call.

Understand that even working, regular expressions are not the proper way of extracting web content - you can use if you want something very specific, on pages you know how they are - and that do not parse right, as is this case - but a regular expression "universal" that would account for any valid structure in HTML not only would be very difficult to write and maintain, but would be something monstrous -it would be much easier to parse. I don’t emphasize this point anymore, because this HTML is really bad - so your ad-hoc solution might be the best thing to do.

However you are making rather inefficient use of regular expressions there: You are passing exact characters, using none of the resources to ensure that you keep picking your results if you change the class name in the element spam, or if you have an extra space somewhere, or other tags, etc... - it might work that way all the time, and it might break suddenly, or you might stop picking up tags that you don’t even realize are there because of spacing, etc... The ideal would be to use a more flexible regexp than this one.

  • 1

    Actually the flag to be used is DOTALL, for the MULTILINE only changes the behavior of ^ and $ (they also consider the beginning and end of each line). Already the DOTALL makes the point correspond to line breaks, and then "works": https://ideone.com/HbKC2C - personally I find confusing the name MULTILINE, And to make matters worse some Ngines call DOTALL a "single line", which makes everything even more confusing. But somehow, I think regex isn’t even the best solution, as I speak at my answer :-)

  • 1

    I switched to the. DOTALL - but I will stick to the answer, since you should solve the A.P. problem now - I will emphasize that you should not use regexps, however.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.