As André Figueiredo explained perfectly, regular expressions (in the formal sense of the term) are not powerful enough to treat context-free languages. Therefore, treating HTML with regexes is not a good practice.
A more practical explanation may be useful.
Sometimes the language nay is HTML
Regular expressions are useful for parsing text, even if the content is HTML. Consider a list like the one below:
<ul class="my-list">
<li>item1</li>
<li>item2</li>
<li>item3</li>
</ul>
It is easy to pick up the contents of each item with a regular expression (example in Python):
>>> a = '''<ul class="my-list">
... <li>item1</li>
... <li>item2</li>
... <li>item3</li>
... </ul>'''
>>> def items(string):
... return re.findall('<li>(.*)</li>', string)
...
>>> items(a)
['item1', 'item2', 'item3']
This is not "wrong" per se. It’s quite practical, by the way. In this case, we don’t even want to treat a context-free language as HTML. We just want to treat regular language <li>.*</li>
.
Then the language becomes HTML
The (potential) problem with this approach is that content can change in ways that a regex cannot follow.
Let’s say HTML is generated by a dynamic application, and one day it changes to something like this:
<ul class="my-list">
<li>
item1
</li>
<li>
item2
<ol class="sublist">
<li>item2.1</li>
<li>item2.2</li>
</ol>
</li>
<li>
item3
</li>
</ul>
The result will be quite different now:
>>> b = '''<ul class="my-list">
... <li>
... item1
... </li>
... <li>
... item2
... <ol class="sublist">
... <li>item2.1</li>
... <li>item2.2</li>
... </ol>
... </li>
... <li>
... item3
... </li>
... </ul>'''
>>> items(b)
['item2.1', 'item2.2']
Some parts of the problem are easy to solve. We can change the expression to include item1
and item3
by discarding blank spaces, as in <li>\s*(.*)\s*</li>
. Others are more complicated. How to catch item2
? We want to get the sub-list on item2
, or just the text? If we want to pick only the items from the first list, how do we do?
That’s where regular expressions start to fail. It is practically impossible to make the regular expression understand that it is on the first level, or on the second, for example.
This scenario is common: a simple solution with regex starts to fail in one case, then in another, then another... Gradually it is adapted, but the regex gets more and more complicated. Over time, it becomes almost impossible to understand.
The solution through the parser
An HTML parser is slightly more complicated to use than a regular expression, but is much more powerful.
For example, our function could use Beautifulsoup instead of regular expressions. It would be something like this:
>>> def items(string)
... b = BeautifulSoup(string, 'html.parser')
... items = b.find_all('li')
... return [i.text for i in items]
...
>>> items(a)
[u'item1', u'item2', u'item3']
There’s more code here, but not much. Now, what happens if we call this function with the new value b
?
>>> items(b)
[u'\n item1\n ', u'\n item2\n \nitem2.1\nitem2.2\n\n', u'item2.1', u'item2.2', u'\n item3\n ']
This time, she returned the content of all li
!
On the other hand, it failed too, differently. But it’s easier to fix now. To remove the blanks, we use the method strip()
:
>>> def items(string):
... b = BeautifulSoup(string, 'html.parser')
... elements = b.find_all('li')
... return [e.text.strip() for e in elements]
...
>>> items(b)
[u'item1', u'item2\n \nitem2.1\nitem2.2', u'item2.1', u'item2.2', u'item3']
Nested list is added as text in item2
. Fortunately, the parser is pliable: we can use the method find()
of each element to pick up only text:
>>> def items(string):
... b = BeautifulSoup(string, 'html.parser')
... elements = b.find_all('li')
... return [e.find(text=True, recursive=False).strip() for e in elements]
...
>>> items(b)
[u'item1', u'item2', u'item2.1', u'item2.2', u'item3']
If we just want the items from the primary list, we can be more specific. Let’s get the list first, and then let’s get all the <li>
without recursion:
>>> def items(string):
... b = BeautifulSoup(string, 'html.parser')
... list = b.find('ul')
... elements = list.find_all('li', recursive=False)
... return [e.find(text=True, recursive=False).strip() for e in elements]
...
>>> items(b)
[u'item1', u'item2', u'item3']
The function is not as simple as the first one, but it does a lot more. And, even better, we were able to evolve it gradually, in a readable way.
One good tool versus the best tool for the job
This nay means you can never use regular expressions to extract HTML content. The question is tradeoffs:
- If you are making a disposable script, regular expression is probably a good choice;
- If you trust that the HTML format will not change, regular expression may be sufficient;
- If you have control over the HTML to be treated, and are willing to use a hard format on it (which is not a good idea), the regular expression will be stable;
- If you’re willing to run and change your regex every time someone changes something in HTML, you can do it (and prepare to waste time.)
It is very easy for a beginner to fascinate himself with the simplicity of regular expressions and make the wrong choice. Therefore, it is so common to recommend using the parser from the first line of code. The parser is always a right choice, and 90% of the time it’s the right choice.
Now, if you’re experienced and already used to parsers, you can weigh the pros and cons case by case. Ultimately, even if regular expression is not the most scalable choice, if you keep it encapsulated in a function it will be quite easy to replace it with a parser eventually.
'Cause Regex pees his pants :)
– Maniero