Why shouldn’t Regex be used to handle HTML?

Asked

Viewed 576 times

27

I understand that if I try to use Regex over HTML, HTML tags will drip from my eyes like liquid pain, among other horrors. And that I should use an XML parser or something like.

My curious child side not stop asking me: but why? Why, dear Stack Overflow, can’t I use regular expressions to scan in the fields of markup languages?

  • 16

    'Cause Regex pees his pants :)

2 answers

24


TL;DR

The HTML is simply more complex than Regex, to the point that it is impossible to have a Regular Expression that treats HTML satisfactorily.

Explanation

The most direct explanation is in formal language with the help of Hierarchy of Chomsky, which basically organizes the languages according to their complexity, that is, freedom of rules and language recognizing machine.

Of Type 0, the most complex - generating all grammars recognizable by the Turing machine, to Type 3, less complex, recognizable by a simple finite automaton.

  • Regex or Regular Expression is the implementation of Regular Grammar, Type 3 of the Chomsky, is a linear grammar, easily recognized by a finite automaton.

  • Already the HTML derived from SGML, which is a Language Free of Context (LLC), generated by a Grammar, Free of Context (GLC), Type 2, recognised by a pushdown automaton. And HTML is not even LLC, although it is Turing Complete for Rule 110 in combination with CSS3.

That is, a finite automaton nay is enough to recognize the HTML, and it takes a pushdown automaton to recognize the XML, which generated the HTML. Thus, by definition it is not possible to recognize HTML with Regular Expression in a satisfactory way (which covers all cases).


Notes:

  1. Grammars in the Hierarchy of Chomsky are not exclusively separated, but rather by posts in subsets: Type #0 #1 #2 #3.
  • HTML is not context-free? It is possible to reference this rule 110?

  • I found a reference: https://en.wikipedia.org/wiki/Rule_110 Conway’s Life Game =)

  • I knew this for CSS3 along with HTML, not HTML per se. And it also has the point that the description itself is GLC (even if the computation of this description allows something Complete Turing). For example, I wrote a Type 0 language identifier using GLC only, in this answer about Meta Language

  • @Jeffersonquesado HTML per se, no - by definition. I corrected the text. However, HTML is not context-free (although there are some proponents of this idea, mistakenly in my view). I will not delve into this topic here because it is not pertinent to the post, but there is plenty of material on the internet.

12

As André Figueiredo explained perfectly, regular expressions (in the formal sense of the term) are not powerful enough to treat context-free languages. Therefore, treating HTML with regexes is not a good practice.

A more practical explanation may be useful.

Sometimes the language nay is HTML

Regular expressions are useful for parsing text, even if the content is HTML. Consider a list like the one below:

<ul class="my-list">
    <li>item1</li>
    <li>item2</li>
    <li>item3</li>
</ul>

It is easy to pick up the contents of each item with a regular expression (example in Python):

>>> a = '''<ul class="my-list">
...   <li>item1</li>
...   <li>item2</li>
...   <li>item3</li>
... </ul>'''
>>> def items(string):
...     return re.findall('<li>(.*)</li>', string)
... 
>>> items(a)
['item1', 'item2', 'item3']

This is not "wrong" per se. It’s quite practical, by the way. In this case, we don’t even want to treat a context-free language as HTML. We just want to treat regular language <li>.*</li>.

Then the language becomes HTML

The (potential) problem with this approach is that content can change in ways that a regex cannot follow.

Let’s say HTML is generated by a dynamic application, and one day it changes to something like this:

<ul class="my-list">
    <li>
        item1
    </li>
    <li>
        item2
        <ol class="sublist">
            <li>item2.1</li>
            <li>item2.2</li>
        </ol>
    </li>
    <li>
        item3
    </li>
</ul>

The result will be quite different now:

>>> b = '''<ul class="my-list">
...     <li>
...         item1
...     </li>
...     <li>
...         item2
...         <ol class="sublist">
...             <li>item2.1</li>
...             <li>item2.2</li>
...         </ol>
...     </li>
...     <li>
...         item3
...     </li>
... </ul>'''
>>> items(b)
['item2.1', 'item2.2']

Some parts of the problem are easy to solve. We can change the expression to include item1 and item3 by discarding blank spaces, as in <li>\s*(.*)\s*</li>. Others are more complicated. How to catch item2? We want to get the sub-list on item2, or just the text? If we want to pick only the items from the first list, how do we do?

That’s where regular expressions start to fail. It is practically impossible to make the regular expression understand that it is on the first level, or on the second, for example.

This scenario is common: a simple solution with regex starts to fail in one case, then in another, then another... Gradually it is adapted, but the regex gets more and more complicated. Over time, it becomes almost impossible to understand.

The solution through the parser

An HTML parser is slightly more complicated to use than a regular expression, but is much more powerful.

For example, our function could use Beautifulsoup instead of regular expressions. It would be something like this:

>>> def items(string)
...     b = BeautifulSoup(string, 'html.parser')
...     items = b.find_all('li')
...     return [i.text for i in items]
... 
>>> items(a)
[u'item1', u'item2', u'item3']

There’s more code here, but not much. Now, what happens if we call this function with the new value b?

>>> items(b)
[u'\n        item1\n    ', u'\n        item2\n        \nitem2.1\nitem2.2\n\n', u'item2.1', u'item2.2', u'\n        item3\n    ']

This time, she returned the content of all li!

On the other hand, it failed too, differently. But it’s easier to fix now. To remove the blanks, we use the method strip():

>>> def items(string):
...     b = BeautifulSoup(string, 'html.parser')
...     elements = b.find_all('li')
...     return [e.text.strip() for e in elements]
... 
>>> items(b)
[u'item1', u'item2\n        \nitem2.1\nitem2.2', u'item2.1', u'item2.2', u'item3']

Nested list is added as text in item2. Fortunately, the parser is pliable: we can use the method find() of each element to pick up only text:

>>> def items(string):
...     b = BeautifulSoup(string, 'html.parser')
...     elements = b.find_all('li')
...     return [e.find(text=True, recursive=False).strip() for e in elements]
... 
>>> items(b)
[u'item1', u'item2', u'item2.1', u'item2.2', u'item3']

If we just want the items from the primary list, we can be more specific. Let’s get the list first, and then let’s get all the <li> without recursion:

>>> def items(string):
...     b = BeautifulSoup(string, 'html.parser')
...     list = b.find('ul')
...     elements = list.find_all('li', recursive=False)
...     return [e.find(text=True, recursive=False).strip() for e in elements]
... 
>>> items(b)
[u'item1', u'item2', u'item3']

The function is not as simple as the first one, but it does a lot more. And, even better, we were able to evolve it gradually, in a readable way.

One good tool versus the best tool for the job

This nay means you can never use regular expressions to extract HTML content. The question is tradeoffs:

  • If you are making a disposable script, regular expression is probably a good choice;
  • If you trust that the HTML format will not change, regular expression may be sufficient;
  • If you have control over the HTML to be treated, and are willing to use a hard format on it (which is not a good idea), the regular expression will be stable;
  • If you’re willing to run and change your regex every time someone changes something in HTML, you can do it (and prepare to waste time.)

It is very easy for a beginner to fascinate himself with the simplicity of regular expressions and make the wrong choice. Therefore, it is so common to recommend using the parser from the first line of code. The parser is always a right choice, and 90% of the time it’s the right choice.

Now, if you’re experienced and already used to parsers, you can weigh the pros and cons case by case. Ultimately, even if regular expression is not the most scalable choice, if you keep it encapsulated in a function it will be quite easy to replace it with a parser eventually.

  • Type parse Java? You can do a few things with regex https://answall.com/a/214947/64969

Browser other questions tagged

You are not signed in. Login or sign up in order to post.