How do greedy, non-greedy quantifiers work?

Asked

Viewed 72 times

2

import re
import requests
#o req vai ser a requisição á uma página
print(re.findall(r'(?<=href=["\'])https?://.+?(?=["\'])',req.text))

The code is to pick up links on a web page.

I know he’s right but I ended up getting it right without knowing what I was doing, I wanted to understand how the question mark works in this case.

  • 1

    "Greedy coders" or "greedy quantifiers"?

  • 1

    You already have explanations about the quantifiers here, here and here, serves?

  • 2

    But anyway, if you want to extract all the href of an HTML, should not use regex but some dedicated lib: https://answall.com/a/440262/112052

  • @anonymity , quantifiers, I said wrong

  • @hkotsubo , actually does not serve pq are in javascript and what is in python I think is not using the library re that is my code

  • But the explanation about the ? is the same regardless of language

  • Hello Rafa, I recommend you read What is a greedy Regular Expression? and then read: Why Regex should not be used to handle HTML? ... and a few more links at: https://answall.com/search?q=%5Bpython%5D+pegar+links

Show 2 more comments

1 answer

1


If you do not know very well what you are doing, I suggest you do not use regex for this, there are better solutions (and throughout the answer we will understand the reasons).


But anyway, about the quantifier +: by default he is "greedy" or "greedy" (in English, called Greedy), that is, it tries to grab as many characters as possible.

And how you used along with the point (which corresponds to any character - except line breaks), then .+ ends up taking "everything".

But how then there’s (?=["\']), which checks if it has quotation marks, so what .+ does is go to the end of the string, and then starts to come back until you find a " or ' (this behavior is explained in detail here). This means that the regex ends up taking everything from the first href to the last quote.

But if you use .+?, the quantifier becomes Lazy/lazy/non-greedy, by picking up as few characters as possible (this is explained here, here and in the link already quoted - and although these links are not in Python, the quantifier behavior Lazy is the same, so I suggest you read to better understand). With this, he only takes what is between the quotation marks after the href. So "it works".

For the record, it could also be like this:

print(re.findall(r'href=["\'](https?://[^"\']+)["\']', req.text))

Instead of the point, I use [^"\'], indicating that I want anything that nay or quotation marks (or [^ indicates a character class denied), so I don’t need to use the quantifier Lazy, because I already guarantee that the regex will stop when I find some quotes.


But as already said here (and here, and here), regex is not the ideal solution. It is best to use a dedicated lib, such as Beautiful Soup:

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(req.text, 'html.parser')
for a in soup.find_all('a', href=re.compile('^https?://')):
    print(a['href'])

Note that I used a regex only to check if the href starts with "http" or "https". But here is a fundamental difference: the assurance that I am only looking at the attribute of href tag a of an HTML.

This makes a difference, for example, if one of the tags is commented:

<!--
comentário etc...
<a href="http://www.google.com/abc/xyz">fad fa</a>
-->
<p>blablabla</p>
<a href="https://www.abc.com">fafdafadsfsdad fa</a>

regex takes the 2 links above, Beautiful Soup only takes the second (www.abc.com). So regex can detect that the tag is inside a comment, would be very complicated.

In the link already quoted has many other cases that a regex may fail, while Beautiful Soup (or any other parser html) can handle normally, smoothly.

Regex are legal - i like it enough - and often looks like be the best solution. But it’s not always (to manipulate HTML, for sure is not).

  • thank you very much!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.