If you do not know very well what you are doing, I suggest you do not use regex for this, there are better solutions (and throughout the answer we will understand the reasons).
But anyway, about the quantifier +
: by default he is "greedy" or "greedy" (in English, called Greedy), that is, it tries to grab as many characters as possible.
And how you used along with the point (which corresponds to any character - except line breaks), then .+
ends up taking "everything".
But how then there’s (?=["\'])
, which checks if it has quotation marks, so what .+
does is go to the end of the string, and then starts to come back until you find a "
or '
(this behavior is explained in detail here). This means that the regex ends up taking everything from the first href
to the last quote.
But if you use .+?
, the quantifier becomes Lazy/lazy/non-greedy, by picking up as few characters as possible (this is explained here, here and in the link already quoted - and although these links are not in Python, the quantifier behavior Lazy is the same, so I suggest you read to better understand). With this, he only takes what is between the quotation marks after the href
. So "it works".
For the record, it could also be like this:
print(re.findall(r'href=["\'](https?://[^"\']+)["\']', req.text))
Instead of the point, I use [^"\']
, indicating that I want anything that nay or quotation marks (or [^
indicates a character class denied), so I don’t need to use the quantifier Lazy, because I already guarantee that the regex will stop when I find some quotes.
But as already said here (and here, and here), regex is not the ideal solution. It is best to use a dedicated lib, such as Beautiful Soup:
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(req.text, 'html.parser')
for a in soup.find_all('a', href=re.compile('^https?://')):
print(a['href'])
Note that I used a regex only to check if the href
starts with "http" or "https". But here is a fundamental difference: the assurance that I am only looking at the attribute of href
tag a
of an HTML.
This makes a difference, for example, if one of the tags is commented:
<!--
comentário etc...
<a href="http://www.google.com/abc/xyz">fad fa</a>
-->
<p>blablabla</p>
<a href="https://www.abc.com">fafdafadsfsdad fa</a>
regex takes the 2 links above, Beautiful Soup only takes the second (www.abc.com). So regex can detect that the tag is inside a comment, would be very complicated.
In the link already quoted has many other cases that a regex may fail, while Beautiful Soup (or any other parser html) can handle normally, smoothly.
Regex are legal - i like it enough - and often looks like be the best solution. But it’s not always (to manipulate HTML, for sure is not).
"Greedy coders" or "greedy quantifiers"?
– anonimo
You already have explanations about the quantifiers here, here and here, serves?
– hkotsubo
But anyway, if you want to extract all the
href
of an HTML, should not use regex but some dedicated lib: https://answall.com/a/440262/112052– hkotsubo
@anonymity , quantifiers, I said wrong
– Rafa0712
@hkotsubo , actually does not serve pq are in javascript and what is in python I think is not using the library re that is my code
– Rafa0712
But the explanation about the
?
is the same regardless of language– hkotsubo
Hello Rafa, I recommend you read What is a greedy Regular Expression? and then read: Why Regex should not be used to handle HTML? ... and a few more links at: https://answall.com/search?q=%5Bpython%5D+pegar+links
– Guilherme Nascimento