As has already been said here (and also here, here and mainly here - and in many other places around), regex is not the ideal tool to manipulate HTML (read each of the links to understand all the reasons).
In your case, an option would be to use a dedicated library, such as Beautiful Soup. With it, it’s easy to find all the links on a page:
from bs4 import BeautifulSoup
html = # obter o HTML da página...
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a'):
print(link['href'])
A regex might work, but if you read the links indicated at the beginning of the answer, you will see that there are many situations that a regex cannot detect (or even can detect, but it gets so complicated that it is not worth it).
But nothing prevents you from using regex along with Beautiful Soup, as it is now a more restricted and controlled environment:
for link in soup.find_all('a', href=re.compile(r'https?://(google\.com\.br|seila\.org|pt\.wikipedia\.org)')):
print(link['href'])
That is, here it is not so problematic to use regex, because I am sure that the search is done only in the attribute href
of tags a
(without the false positives that a regex could bring, such as if the tag was commented, or if the link was in another tag - or in the middle of the Javascript code that came along with the page - or if the text was not an HTML, etc.). In that case, I’m looking for links http
or https
, which belong to one of the fields indicated (google.com
, seial.org
or pt.wikipedia.org
).
But if the idea is to validate URL’s, why not use a dedicated lib? You can use, for example, urllib
:
from urllib.parse import urlparse
# verifica se uma URL é válida
def url_valida(url):
try:
parsed_url = urlparse(url)
if not parsed_url:
return False
# deve ser http ou https, hostname deve ser google.com ou seila.org
# ou, se for pt.wikipedia.org, verificar o restante da URL (wiki/Diretorio1/etc...)
return parsed_url.scheme in ('http', 'https') and \
(parsed_url.hostname in ('google.com', 'seila.org') \
or (parsed_url.hostname == 'pt.wikipedia.org' and parsed_url.path == '/wiki/Diretorio1/diretorio2/diretorio3/naoseikkk5'))
except ValueError:
return False
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a', href=url_valida):
print(link['href'])
So I check if the link is http
or https
, and if the address is one of what I need (google.com
or seila.org
, or, if it is pt.wikipedia.org
, the rest of the URL must be /wiki/Diretorio1/etc...
).
For the record, your regex didn’t work because the shortcut \w
take any letters, numbers or the character _
(that is, in practice you were picking up anything that looks like a URL). But in the end you used \w+?
, who uses the quantifier "lazy", which takes as few characters as possible (read here and here to better understand). That is, if the URL is http://google.com
, and regex only takes http://google.com.b
- see here.
You could even use the regex suggested above (https?://(google\.com\.br|seila\.org|pt\.wikipedia\.org)
), But as I said, this is very prone to false positives. The tag may be commented, the link may be "loose" in the text (or as an attribute of another tag, or in the middle of the Javascript that came along with the page, etc). Regex only looks at the text itself, without taking into account its structure (it doesn’t even need to be an HTML). Already using Beautiful Soup (or any other own lib for HTML/XML), you get a more reliable way to handle the data (as well as being less complicated, as a regex to validate URL’s is nothing trivial, and if it is to treat the special cases already mentioned, it will become increasingly complicated and impracticable to use).
Not to mention that Beautiful Soup allows more control over tags. For example, if you want the tag text a
, just use link.text
. If the tag a
has other tags inside it (such as a img
, etc), you can get all the content from it with link.decode_contents()
, and so on. With regex, you would have to increase it to include these cases, complicating it further. It’s not worth it.
And simplistic solutions can fool you, because it seems that "worked", as is the example of another answer. She uses .*?
, which is basically "zero or more occurrences of any character", ie, will accept anything that comes after http
or https
, until it finds quotes. So it’s not restricting any link, and the use of .
together with the quantifier Lazy (as the links already mentioned above) makes regex extremely inefficient. So much so that the example placed gives timeout, so inefficient that this is.
Don’t get me wrong, regex is legal - me particularly quite a taste - but is not always the best solution.
Do not use regex to manipulate HTML. In your case, an alternative is to use Beautiful Soup: https://answall.com/a/440262/112052
– hkotsubo
Because a regex to get URL’s is much more complicated than it seems: https://stackoverflow.com/q/161738
– hkotsubo