1
You can use the library Beautifulsoup at the command .find_all()
to extract all tags td
of a website, without specifying any class
, name
or id
, for example.
Code:
from bs4 import BeautifulSoup
import requests
url = 'https://en.wikipedia.org/wiki/Web_scraping'
html_page = requests.get(url)
html_source = html_page.text
soup = BeautifulSoup(html_source, 'html.parser')
td_tags = soup.find_all('td')
for td in td_tags:
print(td, '\n')
Output:
<td class="mbox-image"><div style="width:52px"><img alt="Globe icon." data-file-height="290" data-file-width="350" height="40" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/bd/Ambox_globe_content.svg/48px-Ambox_globe_content.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/bd/Ambox_globe_content.svg/73px-Ambox_globe_content.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/bd/Ambox_globe_content.svg/97px-Ambox_globe_content.svg.png 2x" width="48"/></div></td>
<td class="mbox-text"><span class="mbox-text-span">The examples and perspective in this article <b>deal primarily with the United States and do not represent a <a href="/wiki/Wikipedia:WikiProject_Countering_systemic_bias" title="Wikipedia:WikiProject Countering systemic bias">worldwide view</a> of the subject</b>. <span class="hide-when-compact">You may <a class="external text" href="//en.wikipedia.org/w/index.php?title=Web_scraping&action=edit">improve this article</a>, discuss the issue on the <a href="/wiki/Talk:Web_scraping" title="Talk:Web scraping">talk page</a>, or <a href="/wiki/Wikipedia:Article_wizard" title="Wikipedia:Article wizard">create a new article</a>, as appropriate.</span> <small><i>(October 2015)</i></small> <small class="hide-when-compact"><i>(<a href="/wiki/Help:Maintenance_template_removal" title="Help:Maintenance template removal">Learn how and when to remove this template message</a>)</i></small></span></td>
I understand, but I will not be able to save the information I want in the bank, because I think it is not possible to make a regular expression to identify the text ex: Pick the text of td that starts with "x" and ends with "z"
– DaniloAlbergardi
Explain better what you want to do... I find it difficult "not to be possible"... it is all a matter of data processing. You can use the method
.text
in thetd
and join filters as.startswith("x")
and.endswith("z")
, to select your text, or even a well-defined regular expression, for example.– dot.Py
It worked out, buddy, thanks!
– DaniloAlbergardi