How to collect text when there is no HTML reference class - Crawler Python

Asked

Viewed 788 times

1

I have the following situation below: inserir a descrição da imagem aqui

I want to collect "Text to Crawler" that is below, as I will navigate there without class or id?

<td>Texto para crawler</td>

1 answer

1


You can use the library Beautifulsoup at the command .find_all() to extract all tags td of a website, without specifying any class, name or id, for example.

Code:

from bs4 import BeautifulSoup
import requests

url = 'https://en.wikipedia.org/wiki/Web_scraping'

html_page = requests.get(url)
html_source = html_page.text

soup = BeautifulSoup(html_source, 'html.parser')

td_tags = soup.find_all('td')

for td in td_tags:
    print(td, '\n')

Output:

<td class="mbox-image"><div style="width:52px"><img alt="Globe icon." data-file-height="290" data-file-width="350" height="40" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/bd/Ambox_globe_content.svg/48px-Ambox_globe_content.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/bd/Ambox_globe_content.svg/73px-Ambox_globe_content.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/bd/Ambox_globe_content.svg/97px-Ambox_globe_content.svg.png 2x" width="48"/></div></td> 

<td class="mbox-text"><span class="mbox-text-span">The examples and perspective in this article <b>deal primarily with the United States and do not represent a <a href="/wiki/Wikipedia:WikiProject_Countering_systemic_bias" title="Wikipedia:WikiProject Countering systemic bias">worldwide view</a> of the subject</b>. <span class="hide-when-compact">You may <a class="external text" href="//en.wikipedia.org/w/index.php?title=Web_scraping&amp;action=edit">improve this article</a>, discuss the issue on the <a href="/wiki/Talk:Web_scraping" title="Talk:Web scraping">talk page</a>, or <a href="/wiki/Wikipedia:Article_wizard" title="Wikipedia:Article wizard">create a new article</a>, as appropriate.</span> <small><i>(October 2015)</i></small> <small class="hide-when-compact"><i>(<a href="/wiki/Help:Maintenance_template_removal" title="Help:Maintenance template removal">Learn how and when to remove this template message</a>)</i></small></span></td> 
  • I understand, but I will not be able to save the information I want in the bank, because I think it is not possible to make a regular expression to identify the text ex: Pick the text of td that starts with "x" and ends with "z"

  • Explain better what you want to do... I find it difficult "not to be possible"... it is all a matter of data processing. You can use the method .text in the td and join filters as .startswith("x") and .endswith("z"), to select your text, or even a well-defined regular expression, for example.

  • It worked out, buddy, thanks!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.