Scrapy xpath href or span inside the div

Asked

Viewed 354 times

1

Hello, I’m trying to do a scraping where I have to pick up a link and text but I’m struggling because of page variations. I have three possible variations:

1.

<div>
<strong>
    <span style="font-family: arial, helvetica, sans-serif;">
        <a href="www...com.br" target="_blank">Edição</a>&nbsp;-&nbsp;
    </span>
</strong>
<span style="font-family: arial, helvetica, sans-serif;">01/12/2017
</span>
</div>

2.

<div>
<span style="font-family: arial, helvetica, sans-serif;">
    <a href="www...com.br">
        <strong>Edição</strong>
    </a>&nbsp;- 04/12/2017
</span>
</div>

3.

<div>
    <a href="www...com.br">
        <strong>Edição</strong>
    </a>&nbsp;- 05/12/2017
</div>

I need to get the link inside the href and the date. The link I can pick up with

response.xpath('//a[contains(@href,"www...com.br")]')

I’m not getting the date. I’m trying to find a solution where I can get the link and the date within these code variations.

Thanks in advance for your help.

  • You can post which page you are trying to parse?

  • Página:http://www.uberlandia.mg.gov.br/? pagina=Conteudo&id=3077

2 answers

1

Based on your example, we can see that there are two patterns:

Dates within the span (case 1 and 2):

response.xpath('//div/span/text()').extract()

Output:

['01/12/2017\n        ', '\n            ', '\xa0- 04/12/2017\n        ']

Loose dates on div (case 3):

response.xpath('//div/text()').extract()

Output:

'\n        ', '\n        ', '\n    ', '\n        ', '\n    ', '\n        ', '\xa0- 05/12/2017\n    ']

A strategy to solve the problem would be:

1) Czech if the first option is;

2) If not found in the first, try the second.

Since for both of you you would have to wipe the data: remove the \n, maybe use regex to find the DD/MM/YYYY pattern etc.

To get to these conclusions I created an HTML page with just the example that you pasted here. The paths can change according to the page.

1

Can do this way below, to preferring to use the Beautifulsoup is much simpler and solves perfectly.

from bs4 import BeautifulSoup
import scrapy

class MgUberlandia(scrapy.Spider):
    name = 'mg_uberlandia'
    start_urls = ['http://www.uberlandia.mg.gov.br/?pagina=Conteudo&id=3077']

    def parse(self, response):
        soup = BeautifulSoup(response.body_as_unicode())
        a = soup.find_all('a')

        for link in a:
            print(link.get('href'))

Browser other questions tagged

You are not signed in. Login or sign up in order to post.