Problems with restrict_xpaths parameter in a Crawler

Asked

Viewed 75 times

2

I have no Python experience, but I decided to try to do something with Scrapy for testing. So I’m trying to collect the existing articles on a particular page, namely a DIV element with an ID devBody.

In this sense, my aim is to obtain the title of the article and its URL. Thus, I established a rule to go through only the content of that element.

It turns out that, for some reason, the collection of links is not just limited to that element, which causes relevant links to be collected and then "shuffle" the title-URL pairs when I try to build them. Follows the code:

from scrapy import Spider
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule
from stack.items import StackItem


class StackSpider(Spider):
    name = "stack"
    allowed_domains = ["dev.mysql.com"]
    start_urls = ["http://dev.mysql.com/tech-resources/articles/"]


    rules = (Rule(LinkExtractor(restrict_xpaths='//div[@id="devBody"]',), callback='parse'),)


    def parse(self, response):
        entries = response.xpath('//h4')
        items = []    
        #usar um contador aqui não será, de certeza, a melhor solução mas foi a única que encontrei para não receber todos os dados recolhidos num único objecto
        i = 0            
        for entry in entries:
            item = StackItem()
            item['title'] = entry.xpath('//a/text()').extract()[i]
            item['url'] = entry.xpath('//a/@href').extract()[i]
            yield item
            items.append(item)
            i += 1

To try to figure out what’s going on, I turned to Developer Tools and, through Xpath queries, everything seems to be correct. However, when I try to replicate the same logic in the code, something goes wrong. According to the logs, it is said that 57 were actually collected links, but there are quite a few that are outside the scope (such a div with devBody ID).

I have no idea what might be causing this behavior. I am using version 1.0.5 of Scrapy and Python 2.7.

Thanks in advance for any help.

1 answer

2


According to this reply, changed the structure of the code to work as intended. Here is the final result:

from scrapy.spiders import Spider
from stack.items import StackItem

class StackSpider(Spider):
    handle_httpstatus_list = [403, 404]
    name = "stack"
    allowed_domains = ["dev.mysql.com"]
    start_urls = ["https://dev.mysql.com/tech-resources/articles/"]

    def parse_items(self, response):
        for row in response.xpath('//div[@id="devBody"]/h4'):
            item = StackItem()
            item['title'] = row.xpath('a/text()').extract()
            # get the full url
            item['url'] = response.urljoin(row.xpath('a/@href').extract_first())
            yield item

Browser other questions tagged

You are not signed in. Login or sign up in order to post.