2
I have no Python experience, but I decided to try to do something with Scrapy for testing. So I’m trying to collect the existing articles on a particular page, namely a DIV element with an ID devBody.
In this sense, my aim is to obtain the title of the article and its URL. Thus, I established a rule to go through only the content of that element.
It turns out that, for some reason, the collection of links is not just limited to that element, which causes relevant links to be collected and then "shuffle" the title-URL pairs when I try to build them. Follows the code:
from scrapy import Spider
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule
from stack.items import StackItem
class StackSpider(Spider):
name = "stack"
allowed_domains = ["dev.mysql.com"]
start_urls = ["http://dev.mysql.com/tech-resources/articles/"]
rules = (Rule(LinkExtractor(restrict_xpaths='//div[@id="devBody"]',), callback='parse'),)
def parse(self, response):
entries = response.xpath('//h4')
items = []
#usar um contador aqui não será, de certeza, a melhor solução mas foi a única que encontrei para não receber todos os dados recolhidos num único objecto
i = 0
for entry in entries:
item = StackItem()
item['title'] = entry.xpath('//a/text()').extract()[i]
item['url'] = entry.xpath('//a/@href').extract()[i]
yield item
items.append(item)
i += 1
To try to figure out what’s going on, I turned to Developer Tools and, through Xpath queries, everything seems to be correct. However, when I try to replicate the same logic in the code, something goes wrong. According to the logs, it is said that 57 were actually collected links, but there are quite a few that are outside the scope (such a div with devBody ID).
I have no idea what might be causing this behavior. I am using version 1.0.5 of Scrapy and Python 2.7.
Thanks in advance for any help.