I need help on a python Crawler

Question

I need help on a python Crawler

Asked 8 years, 5 months ago

Viewed 280 times

1

from scrapy.spiders import BaseSpider
from scrapy.selector import HtmlXPathSelector
from crawler.items import crawlerlistItem

class MySpider(BaseSpider):
    name = "epoca"
    allowed_domains = ["epocacosmeticos.com.br"]
    start_urls = ["http://www.epocacosmeticos.com.br/maquiagem"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.xpath("//span[@class='pl']")
        items = []
        for titles in titles:
            item = crawlerlistItem()
            item["title"] = titles.select("a/text()").extract()
            item["link"] = titles.select("a/@href").extract()
            items.append(item)
        return items

I have this file, but I wanted to get all the urls of the epocacosmeticos.com.br with product name, title and url without the information being duplicated, someone can help me?

2 answers

Browser other questions tagged python scrapy

You are not signed in. Login or sign up in order to post.

by Miguel • **29,306** points · Answer 1 · 2017-02-16T08:11:33+00:00

If the problem is just the fact that in the end there’s duplicate information inside your items you can check if it already exists before making the append:

...
item["title"] = titles.select("a/text()").extract()
item["link"] = titles.select("a/@href").extract()
if item not in items:
    items.append(item)

For prevention of duplicates in a collection at first review I was going to suggest using a set(), but since item is a dictionary (is mutable) might as well do what I put on top so I don’t have too many laps.

by stummjr • **111** points · Answer 2 · 2017-03-01T13:36:37+00:00

The solution proposed by Miguel is valid for the case of this Spider, since he makes only one request (the first, made for the URL in start_urls). However, it is very common to have Spiders that after collecting the data from a page in the method parse() (or in another callback), make new requests for Urls found on the page itself.

Anyway, in Scrapy projects it is a good practice to separate the logic of validation and data transformation into Item Pipelines.

To do this, just create a pipeline as the example below in the file pipelines.py inside your project folder:

from scrapy.exceptions import DropItem


class DropDuplicatesPipeline(object):
    def __init__(self):
        self.urls_seen = set()

    def process_item(self, item, spider):
        if item['link'] in self.urls_seen:
            raise DropItem('Duplicate item found: {}'.format(item['link']))
        else:
            self.urls_seen.add(item['link'])
            return item

And enable it in the file settings.py with the following snippet:

ITEM_PIPELINES = {
    'your_project.pipelines.DropDuplicatesPipeline': 300,
}

Once done, any and all items extracted by your Spider will go through the method process_item above is rejected if it has already been extracted previously.