I need help on a python Crawler


from scrapy.spiders import BaseSpider
from scrapy.selector import HtmlXPathSelector
from crawler.items import crawlerlistItem

class MySpider(BaseSpider):
    name = "epoca"
    allowed_domains = ["epocacosmeticos.com.br"]
    start_urls = ["http://www.epocacosmeticos.com.br/maquiagem"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.xpath("//span[@class='pl']")
        items = []
        for titles in titles:
            item = crawlerlistItem()
            item["title"] = titles.select("a/text()").extract()
            item["link"] = titles.select("a/@href").extract()
        return items

I have this file, but I wanted to get all the urls of the epocacosmeticos.com.br with product name, title and url without the information being duplicated, someone can help me?

2 answers


If the problem is just the fact that in the end there’s duplicate information inside your items you can check if it already exists before making the append:

item["title"] = titles.select("a/text()").extract()
item["link"] = titles.select("a/@href").extract()
if item not in items:

For prevention of duplicates in a collection at first review I was going to suggest using a set(), but since item is a dictionary (is mutable) might as well do what I put on top so I don’t have too many laps.


The solution proposed by Miguel is valid for the case of this Spider, since he makes only one request (the first, made for the URL in start_urls). However, it is very common to have Spiders that after collecting the data from a page in the method parse() (or in another callback), make new requests for Urls found on the page itself.

Anyway, in Scrapy projects it is a good practice to separate the logic of validation and data transformation into Item Pipelines.

To do this, just create a pipeline as the example below in the file pipelines.py inside your project folder:

from scrapy.exceptions import DropItem

class DropDuplicatesPipeline(object):
    def __init__(self):
        self.urls_seen = set()

    def process_item(self, item, spider):
        if item['link'] in self.urls_seen:
            raise DropItem('Duplicate item found: {}'.format(item['link']))
            return item

And enable it in the file settings.py with the following snippet:

    'your_project.pipelines.DropDuplicatesPipeline': 300,

Once done, any and all items extracted by your Spider will go through the method process_item above is rejected if it has already been extracted previously.

  • show, thank you so much

  • in it I’m managing to get the title and url, if I wanted to add the product name as I would do?

