Error with scrapy requests

Asked

Viewed 498 times

3

I have a csv file with some urls that need to be accessed.

http://www.icarros.com.br/Audi, Audi
http://www.icarros.com.br/Fiat, Fiat
http://www.icarros.com.br/Chevrolet, Chevrolet

I’ve got an Spider to make all the requisitions.

import scrapy
import csv
from scrapy.selector import Selector

class ModelSpider(scrapy.Spider):
    name = "config_brands"
    start_urls = [
        'http://www.icarros.com/'
    ]

    def parse(self, response):
        file = open("files/brands.csv")
        reader = csv.reader(file)

        for line in reader:
            yield scrapy.Request(line[0], self.success_connect, self.error_connect)

    def success_connect(self, response):
        self.log('Entrei na url: %s' %response.url)

    def error_connect(self, response):
        self.log('Nao foi possivel %s' %response.url)

When I try to run Spider it cannot connect to any of the urls and if I enter the same url in the browser it can access normally. And my errback function doesn’t work either.

Debug:

2016-09-09 10:17:00 [scrapy] DEBUG: Crawled (200) <GET http://www.icarros.com.br/principal/index.jsp> (referer: None)
2016-09-09 10:17:00 [scrapy] DEBUG: Retrying <<BOUND METHOD MODELSPIDER.ERROR_CONNECT OF <MODELSPIDER 'CONFIG_BRANDS' AT 0X7F7D18B45990>> http://www.icarros.com.br/Audi> (failed 1 times): 400 Bad Request
2016-09-09 10:17:07 [scrapy] DEBUG: Retrying <<BOUND METHOD MODELSPIDER.ERROR_CONNECT OF <MODELSPIDER 'CONFIG_BRANDS' AT 0X7F7D18B45990>> http://www.icarros.com.br/Audi> (failed 2 times): 400 Bad Request
2016-09-09 10:17:14 [scrapy] DEBUG: Gave up retrying <<BOUND METHOD MODELSPIDER.ERROR_CONNECT OF <MODELSPIDER 'CONFIG_BRANDS' AT 0X7F7D18B45990>> http://www.icarros.com.br/Audi> (failed 3 times): 400 Bad Request
2016-09-09 10:17:14 [scrapy] DEBUG: Crawled (400) <<BOUND METHOD MODELSPIDER.ERROR_CONNECT OF <MODELSPIDER 'CONFIG_BRANDS' AT 0X7F7D18B45990>> http://www.icarros.com.br/Audi> (referer: http://www.icarros.com.br/principal/index.jsp)

1 answer

4


There are at least two ways to solve this.

  1. The first is to specify to the middleware that you want to handle response codes that are outside the range 200-300, do it on handle_httpstatus_list:

    class ModelSpider(scrapy.Spider):
        name = "config_brands"
        handle_httpstatus_list = [400, 403]
    

    See the documentation for more details.

    And my job as errback also doesn’t work.

    Specify the callback and errback:

    yield scrapy.Request(line[0], callback = self.success_connect, 
                                   errback = self.error_connect)
    

    When making these two changes your code should work as expected.


  1. An alternative is to use the method start_requests, which should be more appropriate than parse in this case because you want to access a list of Urls, the parse is usually used to process the response.

    You can do it like this:

    class ModelSpider(scrapy.Spider):
        name = "config_brands"
    
        def start_requests(self):
            with open('brands.csv', 'r') as f:
                reader = csv.reader(f)
    
                for url, modelo in reader:
                    yield scrapy.Request(url, callback = self.success_connect, 
                                               errback = self.error_connect)
    

    In the sucess_connect you treat the received response, see an example:

    def success_connect(self, response):
        self.logger.info('Entrei na url: {}'.format(response.url))
    
        anuncios = response.xpath('//div[@class="dados_veiculo"]')
    
        for anuncio in anuncios:
            titulo = anuncio.xpath('a[@class="clearfix"]/@title').extract()[0]
            valor = anuncio.xpath('a/p/text()').extract()[0]
    
            # Para lidar com caracteres acentuados
            titulo = titulo.encode('utf-8')
            valor = valor.encode('utf-8')
    
            print ("{}: {}".format(titulo, valor))
    

    In the error_connect do the treatment, or report the error:

    def error_connect(self, failure):
            self.logger.error('Nao foi possivel: {}'.format(failure.url))
    

If you prefer to deal appropriately with exceptions that occur in the processing of the request, take a look at this example in the documentation.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.