Error with scrapy requests

Question

Error with scrapy requests

Asked 8 years, 2 months ago

Viewed 498 times

3

I have a csv file with some urls that need to be accessed.

http://www.icarros.com.br/Audi, Audi
http://www.icarros.com.br/Fiat, Fiat
http://www.icarros.com.br/Chevrolet, Chevrolet

I’ve got an Spider to make all the requisitions.

import scrapy
import csv
from scrapy.selector import Selector

class ModelSpider(scrapy.Spider):
    name = "config_brands"
    start_urls = [
        'http://www.icarros.com/'
    ]

    def parse(self, response):
        file = open("files/brands.csv")
        reader = csv.reader(file)

        for line in reader:
            yield scrapy.Request(line[0], self.success_connect, self.error_connect)

    def success_connect(self, response):
        self.log('Entrei na url: %s' %response.url)

    def error_connect(self, response):
        self.log('Nao foi possivel %s' %response.url)

When I try to run Spider it cannot connect to any of the urls and if I enter the same url in the browser it can access normally. And my errback function doesn’t work either.

Debug:

2016-09-09 10:17:00 [scrapy] DEBUG: Crawled (200) <GET http://www.icarros.com.br/principal/index.jsp> (referer: None)
2016-09-09 10:17:00 [scrapy] DEBUG: Retrying <<BOUND METHOD MODELSPIDER.ERROR_CONNECT OF <MODELSPIDER 'CONFIG_BRANDS' AT 0X7F7D18B45990>> http://www.icarros.com.br/Audi> (failed 1 times): 400 Bad Request
2016-09-09 10:17:07 [scrapy] DEBUG: Retrying <<BOUND METHOD MODELSPIDER.ERROR_CONNECT OF <MODELSPIDER 'CONFIG_BRANDS' AT 0X7F7D18B45990>> http://www.icarros.com.br/Audi> (failed 2 times): 400 Bad Request
2016-09-09 10:17:14 [scrapy] DEBUG: Gave up retrying <<BOUND METHOD MODELSPIDER.ERROR_CONNECT OF <MODELSPIDER 'CONFIG_BRANDS' AT 0X7F7D18B45990>> http://www.icarros.com.br/Audi> (failed 3 times): 400 Bad Request
2016-09-09 10:17:14 [scrapy] DEBUG: Crawled (400) <<BOUND METHOD MODELSPIDER.ERROR_CONNECT OF <MODELSPIDER 'CONFIG_BRANDS' AT 0X7F7D18B45990>> http://www.icarros.com.br/Audi> (referer: http://www.icarros.com.br/principal/index.jsp)

1 answer

Browser other questions tagged python python-3.x python-2.7 scrapy web-scraping

You are not signed in. Login or sign up in order to post.

by stderr • **30,356** points · Answer 1 · 2016-09-11T03:16:10+00:00

There are at least two ways to solve this.

The first is to specify to the middleware that you want to handle response codes that are outside the range 200-300, do it on handle_httpstatus_list:
```
class ModelSpider(scrapy.Spider):
    name = "config_brands"
    handle_httpstatus_list = [400, 403]
```
See the documentation for more details.

And my job as errback also doesn’t work.

Specify the callback and errback:
```
yield scrapy.Request(line[0], callback = self.success_connect, 
                               errback = self.error_connect)
```
When making these two changes your code should work as expected.

An alternative is to use the method start_requests, which should be more appropriate than parse in this case because you want to access a list of Urls, the parse is usually used to process the response.

You can do it like this:

class ModelSpider(scrapy.Spider):
    name = "config_brands"

    def start_requests(self):
        with open('brands.csv', 'r') as f:
            reader = csv.reader(f)

            for url, modelo in reader:
                yield scrapy.Request(url, callback = self.success_connect, 
                                           errback = self.error_connect)

In the sucess_connect you treat the received response, see an example:

def success_connect(self, response):
    self.logger.info('Entrei na url: {}'.format(response.url))

    anuncios = response.xpath('//div[@class="dados_veiculo"]')

    for anuncio in anuncios:
        titulo = anuncio.xpath('a[@class="clearfix"]/@title').extract()[0]
        valor = anuncio.xpath('a/p/text()').extract()[0]

        # Para lidar com caracteres acentuados
        titulo = titulo.encode('utf-8')
        valor = valor.encode('utf-8')

        print ("{}: {}".format(titulo, valor))

In the error_connect do the treatment, or report the error:

def error_connect(self, failure):
        self.logger.error('Nao foi possivel: {}'.format(failure.url))

If you prefer to deal appropriately with exceptions that occur in the processing of the request, take a look at this example in the documentation.