How to manage the operation and failure in the execution of Spiders?


I’m developing a module to get information about the Piders that run on the company’s system. Below is the model where we keep the beginning of operations and the job. I would like to validate if the Js were performed in the correct way and fill in the rest of the fields.


# -*- coding: utf-8 -*-

from django.db import models

class CrawlerJob(models.Model):

    job_id = models.CharField(verbose_name=u'ID da Tarefa', editable=False,
                              max_length=255, blank=True, null=True,)

    job_started_dt = models.DateTimeField(
        verbose_name=u'data de início da tarefa', blank=True, null=True,

    job_has_errors = models.BooleanField(
        verbose_name=u'erros?', blank=True, default=False)

    job_finished_dt = models.DateTimeField(
        verbose_name=u'data de fim da tarefa', blank=True, null=True,

tasks py.

# -*- coding: utf-8 -*-

from app.models import CrawlerJob
from celery.decorators import periodic_task
from celery.task.schedules import crontab
from django.utils import timezone
from scrapyd_api import ScrapydAPI
import celery
import datetime

scrapy_url = 'http://localhost:6800'
scrapyd = ScrapydAPI(scrapy_url)

def funcao_assincrona():

    crj = CrawlerJob()
    job_id = scrapyd.schedule('projeto_X', 'rodar_spider')
    crj.job_id = job_id
    crj.job_started_dt =

One idea for this was to have access to the system logs and check the generated json as follows below.

2015-01-09 12:40:18-0300 [spider] INFO: Closing spider (finished)
2015-01-09 12:40:18-0300 [spider] INFO: Stored jsonlines feed (11 items) in: scrapyd_build/items/projeto_X/spider/5a3bc7ca980e11e4b396600308991ea6.jl
2015-01-09 12:40:18-0300 [spider] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 2448,
     'downloader/request_count': 5,
     'downloader/request_method_count/GET': 3,
     'downloader/request_method_count/POST': 2,
     'downloader/response_bytes': 28218,
     'downloader/response_count': 5,
     'downloader/response_status_count/200': 5,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 1, 9, 15, 40, 18, 93445),
     'item_scraped_count': 11,
     'log_count/DEBUG': 19,
     'log_count/INFO': 8,
     'request_depth_max': 4,
     'response_received_count': 5,
     'scheduler/dequeued': 5,
     'scheduler/dequeued/memory': 5,
     'scheduler/enqueued': 5,
     'scheduler/enqueued/memory': 5,
     'start_time': datetime.datetime(2015, 1, 9, 15, 40, 13, 90020)}
2015-01-09 12:40:18-0300 [spider] INFO: Spider closed (finished)

Is there any more practical way to get this information? And in case of errors get the messages generated by possible errors or exceptions?

Well, as who has access to the real Stats is the Scrapy (the scrapyd only runs the Jobs), I think the way to solve this problem is to use a Spider middleware send Crawler statistics to your application when Spider is finished.

You will also need a way to update the application in a Scrapy Spider, and trigger this in the Spider middleware.

Here is a draft:

from scrapy import signals
import os

class UpdateStatsMiddleware(object):
    def __init__(self, crawler):
        self.crawler = crawler
        # registra método close_spider como callback para o sinal spider_closed
        crawler.signals.connect(self.close_spider, signals.spider_closed)

    def from_crawler(cls, crawler):
        return cls(crawler)

    def close_spider(self, spider, reason):
        spider.log('Finishing spider with reason: %s' % reason)
        stats = self.crawler.stats.get_stats()
        jobid = self.get_jobid()
        self.update_job_stats(jobid, stats)

    def get_jobid(self):
        """Gets jobid through scrapyd's SCRAPY_JOB env variable"""
        return os.environ['SCRAPY_JOB']

    def update_job_stats(self, jobid, stats):
        # TODO: atualizar as stats na aplicação Django

