How to manage the operation and failure in the execution of Spiders?

Question

How to manage the operation and failure in the execution of Spiders?

Asked 11 years, 1 month ago

Viewed 126 times

3

I’m developing a module to get information about the Piders that run on the company’s system. Below is the model where we keep the beginning of operations and the job. I would like to validate if the Js were performed in the correct way and fill in the rest of the fields.

py.models

# -*- coding: utf-8 -*-

from django.db import models


class CrawlerJob(models.Model):

    job_id = models.CharField(verbose_name=u'ID da Tarefa', editable=False,
                              max_length=255, blank=True, null=True,)

    job_started_dt = models.DateTimeField(
        verbose_name=u'data de início da tarefa', blank=True, null=True,
        editable=False)

    job_has_errors = models.BooleanField(
        verbose_name=u'erros?', blank=True, default=False)


    job_finished_dt = models.DateTimeField(
        verbose_name=u'data de fim da tarefa', blank=True, null=True,
        editable=False)

tasks py.

# -*- coding: utf-8 -*-

from app.models import CrawlerJob
from celery.decorators import periodic_task
from celery.task.schedules import crontab
from django.utils import timezone
from scrapyd_api import ScrapydAPI
import celery
import datetime

scrapy_url = 'http://localhost:6800'
scrapyd = ScrapydAPI(scrapy_url)


@periodic_task(run_every=(crontab(hour="6-19")))
def funcao_assincrona():

    crj = CrawlerJob()
    job_id = scrapyd.schedule('projeto_X', 'rodar_spider')
    crj.job_id = job_id
    crj.job_started_dt = timezone.now()
    crj.save()

One idea for this was to have access to the system logs and check the generated json as follows below.

2015-01-09 12:40:18-0300 [spider] INFO: Closing spider (finished)
2015-01-09 12:40:18-0300 [spider] INFO: Stored jsonlines feed (11 items) in: scrapyd_build/items/projeto_X/spider/5a3bc7ca980e11e4b396600308991ea6.jl
2015-01-09 12:40:18-0300 [spider] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 2448,
     'downloader/request_count': 5,
     'downloader/request_method_count/GET': 3,
     'downloader/request_method_count/POST': 2,
     'downloader/response_bytes': 28218,
     'downloader/response_count': 5,
     'downloader/response_status_count/200': 5,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 1, 9, 15, 40, 18, 93445),
     'item_scraped_count': 11,
     'log_count/DEBUG': 19,
     'log_count/INFO': 8,
     'request_depth_max': 4,
     'response_received_count': 5,
     'scheduler/dequeued': 5,
     'scheduler/dequeued/memory': 5,
     'scheduler/enqueued': 5,
     'scheduler/enqueued/memory': 5,
     'start_time': datetime.datetime(2015, 1, 9, 15, 40, 13, 90020)}
2015-01-09 12:40:18-0300 [spider] INFO: Spider closed (finished)

Is there any more practical way to get this information? And in case of errors get the messages generated by possible errors or exceptions?

1 answer

Browser other questions tagged python web-application web-crawler scrapy

You are not signed in. Login or sign up in order to post.

by elias • **3,132** points · Answer 1 · 2015-01-13T02:52:51+00:00

Well, as who has access to the real Stats is the Scrapy (the scrapyd only runs the Jobs), I think the way to solve this problem is to use a Spider middleware send Crawler statistics to your application when Spider is finished.

You will also need a way to update the application in a Scrapy Spider, and trigger this in the Spider middleware.

Here is a draft:

from scrapy import signals
import os

class UpdateStatsMiddleware(object):
    def __init__(self, crawler):
        self.crawler = crawler
        # registra método close_spider como callback para o sinal spider_closed
        crawler.signals.connect(self.close_spider, signals.spider_closed)

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def close_spider(self, spider, reason):
        spider.log('Finishing spider with reason: %s' % reason)
        stats = self.crawler.stats.get_stats()
        jobid = self.get_jobid()
        self.update_job_stats(jobid, stats)

    def get_jobid(self):
        """Gets jobid through scrapyd's SCRAPY_JOB env variable"""
        return os.environ['SCRAPY_JOB']

    def update_job_stats(self, jobid, stats):
        # TODO: atualizar as stats na aplicação Django
        pass

How to manage the operation and failure in the execution of Spiders?

1 answer

Read more: