Information contained in two Scrapy pages

Asked

Viewed 309 times

3

I’m not a python programmer, but I’m trying to work with the Scrapy application.

inserir a descrição da imagem aqui

The above example is what I need, this runs in extension of Chrome.

To explain, I need the post and all available information. In the case of the Post, the categories have some information (Short Desc, and others) and information in the Post (Long Desc). They are different information from the same Post.

My doubt is in the process, in the first loop I have Posts that need information from a Second Request, that after the parse Extract would have the information.

Thus remaining

 Post.short_desc = ['xxxx'] ¹ loop

 Post.long_desc = ['xxx'] return ² loop

How do I do this?

Now that complicates a little. Because within the Second Loop, I need to add the Categories,Tags in the queue to be processed.

Fila.lista -> Add -> Url

How do I do this?

I don’t know how to accomplish this, if you can help me. Thank you

1 answer

3


The traditional way to extract data from multiple pages and use the mechanism to pass data between a request and another using the dictionary meta.

Here’s how it works: in the callback that is extracting content from the first page you mount a dict with the initial data:

def parse_pagina_de_listagem(self, response):
    inicial = dict(
        short_desc=response.css('...').extract(),
        ...
    )
    # pega url da pagina com restante dos dados
    url = response.css('...').extract_first()

    # monta uma requisicao passando os dados com o parametro meta
    request = scrapy.Request(request.urljoin(url), callback=self.parse_restante)
    request.meta['item'] = inicial

Scrapy will send the request asynchronously, and will pass on the value in the reply.

This way you can receive the initial item in the callback parse_restante, and also schedule requests for other pages to follow within it:

def parse_restante(self, response):
    # recupera item do meta
    inicial = response.meta['item']

    # faz o restante da extracao do post
    yield dict(
        inicial,
        long_desc=response.css('...').extract_first(),
        ...
    )

    # segue para outras paginas, se necessario
    for link in response.css('...').extract():
        yield scrapy.Request(response.urljoin(link),
                             callback=self.parse_pagina_de_listagem)

Read more:

  • Thanks Elias. You understood perfectly what I said. I really saw few tutorials, I ran some and others, but all were without following. I believe that now

  • @Luiz Legal! Anything, keep asking here. :)

  • Young Vlw, thanks to your reply I managed to solve a mistake that was fucking me here

Browser other questions tagged

You are not signed in. Login or sign up in order to post.