Scrapy different pages

Asked

Viewed 146 times

1

But I am facing a problem. And I ended up getting confused, I decided to return the code in a functional point.

# -*- coding: utf-8 -*-
# coding: utf-8
import scrapy
from mbu2.items import Mbu2Item2
import urlparse
from scrapy.http import Request

class Spider2Spider(scrapy.Spider):
    name = "spider2"
    # allowed_domains = [""]
    start_urls = (
        # 'file:///C:/scrapy/mbu/mbu2/video.html',
        'file:///C:/scrapy/mbu/mbu2/list.htm',
    )

    def parse(self, response):
        # filename = response.url.split("/")[-1] + '.html'
        # with open(filename, 'wb') as f:
            # f.write(response.body)

        # item = Mbu2Item()
        # return item


        posts = response.xpath('/html/body/div/div[2]/div/div[1]/div[2]/ul/li')
        posts.pop(0)
        for post in posts:
            print(post)
            item = Mbu2Item2()
            item['currentitemlist'] = response.url
            item['currentitemlink'] = urlparse.urljoin(response.url,post.xpath('div/div/h2/a/@href').extract()[0].strip())
            item['posttitle'] = post.xpath('div/div/h2/a/text()').extract()[0].strip()
            # print(item['posttitle'])
            item['posturl'] = urlparse.urljoin(response.url,post.xpath('div/div/h2/a/@href').extract()[0].strip())
            item['postautor'] = post.xpath('div/div/div/div[1]/a/text()').extract()[0].strip()
            # print(item['postautorurl'])
            item['postautorurl'] = urlparse.urljoin(response.url,post.xpath('div/div/div/div[1]/a/@href').extract()[0].strip())
            item['postcat'] = post.xpath('div/div/div/div[2]/span/a/text()').extract()[0].strip()
            # print(item['postcaturl'])
            item['postcaturl'] = urlparse.urljoin(response.url,post.xpath('div/div/div/div[2]/span/a/@href').extract()[0].strip())
            # print(item['posttitle'], item['posturl'], item['postautor'], item['postautorurl'])[0].strip()
            # request = Request(item['posturl'],
                      # callback=self.parse_page2)

            # request.meta['item'] = item
            return Request(item['posturl'], meta={'item': item},
                      callback=self.parse_item)
            # return item


    def parse_item(self, response):

        item = response.meta['item']
        item['currentitemlink2'] = response.url
        # item['desc'] = response.xpath('/html/body/div[1]/div/div/div[1]/div[1]/div/div[1]/p/text()').extract()[0].strip()
        item['videosrcembed'] = response.xpath('/html/body/div[1]/div/div/div[1]/div[1]/div[2]/article/iframe/@src').extract()[0].strip()
        item['textcontent'] = response.xpath('/html/body/div[1]/div/div/div[1]/div[1]/div[2]/article/div[1]').extract()[0].strip()
        item['relatedcatlinks'] = response.xpath('/html/body/div[1]/div/div/div[1]/div[1]/div[2]/article/div[2]').extract()[0].strip()
        # filename = response.url.split("/")[-1] + '.html'
        # with open(filename, 'wb') as f:
            # f.write(response.body)

        yield item

Main problem

When I run the Spider. It records only 1 item.

I made the modifications in logic, then he registered 25 items, but not completed with the second Request.

(I need to add new requests to every read listing, add->start_page->append(new_url)

But I am not able to identify when it closes the cycle of an Item() and when it is parsing a listing.

Can you help me?

  • Make this go to a variable and check it: return Request(item['posturl'], meta={'item': item},
 callback=self.parse_item). What is its value? Then you can return it normally.

  • Tip: instead of urlparse.urljoin(response.url, XYZ) you can do response.urljoin(XYZ) :)

  • i realized this here yesterday: request = Request.... n print(request). I saw the data, but it came missing 3 fields that is performed in parse_item. Would this be the doubt?

  • I created another scrapyprojet, made a simple Spider, and to my surprise it worked. http://codepad.org/e1dbzj39

  • Guys, I did it. I believe the problem was because I used Request(Static_file, callback=parse2), so it only accepted 1 item. It makes sense?

  • My list is being processed, but I need it to continue processing more start_pages (categories). I created another question here with some questions to close.. http://tinyurl.com/z88y3fu if you can look at it.. thank you.

Show 1 more comment

1 answer

0

Replace the return of the method parse by a yield.

The return is causing the method to return in the first case, without finishing scrolling through the rest of the posts in the for loop.

Using yield you will turn the function into a generator, that Scrapy will traverse and process the result appropriately.

  • So, I did using Yield dps passed pro Return, tried to Return dps from Yield but generates an error. I did many tests and could not understand how the list is formed. I went back to Yield. I do a print(post) inside the loop. On the screen I see that he went through 25 items. But you only stored 1. 2016-07-13 12:59:57 [scrapy] INFO: Stored json feed (1 items) in: out.json 'dupefilter/Filtered': 24, What does this duperfilter mean?

  • Ah, dupefilter/Filtered are requests that are being filtered because they are being equal to the previous ones (i.e., item['posturl'] was always giving the same result). I kept thinking it was a problem in the Xpaths. I would try to modify the Xpaths from inside the loop to use ./ at first, like: post.xpath('./div/div/h2/a/@href')

Browser other questions tagged

You are not signed in. Login or sign up in order to post.