1
But I am facing a problem. And I ended up getting confused, I decided to return the code in a functional point.
# -*- coding: utf-8 -*-
# coding: utf-8
import scrapy
from mbu2.items import Mbu2Item2
import urlparse
from scrapy.http import Request
class Spider2Spider(scrapy.Spider):
name = "spider2"
# allowed_domains = [""]
start_urls = (
# 'file:///C:/scrapy/mbu/mbu2/video.html',
'file:///C:/scrapy/mbu/mbu2/list.htm',
)
def parse(self, response):
# filename = response.url.split("/")[-1] + '.html'
# with open(filename, 'wb') as f:
# f.write(response.body)
# item = Mbu2Item()
# return item
posts = response.xpath('/html/body/div/div[2]/div/div[1]/div[2]/ul/li')
posts.pop(0)
for post in posts:
print(post)
item = Mbu2Item2()
item['currentitemlist'] = response.url
item['currentitemlink'] = urlparse.urljoin(response.url,post.xpath('div/div/h2/a/@href').extract()[0].strip())
item['posttitle'] = post.xpath('div/div/h2/a/text()').extract()[0].strip()
# print(item['posttitle'])
item['posturl'] = urlparse.urljoin(response.url,post.xpath('div/div/h2/a/@href').extract()[0].strip())
item['postautor'] = post.xpath('div/div/div/div[1]/a/text()').extract()[0].strip()
# print(item['postautorurl'])
item['postautorurl'] = urlparse.urljoin(response.url,post.xpath('div/div/div/div[1]/a/@href').extract()[0].strip())
item['postcat'] = post.xpath('div/div/div/div[2]/span/a/text()').extract()[0].strip()
# print(item['postcaturl'])
item['postcaturl'] = urlparse.urljoin(response.url,post.xpath('div/div/div/div[2]/span/a/@href').extract()[0].strip())
# print(item['posttitle'], item['posturl'], item['postautor'], item['postautorurl'])[0].strip()
# request = Request(item['posturl'],
# callback=self.parse_page2)
# request.meta['item'] = item
return Request(item['posturl'], meta={'item': item},
callback=self.parse_item)
# return item
def parse_item(self, response):
item = response.meta['item']
item['currentitemlink2'] = response.url
# item['desc'] = response.xpath('/html/body/div[1]/div/div/div[1]/div[1]/div/div[1]/p/text()').extract()[0].strip()
item['videosrcembed'] = response.xpath('/html/body/div[1]/div/div/div[1]/div[1]/div[2]/article/iframe/@src').extract()[0].strip()
item['textcontent'] = response.xpath('/html/body/div[1]/div/div/div[1]/div[1]/div[2]/article/div[1]').extract()[0].strip()
item['relatedcatlinks'] = response.xpath('/html/body/div[1]/div/div/div[1]/div[1]/div[2]/article/div[2]').extract()[0].strip()
# filename = response.url.split("/")[-1] + '.html'
# with open(filename, 'wb') as f:
# f.write(response.body)
yield item
Main problem
When I run the Spider. It records only 1 item.
I made the modifications in logic, then he registered 25 items, but not completed with the second Request.
(I need to add new requests to every read listing, add->start_page->append(new_url)
But I am not able to identify when it closes the cycle of an Item() and when it is parsing a listing.
Can you help me?
Make this go to a variable and check it:
return Request(item['posturl'], meta={'item': item},
 callback=self.parse_item)
. What is its value? Then you can return it normally.– Leonel Sanches da Silva
Tip: instead of
urlparse.urljoin(response.url, XYZ)
you can doresponse.urljoin(XYZ)
:)– elias
i realized this here yesterday: request = Request.... n print(request). I saw the data, but it came missing 3 fields that is performed in parse_item. Would this be the doubt?
– Luiz Brz Developer
I created another scrapyprojet, made a simple Spider, and to my surprise it worked. http://codepad.org/e1dbzj39
– Luiz Brz Developer
Guys, I did it. I believe the problem was because I used Request(Static_file, callback=parse2), so it only accepted 1 item. It makes sense?
– Luiz Brz Developer
My list is being processed, but I need it to continue processing more start_pages (categories). I created another question here with some questions to close.. http://tinyurl.com/z88y3fu if you can look at it.. thank you.
– Luiz Brz Developer