1
I need to collect information from a website using Spiders within Scrapy in Python, but the site is a method post and I’m learning the language while developing the project. I found a model of post but I’m not getting it right. The code I have is this:
scrapy.FormRequest(
url='http://www.camex.gov.br/resolucoes-camex/resolucoes',
formdata={
'filter[search]': '',
'filter[res]': '',
'filter[ano]': '',
'limit': paginas,
'limitstart': quantidadeDeRegistros,
'task': '',
'boxchecked': 0,
'filter_order': '',
'filter_order_Dir': '',
'46598c34d1ab5af3b00e8d84a4281fbc': 1,
'list[fullordering]': 'null ASC'
},
callback=self.parsePagina
)
Is it right or is there another better way to do?
What problem are you facing?
– Laerte
As I have little Exp with python and I am using the language to develop a project where I work, I can’t say if this method is correct, because it does not return anything in the IDE log
– Jonathan Igor Bockorny Pereira
I work with scrapy, it will only return something if you parse the content in your callback
parsePagina
, what you have in this method?– Laerte
I have the following function:
def parsePagina(self, response):
 itemResolucao = response.xpath('//*[@id="resolucaoList"]/tbody/tr')
 urlBase = "http://www.camex.gov.br"
 for itens in itemResolucao:
 links_resolucoes = itens.xpath('.//a/@href').extract_first()
 if not '://' in links_resolucoes:
 link = urlBase + itens.xpath('.//a/@href').extract_first()
 req = Request(url=link, callback=self.parseResolucao)
 yield req
 print 'trabalhando na pagina'
– Jonathan Igor Bockorny Pereira
From what I understood this parsePagina, you take each link from the list and then make a request for each resolution, so the parsing of the content is in parseResolution?
– Laerte
this, in parse resolution we applied a class of the project, but before trying to do the post it captured all and only information related to the first page of the site
– Jonathan Igor Bockorny Pereira
Oh right, but I still don’t understand your doubt, the syntax of your code is correct.
– Laerte
My problem is that it is indicating that the callback is incorrect, I have researched several sources and can not identify the existing problem.
– Jonathan Igor Bockorny Pereira
Can you post the full code of that your Spider? So I can look and help you.
– Laerte
class camax_mdic(Spider):
 name = "camax_mdic"
 start_urls = ["http://www.camex.gov.br/resolucoes-camex/resolucoes"]

 diretorio_temporario = settings["TEMP_DIR"]
 pdf2text = settings["PDF2TEXT"]
 data_dir = settings["DATA_DIR"]
 diretorio_arquivos = os.path.join(data_dir, name, "docs")
 link_arquivos = 'https://s3.amazonaws.com/plugar-contents/normativas/src/camax_mdic/docs/'
 custom_settings = {
 'FEED_FORMAT': 'json',
 'FEED_URI': os.path.join(data_dir, 'camax_mdic', 'data', '%(time)s.json'),
 }
– Jonathan Igor Bockorny Pereira
def __init__(self):
 pathBase = settings['DATA_DIR']

 if not os.path.exists(os.path.join(pathBase, self.name)):
 os.mkdir(os.path.join(pathBase, self.name))
 if not os.path.exists(os.path.join(pathBase, self.name, 'data')):
 os.mkdir(os.path.join(pathBase, self.name, 'data'))
 if not os.path.exists(os.path.join(pathBase, self.name, 'docs')):
 os.mkdir(os.path.join(pathBase, self.name, 'docs'))
 if not os.path.exists(os.path.join(pathBase, self.name, )

– Jonathan Igor Bockorny Pereira
def parse(self, response):
 quantidadeDeRegistrosPorPagina = 20
 quantidadeDeRegistros = response.xpath('//*[@class="pagination-list"]/li/a[@title="Fim"]/@onclick').extract_first()[:-1].split('=')[1].replace(";","").replace("Joomla.submitform()return false","")
 quantidadeDePaginas = int(quantidadeDeRegistros)/quantidadeDeRegistrosPorPagina
 for paginas in xrange(0, quantidadeDePaginas, 1):
– Jonathan Igor Bockorny Pereira
Post here: https://gist.github.com/ is better for me to see, in the comments is difficult.
– Laerte
then there’s the post I showed you
– Jonathan Igor Bockorny Pereira
https://gist.github.com/jonathanigorpereira/d7c3e2277c0404a26eb349a618c11ccb
– Jonathan Igor Bockorny Pereira
I’ll test it here and I’ll get right back to you!
– Laerte