Scratch parameters of a post method, with scrapy in python!

Question

Scratch parameters of a post method, with scrapy in python!

Asked 7 years, 3 months ago

Viewed 332 times

1

I need to collect information from a website using Spiders within Scrapy in Python, but the site is a method post and I’m learning the language while developing the project. I found a model of post but I’m not getting it right. The code I have is this:

scrapy.FormRequest(
    url='http://www.camex.gov.br/resolucoes-camex/resolucoes',
    formdata={
        'filter[search]': '',
        'filter[res]': '',
        'filter[ano]': '',
        'limit': paginas,
        'limitstart': quantidadeDeRegistros,
        'task': '',
        'boxchecked': 0,
        'filter_order': '',
        'filter_order_Dir': '',
        '46598c34d1ab5af3b00e8d84a4281fbc': 1,
        'list[fullordering]': 'null ASC'
    },
    callback=self.parsePagina
)

Is it right or is there another better way to do?

What problem are you facing?

– Laerte

2018/05/07 at 11:45
As I have little Exp with python and I am using the language to develop a project where I work, I can’t say if this method is correct, because it does not return anything in the IDE log

– Jonathan Igor Bockorny Pereira

2018/05/07 at 11:46
I work with scrapy, it will only return something if you parse the content in your callback parsePagina, what you have in this method?

– Laerte

2018/05/07 at 11:48
I have the following function: def parsePagina(self, response):
 itemResolucao = response.xpath('//*[@id="resolucaoList"]/tbody/tr')
 urlBase = "http://www.camex.gov.br"
 for itens in itemResolucao:
 links_resolucoes = itens.xpath('.//a/@href').extract_first()
 if not '://' in links_resolucoes:
 link = urlBase + itens.xpath('.//a/@href').extract_first()
 req = Request(url=link, callback=self.parseResolucao)
 yield req
 print 'trabalhando na pagina'

– Jonathan Igor Bockorny Pereira

2018/05/07 at 11:49
From what I understood this parsePagina, you take each link from the list and then make a request for each resolution, so the parsing of the content is in parseResolution?

– Laerte

2018/05/07 at 12:00
this, in parse resolution we applied a class of the project, but before trying to do the post it captured all and only information related to the first page of the site

– Jonathan Igor Bockorny Pereira

2018/05/07 at 12:16
Oh right, but I still don’t understand your doubt, the syntax of your code is correct.

– Laerte

2018/05/07 at 12:26
My problem is that it is indicating that the callback is incorrect, I have researched several sources and can not identify the existing problem.

– Jonathan Igor Bockorny Pereira

2018/05/07 at 13:43
Can you post the full code of that your Spider? So I can look and help you.

– Laerte

2018/05/07 at 13:44
class camax_mdic(Spider):
 name = "camax_mdic"
 start_urls = ["http://www.camex.gov.br/resolucoes-camex/resolucoes"]

 diretorio_temporario = settings["TEMP_DIR"]
 pdf2text = settings["PDF2TEXT"]
 data_dir = settings["DATA_DIR"]
 diretorio_arquivos = os.path.join(data_dir, name, "docs")
 link_arquivos = 'https://s3.amazonaws.com/plugar-contents/normativas/src/camax_mdic/docs/'
 custom_settings = {
 'FEED_FORMAT': 'json',
 'FEED_URI': os.path.join(data_dir, 'camax_mdic', 'data', '%(time)s.json'),
 }

– Jonathan Igor Bockorny Pereira

2018/05/07 at 13:52
def __init__(self):
 pathBase = settings['DATA_DIR']

 if not os.path.exists(os.path.join(pathBase, self.name)):
 os.mkdir(os.path.join(pathBase, self.name))
 if not os.path.exists(os.path.join(pathBase, self.name, 'data')):
 os.mkdir(os.path.join(pathBase, self.name, 'data'))
 if not os.path.exists(os.path.join(pathBase, self.name, 'docs')):
 os.mkdir(os.path.join(pathBase, self.name, 'docs'))
 if not os.path.exists(os.path.join(pathBase, self.name, )


– Jonathan Igor Bockorny Pereira

2018/05/07 at 13:53
def parse(self, response):
 quantidadeDeRegistrosPorPagina = 20
 quantidadeDeRegistros = response.xpath('//*[@class="pagination-list"]/li/a[@title="Fim"]/@onclick').extract_first()[:-1].split('=')[1].replace(";","").replace("Joomla.submitform()return false","")
 quantidadeDePaginas = int(quantidadeDeRegistros)/quantidadeDeRegistrosPorPagina
 for paginas in xrange(0, quantidadeDePaginas, 1):

– Jonathan Igor Bockorny Pereira

2018/05/07 at 13:53
Post here: https://gist.github.com/ is better for me to see, in the comments is difficult.

– Laerte

2018/05/07 at 13:53
then there’s the post I showed you

– Jonathan Igor Bockorny Pereira

2018/05/07 at 13:54
https://gist.github.com/jonathanigorpereira/d7c3e2277c0404a26eb349a618c11ccb

– Jonathan Igor Bockorny Pereira

2018/05/07 at 13:55
I’ll test it here and I’ll get right back to you!

– Laerte

2018/05/07 at 13:57

Show 11 more comments

1 answer

Browser other questions tagged python web-scraping scrapy

You are not signed in. Login or sign up in order to post.

by Laerte • **22,243** points · Answer 1 · 2018-05-07T14:52:20+00:00

I performed some tests, this problem happens because the page you are trying to submit the form has two elements form. The scrapy is sending the request to the first, but should be the second.

To fix this and your Spider works, you must add the attribute formname in the method call:

yield FormRequest.from_response(
    response,
    url='http://www.camex.gov.br/resolucoes-camex/resolucoes',
    formname="adminForm", # nome do form que você deseja enviar a request
    formdata={
            'filter[search]': '', #codigo omitido