Scratch parameters of a post method, with scrapy in python!

Asked

Viewed 332 times

1

I need to collect information from a website using Spiders within Scrapy in Python, but the site is a method post and I’m learning the language while developing the project. I found a model of post but I’m not getting it right. The code I have is this:

scrapy.FormRequest(
    url='http://www.camex.gov.br/resolucoes-camex/resolucoes',
    formdata={
        'filter[search]': '',
        'filter[res]': '',
        'filter[ano]': '',
        'limit': paginas,
        'limitstart': quantidadeDeRegistros,
        'task': '',
        'boxchecked': 0,
        'filter_order': '',
        'filter_order_Dir': '',
        '46598c34d1ab5af3b00e8d84a4281fbc': 1,
        'list[fullordering]': 'null ASC'
    },
    callback=self.parsePagina
)

Is it right or is there another better way to do?

  • What problem are you facing?

  • As I have little Exp with python and I am using the language to develop a project where I work, I can’t say if this method is correct, because it does not return anything in the IDE log

  • I work with scrapy, it will only return something if you parse the content in your callback parsePagina, what you have in this method?

  • I have the following function: def parsePagina(self, response):
 itemResolucao = response.xpath('//*[@id="resolucaoList"]/tbody/tr')
 urlBase = "http://www.camex.gov.br"
 for itens in itemResolucao:
 links_resolucoes = itens.xpath('.//a/@href').extract_first()
 if not '://' in links_resolucoes:
 link = urlBase + itens.xpath('.//a/@href').extract_first()
 req = Request(url=link, callback=self.parseResolucao)
 yield req
 print 'trabalhando na pagina'

  • From what I understood this parsePagina, you take each link from the list and then make a request for each resolution, so the parsing of the content is in parseResolution?

  • this, in parse resolution we applied a class of the project, but before trying to do the post it captured all and only information related to the first page of the site

  • Oh right, but I still don’t understand your doubt, the syntax of your code is correct.

  • My problem is that it is indicating that the callback is incorrect, I have researched several sources and can not identify the existing problem.

  • Can you post the full code of that your Spider? So I can look and help you.

  • class camax_mdic(Spider):
 name = "camax_mdic"
 start_urls = ["http://www.camex.gov.br/resolucoes-camex/resolucoes"]

 diretorio_temporario = settings["TEMP_DIR"]
 pdf2text = settings["PDF2TEXT"]
 data_dir = settings["DATA_DIR"]
 diretorio_arquivos = os.path.join(data_dir, name, "docs")
 link_arquivos = 'https://s3.amazonaws.com/plugar-contents/normativas/src/camax_mdic/docs/'
 custom_settings = {
 'FEED_FORMAT': 'json',
 'FEED_URI': os.path.join(data_dir, 'camax_mdic', 'data', '%(time)s.json'),
 }

  • def __init__(self):
 pathBase = settings['DATA_DIR']

 if not os.path.exists(os.path.join(pathBase, self.name)):
 os.mkdir(os.path.join(pathBase, self.name))
 if not os.path.exists(os.path.join(pathBase, self.name, 'data')):
 os.mkdir(os.path.join(pathBase, self.name, 'data'))
 if not os.path.exists(os.path.join(pathBase, self.name, 'docs')):
 os.mkdir(os.path.join(pathBase, self.name, 'docs'))
 if not os.path.exists(os.path.join(pathBase, self.name, )


  • def parse(self, response):
 quantidadeDeRegistrosPorPagina = 20
 quantidadeDeRegistros = response.xpath('//*[@class="pagination-list"]/li/a[@title="Fim"]/@onclick').extract_first()[:-1].split('=')[1].replace(";","").replace("Joomla.submitform()return false","")
 quantidadeDePaginas = int(quantidadeDeRegistros)/quantidadeDeRegistrosPorPagina
 for paginas in xrange(0, quantidadeDePaginas, 1):

  • Post here: https://gist.github.com/ is better for me to see, in the comments is difficult.

  • then there’s the post I showed you

  • https://gist.github.com/jonathanigorpereira/d7c3e2277c0404a26eb349a618c11ccb

  • I’ll test it here and I’ll get right back to you!

Show 11 more comments

1 answer

1

I performed some tests, this problem happens because the page you are trying to submit the form has two elements form. The scrapy is sending the request to the first, but should be the second.

inserir a descrição da imagem aqui

To fix this and your Spider works, you must add the attribute formname in the method call:

yield FormRequest.from_response(
    response,
    url='http://www.camex.gov.br/resolucoes-camex/resolucoes',
    formname="adminForm", # nome do form que você deseja enviar a request
    formdata={
            'filter[search]': '', #codigo omitido
  • this formname can be randomized? say I can define which is going to be the formname? or is it something specific?

  • Not, formname is the name of the form element that is in the html of the page, it will change according to the page that you are capturing the data, but you only need to use it when you have these cases of more than one form on the page, in the others you don’t even need to pass.

  • Got it, thank you so much for the support, really I’ve been stuck a long time in this role

  • State if the answer helped you mark it as accepted =D

Browser other questions tagged

You are not signed in. Login or sign up in order to post.