Only json content between []

Asked

Viewed 55 times

-3

I have a question about a code I made using scrapy to collect data and send to a json file.

The problem is that the file formatting is not as it usually is, so I found it strange, I’m in doubt if there is a problem or not.

Below is the code and the contents of the file :

[
{"uf": "AL", "area": "C\u00edvel", "juiz": "Henrique Gomes de Barros Teixeira\n", "partes": [{"nome": "Maria Edite dos Santos", "tipo": "Autora", "Advogado(s)": [{"nome": "Defensoria P\u00fablica do Estado de Alagoas", "tipo": "Defensor P"}]}, {"nome": "Hipercard Banco Multiplo S/A", "tipo": "R\u00e9u", "Advogado(s)": [{"nome": "Raoni Souza Drummond", "tipo": "Advogado"}, {"nome": "Eduardo Fraga", "tipo": "Advogado"}, {"nome": "Andrea Freire Tynan", "tipo": "Advogado"}]}, {"nome": "W. dos S. F.", "tipo": "Testemunha"}, {"nome": "P. V. R. de L.", "tipo": "Testemunha"}]}
]

CODE:

import scrapy

class TjalSpdrSpider(scrapy.Spider):

    name = 'tjal'
    allowed_domains = ['www2.tjal.jus.br/cpopg/']
    # url_path = www2.tjal.jus.br/cpopg/open.do
    start_urls = [
        'https://www2.tjal.jus.br/cpopg/show.do?processo.codigo=01000I1FT0000&processo.foro=1&processo.'
        'numero=0731425-82.2014.8.02.0001&uuidCaptcha=sajcaptcha_2976d855423340b4be91a23ff5add85d'
    ]

    def parse(self, response):

        table_partes = response.xpath('//table[@id="tableTodasPartes"]/tr[@class="fundoClaro"]')

        area = ''.join(response.xpath('//table[@class="secaoFormBody"]/tr[4]/td[2]/table/tr/td/text()').getall())
        juiz = response.xpath('//table[@class="secaoFormBody"]/tr[10]/td/span/text()').get()
        partes = []

        for dados in table_partes:
            tipo = dados.xpath('./td/span/text()').get().strip()[:-1]
            tipo_adv = dados.xpath('./td[2]/span[@class="mensagemExibindo"]/text()').get()
            nome = dados.xpath('./td[2]/text()').get().strip()
            advg = [{'nome': f'{adv}'.strip(),'tipo': f'{tipo_adv}'.strip()[:-1]}
                    for adv in dados.xpath('./td[2]/text()[preceding-sibling::span]').getall() if adv.strip() != '']
            if nome != '':
                if tipo != 'Testemunha':
                    partes.append({
                        'nome': nome,
                        'tipo': tipo,
                        'Advogado(s)': advg
                        })
                else:
                    partes.append({
                        'nome': nome,
                        'tipo': tipo,
                    })

        yield {
               'uf': 'AL',
               'area': area.strip(),
               'juiz': juiz,
               'partes': partes
              }

1 answer

0

If the question is just to format the json output. The solution is to use the method dumps() library json.

See below:

>>> import json

>>> meu_json = [ {"uf": "AL", "area": "C\u00edvel", "juiz": "Henrique Gomes de Barros Teixeira\n", "partes": [{"nome": "Maria Edite dos Santos", "tipo": "Autora", "Advogado(s)": [{"nome": "Defensoria P\u00fablica do Estado de Alagoas", "tipo": "Defensor P"}]}, {"nome": "Hipercard Banco Multiplo S/A", "tipo": "R\u00e9u", "Advogado(s)": [{"nome": "Raoni Souza Drummond", "tipo": "Advogado"}, {"nome": "Eduardo Fraga", "tipo": "Advogado"}, {"nome": "Andrea Freire Tynan", "tipo": "Advogado"}]}, {"nome": "W. dos S. F.", "tipo": "Testemunha"}, {"nome": "P. V. R. de L.", "tipo": "Testemunha"}]} ]

>>> print(json.dumps(meu_json, indent=2))

The exit will be:

[
  {
    "uf": "AL",
    "area": "C\u00edvel",
    "juiz": "Henrique Gomes de Barros Teixeira\n",
    "partes": [
      {
        "nome": "Maria Edite dos Santos",
        "tipo": "Autora",
        "Advogado(s)": [
          {
            "nome": "Defensoria P\u00fablica do Estado de Alagoas",
            "tipo": "Defensor P"
          }
        ]
      },
...
]

Note I did not put all output in the post purposely not to leave it extensive.

However if the question is accentuation, you can use the conversion with the parameter ensure_ascii=False.

>>> str_json = json.dumps(meu_json, ensure_ascii=False)

>>> print(str_json)

[{"uf": "AL", "area": "Cível", "juiz": "Henrique Gomes de Barros Teixeira\n", "partes": [{"nome": "Maria Edite dos Santos", "tipo": "Autora", "Advogado(s)": [{"nome": "Defensoria Pública do Estado de Alagoas", "tipo": "Defensor P"}]}, {"nome": "Hipercard Banco Multiplo S/A", "tipo": "Réu", "Advogado(s)": [{"nome": "Raoni Souza Drummond", "tipo": "Advogado"}, {"nome": "Eduardo Fraga", "tipo": "Advogado"}, {"nome": "Andrea Freire Tynan", "tipo": "Advogado"}]}, {"nome": "W. dos S. F.", "tipo": "Testemunha"}, {"nome": "P. V. R. de L.", "tipo": "Testemunha"}]}]

Of course you can combine the two

>>> print(json.dumps(meu_json, ensure_ascii=False, indent=2))

I hope I’ve helped.

  • The point is that I have to throw the dice by Yield running Spider using 'scrapy Crawl <Spider> -o <filename>. json', and in this file the content is going all in one line. I tried to release a Yield with json.dumps as you said but the following error: ERROR: Spider must Return request, item, or None, got 'str' in

  • If json is well formatted, I see no problem being on a single line. There are some errors being generated?

  • No, it was just that doubt. As recently I started studying web scraping normally qnd I tried to pull the data to a json file it came different, but if no problem all right, thank you!!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.