Scraping in Python. Mounting an Insert

Asked

Viewed 194 times

-1

  • Yeah, did you make any code? Ta using some lib?

  • Do you know how to Insert/connect to db? Or do you want the answer to cover that part too? You should put what you already have

1 answer

1


Starting from the beginning - in general we would use the requests and beautifulsoup modules to read the content of a web page - but it is also possible to do yes. We have another problem that these recipe pages use a certification authority for SSL that is not configured either in the browsers or in Python installs. With pure Python, we need to create a SSL context and explicitly disconnect the certificate check - stays:

import urllib.request
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = "https://idg.receita.fazenda.gov.br/orientacao/tributaria/pagamentos-e-parcelamentos/codigos-de-receita/codigos-de-receita-de-contribuicao-previdenciaria"
data = urllib.request.urlopen(url, context=ctx).read().decode("utf-8")

With the requests library installed, you’re only:

import requests
url = "https://idg.receita.fazenda.gov.br/orientacao/tributaria/pagamentos-e-parcelamentos/codigos-de-receita/codigos-de-receita-de-contribuicao-previdenciaria"
data = requests.get(url, verify=False).text

The next step is to spy on the page’s HTML code - it’s easy to see that it’s an HTML that, although verbose, has a single table on the page ("table" element), which is well structured with tags <tr>and <td> all closed in a well-formed way. With no auxiliary library, Python has the class html.parser.HTMLParser that can help us. It is one of the few Apis in the standard library that requires you to make a sub-class of an existing class for your use. Fortunately, we only need 3 methods: one that is called when the parser finds a new tag (whatever), one that is called when it closes a tag, and the other that is called when the parser finds text content within the tags. Check out the Htmlparser documentation.

That is - just put some state attributes, attributes to save the recovered data, and some "if"’s in the methods that check the beginning and end of tag - and thus when we are in the method that is called with the text content of the tags, a single "if" checks if we’re inside a table cell, and if so, stores the data. This breaking ignores all the crazy tags you have on this page - "Strong", "p", "span" within the "td"s themselves:

from html.parser import HTMLParser

class TableParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.table_data = []
        self.inside_table = False
        self.inside_tr = False
        self.inside_td = False
        self.tmp_row_data = []
        self.tmp_cell_data = []

    def handle_starttag(self, tag, attrs):
        if tag == "table":
            self.inside_table = True
        elif tag == "tr":
            self.inside_tr = True
        elif tag == "td":
            self.inside_td = True

    def handle_endtag(self, tag):
        if tag == "table":
            self.inside_table = False
        elif tag == "tr":
            self.inside_tr = False
            self.table_data.append(self.tmp_row_data)
            self.tmp_row_data = []
        elif tag == "td":
            self.inside_td = False
            self.tmp_row_data.append(" ".join(self.tmp_cell_data).strip())
            self.tmp_cell_data = []

    def handle_data(self, data):
        if self.inside_td:
            self.tmp_cell_data.append(data)

And like this class instead, we can have a list containing lists with each row of the page table:

parser = TableParser()
parser.feed(data)
lines = parser.table_data[1:]

And now, to "mount the Inserts", just use a "for" with the Python db api - and make the calls to the bank. That changes a bit from bank to bank. To only mount the "Inserts", as you said, we can save an equity of Inserts in the text file:

with open("codigo_receita.sql", "wt") as file_:
    for item, codigo, especificacao in lines:
          especificao = especificacao.replace("'", "''")
          file_.write(f"""INSERT INTO TABELA (CAMPO1, CAMPO2) VALUES ({codigo}, '{especificacao}');\n""")

Browser other questions tagged

You are not signed in. Login or sign up in order to post.