Web Scraping - convert HTML table to python Dict

Asked

Viewed 962 times

1

I’m trying to turn an HTML table into dict@python, I came across some problems and I ask for your help.

Go as far as I can go...

def impl12(url='http://www.geonames.org/countries/', tmout=2):
    import requests
    from bs4 import  BeautifulSoup
    import csv

    page = requests.get(url=url, timeout=tmout)
    soup = BeautifulSoup(page.content, 'html.parser')
    #print('\nsoup >>>', soup)
    data = soup.find_all(id="countries")[0].get_text(separator='; ')

    #print('\ndata >>>', data, type(data))
    data = data.split('; ')

    content = csv.DictReader(data)
    for linha in content:
        pass
        print(linha)
  • What is returning so far from your script?

  • a dictionary with each element, https://pastebin.com/WjiebRLZ

  • Do you want what is the key and what is the value? Because the lines have multiple data.. Or do you want the keys to be the acronym and the values another dictionary with the other line data? {"BR":{"Country":"Brazil","Capital":"Brasilia",,,}}

  • This. With csv.Dictreader, the field title becomes key, and each line becomes a dictionary.

  • I got it. I’ll check here..

  • Thank you very much.

Show 2 more comments

1 answer

4

Beautifullsoup is being used redundantly there - you just find the beginning of the table, and then use "brute force" to separate all the elements by "; ", and then treat the result as plain text. This way you do not preserve the table structure, and it is difficult to know what is table header and what is content.

Nothing will create "by magic" the headers for you. The CSV module has tools to extract dictionaries from a disk-structured text file. Same as the call to get_text("; ") turn your data into a well-structured CSV file - which is not the case because the line breaks required for a CSV file won’t be there (except for HTML formatting coincidence), you would have to pass an iterator that delivers one of those lines at a time to Dictreader - but when splitting into "; ", your iterator passes one cell per time. Then it returns you a dictionary with the contents of each cell, without knowing what is header or not.

To do this kind of thing there’s no such thing as a ready-to-go page- each page is a page, and "looking at HTML" and creating the Parsing structure that will work at first, it’s very difficult. The best thing is to do it in Python interactive mode - you retrieve the page daods with requests.get, create the object soup and then experiment with the various methods of this Soup object and the page structure until you find out how you want to leave your data

In this case, you would see that when we find the table "Children", iterating over it with a "for" will alternately return a table row (including the header) and a text string - which is whitespace.

Maybe it’s possible to do something like this then:

def importa(url='http://www.geonames.org/countries/', tmout=2):
    import requests
    from bs4 import  BeautifulSoup
    from collections import OrderedDict

    page = requests.get(url=url, timeout=tmout)
    soup = BeautifulSoup(page.content, 'html.parser')
    #print('\nsoup >>>', soup)
    table = soup.find_all(id="countries")[0]

    result = []

    headers = None
    for row in table:
        # Pule as linhas que não contém tags html
        if isinstance(row, str):
            continue
        # Assume que a primeira linha com conteúdo são os cabeçalhos
        if not headers:
            # cria uma lista com o conteúdo de texto de cada tag na linha:
            headers = [cell.get_text() for cell in row]
            continue

        row_contents = [cell.get_text() for cell in row]
        data_dict = OrderedDict(pair for pair in zip(headers, row_contents))
        result.append(data_dict)

    return result


from pprint import pprint
pprint(importa())

(Here it works - note the use of OrderedDict to facilitate the visualization of dictionaries)

  • It took me a long time to understand the code. Finally intendi. Thanks for the help!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.