Web Scraping - convert HTML table to python Dict

Question

Web Scraping - convert HTML table to python Dict

Asked 7 years, 12 months ago

Viewed 962 times

1

I’m trying to turn an HTML table into dict@python, I came across some problems and I ask for your help.

Go as far as I can go...

def impl12(url='http://www.geonames.org/countries/', tmout=2):
    import requests
    from bs4 import  BeautifulSoup
    import csv

    page = requests.get(url=url, timeout=tmout)
    soup = BeautifulSoup(page.content, 'html.parser')
    #print('\nsoup >>>', soup)
    data = soup.find_all(id="countries")[0].get_text(separator='; ')

    #print('\ndata >>>', data, type(data))
    data = data.split('; ')

    content = csv.DictReader(data)
    for linha in content:
        pass
        print(linha)

What is returning so far from your script?

– Marlysson

2017/08/01 at 15:11
a dictionary with each element, https://pastebin.com/WjiebRLZ

– britodfbr

2017/08/01 at 15:31
Do you want what is the key and what is the value? Because the lines have multiple data.. Or do you want the keys to be the acronym and the values another dictionary with the other line data? {"BR":{"Country":"Brazil","Capital":"Brasilia",,,}}

– Marlysson

2017/08/01 at 16:11
This. With csv.Dictreader, the field title becomes key, and each line becomes a dictionary.

– britodfbr

2017/08/01 at 16:16
I got it. I’ll check here..

– Marlysson

2017/08/01 at 16:24
Thank you very much.

– britodfbr

2017/08/01 at 16:26
Let’s go continue this discussion in chat.

– Marlysson

2017/08/01 at 16:29

Show 2 more comments

1 answer

Browser other questions tagged python python-3.x web-scraping scrapy scraping

You are not signed in. Login or sign up in order to post.

by jsbueno • **30,668** points · Answer 1 · 2017-08-01T21:24:21+00:00

Beautifullsoup is being used redundantly there - you just find the beginning of the table, and then use "brute force" to separate all the elements by "; ", and then treat the result as plain text. This way you do not preserve the table structure, and it is difficult to know what is table header and what is content.

Nothing will create "by magic" the headers for you. The CSV module has tools to extract dictionaries from a disk-structured text file. Same as the call to get_text("; ") turn your data into a well-structured CSV file - which is not the case because the line breaks required for a CSV file won’t be there (except for HTML formatting coincidence), you would have to pass an iterator that delivers one of those lines at a time to Dictreader - but when splitting into "; ", your iterator passes one cell per time. Then it returns you a dictionary with the contents of each cell, without knowing what is header or not.

To do this kind of thing there’s no such thing as a ready-to-go page- each page is a page, and "looking at HTML" and creating the Parsing structure that will work at first, it’s very difficult. The best thing is to do it in Python interactive mode - you retrieve the page daods with requests.get, create the object soup and then experiment with the various methods of this Soup object and the page structure until you find out how you want to leave your data

In this case, you would see that when we find the table "Children", iterating over it with a "for" will alternately return a table row (including the header) and a text string - which is whitespace.

Maybe it’s possible to do something like this then:

def importa(url='http://www.geonames.org/countries/', tmout=2):
    import requests
    from bs4 import  BeautifulSoup
    from collections import OrderedDict

    page = requests.get(url=url, timeout=tmout)
    soup = BeautifulSoup(page.content, 'html.parser')
    #print('\nsoup >>>', soup)
    table = soup.find_all(id="countries")[0]

    result = []

    headers = None
    for row in table:
        # Pule as linhas que não contém tags html
        if isinstance(row, str):
            continue
        # Assume que a primeira linha com conteúdo são os cabeçalhos
        if not headers:
            # cria uma lista com o conteúdo de texto de cada tag na linha:
            headers = [cell.get_text() for cell in row]
            continue

        row_contents = [cell.get_text() for cell in row]
        data_dict = OrderedDict(pair for pair in zip(headers, row_contents))
        result.append(data_dict)

    return result


from pprint import pprint
pprint(importa())

(Here it works - note the use of OrderedDict to facilitate the visualization of dictionaries)