Beautifullsoup is being used redundantly there - you just find the beginning of the table, and then use "brute force" to separate all the elements by "; ", and then treat the result as plain text. This way you do not preserve the table structure, and it is difficult to know what is table header and what is content.
Nothing will create "by magic" the headers for you. The CSV module has tools to extract dictionaries from a disk-structured text file. Same as the call to get_text("; ")
turn your data into a well-structured CSV file - which is not the case because the line breaks required for a CSV file won’t be there (except for HTML formatting coincidence), you would have to pass an iterator that delivers one of those lines at a time to Dictreader - but when splitting into "; ", your iterator passes one cell per time. Then it returns you a dictionary with the contents of each cell, without knowing what is header or not.
To do this kind of thing there’s no such thing as a ready-to-go page- each page is a page, and "looking at HTML" and creating the Parsing structure that will work at first, it’s very difficult. The best thing is to do it in Python interactive mode - you retrieve the page daods with requests.get, create the object soup
and then experiment with the various methods of this Soup object and the page structure until you find out how you want to leave your data
In this case, you would see that when we find the table "Children", iterating over it with a "for" will alternately return a table row (including the header) and a text string - which is whitespace.
Maybe it’s possible to do something like this then:
def importa(url='http://www.geonames.org/countries/', tmout=2):
import requests
from bs4 import BeautifulSoup
from collections import OrderedDict
page = requests.get(url=url, timeout=tmout)
soup = BeautifulSoup(page.content, 'html.parser')
#print('\nsoup >>>', soup)
table = soup.find_all(id="countries")[0]
result = []
headers = None
for row in table:
# Pule as linhas que não contém tags html
if isinstance(row, str):
continue
# Assume que a primeira linha com conteúdo são os cabeçalhos
if not headers:
# cria uma lista com o conteúdo de texto de cada tag na linha:
headers = [cell.get_text() for cell in row]
continue
row_contents = [cell.get_text() for cell in row]
data_dict = OrderedDict(pair for pair in zip(headers, row_contents))
result.append(data_dict)
return result
from pprint import pprint
pprint(importa())
(Here it works - note the use of OrderedDict
to facilitate the visualization of dictionaries)
What is returning so far from your script?
– Marlysson
a dictionary with each element, https://pastebin.com/WjiebRLZ
– britodfbr
Do you want what is the key and what is the value? Because the lines have multiple data.. Or do you want the keys to be the acronym and the values another dictionary with the other line data? {"BR":{"Country":"Brazil","Capital":"Brasilia",,,}}
– Marlysson
This. With csv.Dictreader, the field title becomes key, and each line becomes a dictionary.
– britodfbr
I got it. I’ll check here..
– Marlysson
Thank you very much.
– britodfbr
Let’s go continue this discussion in chat.
– Marlysson