Use Python to import locally saved htm file information

Asked

Viewed 141 times

0

I have an htm file with the following code:

<html>
<head>
</head>
<body bgcolor="#FFFFFF">
<p><strong><big><big><font face="Arial" color="#004080">Numero de Vendas</font>
</big></big></strong></p>
</p>
<table border="0" cellspacing="1" cellpadding="0" width="800">
<tr>
<th height="20" bgcolor="0087E9"><small><font face="Arial" color="#FFF500">Vendedor</font></small></th>
<th height="20" bgcolor="0087E9"><small><font face="Arial" color="#FFF500">January</font></small></th>
<th height="20" bgcolor="0087E9"><small><font face="Arial" color="#FFF500">February</font></small></th>
<th height="20" bgcolor="0087E9"><small><font face="Arial" color="#FFF500">March</font></small></th>
<th height="20" bgcolor="0087E9"><small><font face="Arial" color="#FFF500">April</font></small></th>
<th height="20" bgcolor="0087E9"><small><font face="Arial" color="#FFF500">May</font></small></th>
<th height="20" bgcolor="0087E9"><small><font face="Arial" color="#FFF500">June</font></small></th>
</tr>
<tr>
<td rowspan="1">Pedro</td>
<td rowspan="1">26</td>
<td rowspan="1">12</td>
<td rowspan="1">21</td>
<td rowspan="1">23</td>
<td rowspan="1">57</td>
<td rowspan="1">24</td>
<td rowspan="1">76</td>
</tr>
</tr>
<td rowspan="1">Joao</td>
<td rowspan="1">22</td>
<td rowspan="1">15</td>
<td rowspan="1">11</td>
<td rowspan="1">13</td>
<td rowspan="1">22</td>
<td rowspan="1">28</td>
<td rowspan="1">50</td>
</tr>
</tr>
</table>
</body>
</html>

I would like to extract information and write to a txt file, to be in the following format:

Peter 26 12 21 23 57 24 76
Joao 22 15 11 13 22 28 50

I am using the code below but did not obtain satisfactory result. I am a beginner in programming and I really appreciate if someone can help me.

import lxml.html as PARSER   
data = open('C:/Vendas/vendas.htm').read()  
root = PARSER.fromstring(data)  
for ele in root.getiterator():  
           if len(ele) < 1:  
             print(ele.text_content()) 

1 answer

1


There are many ways to extract information from HTM files but I think it is easier using the library Beautifulsoup that is proper for this. Below is the code that I made.

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, "html.parser")
elements = soup.findAll("td")

for element in elements:

    text = element.getText()

    if not text.isnumeric() and element != elements[0]:
        print("")
    print(text, end = " ")

In this code above I extracted all the elements <td> using the method findAll which returns an element list and then got the text of each one of them.

To stay in the format you wanted I used a print empty whenever the text was not numeric (with the exception of the first element) to make the line break, and then I printed the value by passing the parameter end a string with a spacing for the text to remain on the same line.

  • It worked the way I wanted it to. Thanks for the quick response.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.