How to use "for" and "While" to capture cell tags from various tables in an html file?

Asked

Viewed 112 times

0

I have several HTML files that I need to capture the data inside the tables, to launch in the database, but I’m not able to navigate in the html tree to find the tags that are cells, the html is this:

<div class="details">
   <div class="title-table"><h2> BEAUNE</h2>
   <div class="table-responsive">
      <div class="table-towers">
        <div id="table472dc5e9b46304cf95865f7db6c459aa" class="collapse in table-content">
           <div class="table-towers">
                 <div class="table-row">
                    <div class="table-cell build_type">Apartamento</div>
                    <div class="table-cell area_useful">220m²</div>
                    <div class="table-cell rooms">3</div>
                    <div class="table-cell garage">4</div>
                    <div class="table-cell bird_estimate_average">R$ 2.816.344,33*
            <p><small>(R$ 2.393.892,68 a R$ 3.238.795,98)</small></p>
        </div>
                 <div class="table-row">
        <div class="table-cell build_type">Cobertura</div>
                    <div class="table-cell area_useful">396m²</div>
                    <div class="table-cell rooms">3</div>
                    <div class="table-cell garage">5</div>
                    <div class="table-cell bird_estimate_average">R$ 5.069.419,80*
                             <p><small>(R$ 4.309.006,83 a R$ 5.829.832,77)</small></p>
                     </div>
   <div class="title-table"><h2>BERGERAC</h2>
      <div class="table-responsive">
          <div class="table-towers">
               <div id="table0b60c9a0a450b921186c91102da447d9" class="collapse table-content">
                   <div class="table-towers">
                       <div class="table-row">
                            <div class="table-cell build_type">Apartamento</div>
                    <div class="table-cell area_useful">220m²</div>
                    <div class="table-cell rooms">3</div>
                    <div class="table-cell garage">4</div>
                    <div class="table-cell bird_estimate_average">R$ 2.816.344,33*
                               <p><small>(R$ 2.393.892,68 a R$ 3.238.795,98)</small></p>
                 </div>
                                            <!-- asdasd -->
                </div>
                        </div>

Then I have 10 more tables, inside an HTML file, which follows the same structure, so I thought of doing a "for" to bring the tag "title-table" which is the name of the table like this:

for id_torre in soup.find("div",{"class":"details"}).findAll("div",{"class":"title-table"}):#.findAll("h2"):
nm = id_torre.find("h2")
print(nm)

And with the list of titles of the tables, I thought of putting in the "while" so that it finds the table with each title and then captures the data of the cells in each row, and then I launch in the database:

while len(id_torre) >0:
nm = id_torre
print(nm)

tipo = soup.find("div",{"class":id_torre}).find("div",{"class":"table-cell build_type"})
print(tipo)

m2_util = soup.find("div",{"class":id_torre}).find("div",{"class":"table-cell area_useful"})
print(m2_util)

dt = soup.find("div",{"class":id_torre}).find("div",{"class":"table-cell rooms"})
print(dt)

But he brings "None" in all fields and keeps looping endlessly. what’s wrong with the code? I’m new to programming and python is the first language I’m learning.

  • All tables have the title in a <H2> within the table-title class? In this case it is BEAUNE, right?

  • yes, they all have title within <H2>, in this case they are two tables, one with name BEAUNE and the other with BERGERAC, but each table has one (<div id="table...) also

  • And you want the text within the elements whose class is table-Cell and to which table it belongs correctly?

  • correct, I need the class=table-Cell of each table related to the table name, to launch in the bank

  • I have the complete HTML file, here only put a part want q send you?

  • I don’t need it, I’ll do it according to what you put here, no problem

  • '\nBERGERAC': [' nApartamento n', ' n220m n', ' N3', ' N4', ' Nr$ 2,816,344,33* n n n n(R$ 2,393,892,68 to R$ 3,238,795,98) n n n',' nCobertura n', ' n396m n', ' N3', ' N5', ' Nr$ 5,069,419,80* n n n n n n(R$ 4,309,006,83 to R$ 5,829,832,77) n n n n n n n n n n n n n n n n n n n n n n n n n n n'],

  • He brought it, but how do I take the " n" and eliminate the spaces of the texts?

  • I’m still seeing a solution, see if this down helps

Show 4 more comments

2 answers

0

Open the file using the open function (example: file = open('filename.type'), so you can use this file in a for(where there would be no infinite loop), where it will pass from line to line, the re library may be more useful than find if you know regular expression, with the library re, you can find for example this line `

'<'div class="table-Cell build_type"'>'Apartment'<'/div'>'

` the

'<'...'>'Apartment'<'/...'>'

the ' ', are why the content was getting hidden

and extract the word 'Apartment' and save it in a database.

0

This html has several unopened tags and that is why the Beautifulsoup parser is lost. You can check the flaws on this site: https://www.aliciaramirez.com/closing-tags-checker/

The following tags don’t Seem to be closed:

Line 1: <div class="details">
Line 2: <div class="title-table">
Line 3: <div class="table-responsive">
Line 4: <div class="table-towers">
Line 5: <div id="table472dc5e9b46304cf95865f7db6c459aa" class="collapse in table-content">
Line 6: <div class="table-towers">
Line 7: <div class="table-row">
Line 15: <div class="table-row">
Line 23: <div class="title-table">
Line 24: <div class="table-responsive">
Line 25: <div class="table-towers">
Line 26: <div id="table0b60c9a0a450b921186c91102da447d9" class="collapse table-content">

In case you haven’t pasted the full code, I ask you to do so. If the page really contains these errors paste the full code let us know so we can help.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.