Convert HTML5 table + images to CSV or SQL

Asked

Viewed 127 times

0

I’m in big trouble!

I have basically 1.5 million records including images in an HTML5 table (it already starts there, the browser does not render all images).

My idea was to convert this table to CSV, and so, play for a Mysql database, and then create a page display algorithm.

How can I perform this tag conversion <table>, <tr>, <td>, some <a> and also <img> to an Excel file?

Is there any other alternative? Here’s an example of what a summary" is for each row of the table:

<tr>
    <td class="">
        <a href="#">Processo 2333382</a>
    </td>
    <td>
        <a>
            <img src="LINK DA IMAGEM">
        </a>
    </td>
    <td>
        <a>
            <img src="LINK DA IMAGEM">
        </a>
    </td>
    <td>
        <a>
            <img src="LINK DA IMAGEM">
        </a>
    </td>
</tr>

In short, I need to pass all image links with the process number to the database, or rather to a CSV file.

  • You want to switch to a CSV because you think it’s better or because you need to have a CSV?

  • It’s one of the ways I found to be able to send this table later to Mysql, but to have some form that already does it directly, it’s fair.

  • I believe that using PHP for an Html5 parser would be the best option.

2 answers

3

You did not specify which output expected, so I’m guessing, for the following input file html input.:

<tr>
    <td class=""> <a href="#">Processo 1</a> </td>
    <td> <a> <img src="LINK DA IMAGEM"> </a> </td>
    <td> <a> <img src="LINK DA IMAGEM"> </a> </td>
    <td> <a> <img src="LINK DA IMAGEM"> </a> </td>
</tr>
<tr>
    <td class=""> <a href="#">Processo 2</a> </td>
    <td> <a> <img src="LINK DA IMAGEM"> </a> </td>
    <td> <a> <img src="LINK DA IMAGEM"> </a> </td>
    <td> <a> <img src="LINK DA IMAGEM"> </a> </td>
</tr>

The output produced is a csv file output.csv in the following format:

 Processo 1,LINK DA IMAGEM,LINK DA IMAGEM,LINK DA IMAGEM
 Processo 2,LINK DA IMAGEM,LINK DA IMAGEM,LINK DA IMAGEM

The python script below makes this conversion:

from lxml import html
import csv

# Le a entrada e salva em s
with open('input.html', 'r') as myfile:
        s = myfile.read()

# Faz o parse e encontra todas as linhas da tabela (<tr>)
page = html.fromstring(s)
rows = page.findall('tr')

# Extrai o conteúdo do html
data = []
for row in rows:
    datarow = []
    for c in row.getchildren():
        # Se for uma imagem, salva o link
        imgel = c.find('a/img')
        if imgel is not None: 
            datarow.append(imgel.get('src'))
        # Se não for uma imagem, salva o texto (nome do processo)
        else:
            datarow.append(c.text_content())
    data.append(datarow)

# Escreve a saída em um arquivo csv
with open('output.csv', 'wb') as myfile:
    wr = csv.writer(myfile)
    for row in data:
        wr.writerow(row)

0

Excel VBA

To accomplish this in Excel VBA with Regex.

Regex

The code is: (?:<a.*?>\s*[<img src="]*)(.+?)(?="?>?\s*(?:<\/a>|$))

The validation link on Regex101 and the Debuggex

const regex = /(?:<a.*?>\s*[<img src="]*)(.+?)(?="?>?\s*(?:<\/a>|$))/g;
const str = `<tr>
    <td class=""> <a href="#">Processo 1</a> </td>
    <td> <a> <img src="LINK DA IMAGEM"> </a> </td>
    <td> <a> <img src="LINK DA IMAGEM"> 
</a> 
</td>
    <td> <a> <img src="LINK DA IMAGEM"> </a> </td>
</tr>
<tr>
    <td class=""> <a href="#">Processo 2</a> </td>
    <td>
 <a>
 <img src="LINK DA IMAGEM"> </a> </td>
    <td> <a> <img src="LINK DA IMAGEM"> </a> </td>
    <td> <a> <img src="LINK DA IMAGEM"> </a> </td>
</tr>
`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

Enable Regex in Excel

  1. Regex needs to be enabled, Enable the Developer mode
    1. In the 'Developer' tab, click 'Visual Basic' and the VBA window will open.
    2. Go to 'Tools' -> 'References...' and a window will open.
    3. Search for 'Microsoft Vbscript Regular Expressions 5.5', as in the image below. And enable this option.

Janela Referências

VBA code

Just manipulate the str variable to insert the desired string in Excel, it can be an Excel cell, Range, Array or simply text, as in the example.

This code does not perform everything desired, but I suggest you divide the problem into several minor problems and try to assemble the code that performs the desired task.

Dim str As String
Dim objMatches As Object
str = "<tr> <td class=""""> <a href=""#"">Processo 1</a> </td> <td> <a> <img src=""LINK DA IMAGEM""> </a> </td>  <td> <a> <img src=""LINK DA IMAGEM""></a></td>    <td> <a> <img src=""LINK DA IMAGEM""> </a> </td></tr><tr>    <td class=""""> <a href=""#"">Processo 2</a> </td><td> <a> <img src=""LINK DA IMAGEM""> </a> </td>    <td> <a> <img src=""LINK DA IMAGEM""> </a> </td>    <td> <a> <img src=""LINK DA IMAGEM""> </a> </td></tr>"
Set objRegExp = CreateObject("VBScript.RegExp") 'New regexp
objRegExp.Pattern = "(?:<a.*?>\s*[<img src=""]*)(.+?)(?=""?>?\s*(?:<\/a>|$))"
objRegExp.Global = True
Set objMatches = objRegExp.Execute(str)
If objMatches.Count <> 0 Then
    For Each m In objMatches
        Debug.Print m.Submatches(0); Value
    Next
End If

Upshot

Using the @Klaus String

Resultado

Browser other questions tagged

You are not signed in. Login or sign up in order to post.