Extract data from an HTML form with PHP and write Json

Asked

Viewed 118 times

0

Hey there, guys. I’m trying to get PHP information from an HTML table on a website, but I’m not able to separate the extracted data. My intention is to generate a Json file with this information separated according to each row and HTML tag of the table.

Follow the table and the code I made using explode.

<ul class="milestones">
    <li>
        <img src="https://imagem.png">
        <span class="out">04/08/2020 10:09</span>
        <strong>Entrega</strong>
        <br>
        IPATINGA/MG
        <br>
        <small>3 semanas</small>
    </li>
    <li>
        <img src="https://imagem.png">
        <span class="out">04/08/2020 10:09</span>
        <strong>Entrega</strong>
        <br>
        SÃO PAULO/SP
        <br>
        <small>3 semanas</small>
    </li>
    <li>
        <img src="https://imagem.png">
        <span class="out">04/08/2020 10:09</span>
        <strong>Entrega</strong>
        <br>
        GOIANIA/GO
        <br>
        <small>3 semanas</small>
    </li>
</ul>

<?php
  $url = 'https://meusite.com.br/tabela';
  $dadosSite = file_get_contents($url);

  $var1 = explode('<ul class="milestones">',$dadosSite);
  $var2 = explode('</ul>',$var1[1]);

  $var3 = explode('<li>',$var2[0]);
  $var4 = explode('</li>',$var3[1]);

  $dados_json = json_encode($var3[1]);
  $fp = fopen("dados.json", "a");
  $escreve = fwrite($fp, $dados_json);
  fclose($fp);
?>

The idea of the Json file is to look like this:

[
  {"imagem":"https://imagem.png","data_hora":"04/08/2020 10:09","titulo":"Entrega", "sub_titulo":"IPATINGA/MG", "semanas":"3 semanas"},
  {"imagem":"https://imagem.png","data_hora":"04/08/2020 10:09","titulo":"Entrega", "sub_titulo":"SÃO PAULO/SP", "semanas":"3 semanas"},
  {"imagem":"https://imagem.png","data_hora":"04/08/2020 10:09","titulo":"Entrega", "sub_titulo":"GOIANIA/GO", "semanas":"3 semanas"},
]

1 answer

0


I believe a better solution is to parse html with Domdocument and Domxpath.

Domdocument allows us to generate a manipulable object from a DOM document, in this case HTML, could be an XML as well.

Domxpath allows you to use search filters similar to CSS selectors by returning document nodes that can be manipulated.

A suggestion would be the html_to_array function in the following code example, note that I considered that you are reading a complete HTML (with Html, Head and Body tags).

<?php
$html = '
<html>
<head>
</head>
<body>
<ul class="milestones">
    <li>
        <img src="https://imagem.png">
        <span class="out">04/08/2020 10:09</span>
        <strong>Entrega</strong>
        <br>
        IPATINGA/MG
        <br>
        <small>3 semanas</small>
    </li>
    <li>
        <img src="https://imagem.png">
        <span class="out">04/08/2020 10:09</span>
        <strong>Entrega</strong>
        <br>
        SÃO PAULO/SP
        <br>
        <small>3 semanas</small>
    </li>
    <li>
        <img src="https://imagem.png">
        <span class="out">04/08/2020 10:09</span>
        <strong>Entrega</strong>
        <br>
        GOIANIA/GO
        <br>
        <small>3 semanas</small>
    </li>
</ul>
</body>
</html>
';

function html_to_array($html, $json_encode) {
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    
    $xpath = new DOMXPath($dom);

    $tags = $xpath->query('//ul[@class="milestones"]/li');

    $array = [];

    foreach ($tags as $tag) {
      $linha = [];

      $itens = $tag->childNodes;

      $linha['imagem'] = trim($itens->item(1)->getAttribute('src'));
      $linha['data_hora'] = trim($itens->item(2)->nodeValue);      
      $linha['titulo'] = trim($itens->item(4)->nodeValue);
      $linha['sub_titulo'] = trim($itens->item(7)->nodeValue);
      $linha['semanas'] = trim($itens->item(9)->nodeValue);

      $array[] = $linha;
    }
    
    if ($json_encode) {
      return json_encode($array);
    }

    return $array;
}

var_dump(html_to_array($html, false));

Note that I created a very "rigid" parser suggestion, thinking about exactly what code I sent from html, so the lines with $itens->item(1) define a specific position of child items in the LI being read.

An example of the result of the executed code:

Array
(
    [0] => Array
        (
            [imagem] => https://imagem.png
            [data_hora] => 04/08/2020 10:09
            [titulo] => Entrega
            [sub_titulo] => IPATINGA/MG
            [semanas] => 3 semanas
        )

    [1] => Array
        (
            [imagem] => https://imagem.png
            [data_hora] => 04/08/2020 10:09
            [titulo] => Entrega
            [sub_titulo] => SÃO PAULO/SP
            [semanas] => 3 semanas
        )

    [2] => Array
        (
            [imagem] => https://imagem.png
            [data_hora] => 04/08/2020 10:09
            [titulo] => Entrega
            [sub_titulo] => GOIANIA/GO
            [semanas] => 3 semanas
        )

)

Note that the position sequence of the items has "jumps", as each element is a node within the DOM tree, including the BR tags and text on the nodes.

Access documentation on MDN to better understand about DOM - Document Object Model

  • Perfect! This is exactly how I intended, but the HTML table is on a site outside the server, how would you look to capture using the site URL? Thank you for your resolution!

  • One way would be to take the content with file_get_contents(url) .... I thank you to mark the answer as chosen, this is the community’s way to identify the solutions found to the problems raised

Browser other questions tagged

You are not signed in. Login or sign up in order to post.