Browse HTML content and remove parts of HTML using PHP

Asked

Viewed 1,134 times

1

I have for example this code that removes all DIVs containing the class contextual of my code HTML passed in string $sHTML:

$nPosIni = strpos($sHTML, '<div class="contextual">');
while ($nPosIni > 0) { // remove todas as divs com a classe contextual
    $nPosFim = strpos($sHTML, '</div>', $nPosIni);
    $sHTML = substr($sHTML, 0, $nPosIni) . 
             substr($sHTML, ($nPosFim + strlen("</div>")));
    $nPosIni = strpos($sHTML, '<div class="contextual">');
}

So what I need is to remove from a code HTML another <div> with another class, but I want only one <h3> CONTEÚDO </h3> that’s inside that <div>.


I tried in many ways but I couldn’t find an efficient way, someone knows some good practice?


OBS.: The code I am using does not accept scripts or functions, only PHP, HTML and CSS ...


SAMPLE HTML:

<html>
    <head></head>
    <body>
        <div class="xy">
            <h3> conteúdo </h3>
        </div>
    </body>
</html>

HTML AS IT SHOULD BE:

<html>
    <head></head>
    <body>
        <h3> conteúdo </h3>
    </body>
</html>
  • 2

    Already tried something with the Domdocument class?

  • You can provide copy of the html code that serves as the basis for this routine?

  • Do you want the div to remain if you have the H3? Or remove the div and keep the H3 element inside?

  • @Andersoncarloswoss, I didn’t try, as you suggest?

  • @Diegoschmidt, I want to remove all DIV and leave only the H3 inside.

  • @Caiubyfreitas, I believe that the code is not necessary, because it would serve for any type of code ...

  • You can put an example in the HTML code question, how it is and how it should look?

  • Sure, just a moment...

  • Updated question, it is very direct even, I put only one example of the use I need.

Show 4 more comments

2 answers

1


As I said, the best way to manipulate HTML with PHP is by using the native classes of GIFT. In this case, I directly used the classes DOMDocument and DOMXPath. The code is commented when the steps are executed and I think it will be easy to understand it:

<?php

$html = <<<HTML
<html>
    <head></head>
    <body>
        <div class="xy">
            <h3> conteúdo </h3>
        </div>
    </body>
</html>
HTML;

// 1. Cria uma instância de DOMDocument:
$dom = new DOMDocument();

// 2. Carrega o código HTML a partir de uma string:
$dom->loadHTML($html);

// 3. Cria uma instância de DomXPath:
$xpath = new DOMXPath($dom);

// 4. Busca no HTML todos os elementos `div` que possuem a classe `xy`:
$nodes = $xpath->query("//div[@class='xy']");

// 5. Percorre a lista de elementos encontrados:
foreach ($nodes as $node) {

    // 6. Busca o primeiro elemento `h3` dentro da `div`:
    $h3 = $node->getElementsByTagName("h3")[0];

    // 7. Substitui no HTML a `div` pelo respectivo `h3`:
    $node->parentNode->replaceChild($h3, $node);
}

// 8. Exibe o HTML final:
echo $dom->saveHTML(), PHP_EOL;

See working on Repl.it

  • There is a way to save the modification to the variable without displaying with the echo ?

  • 1

    Yes, just make the variable receive the value returned by $dom->saveHTML().

  • Perfect, very good Anderson, worked perfectly. Cool this way of using the Dom, not known, very practical. Just out of curiosity, what does the PHP_EQL in the end ?

  • 1

    It’s a PHP constant: end of line, only to insert the line break. Its value refers to the \n or \r\n depending on the operating system.

  • I got it, I didn’t, so whenever you need to use one \n or \r\n better use the PHP_EQL because it will be better interpreted, or depends on the case!?

  • 1

    If your application is running on different operating systems, it is best to use the constant.

Show 1 more comment

0

It is possible to replace the div using the function preg_replace_callback() and leaving only the H3 tag.

$new_sHTML = preg_replace_callback('/<div class=\"xy\">.*?<\/div>/sim',
  function($match) {
    preg_match('/<h3.*?<\/h3>/sim', $match[0], $h3);
    return $h3[0];
  }, $sHTML
);

Follow the example working on https://repl.it/MMxC/3

  • I’m sorry Philip, but it didn’t work that way, in this second method I understood would be to remove all <div>, but I need you to stay <h3> that’s inside that <div> in question, but it still didn’t work either, because the code I’m using doesn’t accept functions also. ':D

  • I answered on top of what you had previously posted, the question was not with this whole detailing of now. :(

  • Only the example was added and when testing its code I saw that functions did not work as well, so I added to the question.

  • I didn’t have enough score to make comments, I’m new, but okay. You’ve already managed to solve the problem.

  • All right, Philip, but thank you for your contribution. Although it is not a good practice, I suggest you then remove your question, try not to lose your points, or edit it with some other idea or suggestion.

  • Hello Marcos, I made the changes in my answer, it follows basically what I had proposed, however, better understanding your problem, I managed to make it meet the goal, apparently. Hugs.

  • 2

    @Filipericardo It might be interesting for you to read Why Regex should not be used to handle HTML? for future reference.

  • @Andersoncarloswoss cool, I will read yes, thanks for the tip.

  • 1

    The good thing about programming is that there are different ways of achieving the same goal, and the more knowledge, the better we are in this decision-making, always aiming for the shortest way, or the best to be followed.

Show 4 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.