How to separate tags from a variable in PHP array

Asked

Viewed 844 times

5

I want to separate tags into one array in PHP, but I couldn’t find an efficient way yet.

I want to turn this:

$variavel="<div><div>texto1<a>texto2</a><b>texto3</b></div>texto4</div>";

in

$array[0]="texto1";
$array[1]="texto2";
$array[2]="texto3";
$array[3]="texto4";

And so on, I want to display the captured text on one site, on several arrays, so treat him 1 to 1.

  • For the given example, do that already produces the expected result, but I find it difficult that it will work for the real need (because using regex in HTML is weird), so I believe it will be better if you [Edit] the question and detail better what you intend to do. What would be this captured text from a website? Does it follow any format? It’s the full page?

  • What I want to do is an alphabet translator, like what google does with the language translator...

  • Ah the list 2 input I forgot was a typo even hehehe

  • 1

    Then do as I advised and edit the question by adding a real example of your need, as you will probably need to use the Domdocument class.

  • But I want to do this in php, it’s not javascript no... And my question is already explaining what I need to do, which is to transform an html string, and capture the text of the tags by disposing it in an array with each bit in a position of Aray I just need to know how to do it. I have tried many things but nothing worked, as split in character <, and using regex tb did not work..

  • It would be useful that link ?

  • Domdocument is a PHP class, not Javascript.

Show 3 more comments

2 answers

2

As commented, the best way to treat HTML text in PHP is to use the class DOMDocument. You can load an HTML page into an object DOMDocument as follows:

$dom = new DOMDocument();
$dom->loadHTML($html);

Being $html the contents of the file to be analyzed. Since it is desired only to obtain the contents of the file body, we can obtain the node referring to the body as follows:

$body = $dom->getElementsByTagName("body")->item(0);

Being $body an object DOMNode. It is possible to check if the element has child elements through the method hasChildNodes and travel through the attribute childNodes. In this way, we can create a recursive function that extracts text from all page nodes:

/**
 * Obtém o texto presente em um arquivo HTML, retornando-o em forma de lista.
 * 
 * @param DOMNode $element Elemento de onde será extraído o texto.
 * @param array   $texts   Lista de textos previamente obtidos.
 * @return array Lista de textos obtidos no elemento.
 */
function getTextsOfElements(DOMNode $element, array $texts = [])
{
    // Verifica se o elemento possui elementos filhos:
    if ($element->hasChildNodes()) {
        // Sim, então percorre todos os elementos filhos de forma recursiva:
        foreach ($element->childNodes as $e) {
            // Obtém os textos dos elementos filhos:
            $texts = getTextsOfElements($e, $texts);
        }
    } else {
        // Não, então verifica se o elemento é um texto:
        if ($element->nodeType == 3) {
            // Sim, remove os espaços em branco:
            $text = trim($element->nodeValue);

            // Verifica se o texto não é vazio:
            if ($text) {
                // Sim, então adiciona o texto à lista:
                $texts[] = $text;
            }
        }
    }

    // Retorna a lista de textos:
    return $texts;
}

So, to get the list of texts, just call the function by passing the object $body as a parameter:

print_r(getTextsOfElements($body));

If the input is the one specified in the question (full HTML):

$html = '<html>
            <head>
                <meta charset="UTF-8">
                <title>Document</title>
            </head>
            <body>
                <div>
                    <div>
                        texto1
                        <a>texto2</a>
                        <b>texto3</b>
                    </div>
                    texto4
                </div>
            </body>
        </html>';

The exit will be:

Array
(
    [0] => texto1
    [1] => texto2
    [2] => texto3
    [3] => texto4
)

See working on Repl.it.

  • strange, for some reason is giving error in the line Function getTextsOfElements(Domnode $element, array $Texts = []) Parse error: syntax error, Unexpected '[' But the brackets apparently

  • 1

    @Megaanim, the problem of the brackets may be due to this stretch array $texts = []. This is the short form for arrays, introduced in PHP5.4. If your version is lower, the syntax error will be issued. In that case, correct for normal form $texts = array().

  • Daniel, I was able to make your way work.... But I figured something out, I can’t delete the tags, I need them, and apparently your way might help me... Is it possible to save them in the same array? Example: Texts[0]="<body> <div> <div>"; Texts[1]=text1; Texts[2]="<a>"; Texts[3]="</a>" .... and so on?

  • I’ve been playing with your example, and I’ve found a way to do what I want, only I’m having a few problems with the tags of the closed example </div> I take the name of the tag with $tagNome=$e->tagname; I put it in the variable $tag="<". $tagName." >"; Then just send it to $Texts[] = $tag; So far so good, the problem is that the closed tags, appear empty the value of the tagname, the right was to be /div or /a etc, but does not come empty value, is there any way to know which tag is being closed? -

  • Worse than that, I found out what it is... This business closes tags without them actually being closed... For example <html><body><div>value<a>content></a></div></body></html> The problem is that After value starts an a, this a is part of the content of the div, but for some reason, this php domnode is creating an empty Node (which corresponds to a lock) after value, and because of that the div is closed before time.

1

The reasoning basically would be to create a function that can replace all HTML tags with a "generic tag" let’s call it that, and use the command explode, to turn this string into an array. This procedure will probably create many empty indices in the array, so we will use the function array_filter, to clean it. A practical example:

<?php
    function remover_vazio($array) {
        return array_filter($array, '_remover_vazio_interno');
    }

    function _remover_vazio_interno($value) {
        return !empty($value) || $value === 0;
    }

    function separaTextoDoHTML($variavel) {
        // Substitui todas as tags HTML por uma tag única
        $variavel = preg_replace('#<[^>]+>#', '<HTML>', $variavel);

        // Realiza um explode com base na tag única criada, viabilizando separar em um array
        $array = explode('<HTML>', $variavel);

        // Utiliza uma função para filtrar o array e tirar os possíveis indices vazios
        $array = remover_vazio($array);

        // Criar array auxiliar para reordenar
        $arrayAux = array();

        // Percorre o array transferindo para os índices na ordem            
        foreach ($array as $value) {
            $arrayAux[] = $value;
        }

        // Retorna valor
        return $arrayAux;
    }
?>

The code of the function separaTextoDoHTML is commented for greater intent, but the process is to receive a variable, perform the preg_replace, to create a single item that can enable the following line explode. This will generate empty indexes and we use the function remover_empty to clear the array. An example of use would be:

// Valor informado na pergunta
$variavel = '<div><div>texto1<a>texto2</a><b>texto3</b></div>texto4</div>';

// Chamada da função criada
$retorno = separaTextoDoHTML($variavel);

// Dump da variável retornada para checar seu valor
var_dump($retorno);

This will print on the screen:

array (size=4)
  0 => string 'texto1' (length=6)
  1 => string 'texto2' (length=6)
  2 => string 'texto3' (length=6)
  3 => string 'texto4' (length=6)
  • 1

    Read this reply in the [so] about using regex and HTML together. To illustrate: what if a tag in the body of the file has an attribute contains the character >? For example: <div data-foo="<teste>" class="red">Texto</div>. The text presented for this situation would be " class="red">Texto instead of just Texto.

  • @Andersoncarloswoss I agree with you, the solution would be failure in this case. But we meet the proposed problem the code would work, given that if we are talking about a text with HTML in the middle, we will hardly have the kind of situation exposed by you. Anyway I’ll improve the answer.

  • This question is also interesting: https://answall.com/q/203940/5878

  • Hello, I’m testing the function, it works, but I have a little problem... The array comes back with failures, it jumps numbers, and I need to go through it, like in a for loop to rescue the values, ai from wrong, because both Count and sizeof count only the objects, for example it returns 4 elements, but in positions 2, 3, 5, 7, so I can’t go through the array.. I tried to clean it but it didn’t help $return=array_filter($return);

  • @Megaanim made an adjustment to the function, which solves the problem you reported. Test again please.

  • As I said and is very explicit in the other questions cited: using regex in HTML will always have flaws. Besides, the way you reset the indexes is strange, because just do return array_values($array)

  • It worked like this, now my translator is almost working, just missing a little problem I found to capture the html of xD iframe

  • I was able to solve everything now first caught the string Document.getElementById("site").contentWindow.Document.body.outerHTML Ai treat it by converting it to katakana using php, via ajax, and with the return, turn it into html object, var doc = parseFromString(stringConverted, "text/html"); , then just set the body inside the iframe Document.getElementById("site").contentWindow.Document.body=doc.body;

Show 3 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.