How to separate tags from a variable in PHP array

Question

How to separate tags from a variable in PHP array

Asked 7 years, 5 months ago

Viewed 844 times

5

I want to separate tags into one array in PHP, but I couldn’t find an efficient way yet.

I want to turn this:

$variavel="<div><div>texto1<a>texto2</a><b>texto3</b></div>texto4</div>";

in

$array[0]="texto1";
$array[1]="texto2";
$array[2]="texto3";
$array[3]="texto4";

And so on, I want to display the captured text on one site, on several arrays, so treat him 1 to 1.

For the given example, do that already produces the expected result, but I find it difficult that it will work for the real need (because using regex in HTML is weird), so I believe it will be better if you [Edit] the question and detail better what you intend to do. What would be this captured text from a website? Does it follow any format? It’s the full page?

– Woss

2017/06/13 at 15:51
What I want to do is an alphabet translator, like what google does with the language translator...

– Mega Anim

2017/06/13 at 17:51
Ah the list 2 input I forgot was a typo even hehehe

– Mega Anim

2017/06/13 at 17:55
1

Then do as I advised and edit the question by adding a real example of your need, as you will probably need to use the Domdocument class.

– Woss

2017/06/13 at 17:56
But I want to do this in php, it’s not javascript no... And my question is already explaining what I need to do, which is to transform an html string, and capture the text of the tags by disposing it in an array with each bit in a position of Aray I just need to know how to do it. I have tried many things but nothing worked, as split in character <, and using regex tb did not work..

– Mega Anim

2017/06/13 at 18:58
It would be useful that link ?

– Don't Panic

2017/06/13 at 19:05
Domdocument is a PHP class, not Javascript.

– Woss

2017/06/13 at 21:43
Related or duplicate: Regex to capture fixed strings in HTML and JS codes.

– Woss

2017/06/13 at 23:08

Show 3 more comments

2 answers

Browser other questions tagged php

You are not signed in. Login or sign up in order to post.

by Woss • **73,416** points · Answer 1 · 2017-06-13T22:15:14+00:00

As commented, the best way to treat HTML text in PHP is to use the class DOMDocument. You can load an HTML page into an object DOMDocument as follows:

$dom = new DOMDocument();
$dom->loadHTML($html);

Being $html the contents of the file to be analyzed. Since it is desired only to obtain the contents of the file body, we can obtain the node referring to the body as follows:

$body = $dom->getElementsByTagName("body")->item(0);

Being $body an object DOMNode. It is possible to check if the element has child elements through the method hasChildNodes and travel through the attribute childNodes. In this way, we can create a recursive function that extracts text from all page nodes:

/**
 * Obtém o texto presente em um arquivo HTML, retornando-o em forma de lista.
 * 
 * @param DOMNode $element Elemento de onde será extraído o texto.
 * @param array   $texts   Lista de textos previamente obtidos.
 * @return array Lista de textos obtidos no elemento.
 */
function getTextsOfElements(DOMNode $element, array $texts = [])
{
    // Verifica se o elemento possui elementos filhos:
    if ($element->hasChildNodes()) {
        // Sim, então percorre todos os elementos filhos de forma recursiva:
        foreach ($element->childNodes as $e) {
            // Obtém os textos dos elementos filhos:
            $texts = getTextsOfElements($e, $texts);
        }
    } else {
        // Não, então verifica se o elemento é um texto:
        if ($element->nodeType == 3) {
            // Sim, remove os espaços em branco:
            $text = trim($element->nodeValue);

            // Verifica se o texto não é vazio:
            if ($text) {
                // Sim, então adiciona o texto à lista:
                $texts[] = $text;
            }
        }
    }

    // Retorna a lista de textos:
    return $texts;
}

So, to get the list of texts, just call the function by passing the object $body as a parameter:

print_r(getTextsOfElements($body));

If the input is the one specified in the question (full HTML):

$html = '<html>
            <head>
                <meta charset="UTF-8">
                <title>Document</title>
            </head>
            <body>
                <div>
                    <div>
                        texto1
                        <a>texto2</a>
                        <b>texto3</b>
                    </div>
                    texto4
                </div>
            </body>
        </html>';

The exit will be:

Array
(
    [0] => texto1
    [1] => texto2
    [2] => texto3
    [3] => texto4
)

See working on Repl.it.

by Pedro Souza • **1,631** points · Answer 2 · 2017-06-13T19:28:39+00:00

The reasoning basically would be to create a function that can replace all HTML tags with a "generic tag" let’s call it that, and use the command explode, to turn this string into an array. This procedure will probably create many empty indices in the array, so we will use the function array_filter, to clean it. A practical example:

<?php
    function remover_vazio($array) {
        return array_filter($array, '_remover_vazio_interno');
    }

    function _remover_vazio_interno($value) {
        return !empty($value) || $value === 0;
    }

    function separaTextoDoHTML($variavel) {
        // Substitui todas as tags HTML por uma tag única
        $variavel = preg_replace('#<[^>]+>#', '<HTML>', $variavel);

        // Realiza um explode com base na tag única criada, viabilizando separar em um array
        $array = explode('<HTML>', $variavel);

        // Utiliza uma função para filtrar o array e tirar os possíveis indices vazios
        $array = remover_vazio($array);

        // Criar array auxiliar para reordenar
        $arrayAux = array();

        // Percorre o array transferindo para os índices na ordem            
        foreach ($array as $value) {
            $arrayAux[] = $value;
        }

        // Retorna valor
        return $arrayAux;
    }
?>

The code of the function separaTextoDoHTML is commented for greater intent, but the process is to receive a variable, perform the preg_replace, to create a single item that can enable the following line explode. This will generate empty indexes and we use the function remover_empty to clear the array. An example of use would be:

// Valor informado na pergunta
$variavel = '<div><div>texto1<a>texto2</a><b>texto3</b></div>texto4</div>';

// Chamada da função criada
$retorno = separaTextoDoHTML($variavel);

// Dump da variável retornada para checar seu valor
var_dump($retorno);

This will print on the screen:

array (size=4)
  0 => string 'texto1' (length=6)
  1 => string 'texto2' (length=6)
  2 => string 'texto3' (length=6)
  3 => string 'texto4' (length=6)