As commented, the best way to treat HTML text in PHP is to use the class DOMDocument
. You can load an HTML page into an object DOMDocument
as follows:
$dom = new DOMDocument();
$dom->loadHTML($html);
Being $html
the contents of the file to be analyzed. Since it is desired only to obtain the contents of the file body, we can obtain the node referring to the body
as follows:
$body = $dom->getElementsByTagName("body")->item(0);
Being $body
an object DOMNode
. It is possible to check if the element has child elements through the method hasChildNodes
and travel through the attribute childNodes
. In this way, we can create a recursive function that extracts text from all page nodes:
/**
* Obtém o texto presente em um arquivo HTML, retornando-o em forma de lista.
*
* @param DOMNode $element Elemento de onde será extraído o texto.
* @param array $texts Lista de textos previamente obtidos.
* @return array Lista de textos obtidos no elemento.
*/
function getTextsOfElements(DOMNode $element, array $texts = [])
{
// Verifica se o elemento possui elementos filhos:
if ($element->hasChildNodes()) {
// Sim, então percorre todos os elementos filhos de forma recursiva:
foreach ($element->childNodes as $e) {
// Obtém os textos dos elementos filhos:
$texts = getTextsOfElements($e, $texts);
}
} else {
// Não, então verifica se o elemento é um texto:
if ($element->nodeType == 3) {
// Sim, remove os espaços em branco:
$text = trim($element->nodeValue);
// Verifica se o texto não é vazio:
if ($text) {
// Sim, então adiciona o texto à lista:
$texts[] = $text;
}
}
}
// Retorna a lista de textos:
return $texts;
}
So, to get the list of texts, just call the function by passing the object $body
as a parameter:
print_r(getTextsOfElements($body));
If the input is the one specified in the question (full HTML):
$html = '<html>
<head>
<meta charset="UTF-8">
<title>Document</title>
</head>
<body>
<div>
<div>
texto1
<a>texto2</a>
<b>texto3</b>
</div>
texto4
</div>
</body>
</html>';
The exit will be:
Array
(
[0] => texto1
[1] => texto2
[2] => texto3
[3] => texto4
)
See working on Repl.it.
For the given example, do that already produces the expected result, but I find it difficult that it will work for the real need (because using regex in HTML is weird), so I believe it will be better if you [Edit] the question and detail better what you intend to do. What would be this captured text from a website? Does it follow any format? It’s the full page?
– Woss
What I want to do is an alphabet translator, like what google does with the language translator...
– Mega Anim
Ah the list 2 input I forgot was a typo even hehehe
– Mega Anim
Then do as I advised and edit the question by adding a real example of your need, as you will probably need to use the Domdocument class.
– Woss
But I want to do this in php, it’s not javascript no... And my question is already explaining what I need to do, which is to transform an html string, and capture the text of the tags by disposing it in an array with each bit in a position of Aray I just need to know how to do it. I have tried many things but nothing worked, as split in character <, and using regex tb did not work..
– Mega Anim
It would be useful that link ?
– Don't Panic
Domdocument is a PHP class, not Javascript.
– Woss
Related or duplicate: Regex to capture fixed strings in HTML and JS codes.
– Woss