The problem of using explode
is that it breaks the string without taking into account the semantics of HTML (i.e., the meaning of each tag, the separation between what is a tag and what is the content of it, etc).
To manipulate an HTML content the way you need it, you can use DOMDocument
:
$links = "<ul><li>CONTEUDO
<div class='conteudo'>CORPO 1</div>
</li></ul>
<ul><li>CONTEUDO
<div class='conteudo'>CORPO 2</div>
</li></ul>
<ul><li>CONTEUDO
<div class='conteudo'>CORPO 3</div>
</li></ul>
<ul><li>CONTEUDO
<div class='conteudo'>CORPO 4</div>
</li></ul>";
$dom = new DOMDocument();
$dom->loadHtml($links);
$xpath = new DOMXPath($dom);
// procura elementos div com classe "conteudo"
foreach ($xpath->query('//div[@class="conteudo"]') as $div) {
echo $div->textContent. "<br>";
}
So I look for all the elements div
that have the class "content" (using the syntax of XPATH), and print their respective values. The output of the above code is:
CORPO 1
CORPO 2
CORPO 3
CORPO 4
The above code works if inside the div
only has a simple text. But if inside the div
have other tags and you want all this content, you need to use an auxiliary function to get the HTML of the internal content (the function below has been taken from here):
$links = "<ul><li>CONTEUDO
<div class='conteudo'>CORPO 1</div>
</li></ul>
<ul><li>CONTEUDO
<div class='conteudo'>CORPO 2</div>
</li></ul>
<ul><li>CONTEUDO
<div class='conteudo'><p>CORPO 3 <span>teste com <strong>outras tags</strong></span> dentro do div</p></div>
</li></ul>
<ul><li>CONTEUDO
<div class='conteudo'><span>CORPO 4</span></div>
</li></ul>";
function innerHTML(DOMNode $element) {
$innerHTML = "";
$children = $element->childNodes;
foreach ($children as $child) {
$innerHTML .= $element->ownerDocument->saveHTML($child);
}
return $innerHTML;
}
$dom = new DOMDocument();
$dom->loadHtml($links);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//div[@class="conteudo"]') as $div) {
echo innerHTML($div). "<br>";
}
The exit is:
CORPO 1
CORPO 2
<p>CORPO 3 <span>teste com <strong>outras tags</strong></span> dentro do div</p>
<span>CORPO 4</span>
but if you have <div class='content'><span>BODY 2</span></div> you wouldn’t be able to catch <span>CORPO2</span>? pq tested only handle BODY 2
– Rogério Silva
@Rogériosilva I updated the answer
– hkotsubo
Ta everything ok, but know pq when something in BODY has the &character, an error appears Domdocument::loadHTML(): htmlParseEntityRef: no name in Entity?
– Rogério Silva
@Rogériosilva Because the
&
has special meaning in HTML, is used for HTML entities (if you have a&
"loose" in the text, is wrong). But there is already escaping the scope of the question (which is how to get the content of a given tag). Anyway, if the text has to have a&
, the correct is to write it as&
– hkotsubo