0
I have some texts that are in HTML, which perhaps has specific style attributes, I would like to make a method that removes these tags and their content, because they are titles... and specific images that should be filtered from HTML. Below I have an example snippet of an HTML content:
<span style="font-size:18px">
<span style="font-family:helvetica-light"><span style="color:rgb(140, 190, 207)">Cultura</span></span></span></p>
<p> </p>
<h2 style="text-align:center"><span style="font-size:42px"><span style="color:rgb(140, 190, 207)"><strong><span style="font-family:helveticaneue">O CIRCO CHEGOU!</span></strong></span></span></h2>
<p style="margin-left:80px; margin-right:80px; text-align:center"><span style="color:rgb(71, 71, 71); font-family:helveticaneue-light; font-size:30px">Cirque du Soleil apresenta espetáculo “Amaluna” em São Paulo e no Rio de Janeiro, na sexta passagem da maior companhia circense do mundo pelo Brasil</span></p>
<p style="text-align:center"> </p>
<p style="text-align:center"><span style="font-size:22px"><span style="font-family:helveticaneue"><span style="color:#8cbecf"><em>Por Melissa Schröder - Edição de André Schröder</em><br />
25/09/2017</span></span></span></p>
I would like to remove the title, for example, from the specific attributes, as in the example below:
<span style="color:rgb(140, 190, 207)"><strong><span style="font-family:helveticaneue">
I have a method that makes almost that.
public function htmlToTextTags($Document) {
if (preg_match('/<img(.+)? style=\".+?height:(4\d|5\d|6\d|7\d)(%|px);.+?\"[^>]*>/', $Document, $matches)) {
if(count($matches)) {
$Document = str_replace($matches[0], "", $Document);
}
}
$Rules = array (
'@<script[^>]*?>.*?<\/script>@si',
'@<style[^>]*?>.*?<\/style>@si',
'@<h2 style=\"text-align:center\">.*?<\/h2>@si',
'@<span style=\"color\:rgb(\(71\, 71\, 71\)); font-family:helveticaneue-light.*?\"?>*.?<\/span>@si',
'@<span.*?><span style=\"color\:#8cbecf\">.*?</span></span>@si',
'@<p style=\"text-align:center\"><img.*?></p>@si',
'@([\r\n])[\s]+@',
'@&(quot|#34);@i',
'@&(amp|#38);@i',
'@&(lt|#60);@i',
'@&(gt|#62);@i',
'@&(nbsp|#160);@i',
'@<div style=\"transform: rotate(\(\-90deg\)); \-webkit\-transform: rotate(\(\-90deg\))\;.+\"?>*.?<\/div>@'
);
$Replace = array (
'',
'',
'',
'',
'',
'',
'',
'"',
'&',
'<',
'>',
' ',
''
);
return html_entity_decode(utf8_decode((preg_replace($Rules, $Replace, $Document))));
}
However, it does not work for all cases, always has a different content, I need to keep creating rules.
My question, is whether there is a better and more efficient way to do this, can anyone tell me?
It would be interesting for you to read this question (and the classical) and on the native class
DOMDocument
of PHP.– Woss
The strip_tags($html) function does not resolve?
– Wilson Faustino
Chomsky turns in his grave every time a context-free language tries to be interpreted using a regular grammar.
– Pedro Corso