Regular expression to remove certain content

Asked

Viewed 171 times

0

I have some texts that are in HTML, which perhaps has specific style attributes, I would like to make a method that removes these tags and their content, because they are titles... and specific images that should be filtered from HTML. Below I have an example snippet of an HTML content:

<span style="font-size:18px">
<span style="font-family:helvetica-light"><span style="color:rgb(140, 190, 207)">Cultura</span></span></span></p>
<p>&nbsp;</p>
<h2 style="text-align:center"><span style="font-size:42px"><span style="color:rgb(140, 190, 207)"><strong><span style="font-family:helveticaneue">O CIRCO CHEGOU!</span></strong></span></span></h2>
<p style="margin-left:80px; margin-right:80px; text-align:center"><span style="color:rgb(71, 71, 71); font-family:helveticaneue-light; font-size:30px">Cirque du Soleil apresenta espet&aacute;culo &ldquo;Amaluna&rdquo; em S&atilde;o Paulo e no Rio de Janeiro, na sexta passagem da maior companhia circense do mundo pelo Brasil</span></p>
<p style="text-align:center">&nbsp;</p>
<p style="text-align:center"><span style="font-size:22px"><span style="font-family:helveticaneue"><span style="color:#8cbecf"><em>Por Melissa Schr&ouml;der -&nbsp;Edi&ccedil;&atilde;o de Andr&eacute; Schr&ouml;der</em><br />
    25/09/2017</span></span></span></p>

I would like to remove the title, for example, from the specific attributes, as in the example below:

<span style="color:rgb(140, 190, 207)"><strong><span style="font-family:helveticaneue">

I have a method that makes almost that.

 public function htmlToTextTags($Document) {
        if (preg_match('/<img(.+)? style=\".+?height:(4\d|5\d|6\d|7\d)(%|px);.+?\"[^>]*>/', $Document, $matches)) {
            if(count($matches)) {
                $Document = str_replace($matches[0], "", $Document);
            }
        }
        $Rules = array (
            '@<script[^>]*?>.*?<\/script>@si',
            '@<style[^>]*?>.*?<\/style>@si',
            '@<h2 style=\"text-align:center\">.*?<\/h2>@si',
            '@<span style=\"color\:rgb(\(71\, 71\, 71\)); font-family:helveticaneue-light.*?\"?>*.?<\/span>@si',
            '@<span.*?><span style=\"color\:#8cbecf\">.*?</span></span>@si',
            '@<p style=\"text-align:center\"><img.*?></p>@si',
            '@([\r\n])[\s]+@',
            '@&(quot|#34);@i',
            '@&(amp|#38);@i',
            '@&(lt|#60);@i',
            '@&(gt|#62);@i',
            '@&(nbsp|#160);@i',
            '@<div style=\"transform: rotate(\(\-90deg\)); \-webkit\-transform: rotate(\(\-90deg\))\;.+\"?>*.?<\/div>@'
        );
        $Replace = array (
            '',
            '',
            '',
            '',
            '',
            '',
            '',
            '"',
            '&',
            '<',
            '>',
            ' ',
            ''
        );
        return html_entity_decode(utf8_decode((preg_replace($Rules, $Replace, $Document))));
    }

However, it does not work for all cases, always has a different content, I need to keep creating rules.

My question, is whether there is a better and more efficient way to do this, can anyone tell me?

  • 2

    It would be interesting for you to read this question (and the classical) and on the native class DOMDocument of PHP.

  • The strip_tags($html) function does not resolve?

  • Chomsky turns in his grave every time a context-free language tries to be interpreted using a regular grammar.

No answers

Browser other questions tagged

You are not signed in. Login or sign up in order to post.