Remove HTML snippet between specific comments

Asked

Viewed 175 times

7

I have a class called Page(page.class.php) that "mounts" the page, and one of the functions of that class is to censor content by user level.

    <?php
    class Page(){
        //(...)
        static function sensurar($str){
            $tipoInt = User::tipoInt();
            for ($i=0; $i < 11; $i++) {
                if ($tipoInt == $i) continue;
                $str = Page::clearTag2($str,"<!--a$i-->","<!--$i-->","<!--a-->");
            }
            return $str;
        }
        static function clearTag2($str,$tA,$tB,$msg=""){
            $str0 = $str;
            $pattern = "/({$tA})(.|\n)*({$tB})/";
            $str = preg_replace($pattern,$msg,$str);
            if (is_null($str)) {echo "erro"; return $str0; };
            if($str == "") {
                    $len = strlen($str0);
                    $error = preg_last_error();
                    Page::error("
                    Limpou a string.
                    [tA] = '$tA', [tB] = '$tB',[pattern] = '$pattern', [str].length = {$len}
                    $error
                    $str0
                    ","Page::clearTag2");
                }
            return $str;
        }

And it was working wonderfully well, until it started to show error. when I use on the page pagina("string") down below.

<pre><h2>Erro Page::clearTag2</h2>
Limpou a string.
[tA] = '<!--a5-->', [tB] = '<!--5-->',[pattern] = '/(<!--a5-->)(.|
)*(<!--5-->)/', [str].length = 6086
6
<!doctype html>
<html>
    <head>
        <meta charset="utf-8">
        <title>Sistema H |Produto, Odin</title>
        <link rel="stylesheet" type="text/css" href="tema.d/oficial.d/css/page.css">
        <link rel="stylesheet" type="text/css" href="tema.d/oficial.d/css/menu.css">
        <script type="text/javascript">
            server = "http://localhost/g2%20soft/ecomerce/";
        </script>
        <script src="tema.d/oficial.d/js/wrequest.js"></script>
    </head>
    <body lang="pt-br">
        <nav id="menunav">
            <header>
    <img src="tema.d/oficial.d/img/logo com fundo transparente.png" alt="">
</header>
<ul class="menu">
      <li
  class="menufechado"
  link="perfil"
  submenu="true"
  >
  <span onclick="menuOpen(this)">+ FabricaA[Fabrica]</span>
  <ul class="submenu">
      <li
  class="menufechado"
  link="edit_perfil"
  submenu="false"
  >
  <span onclick="menuOpen(this)">Perfil</span>

</li>
<li
  class="menufechado"
  link="edit_perfil?a=sair"
  submenu="false"
  >
  <span onclick="menuOpen(this)">Sair</span>

</li>

</ul>

</li>
<li
  class="menufechado"
  link="list_meusprodutos"
  submenu="true"
  >
  <span onclick="menuOpen(this)">+ Produtos</span>
  <ul class="submenu">
      <li
  class="menufechado"
  link="list_meusprodutos"
  submenu="false"
  >
  <span onclick="menuOpen(this)">Meus Produtos</span>

</li>
<li
  class="menufechado"
  link="add_produto"
  submenu="false"
  >
  <span onclick="menuOpen(this)">Cadastrar Produto</span>

</li>
<li
  class="menufechado"
  link="list_valortipo"
  submenu="false"
  >
  <span onclick="menuOpen(this)">Tabelas de Preço</span>

</li>

</ul>

</li>
<li
  class="menufechado"
  link="list_meusclientes"
  submenu="true"
  >
  <span onclick="menuOpen(this)">+ Clientes</span>
  <ul class="submenu">
      <li
  class="menufechado"
  link="list_meusclientes"
  submenu="false"
  >
  <span onclick="menuOpen(this)">Meus Clientes</span>

</li>
<li
  class="menufechado"
  link="list_naoclientes"
  submenu="false"
  >
  <span onclick="menuOpen(this)">Não Clientes</span>

</li>

</ul>

</li>
<li
  class="menufechado"
  link="#"
  submenu="true"
  >
  <span onclick="menuOpen(this)">+ Cadastro</span>
  <ul class="submenu">
      <li
  class="menufechado"
  link="add_produto"
  submenu="false"
  >
  <span onclick="menuOpen(this)">Produto</span>

</li>
<li
  class="menufechado"
  link="add_formadepagamento"
  submenu="false"
  >
  <span onclick="menuOpen(this)">Forma de pagamento</span>

</li>
<li
  class="menufechado"
  link="add_prasodeentrega"
  submenu="false"
  >
  <span onclick="menuOpen(this)">Praso de pagamento</span>

</li>

</ul>

</li>
<li
  class="menufechado"
  link="list_pedidosfabrica"
  submenu="false"
  >
  <span onclick="menuOpen(this)">Pedidos</span>

</li>
<li
  class="menufechado"
  link="mensagens"
  submenu="false"
  >
  <span onclick="menuOpen(this)">Mensagens</span>

</li>
<li
  class="menufechado"
  link=""
  submenu="true"
  >
  <span onclick="menuOpen(this)">+ Relatorios</span>
  <ul class="submenu">
      <li
  class="menufechado"
  link="rela_produtosvendidos"
  submenu="false"
  >
  <span onclick="menuOpen(this)">Produtos Vendidos</span>

</li>
<li
  class="menufechado"
  link="rela_produtoscadastrados"
  submenu="false"
  >
  <span onclick="menuOpen(this)">Produtos Cadastrados</span>

</li>
<li
  class="menufechado"
  link="rela_clientescadastrados"
  submenu="false"
  >
  <span onclick="menuOpen(this)">Clientes Cadastrados</span>

</li>

</ul>

</li>
<li
  class="menufechado"
  link=""
  submenu="false"
  >
  <span onclick="menuOpen(this)">Lançamentos</span>

</li>

</ul>

        </nav>
        <content>
            <h1>Produto, Odin</h1>
            <link rel="stylesheet" href="tema.d/oficial.d/css/view_produto.css">
<div class="page_listabas">
  <!--a5-->
  <div class="page_abalabel " onclick="link('list_loja')">Loja</div>
  <!--5-->
  <div class="page_abalabel page_abalabel_opened">Ver</div>
  <!--a6-->
  <div class="page_abalabel" onclick="link('edit_produto?id=52')">Detalhes</div>
  <div class="page_abalabel" onclick="link('edit_produtomidia?id=52')">Midias</div>
  <!--6-->
</div>
<div class="page_aba">
  <div id="referencia">Odin</div>
  <div class="midias">
    <div class="midia_view">
      <img src="anexo\97" alt="midia0" id="midia_view_img">
    </div>
    <div class="midias_left">
      <span> <img src="tema.d/oficial.d/img/midiasview_arrow_left.svg" alt=""> </span>
    </div>
    <div class="midias_right">
      <span> <img src="tema.d/oficial.d/img/midiasview_arrow_right.svg" alt=""> </span>
    </div>
    <div class="midia_list"><div class="">
  <img src="anexo/97" alt="" onclick="setMidia(this)">
</div>
<div class="">
  <img src="anexo/98" alt="" onclick="setMidia(this)">
</div>
</div>
  </div>
  <div class="detalhes">
    <p>Odim, tambem conhecido como pai de todos.</p>
    <p>
      2cx por <valor>R$ 0,00</valor>
    </p>
    <p>[52]Hidralica Industrial/Eletrica</p>
  </div>
  <div class="formasdepagamento">
    <p>Podendo ser pago:</p>
    {{formas de pagamento}}
  </div>
  <div class="outrosprodutos">
    <div class="produto">
      <img src="anexo/0" alt="">
    </div>
    {{outrosprodutos}}
  </div>
</div>
<script type="text/javascript" src="tema.d/oficial.d/js/view_produto.js"></script>

            <footer>G2</footer>
        </content>
        <div class="menu-button" onclick="menuShow()">&equiv;</div>
        <div class="flutuante" id="flutuante">Loading...</div>
        <div class="msgbox_fundo" id="msgbox_fundo" onclick="MSGbox.close()">
            <div class="msgbox_box">
                <span class="button msgbox_close" onclick="MSGbox.close()">X</span>

                <div class="msgbox_conteudo" id="msgbox_conteudo">
                </div>
            </div>
        </div>
        <script type="text/javascript" src="tema.d/oficial.d/js/page.js"></script>
        <script type="text/javascript" src="tema.d/oficial.d/js/menu.js"></script>
    </body>
</html>

I’ve been doing some tests, I think the possible cause of the problem is the expression.

$pattern = "/({$tA})(.|\n)*({$tB})/";

I would guess that there is a character limit that an expression can check.

  • What is the full, error message?

  • There is no (non-native) error message, what happens is that the function preg_replace() returns "null" instead of the "treated" string. but I see no reason to

  • To documentation says that if the return is NULL then an error has occurred. You can see the return of preg_last_error() https://www.php.net/manualen/function.preg-last-error.php

  • The error this "almost treated" in the code posted, returns error 6 (PREG_BAD_UTF8_OFSET_ERROR), the problem is: Why? How to solve? or even, if there is another way to do?

  • In the manual is written >PREG_BAD_UTF8_OFFSET_ERROR Returned by preg_last_error() if offset did not match the start of a valid UTF-8 code point (only when running a regex in UTF-8 mode).

  • I don’t understand, you pass all the HTML as a string to the function sensurar? What should be the result?

  • The file is UTF-8 and must have been saved with BOM. If this is the case, BOM should be removed before being processed by preg_replace()

  • I checked, converted, rewrote files (template). And yet continued the same error.

  • 1

    @Augustovasques Actually Mistake 6 is PREG_JIT_STACKLIMIT_ERROR (could not be UTF-8 error because the regex only runs in this mode if it has the flag u, and in this case it does not). Maybe it is something related to some config in php.ini, because I did some tests and did not give the same error. I’ll do a little more digging and if that’s the case, I’ll give you an answer

Show 4 more comments

2 answers

6


In its code, the return of preg_last_error() was 6, which corresponds to the error PREG_JIT_STACKLIMIT_ERROR. Basically, this error refers to PCRE JIT, which is a feature that does various optimizations in a regex. But these optimizations are not free: they need extra memory for their internal structures, and this error occurs when regex ends up using more memory than available (by default, the JIT uses a stack 32K memory).

THE PCRE JIT is enabled by default in PHP >= 7, and searching found several links that suggest disable it in the php.ini (setando pcre.jit = 0) or directly in the code, calling ini_set('pcre.jit', false). But I believe that the best option is to optimize the regex so that it is more efficient, thus consuming less resources and preventing the stack of JIT burst.

For starters, your regex makes use of parentheses, which creates capture groups, that consume extra memory, since it is an extra structure to be generated and maintained (it also consumes more memory when searching for pouch, because the groups have to be stored separately). But since you are not using the groups (and simply replacing everything with another string), you can remove them.

Another point is the alternation (.|\n). The point is "any character, except line breaks", and maybe that’s why you put the |\n after. The problem is that this is also inefficient, because for each character the test is done (whether it corresponds to one or the other).

Fortunately, in several languages and Engines, there is an option that makes the point also correspond to line breaks, and the difference is brutal. Behold here to regex with alternation, and note the amount of steps performed (over 11,000). Already using the flag s (which causes the point to correspond to line breaks), to regex gets more efficient and needs just over 2600 steps (about 4 times less steps).

You can still improve a little more. By default, the quantifiers - like the * - are greedy and try to pick up as many characters as possible. Since we are using the dot, which picks up any character, it tries to go to the end of the string, and then goes back until it finds something that satisfies the rest of the regex (process known as backtracking, which also consumes more resources, since the engine needs to keep the states in memory until all possibilities are exhausted).

To avoid this, you can use .*?, which makes the quantifier lazy. Thus, it takes as few characters as possible, and slowly advances in the string (instead of taking everything and going back). This dramatically decreases the amount of steps performed (see that in this case, decreases to about 490 steps).


In short, your regex could look like this:

$pattern = "/{$tA}.*?{$tB}/s";

I removed the parentheses since I am not using the groups for anything. I used the flag s so that the point also corresponds to line breaks, and I used the lazy quantifier *? to decrease the backtracking.

On my machine I did not have the same mistakes as you, but testing on Ideone.com we can see that its original regex actually returns error 6, and changing to the above regex, the error no longer occurs.

In short, his suspicion as to the amount of characters was not completely unfounded. After all, the larger the string, the more backtracking will be necessary for the regex to find the pouch. And by improving the regex, we can decrease the backtracking, and consequently the resources used to run it.


Do not use regex

But perhaps the main problem is that it is processing an HTML file with regex - since this is not the best tool for the task. Although regex might work, in many cases it is best to use a specific API. In case, you could use DOMDocument:

$dom = new DOMDocument;
$dom->loadHtml($html); // $html é uma string contendo todo o HTML
$xpath = new DOMXPath($dom);
for ($i=0; $i < 11; $i++) {
    foreach ($xpath->query('//comment()') as $comment) { // procura comentários
        if ($comment->nodeValue == "a$i") { // comentário inicial
            $parent = $comment->parentNode;
            $remover = [];
            // percorrer os nós irmãos até encontrar o comentário que fecha
            $node = $comment->nextSibling;
            while (true) {
                $remover[] = $node;
                $node = $node->nextSibling;
                if ($node->nodeType == XML_COMMENT_NODE && $node->nodeValue == "$i") {
                    $remover[] = $node;
                    break;
                }
            }
            foreach($remover as $n) { // remover
                $parent->removeChild($n);
            }
            // trocar o comentário "aX" por "a"
            $parent->replaceChild($dom->createComment('a'), $comment);
        }
    }
}
// imprimir o HTML final (com as tags removidas)
echo $dom->saveHTML();

So we can scroll through the HTML looking for comments. And when the comment corresponds to what you want to find, just go through the other nodes until you find the comment that ends the section to be censored, and remove all (at the end, I still replace the text of the comment, in the same way that the regex was doing).

1

I was doing some tests, and I realized that the error always happened after 6000 characters. I divided the String and tested... well... it worked. Follows the code:

    static function clearTag($str,$tA,$tB,$msg=""){
    $pattern = "/({$tA})(.|\n)*({$tB})/";
    $str1 = "";
    $str0 = $str;
    $str = "";
    while ($str0 != "") {
        if (strlen($str0)>6000) {
            $str1 .= substr($str0,0,6000);
            $str .= preg_replace($pattern,$msg,$str1);
            //-------
            if(preg_last_error() != 0) {
                    $len = strlen($str0);
                    $error = preg_last_error();
                    Page::error("
                    Limpou a string.
                    [tA] = '$tA', [tB] = '$tB',[pattern] = '$pattern', [str].length = {$len}
                    $error
                    $str0
                                    ","Page::clearTag2");
                }
            //-------
            $str0 = substr($str0,6000);
        }else {
             $str .= preg_replace($pattern,$msg,$str0);
             //----------
             if(preg_last_error() != 0) {
                     $len = strlen($str0);
                     $error = preg_last_error();
                     Page::error("
                     Limpou a string.
                     [tA] = '$tA', [tB] = '$tB',[pattern] = '$pattern', [str].length = {$len}
                     $error
                     $str0
                                     ","Page::clearTag2");
                 }
             //------
             $str0 = "";
        }
        //------------ Erro ----------
    }
    return $str;
}

The problem is only that if the "censorship" is at the height of the character "6000", it will not take out the content.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.