Regex - Very high replace process time

Asked

Viewed 159 times

2

Recently after developing a process, I saw that this was taking exorbitant 5~6min to be executed, something that should take at most 2s, so I started debugging the code with timers to know what process was taking so long to accomplish, and I came to this.

$html = preg_replace('~[^#]*(<Ajax>[^\~]*?</ajax>)[^#]*~', '$1', $html);

The HTML what I’m doing replace has more than 2 thousand lines so I will not post it here but it is follows this pattern:

<Ajax>
    <Sucesso>True</Sucesso>
    <DadosRetorno><![CDATA[
        <br />
        <input type="button" id="btnExportarExtNfe" class="button" value="Exportar resultado completo da pesquisa para arquivo texto" onclick="btnExportarExtNfe_click();" />
        <br />
        <br />
        <table class="painel">
            <tr class="listaHeaderEcac">
                <th><label>#</label></th>
                <th><label>Dt Emit</label></th>
                <th><label>Dt Ent/Sai</label></th>
                <th><label>IE Emit</label></th>
                <th><label>UF Emit</label></th>
                <th><label>CNPJ Emit</label></th>
                <th><label>IE Dest/Remet</label></th>
                <th><label>UF Dest/Remet</label></th>
                <th><label>CNPJ Dest/Remet</label></th>
                <th><label>Mod</label></th>
                <th><label>Série</label></th>
                <th><label>Número</label></th>
                <th><label>Total NF-e</label></th>
                <th><label>Total BC ICMS</label></th>
                <th><label>Total ICMS</label></th>
                <th><label>Total BC ICMS ST</label></th>
                <th><label>Total ICMS ST</label></th>
                <th><label>Sit</label></th>
                <th><label>E/S</label></th>
            </tr>
            <tr>
                <td><span class="linha"><a onclick="ExibeNfeCompleta('00000000000000000000000000000000000000000020')" style="cuANor:pointer"><img src='../Imagens/lupa.png' alt='Visualizar' border=0></a></span></td>
                <td><span class="linha">03/08/15</span></td>
                <td><span class="linha">03/08/15</span></td>
                <td><span title='Empresa 1' class="linha">000/0000000</span></td>
                <td><span class="linha">AN</span></td>
                <td><span title='Empresa 1' class="linha">00.000.000/0000-00</span></td>
                <td><span title='Empresa 2' class="linha">000/0000000</span></td>
                <td><span class="linha">AN</span></td>
                <td><span title='Empresa 2' class="linha">00000000000000</span></td>
                <td><span class="linha">55</span></td>
                <td><span class="linha">1</span></td>
                <td><span class="linha">00000</span></td>
                <td align="right"><span class="linha">0,00</span></td>
                <td align="right"><span class="linha">0,00</span></td>
                <td align="right"><span class="linha">0,00</span></td>
                <td align="right"><span class="linha">0,00</span></td>
                <td align="right"><span class="linha">0,00</span></td>
                <td><span title='Normal' class="linha">N</span></td>
                <td><span title='Saída' class="linha">S</span></td>
            </tr>
        </table>
        <div width="000%"><span class="linha">NFes Emitidas até: <strong>00/00/0005 09:01:03</strong></span></div>
        <div width="000%" align="center">
            &nbsp;
            <SPAN title="Linha Inicial e Final da Página">Linhas de 1 a 000</SPAN> - &nbsp;
            <SPAN title="Total de Linhas Recuperadas">Total de Linhas: 000</SPAN>
            <br> &nbsp;
            <SPAN title="Total de Páginas">Páginas: 4</SPAN>
            <br> &nbsp;|&nbsp;
            <span class="menu4"><b>1</b></span>&nbsp;|&nbsp;<a href="javascript:trocaPagina(2);" style="font-weight: bold;color: #000000; text-decoration: underline;" class="LinkNavActive">2</a>&nbsp;|&nbsp;<a href="javascript:trocaPagina(3);" style="font-weight: bold;color: #000000; text-decoration: underline;" class="LinkNavActive">3</a>&nbsp;|&nbsp;<a href="javascript:trocaPagina(4);" style="font-weight: bold;color: #000000; text-decoration: underline;" class="LinkNavActive">4</a>&nbsp;|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href="javascript:trocaPagina(2);" style="font-weight: bold;color: #000000; text-decoration: underline;" class="menu4">Próx.</a>&nbsp;&nbsp;&nbsp;:&nbsp;&nbsp;<a href="javascript:trocaPagina(4);" style="font-weight: bold;color: #000000; text-decoration: underline;" class="menu4">Final</a>
        </div>
        ]]></DadosRetorno>
</Ajax>

He’s got a few more tags of header and footer for that replace you. By debug is just this replace which takes 5~6min.

Would anyone know what’s taking so long?
Can anyone indicate a REGEX best?

  • 1

    Could you tell me why you are replacing? Since you are manipulating an XML, you tried using XPATH?

  • I’m actually using simpleHtmlDom, and the reason is for caution as to the returned data, since this html comes from a Curl, and as I commented it comes with some header tags.

1 answer

4


Maybe change the quantifier [^\~]*? for [^\~]* resolve.

The quantifier "not greedy" *? (Lazy) makes for each married character, the search will test the rest of the regular expression, so the delay.

Using a quantifier "greedy" * (Greedy) regular expression will search for the group in question in all characters until the end of the string or until a character doesn’t, and then "comes back" searching for the rest of the regular expression backwards.

But because it is XML, it is recommended to use an XML interpreter and not regular expression

Browser other questions tagged

You are not signed in. Login or sign up in order to post.