How to Backreferencing with Regex in PHP?

Asked

Viewed 58 times

0

Backreferencing to use the same text again in Pattern and leave another group in substitution only.

It may have been confusing, but what I want is this. The String is this:

$Str = '
<p>Os <b>parágrafos</b> são as estruturas que compõe um texto e podem ser: longos, médios e curtos, 
dependendo do tipo de produção <a>textual</a>. Longos: estão mais presentes em textos científicos e acadêmicos, 
<i>os quais exigem uma</i> <strong>explicação</strong> mais complexa, com exemplos e especificações.</p>';

The goal is the tags Strong, b, i and to be replaced only by the internal content:

<b>Negrito</b>

Vira:

Negrito

But:

<b>Negrito</i>

It can’t be moved and it stays the same. (This is likely to happen when you have bold or italics inside a link or vice versa)

I tried this:

$Str = preg_replace(
    array(
        "/<(a|b|strong|i)>(.*?)<\/\1>/si"
    ),
    array(
        "$2"
    ),
    $Str
);

And this:

$Str = preg_replace(
    array(
        "/<(a|b|strong|i)>(.*?)<\/$1>/si"
    ),
    array(
        "$2"
    ),
    $Str
);

But they don’t seem to work.

1 answer

0


At first it would be enough to pass the regex and the substitution string directly, without using array:

$Str = preg_replace('/<\b([abi]|strong)\b>(.*?)<\/\1>/', '$2', $Str);

I use the character class [abi] to determine that it may be any of these letters, or strong, all surrounded by \b (which is the shortcut to "word boundary", which ensures that there will be no other letters before or after the a, b, etc.).

To backreference \1 ensures that the closing tag is the same as the opening tag, since the name of the tag is in parentheses, which forms a catch group - and since it is the first pair of parentheses of regex, then it is group 1 (so I can check if it has exactly the same content later, using \1).


But this is a very "naive" and fault-prone approach. In general, manipulating HTML with regex is not ideal (see here for some examples, and here for the "definitive answer" on the subject).

For example, if tags have attributes (such as <b id="whatever" class="algo">) the regex will already have to be adapted, because the way it is, it would accept only <b>. If the tag is inside comments (<!-- <b>etc</b> -->), regex does not detect and will replace it anyway (and in order for it to detect comments, it would have to add something like that). Anyway, for every minimal variation of HTML, the regex gets more and more complicated and starts to be worthless.

The ideal is to use a specific lib for HTML. In the case of PHP, you can use DOMDocument:

$Str = ... // string contendo o HTML

$dom = new DOMDocument;
$dom->loadHtml($Str, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//b | //strong | //i | //a') as $tag) { // procura b ou strong ou i ou a
    // troca a tag por um textNode contendo somente o texto
    $tag->parentNode->replaceChild($dom->createTextNode($tag->textContent), $tag);
}
// imprimir o HTML final
echo $dom->saveHTML();

One detail is that if you really have something like <b>Negrito</i>, then HTML is not valid. If you have nested tags, the correct one would be something like <b>Negrito <i>italic <a>link</a> etc</i> blabla</b>, would never have an opening of b followed by the closure of i or anything like that.

And DOMDocument supports invalid HTML (will give some warnings and some tags will be "fixed" as far as possible), but then you should treat this problem at source - where the string is generated - to ensure that HTML is valid, instead of trying to fix it with regex, or with any other tool.

And for the case of nested tags (like a a inside b inside i, etc), regex would have to use preg_replace_callback with recursion (i.e., an unnecessary complication):

function removeTags($input) {
    $regex = '/<\b([abi]|strong)\b>(.*?)<\/\1>/';

    if (is_array($input)) { // se é o resultado de um replace anterior
        $input = $input[2]; // pega o segundo grupo (conteúdo da tag)
    }

    return preg_replace_callback($regex, 'removeTags', $input);
}

$Str = removeTags($Str);

Already with DOMDocument does not need it, the code shown earlier already removes nested tags.

  • Help: I started using Domdocument recently and am having a hard time detecting Ivs that have Ivs inside them as the first Child and eliminating them - can you help me? I used $excess_of_divs = $xpath->query("//div/div[0]"); to make the selection but I’m not sure how to eliminate only the div that involves some div as the first Child.

  • By the way, I left it in array because in my original code I was making several substitutions.

  • @Felipealves The first div inside another vc picks up with //div/div[1] (and then just take her relationship). But not wanting to be boring (and not seeming like I don’t want to help), the idea of the site is to have a question by specific problem. The regex problem was answered above, please tell us if it worked or not (and if it worked, consider marking it as correct - see here how and why to do it. It is not mandatory, but it is a good practice of the site, to indicate to future visitors that it solved the problem)

  • But the problem of the div is different from what is in the question, and it would be the case of make another (not forgetting to search before, of course, as they should already have several questions about xpath on the site)

  • @Felipealves Or, depending on what you want to do, //div/div[position()=1] - finally, legal material on the subject is: https://developer.mozilla.org/en-US/docs/Web/XPath

Browser other questions tagged

You are not signed in. Login or sign up in order to post.