What is the advantage of using Perl-compatible regular expression functions or POSIX-compatible ones in PHP? In what context do you use both?

Question

What is the advantage of using Perl-compatible regular expression functions or POSIX-compatible ones in PHP? In what context do you use both?

Asked 5 years, 2 months ago

Viewed 131 times

3

PHP has regular expression functions compatible with Perl and POSIX. However, I can’t see if one stands out from the other or which contexts of use favor one or the other.

It would just be a matter of style or preference between one pattern or another?

3

According to the documentation, Posix functions are deprecated in php 5.3 and have been removed in php 7, so it is no longer a preference issue :-)

– hkotsubo

2020/06/03 at 22:10
You are right. I was due the documentation in Portuguese. This information is not in the Portuguese version. Thank you.

– Everton da Rosa

2020/06/03 at 23:41
1

I didn’t answer yesterday because I didn’t have time, but there’s a better detail of the differences below...

– hkotsubo

2020/06/04 at 12:58

1 answer

Browser other questions tagged php regex posix

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2020-06-04T12:57:08+00:00

As stated in the documentation, regex POSIX compatible functions (such as ereg and ereg_replace, among others) are deprecated in PHP 5.3, and were removed in PHP 7. So if you’re using version 7, the only option is regex PCRE (Perl Compatible Regular Expressions) and their respective function preg_* (as preg_match, preg_replace, etc.).

But if you are using a <= 5.3 version, I think it is worth mentioning some important differences, including some that are mentioned in the documentation. Let’s see some examples.

Character classes, shortcuts and delimiters

$str = 'abc123xyz';
if (preg_match('/\d+/', $str, $matches)) {
    var_dump($matches);
}
if (ereg('[[:digit:]]+', $str, $matches)) {
    var_dump($matches);
}

In PCRE functions, regex must be within delimiters. In the example above, I used the bars, that is, everything that is between the characters / is regex, but the bars themselves are not part of the expression. Already in the POSIX functions, the delimiters should not be placed. Both calls above return:

array(1) {
  [0]=>
  string(3) "123"
}

Another difference we can notice is the absence of shortcut \d in regex POSIX. Instead, we should use the special class [[:digit:]] to fetch any digit from 0 to 9 (this class also works on PCRE). In both cases the character class [0-9].

Other shortcuts

The same goes for other shortcuts, such as \w (which takes letters, digits or the character _), which in POSIX is almost equivalent to [[:alnum:]] (which only takes letters and digits, but not the _):

$str = 'abc123_ xyz';
if (preg_match('/\w+/', $str, $matches)) {
    echo "PCRE\n";
    var_dump($matches);
}
if (ereg('[[:alnum:]]+', $str, $matches)) {
    echo "POSIX\n";
    var_dump($matches);
}

The \w also considers the character _, but [[:alnum:]] no, so the result is:

PCRE
array(1) {
  [0]=>
  string(7) "abc123_"
}
POSIX
array(1) {
  [0]=>
  string(6) "abc123"
}

If we want the regex POSIX to also consider the _, we have to change the expression to [[:alnum:]_]+.

Flags

Another difference is that the regex PCRE have flags/modifiers, that change some details in the functioning of the expression. For example, to do a search case insensitive, just use the flag i. The regex POSIX should use another specific function for this (in this case, eregi):

$str = 'ABCabc';
if (preg_match('/[a-z]+/', $str, $matches)) {
    echo "PCRE case sensitive\n";
    var_dump($matches);
}
if (preg_match('/[a-z]+/i', $str, $matches)) { // <- repare no "i" após a barra
    echo "PCRE case insensitive\n";
    var_dump($matches);
}
if (ereg('[a-z]+', $str, $matches)) {
    echo "POSIX case sensitive\n";
    var_dump($matches);
}
if (eregi('[a-z]+', $str, $matches)) {
    echo "POSIX case insensitive\n";
    var_dump($matches);
}

Note that in the regex PCRE a flag i is placed after the final delimiter (after the last /), making the regex case insensitive (does not differentiate between upper and lower case). For regex POSIX, there are no flags and the only way is to call another function (eregi). The result is:

PCRE case sensitive
array(1) {
  [0]=>
  string(3) "abc"
}
PCRE case insensitive
array(1) {
  [0]=>
  string(6) "ABCabc"
}
POSIX case sensitive
array(1) {
  [0]=>
  string(3) "abc"
}
POSIX case insensitive
array(1) {
  [0]=>
  string(6) "ABCabc"
}

PCRE has other flags for which there is no equivalent in POSIX, such as flag m, which causes the markers ^ and $ (respectively the beginning and end of the string) also correspond to the beginning and end of a line. Ex:

$str = "abc\n123\nxyz\n456";
if (preg_match('/\d$/m', $str, $matches)) {
    echo "PCRE\n";
    var_dump($matches);
}
if (ereg('[[:digit:]]$', $str, $matches)) {
    echo "POSIX\n";
    var_dump($matches);
}

Expressions search for a digit followed by $: in regex PCRE, because of the flag m, indicates that it can be either the end of the string or the end of a line. So it finds the digit 3 (preg_match always returns the first occurrence found).

In the regex POSIX, the $ always matches the end of the string, then ereg finds the digit 6. So she can find the 3, i would have to modify the expression a little, to consider that after the digit may have a \n or the end of the string:

if (ereg("([[:digit:]])(\n|$)", $str, $matches)) {
    var_dump($matches[1]); // 3
}

I put the digit in parentheses to form a catch group, and since it is the first pair of parentheses of regex, then it will be in the first group - which can be recovered with $matches[1]. I had to do it this way to get only the digit, because if you have one \n, he is also returned in match (as I only want the digit, I create a capture group containing only it).

Algorithm of Match

regex PCRE starts the search by the beginning of the string and returns as soon as it finds the first match, while the regex POSIX always returns the largest match possible (the longest). Using the documentation example:

$str = "oneselfsufficient";
if (preg_match('/one(self)?(selfsufficient)?/', $str, $matches)) {
    echo "PCRE\n";
    var_dump($matches[0]); // oneself
}
if (ereg('one(self)?(selfsufficient)?', $str, $matches)) {
    echo "POSIX\n";
    var_dump($matches[0]); // oneselfsufficient
}

A regex searches for the string "one" followed optionally by "self", optionally followed by "selfsufficient". The regex PCRE returns the first match which to find: as it starts at the beginning of the string, it finds the "one" snippet, and then checks whether the optional "self" snippet is present. As it is (because the string contains "oneself") and the rest of the expression is optional, the match "oneself" is returned.

Already the regex POSIX tests all possibilities to return the largest match possible (which in this case is the whole string).

Depending on the regex and strings being searched, this can have performance implications: how PCRE returns as soon as it finds the first match, she can be faster than POSIX, which needs to test all possibilities to know which is the largest pouch found.

Other resources

There are also other differences that are not mentioned in the documentation:

Catch groups

PCRE has support for catch groups. When I nay I want the parentheses to form a capture group, just use the syntax (?: ... ):

$str = "abc 123";
if (preg_match('/([a-z]+) (?:1)/', $str, $matches)) { // o "1" não é um grupo de captura
    echo "PCRE\n";
    var_dump($matches);
}
if (ereg('([a-z]+) (?:1)', $str, $matches)) { // ERRO! (expressão inválida)
    echo "POSIX\n";
    var_dump($matches);
}

In regex PCRE, only the letters are in a capture group, whereas the number 1 no. In regex POSIX, the syntax (?: generates an invalid regex and gives error:

PCRE
array(2) {
  [0]=>
  string(5) "abc 1"
  [1]=>
  string(3) "abc"
}

Warning: ereg(): REG_BADRPT in /mnt/c/Users/hkotsubo/teste.php on line 8

Note that in the case of preg_match, the array $matches has 2 elements: the first is all the match found, and the second (at position 1) is the first - and only - capture group.

Lazy quantifiers

The regex PCRE have lazy quantifiers.

By default, quantifiers such as + (one or more occurrences) and * (zero or more occurrences) are "greedy" and try to catch as many characters as possible. But it is possible to let them "lazy", which causes them to take the least amount that satisfies the expression.

For example:

$str = "a123";
if (preg_match('/a\d*/', $str, $matches)) {
    echo "PCRE\n";
    var_dump($matches); // a123
}
if (ereg('a[[:digit:]]*', $str, $matches)) {
    echo "POSIX\n";
    var_dump($matches); // a123
}

Expressions search for the letter a followed by zero or more digits. As the quantifier * is greedy, he takes as much as possible (which in this case, is a123). But in regex PCRE it is possible to let you lazy by adding a ? in front:

$str = "a123";
if (preg_match('/a\d*?/', $str, $matches)) {
    echo "PCRE\n";
    var_dump($matches); // a
}
if (ereg('a[[:digit:]]*?', $str, $matches)) { // ERRO!
    echo "POSIX\n";
    var_dump($matches);
}

With this, regex takes as few characters as possible. How * that is to say zero or more occurrences, so the smallest possible number of digits she can pick up is zero. With this, preg_match finds only "a". Already regex POSIX gives error for not supporting this resource.

For more details on lazy quantifiers, see here and here.

Lookarounds

The lookarounds are ways of "looking" at what you have in front of or behind a certain position, without this being part of the match. Ex:

$str = "a1 bc d3";
if (preg_match_all('/[a-z](?=\d)/', $str, $matches)) {
    var_dump($matches);
}

regex searches for a letter from a to z, provided that after it there is a digit. The Lookahead (?=\d) checks if there is a digit in front, but this digit will not be part of the match. The result will have only letters a and d, because the others don’t have a digit right after:

array(1) {
  [0]=>
  array(2) {
    [0]=>
    string(1) "a"
    [1]=>
    string(1) "d"
  }
}

Already if we try to do something similar to POSIX (something like [a-z](?=[[:digit:]])) will give error, because it does not support lookarounds.

Note also that I used preg_match_all to bring all occurrences (because preg_match brings only the first found). There is no equivalent for POSIX (there is no such thing as ereg_all).

Replace with callback

In PCRE functions there is the possibility of using a function of callback in substitution, using preg_replace_callback:

echo preg_replace_callback('/\b([A-Z]+)\b/', // procura por palavras todas em maiúsculas
    function ($match) {
        // deixa só a primeira letra maiúscula, o restante em minúsculo
        $word = $match[1];
        return $word[0]. strtolower(substr($word, 1));
    },
    'ABC def GHI X jkl'); // Abc def Ghi X jkl

No equivalent in POSIX.

And much more

There are many other PCRE features that are not supported by regex POSIX, such as recursive regex, comments, atomic groups, Unicode properties, inline flags, conditional, branch reset, control verbs (has an example of use here), subroutines, etc....

Some of these resources I consider "esoteric" and I don’t usually use, because they usually serve situations in which regex is not the best solution (for example, subroutines can be used to Parsing - see examples here and here - but use a parser specific is much better; recursive regex can be used to check parentheses/brackets/balanced keys - see examples here and here - but there are simpler algorithms for this without needing a complicated regex, etc).