As stated in the documentation, regex POSIX compatible functions (such as ereg
and ereg_replace
, among others) are deprecated in PHP 5.3, and were removed in PHP 7. So if you’re using version 7, the only option is regex PCRE (Perl Compatible Regular Expressions) and their respective function preg_*
(as preg_match
, preg_replace
, etc.).
But if you are using a <= 5.3 version, I think it is worth mentioning some important differences, including some that are mentioned in the documentation. Let’s see some examples.
Character classes, shortcuts and delimiters
$str = 'abc123xyz';
if (preg_match('/\d+/', $str, $matches)) {
var_dump($matches);
}
if (ereg('[[:digit:]]+', $str, $matches)) {
var_dump($matches);
}
In PCRE functions, regex must be within delimiters. In the example above, I used the bars, that is, everything that is between the characters /
is regex, but the bars themselves are not part of the expression. Already in the POSIX functions, the delimiters should not be placed. Both calls above return:
array(1) {
[0]=>
string(3) "123"
}
Another difference we can notice is the absence of shortcut \d
in regex POSIX. Instead, we should use the special class [[:digit:]]
to fetch any digit from 0 to 9 (this class also works on PCRE). In both cases the character class [0-9]
.
Other shortcuts
The same goes for other shortcuts, such as \w
(which takes letters, digits or the character _
), which in POSIX is almost equivalent to [[:alnum:]]
(which only takes letters and digits, but not the _
):
$str = 'abc123_ xyz';
if (preg_match('/\w+/', $str, $matches)) {
echo "PCRE\n";
var_dump($matches);
}
if (ereg('[[:alnum:]]+', $str, $matches)) {
echo "POSIX\n";
var_dump($matches);
}
The \w
also considers the character _
, but [[:alnum:]]
no, so the result is:
PCRE
array(1) {
[0]=>
string(7) "abc123_"
}
POSIX
array(1) {
[0]=>
string(6) "abc123"
}
If we want the regex POSIX to also consider the _
, we have to change the expression to [[:alnum:]_]+
.
Flags
Another difference is that the regex PCRE have flags/modifiers, that change some details in the functioning of the expression. For example, to do a search case insensitive, just use the flag i
. The regex POSIX should use another specific function for this (in this case, eregi
):
$str = 'ABCabc';
if (preg_match('/[a-z]+/', $str, $matches)) {
echo "PCRE case sensitive\n";
var_dump($matches);
}
if (preg_match('/[a-z]+/i', $str, $matches)) { // <- repare no "i" após a barra
echo "PCRE case insensitive\n";
var_dump($matches);
}
if (ereg('[a-z]+', $str, $matches)) {
echo "POSIX case sensitive\n";
var_dump($matches);
}
if (eregi('[a-z]+', $str, $matches)) {
echo "POSIX case insensitive\n";
var_dump($matches);
}
Note that in the regex PCRE a flag i
is placed after the final delimiter (after the last /
), making the regex case insensitive (does not differentiate between upper and lower case). For regex POSIX, there are no flags and the only way is to call another function (eregi
). The result is:
PCRE case sensitive
array(1) {
[0]=>
string(3) "abc"
}
PCRE case insensitive
array(1) {
[0]=>
string(6) "ABCabc"
}
POSIX case sensitive
array(1) {
[0]=>
string(3) "abc"
}
POSIX case insensitive
array(1) {
[0]=>
string(6) "ABCabc"
}
PCRE has other flags for which there is no equivalent in POSIX, such as flag m
, which causes the markers ^
and $
(respectively the beginning and end of the string) also correspond to the beginning and end of a line. Ex:
$str = "abc\n123\nxyz\n456";
if (preg_match('/\d$/m', $str, $matches)) {
echo "PCRE\n";
var_dump($matches);
}
if (ereg('[[:digit:]]$', $str, $matches)) {
echo "POSIX\n";
var_dump($matches);
}
Expressions search for a digit followed by $
: in regex PCRE, because of the flag m
, indicates that it can be either the end of the string or the end of a line. So it finds the digit 3
(preg_match
always returns the first occurrence found).
In the regex POSIX, the $
always matches the end of the string, then ereg
finds the digit 6
. So she can find the 3
, i would have to modify the expression a little, to consider that after the digit may have a \n
or the end of the string:
if (ereg("([[:digit:]])(\n|$)", $str, $matches)) {
var_dump($matches[1]); // 3
}
I put the digit in parentheses to form a catch group, and since it is the first pair of parentheses of regex, then it will be in the first group - which can be recovered with $matches[1]
. I had to do it this way to get only the digit, because if you have one \n
, he is also returned in match (as I only want the digit, I create a capture group containing only it).
Algorithm of Match
regex PCRE starts the search by the beginning of the string and returns as soon as it finds the first match, while the regex POSIX always returns the largest match possible (the longest). Using the documentation example:
$str = "oneselfsufficient";
if (preg_match('/one(self)?(selfsufficient)?/', $str, $matches)) {
echo "PCRE\n";
var_dump($matches[0]); // oneself
}
if (ereg('one(self)?(selfsufficient)?', $str, $matches)) {
echo "POSIX\n";
var_dump($matches[0]); // oneselfsufficient
}
A regex searches for the string "one" followed optionally by "self", optionally followed by "selfsufficient". The regex PCRE returns the first match which to find: as it starts at the beginning of the string, it finds the "one" snippet, and then checks whether the optional "self" snippet is present. As it is (because the string contains "oneself") and the rest of the expression is optional, the match "oneself" is returned.
Already the regex POSIX tests all possibilities to return the largest match possible (which in this case is the whole string).
Depending on the regex and strings being searched, this can have performance implications: how PCRE returns as soon as it finds the first match, she can be faster than POSIX, which needs to test all possibilities to know which is the largest pouch found.
Other resources
There are also other differences that are not mentioned in the documentation:
Catch groups
PCRE has support for catch groups. When I nay I want the parentheses to form a capture group, just use the syntax (?: ... )
:
$str = "abc 123";
if (preg_match('/([a-z]+) (?:1)/', $str, $matches)) { // o "1" não é um grupo de captura
echo "PCRE\n";
var_dump($matches);
}
if (ereg('([a-z]+) (?:1)', $str, $matches)) { // ERRO! (expressão inválida)
echo "POSIX\n";
var_dump($matches);
}
In regex PCRE, only the letters are in a capture group, whereas the number 1
no. In regex POSIX, the syntax (?:
generates an invalid regex and gives error:
PCRE
array(2) {
[0]=>
string(5) "abc 1"
[1]=>
string(3) "abc"
}
Warning: ereg(): REG_BADRPT in /mnt/c/Users/hkotsubo/teste.php on line 8
Note that in the case of preg_match
, the array $matches
has 2 elements: the first is all the match found, and the second (at position 1) is the first - and only - capture group.
Lazy quantifiers
The regex PCRE have lazy quantifiers.
By default, quantifiers such as +
(one or more occurrences) and *
(zero or more occurrences) are "greedy" and try to catch as many characters as possible. But it is possible to let them "lazy", which causes them to take the least amount that satisfies the expression.
For example:
$str = "a123";
if (preg_match('/a\d*/', $str, $matches)) {
echo "PCRE\n";
var_dump($matches); // a123
}
if (ereg('a[[:digit:]]*', $str, $matches)) {
echo "POSIX\n";
var_dump($matches); // a123
}
Expressions search for the letter a
followed by zero or more digits. As the quantifier *
is greedy, he takes as much as possible (which in this case, is a123
). But in regex PCRE it is possible to let you lazy by adding a ?
in front:
$str = "a123";
if (preg_match('/a\d*?/', $str, $matches)) {
echo "PCRE\n";
var_dump($matches); // a
}
if (ereg('a[[:digit:]]*?', $str, $matches)) { // ERRO!
echo "POSIX\n";
var_dump($matches);
}
With this, regex takes as few characters as possible. How *
that is to say zero or more occurrences, so the smallest possible number of digits she can pick up is zero. With this, preg_match
finds only "a". Already regex POSIX gives error for not supporting this resource.
For more details on lazy quantifiers, see here and here.
Lookarounds
The lookarounds are ways of "looking" at what you have in front of or behind a certain position, without this being part of the match. Ex:
$str = "a1 bc d3";
if (preg_match_all('/[a-z](?=\d)/', $str, $matches)) {
var_dump($matches);
}
regex searches for a letter from a
to z
, provided that after it there is a digit. The Lookahead (?=\d)
checks if there is a digit in front, but this digit will not be part of the match. The result will have only letters a
and d
, because the others don’t have a digit right after:
array(1) {
[0]=>
array(2) {
[0]=>
string(1) "a"
[1]=>
string(1) "d"
}
}
Already if we try to do something similar to POSIX (something like [a-z](?=[[:digit:]])
) will give error, because it does not support lookarounds.
Note also that I used preg_match_all
to bring all occurrences (because preg_match
brings only the first found). There is no equivalent for POSIX (there is no such thing as ereg_all
).
Replace with callback
In PCRE functions there is the possibility of using a function of callback in substitution, using preg_replace_callback
:
echo preg_replace_callback('/\b([A-Z]+)\b/', // procura por palavras todas em maiúsculas
function ($match) {
// deixa só a primeira letra maiúscula, o restante em minúsculo
$word = $match[1];
return $word[0]. strtolower(substr($word, 1));
},
'ABC def GHI X jkl'); // Abc def Ghi X jkl
No equivalent in POSIX.
And much more
There are many other PCRE features that are not supported by regex POSIX, such as recursive regex, comments, atomic groups, Unicode properties, inline flags, conditional, branch reset, control verbs (has an example of use here), subroutines, etc....
Some of these resources I consider "esoteric" and I don’t usually use, because they usually serve situations in which regex is not the best solution (for example, subroutines can be used to Parsing - see examples here and here - but use a parser specific is much better; recursive regex can be used to check parentheses/brackets/balanced keys - see examples here and here - but there are simpler algorithms for this without needing a complicated regex, etc).
According to the documentation, Posix functions are deprecated in php 5.3 and have been removed in php 7, so it is no longer a preference issue :-)
– hkotsubo
You are right. I was due the documentation in Portuguese. This information is not in the Portuguese version. Thank you.
– Everton da Rosa
I didn’t answer yesterday because I didn’t have time, but there’s a better detail of the differences below...
– hkotsubo