Preg_replace for Preg_replace_callback

Asked

Viewed 960 times

4

I’m having trouble migrating one script mine that clears characters from a sentence.

The preg_replace (depreciated) I use the value and key of the array for the exchange, however the preg_replace_callback converts what you find in Regex to another array independent, prevented me from using the key of the previous.

function LimpaTexto($texto) {

    $texto = html_entity_decode($texto);

    $texto = strtolower(trim($texto));

    $replaces = array(
        '/[áaãâäÁAAÂÄ]/'     => 'a',
        '/[éèêë&ÉEeË]/'     => 'e',
        '/[íìîïÍ]/'      => 'i',
        '/[óòõôöOÔÓÖO]/'     => 'o',
        '/[úùûüÚUUÜ]/'      => 'u',
        '/[çÇ]/'         => 'c',
        '/[ñnN]/'         => 'n',
        '/\s[\s]+/'      => '-',
        '/( )/'          => '-',
        '/( )\/( )/'          => '-',
        '/( )[-]( )/'          => '-',
        '/\//'       => '-',
        '/[^a-z0-9\-_]/' => '', 
        '/-+/'           => '-', 
        '/[.]/'          => '-'
        );

    $texto = preg_replace(array_keys($replaces), array_values($replaces), $texto);

    return $texto;
}

How does the array_keys is found and exchanged for array_values.

I couldn’t get a formula to use on preg_replace_callback. Only if I dismember character by character within the function callback and compare them for change, which makes the script more expensive in performance and size.

  • 2

    preg_replace was not depreciated, what was depreciated was the modifier evile ;)

  • Well eh... studying better I discovered it. Thank you.

  • About the your comment not understanding English: documentation is also available in Portuguese -> https://www.php.net/manual/en/function.preg-replace.php

2 answers

0

First I recommend reading of this issue, because by the answers and the related article you see that the REGEX in PHP does not support Unicode characters, as stated by @mgibsonbr.

Thus the expressions

'/[áaãâäÁAAÂÄ]/'
'/[éèêë&ÉEeË]/'
'/[íìîïÍ]/'
'/[óòõôöOÔÓÖO]/'
'/[úùûüÚUUÜ]/'
'/[çÇ]/'
'/[ñnN]/'

will not work.

To solve this problem you can use the function str_replace, which will search exactly for the specified character.

For example this function:

function changeLetters($string, $down = true){

    $letters = array(
        'A'=>array('@','â','ä','à','å','Ä','Å','á','ª','Á','Â','À','ã','Ã'),
        'E'=>array('&','é','ê','ë','è','É','£','Ê','Ë','È'),
        'I'=>array('!','ï','î','ì','¡','Í','Î','Ï','Ì','í'),
        'O'=>array('ô','ö','ò','Ö','ø','Ø','ó','º','¤','ð','Ó','Ô','Ò','õ','Õ'),
        'U'=>array('ü','û','ù','Ü','ú','µ','Ú','Û','Ù'),
        'B'=>array('ß'),
        'C'=>array('Ç','ç','©','¢'),
        'D'=>array('Ð'),
        'F'=>array('ƒ'),
        'L'=>array('¦'),
        'N'=>array('ñ','Ñ'),
        'S'=>array('$','§'),
        'X'=>array('×'),
        'Y'=>array('ÿ','¥','ý','Ý'),
        'AE'=>array('æ','Æ'),
        'P'=>array('þ','Þ'),
        'R'=>array('®'),
        '0'=>array('°'),
        '1'=>array('¹','ı'),
        '2'=>array('²'),
        '3'=>array('³'),
    );

    foreach ($letters as $letter => $change){
        if($down){ $letter = down($letter); }
        $string = str_replace($change, $letter, $string);
    }

    return $string;
}
  • Hello my friend. Thank you for the answer. Studying a little more concluded that my function can continue to be used because it is not depreciated. My English is very bad so I didn’t understand the text properly in the PHP Manual. But the question you raised, in the separation of characters is very interesting, however the function I describe in the question works perfectly already 5 years, without any bug. My concern was the migration to a new function, not existing relieves me enough.

0

The function preg_replace not depreciated. What was removed in PHP 7 is the flag e, but since you don’t use it in any of the regex, no problem.

Just to complement, it is possible to remove the accents otherwise. First you must have installed the extension PHP-Normalizer-Extension (Intl). Then it’s enough normalize string and apply regex /\p{M}/u:

$replaces = array(
    '/\p{M}/u'       => '',
    '/N/'            => 'n',
    '/\s[\s]+/'      => '-',
    '/( )/'          => '-',
    '/( )\/( )/'     => '-',
    '/( )[-]( )/'    => '-',
    '/\//'           => '-',
    '/[^a-z0-9\-_]/' => '', 
    '/-+/'           => '-', 
    '/[.]/'          => '-'
    );
$texto = preg_replace(array_keys($replaces), array_values($replaces), Normalizer::normalize($texto, Normalizer::FORM_D));

Basically, the normalization for NFD (the FORM_D above) decomposes an accentuated character into two or more. For example, the á is broken down into two characters: the letter a without accent, and the accent itself ´. Then the regex \p{M} search for characters that have Unicode categories "Mark, Spacing Combining", "Mark, Enclosing" or "Mark, Nonspacing", that encompass all these accents.

And remember that this code is not limited to accents. For example, the character ç (cedilla) in NFD is broken down into two: the letter c and the character ̧ (COMBINING CEDILLA), and as COMBINING CEDILLA is in the category "Mark, Nonspacing", it is also removed, so the ç ends up being replaced by c.

In any case, the regex \p{M}, along with normalization, already removes all accents (Obs: the normalization algorithm is defined by Unicode, and for more details about how it works, read here, here and here).


The other expressions are the same as your code, although I think there are spaces for improvement, since the shortcut \s means "space, TAB, line breaks, among others", then have a regex with \s and then another with only space, and both replacing by the same character, seems to me redundant.

Another detail is that [\s]+ is the same as \s+ (when there is only one element inside a character class, that is, inside the brackets, there is no gain in using them, so they can be removed). So the second regex would be \s\s+ (two or more occurrences of \s), that you can simplify to \s{2,}.

The expressions /( )\/( )/ and /( )[-]( )/ may be exchanged for /( )[-\/]( )/ (space, followed by hyphen or bar, and other space). In fact you can even take out these parentheses and leave only the spaces, which will not make a difference. The parentheses form capture groups, But since you don’t use the groups for anything, you don’t need them. I understand that in some cases, the parentheses serve to make it clear that there is a space there (some think it is more readable than / \/ /, or more "obvious" that has spaces in the expression), but they are not necessary for this regex to work.

Notice also that I changed [ñnN] by just N, since you were replacing these characters with n. But since the first regex already removes the accents (then ñ becomes n) and trade n for n is redundant, would remain only [N] - but as already said, when there is only one element between the brackets, these can be removed, then this expression would be just N.


Another detail is that '/[^a-z0-9\-_] /'=> '' will remove all that nay for lower case letters, numbers, hyphen or _. This means that this regex also removes the end point of the string, and when it reaches the last expression ([.]), no more points in the string to be replaced by - (for preg_replace applies substitutions in the order they appear in the array). See if this is what you intend (if so, then the regex [.] will serve no purpose, as the point has already been removed previously and there will be nothing for it to replace).

By the way, one more detail: if you want a regex that only takes the end point, you can switch to /\./ That’s kind of a confusing detail, but basically, out of the brackets, the meta-characters (those who possess special significance, such as the ., which means "any character (except line breaks)") need to be escaped with \, and inside the brackets, usually not.

Finally, the last two expressions (-+, which is "one or more hyphens", and [.], which is the dot character itself), can be exchanged for /-+|\./ (the character | indicates a alternation, that is, the regex corresponds to -+ or \.). You can do so because both are being replaced by the same thing, so there is no reason to do it separately in two regex.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.