Refactoring function to remove punctuation, spaces and special characters

Asked

Viewed 59,028 times

24

I have this function already too old to "clean" the contents of a variable:

Function

function sanitizeString($string) {

    // matriz de entrada
    $what = array( 'ä','ã','à','á','â','ê','ë','è','é','ï','ì','í','ö','õ','ò','ó','ô','ü','ù','ú','û','À','Á','É','Í','Ó','Ú','ñ','Ñ','ç','Ç',' ','-','(',')',',',';',':','|','!','"','#','$','%','&','/','=','?','~','^','>','<','ª','º' );

    // matriz de saída
    $by   = array( 'a','a','a','a','a','e','e','e','e','i','i','i','o','o','o','o','o','u','u','u','u','A','A','E','I','O','U','n','n','c','C','_','_','_','_','_','_','_','_','_','_','_','_','_','_','_','_','_','_','_','_','_','_','_' );

    // devolver a string
    return str_replace($what, $by, $string);
}

Utilizing

<?php
$pessoa = 'João dos Santos Videira';

$pastaPessoal = sanitizeString($pessoa);

// resultado
echo $pastaPessoal; // Joao_dos_Santos_Videira
?>

Being an old function, at the time of its creation make a substitution of a character A by B was the best option, but do maintenance to an input matrix and an output matrix is not easy and back and forth there appears an unforeseen scenario.

With the evolution of PHP, how to refactor this function making use of own language solutions or easier to maintain?

  • 1

    Note referring to the tag Sanitize: I couldn’t translate or find an existing tag for this word.

  • Mark an answer as accepted if one of them is for you.

  • Sanitize means to sanitize or sanitize. But it’s kind of ugly to translate for this, rs! Let’s really.

7 answers

25

Just use regular expressions!

<?php
function sanitizeString($str) {
    $str = preg_replace('/[áàãâä]/ui', 'a', $str);
    $str = preg_replace('/[éèêë]/ui', 'e', $str);
    $str = preg_replace('/[íìîï]/ui', 'i', $str);
    $str = preg_replace('/[óòõôö]/ui', 'o', $str);
    $str = preg_replace('/[úùûü]/ui', 'u', $str);
    $str = preg_replace('/[ç]/ui', 'c', $str);
    // $str = preg_replace('/[,(),;:|!"#$%&/=?~^><ªº-]/', '_', $str);
    $str = preg_replace('/[^a-z0-9]/i', '_', $str);
    $str = preg_replace('/_+/', '_', $str); // ideia do Bacco :)
    return $str;
}
?>

The code line below the comment serves to replace all characters with "_", except if it is letters or numbers.

  • If I had to implement it, I’d do something similar to that answer. However, for a project in production I think using Urlify (see my answer) is more advantageous in the long term, since it is well tested, easy to extend and modify.

6

I think this would be the best and simplest solution to your problem:

$valor = "João dos Santos Videira" 
$valor = str_replace(" ","_",preg_replace("/&([a-z])[a-z]+;/i", "$1", htmlentities(trim($valor))));
// Joao_dos_Santos_Videira

If you want to keep the spaces instead of exchanging them for "_", just remove the str_replace:

$valor = "João dos Santos Videira" 
$valor = preg_replace("/&([a-z])[a-z]+;/i", "$1", htmlentities(trim($valor)));
// Joao dos Santos Videira
  • 1

    This was the best answer and solution. I added two things to my login system: $FormataLogin = str_replace(" ","",strtolower(preg_replace("/&([a-z])[a-z]+;/i", "$1", htmlentities(trim($_POST['login']))))); Thank you.

  • Great answer.

6

You may want to use the library Urlify.php (source code here), which has extensive testing to support multiple characters and languages, and also supports adding more complex mappings than 1 character -> 1 character.

It also ignores symbols it cannot transliterate, which makes it robust enough to use in a URL or filename.

Here are some examples from the project page:

Clearing to use in URL or filename

echo URLify::filter (' J\'étudie le français ');
// "jetudie-le-francais"    
echo URLify::filter ('Lo siento, no hablo español.');
// "lo-siento-no-hablo-espanol"

Just removing the special characters by ASCII

echo URLify::downcode ('J\'étudie le français');
// "J'etudie le francais"
echo URLify::downcode ('Lo siento, no hablo español.');
// "Lo siento, no hablo espanol."

Mapping complex characters to expressions

URLify::add_chars (array (
    '¿' => '?', '®' => '(r)', '¼' => '1/4',
    '¼' => '1/2', '¾' => '3/4', '¶' => 'P'
));    
echo URLify::downcode ('¿ ® ¼ ¼ ¾ ¶');
// "? (r) 1/2 1/2 3/4 P"

6

You can use PHP to simply remove accents using iconv, case-sensitive, and conflict-free. IGNORE will ignore characters that may not have a translation. Then preg_replace will remove what is not A-Z and 0-9, leaving a clean string without spaces, symbols or special characters.

$string = "ÁÉÍÓÚáéíóú! äëïöü";
$string = iconv( "UTF-8" , "ASCII//TRANSLIT//IGNORE" , $string );
$string = preg_replace( array( '/[ ]/' , '/[^A-Za-z0-9\-]/' ) , array( '' , '' ) , $string );

-----------------------------------------------------------------
Input:  ÁÉÍÓÚáéíóú! äëïöü
Output: AEIOUaeiouaeiou

See an example on ideone

  • A small change for those who need to keep space php&#xA;$string = "ÁÉÍÓÚáéíóú! äëïöü"; &#xA;$string = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);&#xA;$string = preg_replace('/[^a-zA-Z0-9\s]/', '', $string);&#xA;-------------------&#xA;Input: ÁÉÍÓÚáéíóú! äëïöü&#xA;Output: AEIOUaeiou aeiou&#xA;

4

You look for the function strtr(). Regular expressions help to deal with exceptional cases:

function sanitizeString($str)
{
    return preg_replace('{\W}', '', preg_replace('{ +}', '_', strtr(
        utf8_decode(html_entity_decode($str)),
        utf8_decode('ÀÁÃÂÉÊÍÓÕÔÚÜÇÑàáãâéêíóõôúüçñ'),
        'AAAAEEIOOOUUCNaaaaeeiooouucn')));
}

PS: I used the function utf8_decode() because I saved the files as UTF-8 on my system (OSX). You probably don’t need to use it if the file is saved in other encodings like ISO-8859-1, CP1252 and the like.

  • Thanks for the answer, but I’m a little confused after reading it. You speak in utf8_encode() but you use utf8_decode() function. On the other hand, you say you use utf8* because of the way you keep the file? I assume you’re saying that because the headers of the application are not in UFT-8? I haven’t had a chance to test your solution, but if in the meantime I can clear up my doubts, I’d appreciate it!

  • Oops! I was referring to utf8_decode() same, already corrected my remark... In my case it is necessary to use the utf8_decode() because I saved the files with UTF-8 encoding on my system... This is necessary precisely because of the accented characters being used in the function.

0

I developed a different way to remove all special characters and accents.

<?php

$VString = "Matan-zá@Ç/_454ç8";

echo "Com caracteres especiais e acentos: " . $VString . "<br>";

$VString = preg_replace("/[ÁÀÂÃÄ]/", "A", $VString);
$VString = preg_replace("/[áàâãä]/", "a", $VString);
$VString = preg_replace("/[ÉÈÊ]/", "E", $VString);
$VString = preg_replace("/[éèê]/", "e", $VString);
$VString = preg_replace("/[ÍÌ]/", "I", $VString);
$VString = preg_replace("/[íì]/", "i", $VString);
$VString = preg_replace("/[ÓÒÔÕÖ]/", "O", $VString);
$VString = preg_replace("/[óòôõö]/", "o", $VString);
$VString = preg_replace("/[ÚÙÜ]/", "u", $VString);
$VString = preg_replace("/[úùü]/", "u", $VString);
$VString = preg_replace("/[Ç]/", "C", $VString);
$VString = preg_replace("/[ç]/", "c", $VString);
$VString = preg_replace("/[Ñ]/", "N", $VString);
$VString = preg_replace("/[ñ]/", "n", $VString);

echo "Com caracteres especiais: " . $VString . "<br>";

$VNovo = "";
for ($i = 0; $i < mb_strlen($VString); $i++)
  if (preg_match ("/[a-zA-Z0-9]/", substr($VString, $i, 1)) == 1)
    $VNovo .= substr($VString, $i, 1);

echo "Sem caracteres especiais e acentos: " . $VNovo . "<br>";

-1

public static function removerCaracteresEspeciaiss($str){
    $str_saida = "";
    for($i=0; $i<strlen($str); $i++){
        $num_asc = ord($str[$i]);
        if( ($num_asc>=65 && $num_asc<=90) || ($num_asc>=97 && $num_asc<=122) || ($num_asc>=48 && $num_asc<=57)){
            $str_saida .= $str[$i];
        }
    }
    return $str_saida;
}

Browser other questions tagged

You are not signed in. Login or sign up in order to post.