str_split does not work well in string containing UTF-8?

Asked

Viewed 631 times

8

I want to iterate a string with foreach. For this, I learned that I must use the function str_split, that separates each character from the string to a array. However, this operation does not work as expected when using strings that contain accents, for example (utf-8 characters).

Example:

str_split('coração da programação');

The result for this is:

Array
(
    [0] => c
    [1] => o
    [2] => r
    [3] => a
    [4] => �
    [5] => �
    [6] => �
    [7] => �
    [8] => o
    [9] =>  
    [10] => d
    [11] => a
    [12] =>  
    [13] => p
    [14] => r
    [15] => o
    [16] => g
    [17] => r
    [18] => a
    [19] => m
    [20] => a
    [21] => �
    [22] => �
    [23] => �
    [24] => �
    [25] => o
)

How do I split a string the same way str_split does, but keeping the characters utf-8?

5 answers

9

As already mentioned, most of the standard PHP functions do not support multibyte strings and for these cases the ideal is to use the multibyte string functions. Being more specific in the case of your question the ideal is the mb_split.

8


Since some php functions do not support multibyte characters, the solution is regex O.o, because in this library they are supported.

You can use the dot meta character(.) to break the string into an array and get the same result as str_split(), it is worth remembering that for this it is necessary to use the modify u of PCRE.

$str = 'ação';
preg_match_all('/./u', $str, $arr);

echo "<pre>";

Satida:

Array
(
    [0] => Array
        (
            [0] => a
            [1] => ç
            [2] => ã
            [3] => o
        )

)
  • Interesting. I’ve thought about using preg_split with the ., but it leaves a lot of space. There you have to use PREG_SPLIT_NO_EMPTY. Already with preg_match doesn’t have that problem.

  • @Wallacemaxters, each string lib with Unicode support? do not want a more rigid typing in the language :P, pq not solve more serious problems before having new super features.

  • you meant "where it is". The only thing I saw that implemented from utf-8 is the one of the special characters. like: "My name x66 Wallace".

2

PHP does not support all Unicode characters, however you can force them through REGEX.

preg_split('//u', 'coração da programação');

u is the modifier for Unicode.

1

Can do using the function preg_split().

A regular expression that provides greater compatibility is /(?<!^)(?!$)/u

   $str = 'coração da programação';
   preg_split("/(?<!^)(?!$)/u", $str);





I will show with the other answers are flaws or insecure regarding functionality.


Testing proposed regular expressions in the other responses using the string 日本語:

   $str = '日本語';
   /*
   Essa é a expressão regular que provê maior segurança
   */
   print_r(preg_split("/(?<!^)(?!$)/u", $str));
   /** 
   retorno:

   Array
   (
       [0] => 日
       [2] => 本
       [3] => 語
   )
   */

   /*
   Essa expressão está numa das respostas (atualmente marcada como aceita)
   */
   print_r(preg_split("/./u", $str));
   /*
   Funciona bem com caracteres romanos, porém, não retorna corretamente com um simples kanji

   Array
   (
       [0] => 
       [1] => 
       [2] => 
       [3] => 
   )
   */

   print_r(preg_split("//u", $str));
   /*
   Essa outra consegue separar os caracteres, porém, retorna índices vazios no começo e no fim.

   Array
   (
       [0] => 
       [1] => 日
       [2] => 本
       [3] => 語
       [4] => 
   )

   Caso queira usar a expressão "//u", deve-se adicionar alguns parâmetros caso não queira os índices com valores vazios:
   */
   print_r(preg_split("//u", $str, -1, PREG_SPLIT_NO_EMPTY));
   /**
   Retorno:

   Array
   (
       [0] => 日
       [1] => 本
       [2] => 語
   )
   */





Optional for character count control:

$str = '日本語';

$l = 1;
print_r(preg_split('/(.{'.$l.'})/us', $str, -1, PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE));

Finally, a simple routine, just traversing each character of the string and populating an array:

$str = '日本語';
$l = mb_strlen($str);
for ($i = 0; $i < $l; $i++) {
    $arr[] = mb_substr($str, $i, 1);
}
print_r($arr);
// Dependendo do caso, esse pode ser mais performático que todos os outros.
// Basta saber como e quando usar os recursos da linguagem.

Note: The above examples are for environments where the character set is correctly configured.

0

De php.net

<?php
function str_split_unicode($str, $l = 0) {
    if ($l > 0) {
        $ret = array();
        $len = mb_strlen($str, "UTF-8");
        for ($i = 0; $i < $len; $i += $l) {
            $ret[] = mb_substr($str, $i, $l, "UTF-8");
        }
        return $ret;
    }
    return preg_split("//u", $str, -1, PREG_SPLIT_NO_EMPTY);
}
?>

Browser other questions tagged

You are not signed in. Login or sign up in order to post.