str_split does not work well in string containing UTF-8?

Question

str_split does not work well in string containing UTF-8?

Asked 9 years, 6 months ago

Viewed 631 times

8

I want to iterate a string with foreach. For this, I learned that I must use the function str_split, that separates each character from the string to a array. However, this operation does not work as expected when using strings that contain accents, for example (utf-8 characters).

Example:

str_split('coração da programação');

The result for this is:

Array
(
    [0] => c
    [1] => o
    [2] => r
    [3] => a
    [4] => �
    [5] => �
    [6] => �
    [7] => �
    [8] => o
    [9] =>  
    [10] => d
    [11] => a
    [12] =>  
    [13] => p
    [14] => r
    [15] => o
    [16] => g
    [17] => r
    [18] => a
    [19] => m
    [20] => a
    [21] => �
    [22] => �
    [23] => �
    [24] => �
    [25] => o
)

How do I split a string the same way str_split does, but keeping the characters utf-8?

5 answers

8

Since some php functions do not support multibyte characters, the solution is regex O.o, because in this library they are supported.

You can use the dot meta character(.) to break the string into an array and get the same result as str_split(), it is worth remembering that for this it is necessary to use the modify u of PCRE.

$str = 'ação';
preg_match_all('/./u', $str, $arr);

echo "<pre>";

Satida:

Array
(
    [0] => Array
        (
            [0] => a
            [1] => ç
            [2] => ã
            [3] => o
        )

)

Interesting. I’ve thought about using preg_split with the ., but it leaves a lot of space. There you have to use PREG_SPLIT_NO_EMPTY. Already with preg_match doesn’t have that problem.

– Wallace Maxters

2016/02/01 at 19:28
@Wallacemaxters, each string lib with Unicode support? do not want a more rigid typing in the language :P, pq not solve more serious problems before having new super features.

– rray

2016/02/01 at 19:33
you meant "where it is". The only thing I saw that implemented from utf-8 is the one of the special characters. like: "My name x66 Wallace".

– Wallace Maxters

2016/02/01 at 19:45

Browser other questions tagged php string

You are not signed in. Login or sign up in order to post.

by BrunoRB • **5,526** points · Answer 1 · 2016-02-01T19:09:58+00:00

As already mentioned, most of the standard PHP functions do not support multibyte strings and for these cases the ideal is to use the multibyte string functions. Being more specific in the case of your question the ideal is the mb_split.

by Guilherme Lautert • **15,097** points · Answer 2 · 2016-02-01T19:03:56+00:00

PHP does not support all Unicode characters, however you can force them through REGEX.

preg_split('//u', 'coração da programação');

u is the modifier for Unicode.

by Daniel Omine • **19,666** points · Answer 3 · 2016-02-02T10:56:31+00:00

Can do using the function preg_split().

A regular expression that provides greater compatibility is /(?<!^)(?!$)/u

   $str = 'coração da programação';
   preg_split("/(?<!^)(?!$)/u", $str);

I will show with the other answers are flaws or insecure regarding functionality.

Testing proposed regular expressions in the other responses using the string 日本語:

   $str = '日本語';
   /*
   Essa é a expressão regular que provê maior segurança
   */
   print_r(preg_split("/(?<!^)(?!$)/u", $str));
   /** 
   retorno:

   Array
   (
       [0] => 日
       [2] => 本
       [3] => 語
   )
   */

   /*
   Essa expressão está numa das respostas (atualmente marcada como aceita)
   */
   print_r(preg_split("/./u", $str));
   /*
   Funciona bem com caracteres romanos, porém, não retorna corretamente com um simples kanji

   Array
   (
       [0] => 
       [1] => 
       [2] => 
       [3] => 
   )
   */

   print_r(preg_split("//u", $str));
   /*
   Essa outra consegue separar os caracteres, porém, retorna índices vazios no começo e no fim.

   Array
   (
       [0] => 
       [1] => 日
       [2] => 本
       [3] => 語
       [4] => 
   )

   Caso queira usar a expressão "//u", deve-se adicionar alguns parâmetros caso não queira os índices com valores vazios:
   */
   print_r(preg_split("//u", $str, -1, PREG_SPLIT_NO_EMPTY));
   /**
   Retorno:

   Array
   (
       [0] => 日
       [1] => 本
       [2] => 語
   )
   */

Optional for character count control:

$str = '日本語';

$l = 1;
print_r(preg_split('/(.{'.$l.'})/us', $str, -1, PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE));

Finally, a simple routine, just traversing each character of the string and populating an array:

$str = '日本語';
$l = mb_strlen($str);
for ($i = 0; $i < $l; $i++) {
    $arr[] = mb_substr($str, $i, 1);
}
print_r($arr);
// Dependendo do caso, esse pode ser mais performático que todos os outros.
// Basta saber como e quando usar os recursos da linguagem.

Note: The above examples are for environments where the character set is correctly configured.

by Ivan Nack • **478** points · Answer 4 · 2016-02-01T18:56:01+00:00

De php.net

<?php
function str_split_unicode($str, $l = 0) {
    if ($l > 0) {
        $ret = array();
        $len = mb_strlen($str, "UTF-8");
        for ($i = 0; $i < $len; $i += $l) {
            $ret[] = mb_substr($str, $i, $l, "UTF-8");
        }
        return $ret;
    }
    return preg_split("//u", $str, -1, PREG_SPLIT_NO_EMPTY);
}
?>