How to get a String size correctly using UTF8?

Question

How to get a String size correctly using UTF8?

Asked 9 years, 1 month ago

Viewed 605 times

4

I’m doing some tests and I realized that when the string has special characters it is counted more than one in substr.

Example:

$string = "PAÇOCA";

echo strlen($string);
echo substr($string, 0, 3);

Should print: PAÇ but only prints PA, now if I increase a size from 3 to 4 prints, and if I take the Ç and put a C, he counts correctly, so from what I understand he’s considering the Ç as if it were two characters, as I can count them correctly?

already tried using mb_string as well. and header with UTF8.

1

It worked for me: http://ideone.com/PRGNGR

– Maniero

2016/07/08 at 18:12
account the same thing. convert so $string = "PAÇOCA"; $a = utf8_decode($string ) echo utf8_encode(substr($a , 0, 3));

– denis

2016/07/08 at 18:21
mb_string functions are sufficient, as long as they are set to UTF-8

– Bacco

2016/07/08 at 18:25
1

You look like a duplicate

– Wallace Maxters

2016/07/08 at 18:39
strlen() returns the number of bytes of the string (if the number of characters is lucky), mb_strlen() returns the number of characters in the string.

– rray

2016/07/08 at 18:40
Related or duplicated : http://answall.com/questions/78308/por-que-deveria-utilizar-fun%C3%A7%C3%B5es-que-come%C3%A7am-by-Mb

– Wallace Maxters

2016/07/08 at 18:40
@Wallacemaxters I think is more related, in case he tried to use mb_ but it did not help, and that does not focus on configuration. But it’s a good supplementary indication

– Bacco

2016/07/08 at 18:43
@Bacco is true. But maybe because he had to pass the third 4 parameter, right, rsrsrsrsrsrsrsrsrsrs.

– Wallace Maxters

2016/07/08 at 18:44
Now I’m thinking, using replace in 80% of my project, I’ll have to modify everything.O

– Gabriel Rodrigues

2016/07/08 at 18:46
1

@Gabrielrodrigues if your encoding is UTF, yes. If it’s ISO, you can let the substr. And in situations that are byte operation, always overwrite without mb_ (for example, extract encoded things, or binary) - And look at the link I put to ini settings, it’s better than changing internal_encoding in Runtime.

– Bacco

2016/07/08 at 18:48

Show 5 more comments

3 answers

6

Function mb_ suffice, but need to configure for correct encoding:

mb_internal_encoding('UTF-8');

Then the result is

$string = "PAÇOCA";

echo mb_strlen($string);            // 6
echo mb_substr($string, 0, 3);      // PAÇ

Only your code has to have been saved in UTF-8 in the editor/IDE as well!
After all, you are providing a literal value in source that is not affected by the PHP settings themselves.

Take care not to unnecessarily set other settings, to avoid getting confused. The ideal is to hit everything on php.ini, if possible, and not Runtime.

Handbook:

http://php.net/manual/en/function.mb-internal-encoding.php

Configuring in php.ini

http://php.net/manual/en/mbstring.configuration.php

That’s what it’s worth!

– Gabriel Rodrigues

2016/07/08 at 18:38

Browser other questions tagged php

You are not signed in. Login or sign up in order to post.

by Ricardo Mota • **2,905** points · Answer 1 · 2016-07-08T18:24:52+00:00

I suggest updating the settings. It would look like this:

    setlocale(LC_ALL,'pt_BR.UTF8');
    mb_internal_encoding('UTF8'); 
    mb_regex_encoding('UTF8');

    $string = "PAÇOCA";
    echo strlen($string);
    echo '<br>';
    echo mb_substr($string, 0, 3);

by Gabriel Rodrigues • **15,969** points · Answer 2 · 2016-07-08T18:37:53+00:00

2

Adding the small detail.

In the documentation of replace Andreas Bur says:

For UTF-8 character subsequence, I recommend mb_substr

Example:

<?php
 $string = "PAÇOCA";

 echo strlen($string);
 echo mb_substr($string, 0, 3, 'UTF-8');
?>

1

This syntax tb is correct, it is worth noting that this 3rd parameter is only recommended if it is an exception, for example a case where you need an encoding from a source other than your default (so you don’t have to keep using all themb_ of the code, and if one day it changes, you don’t have to change everything again, only internal_encoding)

– Bacco

2016/07/08 at 18:41