How to get a String size correctly using UTF8?

Asked

Viewed 605 times

4

I’m doing some tests and I realized that when the string has special characters it is counted more than one in substr.

Example:

$string = "PAÇOCA";

echo strlen($string);
echo substr($string, 0, 3);

Should print: PAÇ but only prints PA, now if I increase a size from 3 to 4 prints, and if I take the Ç and put a C, he counts correctly, so from what I understand he’s considering the Ç as if it were two characters, as I can count them correctly?

already tried using mb_string as well. and header with UTF8.

  • 1

    It worked for me: http://ideone.com/PRGNGR

  • account the same thing. convert so $string = "PAÇOCA"; $a = utf8_decode($string ) echo utf8_encode(substr($a , 0, 3));

  • mb_string functions are sufficient, as long as they are set to UTF-8

  • 1

    You look like a duplicate

  • strlen() returns the number of bytes of the string (if the number of characters is lucky), mb_strlen() returns the number of characters in the string.

  • Related or duplicated : http://answall.com/questions/78308/por-que-deveria-utilizar-fun%C3%A7%C3%B5es-que-come%C3%A7am-by-Mb

  • @Wallacemaxters I think is more related, in case he tried to use mb_ but it did not help, and that does not focus on configuration. But it’s a good supplementary indication

  • @Bacco is true. But maybe because he had to pass the third 4 parameter, right, rsrsrsrsrsrsrsrsrsrs.

  • Now I’m thinking, using replace in 80% of my project, I’ll have to modify everything.O

  • 1

    @Gabrielrodrigues if your encoding is UTF, yes. If it’s ISO, you can let the substr. And in situations that are byte operation, always overwrite without mb_ (for example, extract encoded things, or binary) - And look at the link I put to ini settings, it’s better than changing internal_encoding in Runtime.

Show 5 more comments

3 answers

6


Function mb_ suffice, but need to configure for correct encoding:

mb_internal_encoding('UTF-8');

Then the result is

$string = "PAÇOCA";

echo mb_strlen($string);            // 6
echo mb_substr($string, 0, 3);      // PAÇ

Only your code has to have been saved in UTF-8 in the editor/IDE as well!
After all, you are providing a literal value in source that is not affected by the PHP settings themselves.

Take care not to unnecessarily set other settings, to avoid getting confused. The ideal is to hit everything on php.ini, if possible, and not Runtime.


Handbook:

http://php.net/manual/en/function.mb-internal-encoding.php

Configuring in php.ini

http://php.net/manual/en/mbstring.configuration.php

5

I suggest updating the settings. It would look like this:

    setlocale(LC_ALL,'pt_BR.UTF8');
    mb_internal_encoding('UTF8'); 
    mb_regex_encoding('UTF8');

    $string = "PAÇOCA";
    echo strlen($string);
    echo '<br>';
    echo mb_substr($string, 0, 3);

2

Adding the small detail.

In the documentation of replace Andreas Bur says:

For UTF-8 character subsequence, I recommend mb_substr

Example:

<?php
 $string = "PAÇOCA";

 echo strlen($string);
 echo mb_substr($string, 0, 3, 'UTF-8');
?>
  • 1

    This syntax tb is correct, it is worth noting that this 3rd parameter is only recommended if it is an exception, for example a case where you need an encoding from a source other than your default (so you don’t have to keep using all themb_ of the code, and if one day it changes, you don’t have to change everything again, only internal_encoding)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.