Accented characters are considered as two characters

Asked

Viewed 1,201 times

1

Accented characters are considered as two (I imagine the number of bytes), how can I fix this ??

$t = "á";
if(strlen($t) == 1){
    echo "UM CARACTER";
}
if(strlen($t) == 2){
    echo "DOIS CARACTER";
}
if(strlen($t) == 3){
    echo "TRES CARACTER";
}

Another problem I’m facing is $string{0} unable to identify accents.

$text = "á25";

echo $text{0}."<br>"; //retorna �
echo $text{1}."<br>"; //retorna �
echo $text{2}."<br>"; //retorna 2
echo $text{3}."<br>"; //retorna 5

And putting in ISO-8859-1 is

$text = "á25";

echo $text{0}."<br>"; //retorna Ã
echo $text{1}."<br>"; //retorna ¡
echo $text{2}."<br>"; //retorna 2
echo $text{3}."<br>"; //retorna 5
  • Related: https://answall.com/questions/84100/strtoupper-acentos/84104

  • Lucas this issue would not be a new problem, different from the one reported earlier?

  • More or less, the problem still has to do with accented characters, if you think better I open a new question :D

3 answers

4


This depends on the encoding as you yourself noticed. UTF-8, which is the most common, varies from 1 byte (7 useful bits) to 4 bytes (21 useful bits). All ASCII uses only 7 bits, ie the most significant bit of it is always zero (0xxxxxxx) to complete a byte.

Now accented characters are beyond ASCII, they don’t exist in it. For this reason there are other encodings to support accents. UTF-8 uses more than one byte for this, while ISO-8859-1, also known as Latin 1, still uses one byte, but using the 8 bits.

When you use á you have to say what it is, in most cases will be used UTF-8, which in turn will use 2 bytes.


One solution is to use:

mb_strlen('á', 'UTF-8');
// = 1

It is important to define the second parameter, because the behavior can be changed even by mbstring.func_overload.

If you want to cut a section you can use:

mb_substr('á25', 0, 1, 'UTF-8');
// = á

If you want to create an array with multi-byte values:

preg_split('//u', 'á25', null, PREG_SPLIT_NO_EMPTY);
// = array(3) { [0]=> string(2) "á" [1]=> string(1) "2" [2]=> string(1) "5" }

1

You are correct, strlen() returns the number of bytes. To return the number of characters, use mb_strlen() or iconv_strlen():

$t = "à";
print strlen($t); // 2
print mb_strlen($t); // 1
print iconv_strlen($t); // 1
  • I edited my question, can you complement your answer ?? D

0

The function strlen() works well for iso-8859-1 (text without accentuation). stlen() does not count the number of characters, but the number of bytes.

When text has accent (multibite encoding) use mb_strlen() which returns the number of characters.

The function mb_strlen() allows you to define a parameter called encoding.

$t = "á";
tam  = mb_strlen($t, 'utf8');
echo $tam;//resultado 1

test here

Accented letters is a problem, see this article use accents in a text message

  • What would be wrong for -1? 0-0

Browser other questions tagged

You are not signed in. Login or sign up in order to post.