Doubt PHP function - mb_strlen

Question

Doubt PHP function - mb_strlen

Asked 7 years, 6 months ago

Viewed 592 times

1

I understood how "mb_strlen" works, but I didn’t understand an example:

<?php mb_strlen($string, '8bit'); ?>

What would that "8-bit"???

3 answers

4

The 8bit is one of the internal character encodings supported in functions Multibyte String - mb_[função].

Basically, this coding informs for the functions Multibyte as the string shall be converted to be executed correctly.

For example, if you run the code below will get the following outputs:

<?php
    $string = 'ὼ'; // Caractere especial qualquer

    echo strlen($string);             // Saída: 3
    echo mb_strlen($string, '8bit');  // Saída: 3
    echo mb_strlen($string, 'UTF-8'); // Saída: 1 - CORRETO!

In conclusion, the function strlen() works well for table characters ASCII and coding 8bit returns incorrectly in relation to the UTF-8. The pattern UTF-8 (Unicode) is the most efficient and recommended for W3.org.

To know what is the default encoding set in your project, you can run:

<?php
    echo mb_internal_encoding(); // Aqui retornou: UTF-8

Or to set internal encoding to default UTF-8:

<?php
    mb_internal_encoding('UTF-8');

Here you can see the list of supported encodings.

"the strlen() function works well for ASCII table characters, "no, the behavior can be modified by using the overload.

– Inkeliz

2018/04/07 at 11:46
Yes, but the resource of Overloading internally uses the functions Multibyte String: "For example, mb_substr() is called Instead of substr() if Function overloading is enabled." - http://php.net/manual/en/mbstring.overload.php

– user98628

2018/04/07 at 13:42
But still uses the same function (you will use the same strlen()). The behavior of the function is changed, worse than that, it is changed using a natively available configuration. So there is no guarantee that use strlen() will equal to use mb_strlen(..., '8bit'), without knowing Overload. You don’t always have control over the environment. If you are developing open source software, for example, you are not sure how the environment is configured, a native PHP option will entirely modify the behavior of strlen().

– Inkeliz

2018/04/07 at 21:34

Browser other questions tagged php

You are not signed in. Login or sign up in order to post.

by Inkeliz • **20,671** points · Answer 1 · 2018-04-07T11:46:00+00:00

Summary: THE strlen is unreliable, but using `mb_string(..., '8bit') is not always possible.

The question is interesting, because the 8bit not typically common, as stated in the other answers. But I think the @Paul Imon response leads to the mistake in several cases. There’s nothing wrong with mb_strlen('ὼ', '8bit') result in 3, you are just ignoring the encoding used, this answer is correct for 8bit.

Imagine that, for example, you have the following two pieces of information:

0xDF     0xBF
11011111 10111111

These are any two bytes, which may (or may not) have been generated uniformly. If you are interested in bytes, it does not matter your encoding. UTF-8 has a sort of "signaling" for next bytes, so the first byte indicates how many bytes there are, so we can treat it as a single character.

UTF-8, for example, will always be an ASCII when using a single byte (0xxxxxxx), but when it has two it must be (110xxxxx) and all bytes that are not the first must be (10xxxxxx).

This character DOES NOT EXIST in UTF-8, try:

echo "\xDF\xBF"; //= ߿

But its encoding indicates that it has two bytes, so run:

echo mb_strlen("\xDF\xBF", 'UTF-8'); //= 1

Will return 1, even though the character doesn’t even exist. But, this character exists in UTF-16LE, this set of bytes represents 뿟 in UTF-16LE:

echo iconv('UTF-16LE', 'UTF-8', "\xDF\xBF"); //= 뿟

Meanwhile use 8bit will result 2, after all there are 2 bytes. I believe that "wrong" is not the word that best describes it, because all forms are correct, depending on where you apply it, of course.

The 8bit will treat each byte individually, regardless of the encoding, it will treat each byte as one byte, as simply as possible, even using values outside of ASCII, such as 0xFF.

The mb_strlen(..., '8bit') should be used to prevent problems with the mbstring.func_overload, that just now has become obsolete. This problem does not apply if you do not have the Multibyte String installed.

Then the @Paul Imon response goes wrong again. Using a native language resource configured in php.ini modifies the strlen() entirely:

mbstring.func_overload = 2

Testing:

echo strlen("\xDF\xBF");  //= 1

See, the behavior of strlen is no longer the same as mb_strlen(..., '8bit'), if you use the mbstring.func_overload = 2.

Summary, if you want to handle bytes:

$texto = "\xDF\xBF";

if (extension_loaded('mbstring') && defined('MB_OVERLOAD_STRING') && ini_get('mbstring.func_overload') & MB_OVERLOAD_STRING) {

 echo mb_strlen($texto, '8bit');

}else{

 echo strlen($texto);

}

This will use the strlen by default, but if the Overload is being used, so we use the mb_strlen to ensure that we will not use the strlen modified. Note that not everyone has mbstring installed, so use mb_string(..., '8bit') by default is not always possible. If you are sure that mbstring is installed you can only use the mb_string(..., '8bit'). ;)

by Phelipe • **1,541** points · Answer 2 · 2018-02-02T11:43:16+00:00

The second parameter is the character encoding you are using. Most likely you will want this parameter set as UTF-8,

If you want to better understand the function I suggest you take a look at the reference by clicking here