Doubt PHP function - mb_strlen

Asked

Viewed 592 times

1

I understood how "mb_strlen" works, but I didn’t understand an example:

<?php mb_strlen($string, '8bit'); ?>

What would that "8-bit"???

3 answers

4


The 8bit is one of the internal character encodings supported in functions Multibyte String - mb_[função].

Basically, this coding informs for the functions Multibyte as the string shall be converted to be executed correctly.

For example, if you run the code below will get the following outputs:

<?php
    $string = 'ὼ'; // Caractere especial qualquer

    echo strlen($string);             // Saída: 3
    echo mb_strlen($string, '8bit');  // Saída: 3
    echo mb_strlen($string, 'UTF-8'); // Saída: 1 - CORRETO!

In conclusion, the function strlen() works well for table characters ASCII and coding 8bit returns incorrectly in relation to the UTF-8. The pattern UTF-8 (Unicode) is the most efficient and recommended for W3.org.

To know what is the default encoding set in your project, you can run:

<?php
    echo mb_internal_encoding(); // Aqui retornou: UTF-8

Or to set internal encoding to default UTF-8:

<?php
    mb_internal_encoding('UTF-8');

Here you can see the list of supported encodings.

  • "the strlen() function works well for ASCII table characters, "no, the behavior can be modified by using the overload.

  • Yes, but the resource of Overloading internally uses the functions Multibyte String: "For example, mb_substr() is called Instead of substr() if Function overloading is enabled." - http://php.net/manual/en/mbstring.overload.php

  • But still uses the same function (you will use the same strlen()). The behavior of the function is changed, worse than that, it is changed using a natively available configuration. So there is no guarantee that use strlen() will equal to use mb_strlen(..., '8bit'), without knowing Overload. You don’t always have control over the environment. If you are developing open source software, for example, you are not sure how the environment is configured, a native PHP option will entirely modify the behavior of strlen().

1

Summary: THE strlen is unreliable, but using `mb_string(..., '8bit') is not always possible.


The question is interesting, because the 8bit not typically common, as stated in the other answers. But I think the @Paul Imon response leads to the mistake in several cases. There’s nothing wrong with mb_strlen('ὼ', '8bit') result in 3, you are just ignoring the encoding used, this answer is correct for 8bit.

Imagine that, for example, you have the following two pieces of information:

0xDF     0xBF
11011111 10111111

These are any two bytes, which may (or may not) have been generated uniformly. If you are interested in bytes, it does not matter your encoding. UTF-8 has a sort of "signaling" for next bytes, so the first byte indicates how many bytes there are, so we can treat it as a single character.

UTF-8, for example, will always be an ASCII when using a single byte (0xxxxxxx), but when it has two it must be (110xxxxx) and all bytes that are not the first must be (10xxxxxx).

This character DOES NOT EXIST in UTF-8, try:

echo "\xDF\xBF"; //= ߿

But its encoding indicates that it has two bytes, so run:

echo mb_strlen("\xDF\xBF", 'UTF-8'); //= 1

Will return 1, even though the character doesn’t even exist. But, this character exists in UTF-16LE, this set of bytes represents in UTF-16LE:

echo iconv('UTF-16LE', 'UTF-8', "\xDF\xBF"); //= 뿟

Meanwhile use 8bit will result 2, after all there are 2 bytes. I believe that "wrong" is not the word that best describes it, because all forms are correct, depending on where you apply it, of course.


The 8bit will treat each byte individually, regardless of the encoding, it will treat each byte as one byte, as simply as possible, even using values outside of ASCII, such as 0xFF.

The mb_strlen(..., '8bit') should be used to prevent problems with the mbstring.func_overload, that just now has become obsolete. This problem does not apply if you do not have the Multibyte String installed.

Then the @Paul Imon response goes wrong again. Using a native language resource configured in php.ini modifies the strlen() entirely:

mbstring.func_overload = 2

Testing:

echo strlen("\xDF\xBF");  //= 1

See, the behavior of strlen is no longer the same as mb_strlen(..., '8bit'), if you use the mbstring.func_overload = 2.


Summary, if you want to handle bytes:

$texto = "\xDF\xBF";

if (extension_loaded('mbstring') && defined('MB_OVERLOAD_STRING') && ini_get('mbstring.func_overload') & MB_OVERLOAD_STRING) {

 echo mb_strlen($texto, '8bit');

}else{

 echo strlen($texto);

}

This will use the strlen by default, but if the Overload is being used, so we use the mb_strlen to ensure that we will not use the strlen modified. Note that not everyone has mbstring installed, so use mb_string(..., '8bit') by default is not always possible. If you are sure that mbstring is installed you can only use the mb_string(..., '8bit'). ;)

  • "There’s nothing wrong with mb_strlen('ὼ', '8bit') result 3" - Condordo! However, I wanted to show that regarding UTF-8 is incorrect, since the W3 recommends this standard for web content: "Choose UTF-8 for any content..." - "In Unicode there are three different character encodings: UTF-8, UTF-16 and UTF-32. Of these three only UTF-8 should be used for Web content".

  • @Pauloimon "recommends". The word says it all. If you do not follow, it is not "incorrect", just not recommended. Also, in the same text, UTF-8 should be chosen IF to use Unicode. The phrase clearly applies to Unicode, not the web as a whole. You have an inaccuracy in your answer, which is the statement that UTF-8 is more efficient. UTF-8 is very inefficient compared to ISO-8859, for example, pq always needs processing. Tables of 256 characters are always 1:1, space efficient and processing. A simple reverse search for position in UTF-8 needs to scan the entire string.

  • My answer is open to edits, feel free to change anything.

  • For me there is another question, beyond that pointed out by Bacco. We are not always dealing with texts, in these cases the UTF-8 is inapplicable. If you are, for example, reading an image or a private key is only dealing with "some set of arbitrary bytes", PHP has no specific format for bytes. The PO does not specify the context. The main problem, in my view, is that there is no guarantee that the strlen() always have the same behavior of 8bit, so use the use of mb_strlen(.., '8bit') becomes necessary.

1

The second parameter is the character encoding you are using. Most likely you will want this parameter set as UTF-8,

If you want to better understand the function I suggest you take a look at the reference by clicking here

Browser other questions tagged

You are not signed in. Login or sign up in order to post.