Why should we use functions that start with mb_?

Asked

Viewed 415 times

11

Sometimes, problems arise in PHP in relation to some string functions, because of the condition of them.

An example, is the strlen.

$a = strlen('str');

$b = strlen('stré');

var_dump($a, $b); // Imprime 3 e 5

See on IDEONE

As we see, in the case of $b, has been printed that it has 5 characters, and not 4.

I know from experience that to solve this we should use mb_strlen, who are duties multbyte of PHP.

Example:

var_dump(mb_strlen('stré', 'utf-8')); // Imprime 4
  • What exactly does that mean multibyte?

  • How common is the use of UTF-8 here in Brazil, we should always use functions like mb_ instead of the common functions for working with strings?

  • Why is this not handled (regarding the common functions to work with string) simply by changing the default_charset in the php.ini?

  • 1

    pq prefix are cool :D haha. Important question +1.

  • 1

    Important and no one answered yet :\

  • A site can have several languages so the treatment is different, uft8 can serve for 1 language but not for another, for example Iso and what supports with more encoded letters, so that the such default_charset in php.ini does not work.

1 answer

10


The PHP functions whose nomenclature starts with "mb_" belong to the functions Mbstring

MB stands for "Multibyte", that is, they are functions for manipulating multibyte strings.

Encodes like UTF8 are multibyte type (multiple bytes). In the official documentation, see the list of supported encodes: http://php.net/manual/en/mbstring.supported-encodings.php

Practical example

<?php
date_default_timezone_set('Asia/Tokyo');

ini_set('error_reporting', E_ALL);
error_reporting(E_ALL);
ini_set('log_errors',TRUE);
ini_set('html_errors',FALSE);
ini_set('display_errors',TRUE);

define( 'CHARSET',   'UTF-8' );

ini_set( 'default_charset', CHARSET );

if( PHP_VERSION < 5.6 ){
    ini_set( 'mbstring.http_output', CHARSET );
    ini_set( 'mbstring.internal_encoding', CHARSET );
}

header( 'Content-Type: text/html; charset=' . CHARSET );

/*
Retorna 6
Cada caracter "coração" está ocupando 3 bytes.
Caso queira contar a quantidade de bytes, strlen() é o mais indicado.
*/
echo strlen('I♥NY') . PHP_EOL . '<br />';

/*
Retorna 4
Caso queira contar a quantidade de caracteres, utilize a função equivalente em MBString 
*/
echo mb_strlen('I♥NY');


/*
Note que mesmo os caracteres latinos são multibyte
*/
echo strlen('ação') . PHP_EOL . '<br />';
echo mb_strlen('ação');
?>

Another rarely used term to refer to multibyte characters is "variable-width encoding" (variable-width encoding).

https://en.wikipedia.org/wiki/Variable-width_encoding

Additional note

It is not always necessary to use mbstring functions. An example case, is when it is known that a given string has no multibyte characters.

Example:

echo strlen('123') . PHP_EOL . '<br />';
echo mb_strlen('123');

As the example shows, in this case it is unnecessary, however, we can deepen further with another numerical example.

echo strlen('123') . PHP_EOL . '<br />';
echo mb_strlen('123');

In this example, they are numbers, but multibyte.

There are many well-developed systems that "think" to be internationalized, but the vast majority do not make any test with the real world, as if the global term were to be summed up to the American and European continent.

More than 60% of the planet (Arabic, Greek, Russian, Indian, Asian) uses multibyte characters and each language has peculiarities such as this example of the multibyte numbers in the Japanese language table.

Therefore, we recommend the use of Mbstring functions if you want to build a system that offers greater compatibility with the various existing encodes.

Another important note: UTF8 is not an Encode compatible with all languages. And Mbstring functions are not limited to UTF8.

For example, Chinese characters are best supported by Big5. There is also the use of UTF16 or UTF32.

However, even for Chinese characters, UTF8 is also used with some certainty, as it is "rare" that the Chinese themselves use all the ideograms. It’s over 60 grand.

  • To give you a moral in content, put it there: As it is very common to use UTF-8 here in Brazil, we should always use the functions of type Mb instead of the common functions for working with strings?_

  • 3

    give me a moral acceptance, then I gain easy the 50 points. whahaha

  • I preferred not to comment on a specific region because this subject is global

  • But our region uses UTF-8. And I was wondering if it’s important to use mb_ in everything you create (name validation and etc)

  • 3

    is described in the answer, independent of "our region". What you failed to understand in the answer?

Browser other questions tagged

You are not signed in. Login or sign up in order to post.