This other question recalls that in PHP it is not enough to use the correct function, which, well suggested by @bfavaretto, is the mb_substr()
in place of substr()
: we also need to configure PHP correctly for functions multibyte not cause surprises.
What I suggest as configuration, always to be used in Portuguese, is
setlocale(LC_ALL,'pt_BR.UTF8');
mb_internal_encoding('UTF8');
mb_regex_encoding('UTF8');
Use UTF-8 and compatible functions for everything!
The ISO Latin I (formally ISO-8859-1) was retired years ago, the W3C has been suggesting use of UTF-8 (see RFC-3629) in all recommendations.
Likewise, for Brazilian websites, the recommendation e-PING is the charset standard UTF-8...
The "de facto standard", most popular by the minimally serious and "national" sites of the Portuguese language: idem, is UTF-8. If you check large Brazilian portals or even Brazilian portals, you will see right in the HTML header that the adopted standard is UTF8 (ex. <meta http-equiv="Content-Type"../>
of the source code of UOL).
Historical legacies
Who works with PHP deals with two historical legacies that still cause some confusion today, and so I think it important to remember them:
The charset ISO-Latin-1 was for a long time in Brazil and in Portuguese the "official standard" for HTML pages, TXT, XML, SGML, etc. It is natural, because the UTF-8 came after the ISO-Latin, and precisely it houses in its structure, without changes, as Latin-1 Supplement Unicode Block.
PS: Microsoft since Windows 3.x, to isolate its users from any standardization initiative, has always enforced the "Microsoft Latin ISO" (known as "Code Windows-1252"), and even today some Brazilian programmers and web designers publish HTML with this charset. It is an insult to international standards and to the user.
PHP tried to overcome this boring thing of duplicate string functions - a library mb_*
for UTF-8 type Charsets of variable size (multibyte) and other fixed 8-bit ISO Charsets - with the proposal of PHP6, but never succeeded (despite languages such as Python have done it long before). It causes inconvenience (we are here wasting time with this question! ) until today for Portuguese language programmers.
Where else has "catch" to UTF8?
Regular Expressions
Again the multiplicity of options to do the same thing in PHP, causes some confusion. I’ve worked a lot with regular Expressions and I am fully convinced that the best (most powerful accepted as standard in other languages) library is the PCRE (Perl Compatible Regular Expressions). I never had to use the multibyte functions "mb_ereg_*". To family preg_*
can handle it. Just stay tuned for two details,
- Use the modifier
/u
when using accent or special character in the regular Expression.
- (see below discussion) Your PHP script needs to be in UTF8 to understand its regular Expression in UTF8.
Word count
The function str_word_count(), like so many of PHP, it has some flaws for the "general case" of UTF8... See discussion here.
Your PHP scripts... are UTF8?
Another common problem is your own PHP script, which also needs to be in UTF8(!). Check with a serious and reliable editor (never Windows Notepad!), for example Sublimetext or Textpad.
Idem databases, XML files, etc. Everything needs to be the same charset, and, easy: just always configure everything with the "universal standard", which is UTF8.
I suspect your string is multibyte. In this case you could use mb_substr. You can post the relevant chunk of code (and an example string) to confirm?
– bfavaretto
@bfavaretto, I was able to use mb_substr, but what’s the difference, from substr to mb_substr?
– Leandro Costa
I posted an answer with the explanation.
– bfavaretto
@bfavaretto, Thank you!
– Leandro Costa
Related: http://answall.com/questions/78308/por-que-deveria-utilizar-fun%C3%A7%C3%B5es-come%C3%A7am-by-Mb
– Wallace Maxters