Problem using subst in text with PHP

Question

Problem using subst in text with PHP

Asked 11 years, 5 months ago

Viewed 3,756 times

13

When using substr in a variable with text, it is returning a special character " "someone could help me?

I’m using the following code:

$excerpt = get_the_content();
$excerpt = strip_shortcodes($excerpt);
$excerpt = strip_tags($excerpt);
$the_str = substr($excerpt, 0, 335);
echo $the_str . '...';

1

I suspect your string is multibyte. In this case you could use mb_substr. You can post the relevant chunk of code (and an example string) to confirm?

– bfavaretto

2014/03/06 at 13:26
@bfavaretto, I was able to use mb_substr, but what’s the difference, from substr to mb_substr?

– Leandro Costa

2014/03/06 at 13:40
1

I posted an answer with the explanation.

– bfavaretto

2014/03/06 at 13:48
@bfavaretto, Thank you!

– Leandro Costa

2014/03/06 at 13:49
1

Related: http://answall.com/questions/78308/por-que-deveria-utilizar-fun%C3%A7%C3%B5es-come%C3%A7am-by-Mb

– Wallace Maxters

2015/10/30 at 15:27

3 answers

18

Your string is probably encoded as UTF-8, which is desirable, as you can represent an immense amount of special characters. In UTF-8, certain characters, including all accents, occupy more than one byte. But the function substr considers that each character occupies only one byte. What is happening is that the substr is cutting a character in half, picking up only the first byte of it. When the browser will display the output of the substr, that single byte is considered an invalid character.

The solution is to use the function mb_substr, which is designed to handle multibyte characters:

$the_str = mb_substr($excerpt, 0, 335);

Thank you very much man, it was of great help your answer!

– Leandro Costa

2014/03/06 at 13:50
The problem is that it is using PHP: why substr normal does not support UTF-8?

– Gustavo Rodrigues

2014/03/06 at 13:54
@Gustavorodrigues The technical reason is that standard string manipulation functions assume that encoding uses only one byte per character. The reason for this design choice should be historical, when php was created utf8 was not yet used as default on the web.

– bfavaretto

2014/03/06 at 14:08
And they create another function for compatibility instead of changing the old one, causing problems for the developers. From this comes dozens of functions with the same use but for different situations.

– Gustavo Rodrigues

2014/03/06 at 14:32
2

And what’s the point of standing here crying because of it? That’s how language is, if you don’t like it/you can use it that way, look for another one. There is no perfect language.

– Henrique Barcelos

2014/03/10 at 12:18
@bfavaretto, usually I pass the fourth parameter which is utf-8. It is not necessary?

– Wallace Maxters

2015/10/30 at 15:25
@Wallacemaxters You can be this globally with mb_internal_encoding("UTF-8");, if UTF8 is no longer the default server. But without seeing it it is actually safer to pass as the last parameter.

– bfavaretto

2015/10/30 at 15:39
@bfavaretto, by the manual, the third parameter is mb_detect_encode. That is, if nothing is passed, the encoding of the string (and not the preset) is detected. At least that’s what I understood.

– Wallace Maxters

2015/10/30 at 15:40
@Henriquebarcelos, just missed sending to program in Python. Banter

– Wallace Maxters

2015/10/30 at 15:41
@Wallace will we be looking at the same manual?!

– bfavaretto

2015/10/30 at 15:58
Forget this @bfavaretto nonsense. The Handbook confused me - more than helps :)

– Wallace Maxters

2015/10/30 at 15:59
@Wallacemaxters then see that it has no detection, it picks up the standard multibyte encoding

– bfavaretto

2015/10/30 at 16:01
Yes, that’s right. Actually, if you want to detect the encoding you have to do so: mb_substr($str, 0, 3, mb_detect_encoding($str)).

– Wallace Maxters

2015/10/30 at 16:04

Show 8 more comments

Browser other questions tagged php character-encoding

You are not signed in. Login or sign up in order to post.

by Peter Krauss • **1,830** points · Answer 1 · 2014-03-09T13:29:23+00:00

This other question recalls that in PHP it is not enough to use the correct function, which, well suggested by @bfavaretto, is the mb_substr() in place of substr(): we also need to configure PHP correctly for functions multibyte not cause surprises.

What I suggest as configuration, always to be used in Portuguese, is

setlocale(LC_ALL,'pt_BR.UTF8');
mb_internal_encoding('UTF8'); 
mb_regex_encoding('UTF8');

Use UTF-8 and compatible functions for everything!

The ISO Latin I (formally ISO-8859-1) was retired years ago, the W3C has been suggesting use of UTF-8 (see RFC-3629) in all recommendations.

Likewise, for Brazilian websites, the recommendation e-PING is the charset standard UTF-8... The "de facto standard", most popular by the minimally serious and "national" sites of the Portuguese language: idem, is UTF-8. If you check large Brazilian portals or even Brazilian portals, you will see right in the HTML header that the adopted standard is UTF8 (ex. <meta http-equiv="Content-Type"../> of the source code of UOL).

Historical legacies

Who works with PHP deals with two historical legacies that still cause some confusion today, and so I think it important to remember them:

The charset ISO-Latin-1 was for a long time in Brazil and in Portuguese the "official standard" for HTML pages, TXT, XML, SGML, etc. It is natural, because the UTF-8 came after the ISO-Latin, and precisely it houses in its structure, without changes, as Latin-1 Supplement Unicode Block.
PS: Microsoft since Windows 3.x, to isolate its users from any standardization initiative, has always enforced the "Microsoft Latin ISO" (known as "Code Windows-1252"), and even today some Brazilian programmers and web designers publish HTML with this charset. It is an insult to international standards and to the user.
PHP tried to overcome this boring thing of duplicate string functions - a library mb_* for UTF-8 type Charsets of variable size (multibyte) and other fixed 8-bit ISO Charsets - with the proposal of PHP6, but never succeeded (despite languages such as Python have done it long before). It causes inconvenience (we are here wasting time with this question! ) until today for Portuguese language programmers.

Where else has "catch" to UTF8?

Regular Expressions

Again the multiplicity of options to do the same thing in PHP, causes some confusion. I’ve worked a lot with regular Expressions and I am fully convinced that the best (most powerful accepted as standard in other languages) library is the PCRE (Perl Compatible Regular Expressions). I never had to use the multibyte functions "mb_ereg_*". To family preg_* can handle it. Just stay tuned for two details,

Use the modifier /u when using accent or special character in the regular Expression.
(see below discussion) Your PHP script needs to be in UTF8 to understand its regular Expression in UTF8.

Word count

The function str_word_count(), like so many of PHP, it has some flaws for the "general case" of UTF8... See discussion here.

Your PHP scripts... are UTF8?

Another common problem is your own PHP script, which also needs to be in UTF8(!). Check with a serious and reliable editor (never Windows Notepad!), for example Sublimetext or Textpad.

Idem databases, XML files, etc. Everything needs to be the same charset, and, easy: just always configure everything with the "universal standard", which is UTF8.

by Silvio Andorinha • **8,394** points · Answer 2 · 2014-03-06T13:29:11+00:00

php automatically puts that strange character when it does not recognize the character set that this character belongs to. To solve the problem you need to turn your string into utf8 universal character standard

Try to use in string utf8_encode($sua_string);

For more details http://www.php.net/manual/en/function.utf8-encode.php

Or try:

$string= mb_convert_encoding(utf8_encode($sua_string), 'ISO-8859-1', 'UTF-8');