Problem using subst in text with PHP

Asked

Viewed 3,756 times

13

When using substr in a variable with text, it is returning a special character " "someone could help me?

I’m using the following code:

$excerpt = get_the_content();
$excerpt = strip_shortcodes($excerpt);
$excerpt = strip_tags($excerpt);
$the_str = substr($excerpt, 0, 335);
echo $the_str . '...'; 
  • 1

    I suspect your string is multibyte. In this case you could use mb_substr. You can post the relevant chunk of code (and an example string) to confirm?

  • @bfavaretto, I was able to use mb_substr, but what’s the difference, from substr to mb_substr?

  • 1

    I posted an answer with the explanation.

  • @bfavaretto, Thank you!

  • 1

    Related: http://answall.com/questions/78308/por-que-deveria-utilizar-fun%C3%A7%C3%B5es-come%C3%A7am-by-Mb

3 answers

18


Your string is probably encoded as UTF-8, which is desirable, as you can represent an immense amount of special characters. In UTF-8, certain characters, including all accents, occupy more than one byte. But the function substr considers that each character occupies only one byte. What is happening is that the substr is cutting a character in half, picking up only the first byte of it. When the browser will display the output of the substr, that single byte is considered an invalid character.

The solution is to use the function mb_substr, which is designed to handle multibyte characters:

$the_str = mb_substr($excerpt, 0, 335);
  • Thank you very much man, it was of great help your answer!

  • The problem is that it is using PHP: why substr normal does not support UTF-8?

  • @Gustavorodrigues The technical reason is that standard string manipulation functions assume that encoding uses only one byte per character. The reason for this design choice should be historical, when php was created utf8 was not yet used as default on the web.

  • And they create another function for compatibility instead of changing the old one, causing problems for the developers. From this comes dozens of functions with the same use but for different situations.

  • 2

    And what’s the point of standing here crying because of it? That’s how language is, if you don’t like it/you can use it that way, look for another one. There is no perfect language.

  • @bfavaretto, usually I pass the fourth parameter which is utf-8. It is not necessary?

  • @Wallacemaxters You can be this globally with mb_internal_encoding("UTF-8");, if UTF8 is no longer the default server. But without seeing it it is actually safer to pass as the last parameter.

  • @bfavaretto, by the manual, the third parameter is mb_detect_encode. That is, if nothing is passed, the encoding of the string (and not the preset) is detected. At least that’s what I understood.

  • @Henriquebarcelos, just missed sending to program in Python. Banter

  • @Wallace will we be looking at the same manual?!

  • Forget this @bfavaretto nonsense. The Handbook confused me - more than helps :)

  • @Wallacemaxters then see that it has no detection, it picks up the standard multibyte encoding

  • Yes, that’s right. Actually, if you want to detect the encoding you have to do so: mb_substr($str, 0, 3, mb_detect_encoding($str)).

Show 8 more comments

10

This other question recalls that in PHP it is not enough to use the correct function, which, well suggested by @bfavaretto, is the mb_substr() in place of substr(): we also need to configure PHP correctly for functions multibyte not cause surprises.

What I suggest as configuration, always to be used in Portuguese, is

setlocale(LC_ALL,'pt_BR.UTF8');
mb_internal_encoding('UTF8'); 
mb_regex_encoding('UTF8');

Use UTF-8 and compatible functions for everything!

The ISO Latin I (formally ISO-8859-1) was retired years ago, the W3C has been suggesting use of UTF-8 (see RFC-3629) in all recommendations.

Likewise, for Brazilian websites, the recommendation e-PING is the charset standard UTF-8... The "de facto standard", most popular by the minimally serious and "national" sites of the Portuguese language: idem, is UTF-8. If you check large Brazilian portals or even Brazilian portals, you will see right in the HTML header that the adopted standard is UTF8 (ex. <meta http-equiv="Content-Type"../> of the source code of UOL).

Historical legacies

Who works with PHP deals with two historical legacies that still cause some confusion today, and so I think it important to remember them:

  1. The charset ISO-Latin-1 was for a long time in Brazil and in Portuguese the "official standard" for HTML pages, TXT, XML, SGML, etc. It is natural, because the UTF-8 came after the ISO-Latin, and precisely it houses in its structure, without changes, as Latin-1 Supplement Unicode Block.
    PS: Microsoft since Windows 3.x, to isolate its users from any standardization initiative, has always enforced the "Microsoft Latin ISO" (known as "Code Windows-1252"), and even today some Brazilian programmers and web designers publish HTML with this charset. It is an insult to international standards and to the user.

  2. PHP tried to overcome this boring thing of duplicate string functions - a library mb_* for UTF-8 type Charsets of variable size (multibyte) and other fixed 8-bit ISO Charsets - with the proposal of PHP6, but never succeeded (despite languages such as Python have done it long before). It causes inconvenience (we are here wasting time with this question! ) until today for Portuguese language programmers.


Where else has "catch" to UTF8?

Regular Expressions

Again the multiplicity of options to do the same thing in PHP, causes some confusion. I’ve worked a lot with regular Expressions and I am fully convinced that the best (most powerful accepted as standard in other languages) library is the PCRE (Perl Compatible Regular Expressions). I never had to use the multibyte functions "mb_ereg_*". To family preg_* can handle it. Just stay tuned for two details,

  • Use the modifier /u when using accent or special character in the regular Expression.
  • (see below discussion) Your PHP script needs to be in UTF8 to understand its regular Expression in UTF8.

Word count

The function str_word_count(), like so many of PHP, it has some flaws for the "general case" of UTF8... See discussion here.

Your PHP scripts... are UTF8?

Another common problem is your own PHP script, which also needs to be in UTF8(!). Check with a serious and reliable editor (never Windows Notepad!), for example Sublimetext or Textpad.

Idem databases, XML files, etc. Everything needs to be the same charset, and, easy: just always configure everything with the "universal standard", which is UTF8.

  • Great answer!

3

php automatically puts that strange character when it does not recognize the character set that this character belongs to. To solve the problem you need to turn your string into utf8 universal character standard

Try to use in string utf8_encode($sua_string);

For more details http://www.php.net/manual/en/function.utf8-encode.php

Or try:

$string= mb_convert_encoding(utf8_encode($sua_string), 'ISO-8859-1', 'UTF-8');
  • 1

    If I add utf8_encode in the string it returns the text like this "I’ve spoken here of composting, which is the transformation of food remains into fertilizer, a way to decrease organic waste and take better care of your plants. Today I bring another tip that also helps to make the soil richer in nutrients. If you have a garden or a yard in your home, try to leave the leaves and flowers that..."

  • 1

    got it, I’ll edit the answer

  • The problem persists, I was able to solve using mb_substr instead of substr.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.