JSON return with special characters in URL

Asked

Viewed 11,955 times

2

I’m having a problem with the return of a JSON, I made a query to fetch some images from the database, some URL users registered the images with special characters like -> (ç ã) the texts that contain these characters are returning right, no problems.

Some Urls that are in the bank:

media/upload/revista/capa_edição_810.jpg
media/upload/revista/capa_edição_806.jpg

On the return of JSON are coming like this:

media/upload/revista/capa_edi/u00c3/u00a7/u00c3/u00a3o_810.jpg
media/upload/revista/capa_edi/u00c3/u00a7/u00c3/u00a3o_806.jpg

My code in PHP:

<?php

require_once('conexao.php');

$query = " SELECT CONCAT('http://www.meusite.com.br/media/', imagem) as imagem, \n" 
    ." concat('edicao_', rr.edicao) as edicao, rr.edicao as numedicao, ei.id as Id_revista  \n"
    ." FROM `revistas_revista` rr   \n"
    ." inner join edicao_impressa_edicaoanterior ei on ei.revista_anterior_id = rr.id   \n"
    ."  where ei.ativo = 1 and year(rr.data) = '".$_GET['Ano']."' \n"
    ." order by edicao desc ";

$result = mysql_query($query) or die("Erro ao buscar dados");

mysql_close();  

$linhas = array();   

while ($r = mysql_fetch_assoc($result)){
   $linhas[] = $r;
}

echo json_encode($linhas); 

?>

A solution to this problem?

  • The problem could be in some kind of gambit like this one: array_map('utf8_encode', $r); doing double encoding on file names (in this case, the variable name does not match, it should be a previous test, but indicates Kick-Oriented Programming). This type of problem can be detected with a simple test to print the data to debug the code. When in doubt, print the strings in hexadecimal to see exactly how they are encoded. Since you posted the code with missing parts, you can’t give too many details.

  • Still worth the comment above. You are probably double-encoding utf-8, or iso being interpreted as utf, and to locate the problem you need to test from DB to Encode to find where it happened. For example, /u00c3/u00a7 is your Ç, which for some reason has been dismembered into 2 Unicode points. This is very common in double encoding.

  • understood, but the funny thing is that in the part that returns text is correct, any text that has this type of special character is correct, only in this URL field.

  • 1

    Do not take what appears on the screen, print in hexadecimal and check the bytes. You need to see if the DB is correct too. All of a sudden the data is in UTF, but the table or field is configured wrong, in ISO. The problem may have even happened before, at the time of feeding DB. You could even make a remendão and use a utf8_decode( $filename), but there is open the door to hell for good. It would be the Extreme Go Horse technique taken to extremes (for testing it serves, to use in practice, no way).

  • 1

    If you use Mysql Workbech or Query Browser, you can see it in hexa in DB as well.

  • JSON_UNESCAPED_UNICODE solved for me, but it’s a weird bug.

Show 1 more comment

1 answer

5


Quick fix.

Just copy and paste.
You don’t have to use your brain very much.

while ($r = mysql_fetch_assoc($result)){
   $linhas[] = $r;
}

echo json_encode($linhas); 

Trade it in and it’s all quiet!

while ($r = mysql_fetch_assoc($result)){
   $linhas[] = utf8_decode($r);
}

echo json_encode($linhas, JSON_UNESCAPED_UNICODE);



The rest of the answer is in case you want to understand what’s been done.
Continue reading if you want to fix the problem correctly.


















Detailed answer.

Note: It takes a little brain to continue reading. Texts with more than 3 lines give sleep. Good luck.

Concerning the issue

I don’t really understand what you really want, but I understand that you want to display in the return of json, the characters in their original form.

The function json_encode() automatically encodes special characters and applies escape characters. So you get the result as uxxxx.

Simple solution (PHP5.4+)

One simple way to solve is to define the second function parameter json_encode(). There is a constant call JSON_UNESCAPED_UNICODE, which can be used like this:

echo json_encode(array('acentuação'), JSON_UNESCAPED_UNICODE);
// retorna 
// ["acentuação"]

echo json_encode(array('acentuação'));
// retorna 
// ["acentua\u00e7\u00e3o"]

Backward compatibility

To ensure greater compatibility, below is an example of how to create backward compatibility with PHP versions below 5.4:

The reason for this is that this feature of the function json_encode() is available from PHP5.4. Currently it is still common to find servers with PHP below version 5.4, so it is still valid to apply this technique.

function JsonEncode($val, $option = null)
    {

        if (empty($option)) {
            return json_encode($val);
        }

        if (PHP_VERSION >= 5.4) {
            return json_encode($val, JSON_UNESCAPED_UNICODE);
        } else {
            // $option == JSON_UNESCAPED_UNICODE
            $encoded = json_encode($val);
            //$unescaped = preg_replace_callback('/\\\\u(\w{4})/', function ($matches) {
            $unescaped = preg_replace_callback(
                '/(?<!\\\\)\\\\u(\w{4})/',
                function ($matches) {
                    return html_entity_decode('&#x' . $matches[1] . ';', ENT_COMPAT, 'UTF-8');
                },
                $encoded
            );
            return $unescaped;
        }
    }

// Usage sample
echo JsonEncode(array('acentuação'));
// retorna 
// ["acentuação"]

Double condification

According to @Bacco’s observations, its original code is probably applying double coding, so it generates a wrong code.

The Unicode format of the string ção is \u00e7\u00e3o. However, the code presented in the question displays \u00c3\u00a7\u00c3\u00a3o

In a simple test, I simulated double encoding and hit 100% with the result of the question:

$str = 'acentuação';
$str_utf8 = utf8_encode($str);
echo PHP_EOL.$str_utf8;
echo PHP_EOL.json_encode($str_utf8);
// retorna
//u00c3/u00a7/u00c3/u00a3o_810

// Esse aqui é o correto como deveria retornar
echo PHP_EOL.json_encode(array($str));

To solve your specific case, there is an easy and dumb option which is to sweep the dirt under the rug:

utf8_decode(json_encode($val, JSON_UNESCAPED_UNICODE));

The second option is to do it "the right way", eliminating the problem at the root. Look for where you are applying the double encoding and fix it. Thus eliminating the unnecessary use of utf8_encode() and utf8_decode().

The complete code of the test: https://ideone.com/kHyBVf

Note: The use of utf8_encode() and utf8_decode() is unnecessary in the context of the question. It does not mean that it is totally unnecessary, but rather that it is being misused.

  • 1

    The problem is that çã shouldn’t be /u00c3/u00a7/u00c3/u00a3. and yes /u00e7/u00e3. Something is dismembering the e7 in your individual bytes c3 a7. High chance of double encoding or UTF being treated as ISO.

  • well observed. I did a test. I had not tested the question Nicode. But in a simple test I proved that it is really right what you observed. I added in the answer above.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.