How to detect page encoding with PHP?

Asked

Viewed 3,652 times

9

I wanted to create a function that saved the data in the database encoding correct (my bank is UTF-8) according to the encoding detected.

Is there a native PHP function to do this? Is there any other way?

  • What are you trying to save? User-provided data? Files on the server itself? And what is the format of this data? (the question title suggests html pages, that’s right?)

  • Yes data provided by a user. The files are . php used with html. As well as the data format?

  • 1

    I asked about the format because in some cases (e.g., HTML/XML) this information may be contained in the document itself (e.g..: meta http-equiv="Content-Type"), can be extracted from it. In others, what remains is "guess" (using the mb_detect_encoding). See the @Guerra response. Only, personally, I consider it bad practice to work with unknown encodings, so I want to better understand your problem to suggest a more appropriate solution.

  • Again the problem of making programmers aware of using UTF8 and of being able to do this to PHP’s satisfaction: see this answer for more details

6 answers

11


Assuming your server is serving coded pages as UTF-8, the default behavior of most user agents (browsers etc) will be using this same encoding when sending data back to the server (through forms/POST, for example). It is also possible to accept other encodings through the parameter accept-charset. This way you will not need to "detect" anything, you are instructing the client side to send data already in the desired encoding.

See also that answer in the English OS. One of the important points is that a browser who follows the standards will respect this requirement of encoding, but it is always possible for a client (accidentally or maliciously) to send data with different encoding. In this case, it is up to you to determine whether it is necessary to try to fix the problem that the customer has created, or to leave the burden to him... Ordinary users using modern browsers will certainly not go through this type of problem (but it costs nothing to perform some tests, according to your target audience).


Updating: based on your and @Guerra’s responses, I believe it is not necessary to detect anything, simply use utf8_decode should be enough (as your users will always send in UTF-8, and your connection to the bank always expects ISO 8859-1, regardless of the coding your database uses).

But if you want a robust solution, here’s what I suggest:

function fixEncoding($in_str)
{
   $cur_encoding = mb_detect_encoding($in_str) ;

   if($cur_encoding == "UTF-8" && mb_check_encoding($in_str,"UTF-8"))
   {
       return utf8_decode($in_str);
   }
   elseif($cur_encoding == "ISO 8859-1" && mb_check_encoding($in_str,"ISO 8859-1"))
   {
       return $in_str;
   }
   else
   {
       // Não testado:
       // return iconv($cur_encoding, "ISO 8859-1", $in_str);
       throw new Exception('Codificação não suportada.');
   }
}
  • Just a little comment. This mb_detec_encoding PHP function detects whether the string has ASCII or UTF-8 characters and not about the page that was the original question. Anyway it already helps.

9

Your question is a little vague about the specific problem you are encountering, so I leave here some considerations for a correct iteration with user data, data to and from the server and iteration with the database, from the indicated database that your database is running with Charset UTF-8.

Notes: This may not answer your question, but it seems relevant enough to help when dealing with coding issues. Much more information can be added. Simply enter the desired.


Browser statements

  • HTML pages

    HTML pages always need an indication in the header through a META tag, the charset that the browser should use to display and receive data:

    Example in HTML 5

    <!doctype html>
    <html>
      <head>
        <meta charset="UTF-8">
      </head>
      ...
    

    Example in HTML 4

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" 
    "http://www.w3.org/TR/html4/strict.dtd">
    
    <html>
      <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      </head>
      ...
    
  • PHP files

    The main file responsible for displaying the HTML and performing the user interaction functions (usually index.php), must contain an indication at the beginning of the same, before any sending of headers to the browser, indicating the charset to be used:

    /* Setting charset for proper language
     * support, DB interaction, etc.
     */
    header('Content-Type: text/html; charset=UTF-8');
    

    This will ensure that the information sent to the browser and the information collected from it will be in UTF-8.

  • Posts to the server via HTML > PHP

    If PHP and the header of the HTML page are displaying the same Charset, as seen above, a normal post from a form on the page will send the browser information to the server in UTF-8.

    However, there is a way to indicate that the form should send the data to the server in a specific Charset:

    <form action="mytargetfile.php" accept-charset="UTF-8">
    

    This is not necessary because the "normal" procedure is to apply the one mentioned above. But it can be used without problems.

  • Posts to the server via Ajax > PHP

    The posts made via Ajax send the information respecting the indications of the HTML page. The same information must reach a destination file with the charset to be used.

    However, also here it is possible to specify the Charset to be used for sending data:

    $.ajax({
      data: parameters,
      type: "POST",
      url: ajax_url,
      contentType: "application/x-javascript; charset:UTF-8",
      success: callback
    });
    

    The indication of the type of content varies, of course, according to the content to be sent, but it is succeeded by the indication of the Charset to be used.


Care with the files

When editing or creating a file, we should always keep in mind that it should be encoded with Charset equal to the information that will pass through it.

Codificação do Ficheiro

This is a small detail, but it ensures that the information is being well managed in relation to the coding of the same.


Iteration with the Database

Here it is important to note that the connection we open to the database to save or read data must be using the same Charset that the data and code responsible for this operation are using:

Example of database connection via PDO indicating Charset:

<?php

/**
 * Instances a new database connection
 * @return PDO instance of PDO connection
 */
protected function InitConnetion() {

  $dbh = new PDO(
    'mysql:host="meuServidor";dbname="minhaBD";',
    "utilizador",
    "password",
    array(
      PDO::ATTR_PERSISTENT               => false,
      PDO::MYSQL_ATTR_USE_BUFFERED_QUERY => true,
      PDO::ATTR_ERRMODE                  => PDO::ERRMODE_EXCEPTION,
      PDO::MYSQL_ATTR_INIT_COMMAND       => "SET NAMES utf8"
    )
  );

  return $dbh;
}

?>

Note that I am applying "utf8" instead of "utf-8" because the file that the database has with the instructions of this Charset is called utf8. Depending on the server configuration the file can be called "utf-8", "utf8" or "bananas". When you enter a name that doesn’t exist, you get a mistake, and you know you’re gonna have to change it.

  • 2

    +1 Excellent response, in particular the part that speaks "Care to have with the files". A potential error - and with quite damaging, sometimes irreversible, consequences - is to specify a encoding in the meta-data, but having its contents in a encoding different. I have corrupted many files/files because of this error... (fortunately only with personal files, nothing from work or worse: from clients)

3

The best way to convert ISO 8859-1 character to UTF8 I found was this:

function fixEncoding($in_str)
{
  $cur_encoding = mb_detect_encoding($in_str) ;
  if($cur_encoding == "UTF-8" && mb_check_encoding($in_str,"UTF-8"))
    return $in_str;
  else
    return utf8_encode($in_str);
}

But in the case of HTML files just use this header:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 

I strongly recommend reading that article(English) I found it very useful to understand the encoding of life that in PHP sometimes give in the bag.

For other formats the most suitable method would be iconv but would have to do some tests to try to do it dynamically in relation to the current coding see iconv php

Source: Here

  • utf8_encode($in_str) Are you sure this works? The documentation you indicated says the parameter data must be an ISO-8859-1 string. What happens if the detected encoding is in another format (say, UTF-16) and you pass this string to utf8_encode?

  • 1

    @mgibsonbr you’re really right, my fault. I edited the answer for future research.

1

Based on the @Guerra response I was able to find the solution. My html page is with Charset UTF-8 set and my Mysql Database too. Which is strange because when the function detects the character as UTF-8 need to use ut8_decode so that correctly accentuate in the bank.

From what I understand utf8_decode would turn into ISO-8859-1, someone can give a better explanation in the comments?

  function fixEncoding($in_str)
  {
       $cur_encoding = mb_detect_encoding($in_str) ;

       if($cur_encoding == "UTF-8" && mb_check_encoding($in_str,"UTF-8"))
       {
           return utf8_decode($in_str);
       }
       else
       {
           return $in_str;
       }
  }
  • Your bank may be in UTF-8, but the library that interfaces with it asks for parameters in which format?

  • How do I see it?

  • 1

    I don’t know how you find out, but if you quote here we can give a help... But according to what you reported, I strongly suspect it is ISO-8859-1. You need to understand the difference between the encoding used by the bank and the encoding you use to make the connection to it. If the same library connects to 10 different banks, each with a different encoding, still the SQL you will pass to it will always use the same encoding. It is the function of BD to use the right encoding, its function is to interact with the library that makes the connection.

  • I don’t know what you call 'interface library' to be able to tell you. The collation of the database and the tables I saw by Heidisql. I don’t understand much about encoding and it’s being very productive to know more. Thank you for the explanations.

  • 1

    Man, I know a lot of encoding but little PHP... Maybe someone more experienced can help you more, but in the short term, if you’re passing ISO-8859-1 to your bank and he’s accepting, that’s because that’s what he’s asking for, so don’t get your head all worked up about it... :)

  • Thanks, thanks. I just wanted to understand better anyway!

Show 1 more comment

0

Portuguese language programmers: our charset is UTF8!

Briefly, this fact, for PHP programmers, entails two precautions:

  1. Pages, data, PHP scripts, everything must be encoded in UTF8. Be wary of the architecture, the library, the environment, of whatever is not representing Portuguese in UTF-8.

  2. Stay tuned to PHP, it is not "natively UTF8", it can cause inconvenience. To overcome this problem, check out the tips and details in this answer.


Edit (after comment Bacco)

It is not a matter of "personal preference", it is a matter of respect, as well as road signs are respected, whether we like them or not.

Compliance with the following conventions, "de jure" and "in fact":

  • Our charset is ISO-8859-1, which has all the Latin characters and accents we use. It is more performative, because it uses 1 character x 1 byte storage. UTF-8 serves basically for internationalization. With all due respect to Peter Krauss and his other ideas that I like, this is not the case. UTF-8 may have its advantages, but this is not one of them. Moreover, if it were not for emoticons, Arabic, Chinese characters, etc., the UTF-8 would only have disadvantages. Respect as a personal option (independent of bad or good), but not being presented as an absolute fact.

  • Hi @Bacco, I made an Edit and opened for Wiki in case you want to even edit some misc... It is a common confusion, and you are wrong when you say that ISO-9959-1 "It is more performatic because it uses 1 character x 1 byte storage". UTF8 is a great success and a great consensus because, precisely in the Western languages, which are the most used on the Internet, preserves 1 byte (!! ). This 1 byte has just "all the Latin characters and accentuation we use" (the à for example is the 195<255)... anyway, double confusion: see link Wikipedia-English I passed (Wikipedia Portuguese is confusing).

  • I am not going to debate it, because I have commented on it before. I don’t know how much you understand about how the encodings mentioned work, but I can say without fear that UTF-8 is slower and more complex to work with than ISO-8859-1. This 1-byte thing is lost in accentuation in Portuguese. As for links, they are recommendations for data exchange between systems, and not general use. If at some point you want to coincide your time with mine, we can raise these questions in a chat or something. It is not about choosing the "best", but understanding that no format is absolutely superior to the other.

-3

In Style CSS: html:lang(en) { @charset "UTF-8"; }

In HTML: html lang="en"

No HEAD: meta charset="UTF-8"

I use HTML5 with PHP version 7.3 and it works.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.