Why is 'ç' converted to %C3%A7 URL, not %E7?

Asked

Viewed 3,526 times

3

When I was coding the 'ç' character for the query format (where the parameters are) of the URL, I got:

%C3%A7

% specifies a hexadecimal byte, but why almost all characters (including 'ç') should be specified by 2 hexadecimal bytes?

And how %C3%A7 could represent the 'ç'? 'ç' character could not be specified with just that byte %E7 (231)?

To clarify: the intention is to know how the character 'ç' is encoded, how it becomes %C3%A7.

  • 1

    A tab here can help: https://pt.wikipedia.org/wiki/UTF-8 - Especially the part that explains the variable number of bytes per character.

2 answers

7


The RFC 3986 does not specify which encoding should be used for non-ASCII characters.

The URL encoding involves a pair of hexadecimals, which is equivalent to 8 bits. It would be possible to represent the non-ASCII characters all within this context. However, what made it unviable is that many languages have their own standard to represent their respective characters in 8-bit. Furthermore, in languages such as Chinese, many characters do not fit in 8-bit.

Therefore adopted the specification RFC 3629, which proposed to standardize non-ASCII characters with UTF-8 encoding.

It is important to understand that within the non-ASCII group there are reserved and non-reserved characters.

In the table of non-reserved characters, we have

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
0 1 2 3 4 5 6 7 8 9 - _ .~

These are the reserved ones:

! * ' ( ) ; : @ & = + $ , / ? % # [ ]

Note that ~ is not reserved, however, it can be codified. However, the recommendation is that nay the encoding.

What happens in the example that posted from the cedilla c?
Obviously, as the ç is not ASCII, is treated as UTF8 as recommended RFC 3629, above-mentioned.

This in itself explains why it is encoded in UTF-8, representing 2 hexadecimal pairs.

"ç" is encoded in UTF-8 with 2 bytes C3 (Hex) and A7 (Hex), being represented in this format "%C3" and "%A7" respectively. The scope %HH%HH. The pair A7 is what identifies as UTF-8.

Browsers only print the decoded form. And many protocols transmit UTF-8 without having to format to the %HH scope, either 1 or 2 pairs.

*byte != bit
*url encoded != html entities

Out of curiosity, browsers have supported multibyte characters in the URL for some years.

inserir a descrição da imagem aqui inserir a descrição da imagem aqui

5

This string %C3%A7 is the UTF-8 encoding of the ç character for use in Urls.

Reference:
http://www.w3schools.com/tags/ref_urlencode.asp
https://en.wikipedia.org/wiki/UTF-8

Another interesting page:
http://www.fileformat.info/info/unicode/char/00e7/index.htm

Online tools:
http://www.url-encode-decode.com/
http://meyerweb.com/eric/tools/dencoder/

Official definition.
https://tools.ietf.org/html/rfc3629

Transformation of E7 into C3A7

E7: 11 100111
    ^^ ^^^^^^

110x xxxx | 10xx xxxx
1100 0011 | 1010 0111 --> C3A7
       ^^     ^^ ^^^^
  • The references still do not answer the question, I have now updated the question to force the difference between the UTF-8 encoding and the URL encoding.

  • 1

    E7 is the Codepoint Unicode of character ç. UTF-16 encoding is 00E7. UTF-8 encoding is C3A7.

  • Unfortunately I still can not understand how the UTF-8 makes 'ç' become C3A7, there is no explanation. Again, how C3-A7 can represent the character 'ç' if its code is 0xe7?

Browser other questions tagged

You are not signed in. Login or sign up in order to post.