Why is 'ç' converted to %C3%A7 URL, not %E7?

Question

Why is 'ç' converted to %C3%A7 URL, not %E7?

Asked 9 years, 1 month ago

Viewed 3,526 times

3

When I was coding the 'ç' character for the query format (where the parameters are) of the URL, I got:

%C3%A7

% specifies a hexadecimal byte, but why almost all characters (including 'ç') should be specified by 2 hexadecimal bytes?

And how %C3%A7 could represent the 'ç'? 'ç' character could not be specified with just that byte %E7 (231)?

To clarify: the intention is to know how the character 'ç' is encoded, how it becomes %C3%A7.

1

A tab here can help: https://pt.wikipedia.org/wiki/UTF-8 - Especially the part that explains the variable number of bytes per character.

– Bacco

2016/10/07 at 18:20

2 answers

7

The RFC 3986 does not specify which encoding should be used for non-ASCII characters.

The URL encoding involves a pair of hexadecimals, which is equivalent to 8 bits. It would be possible to represent the non-ASCII characters all within this context. However, what made it unviable is that many languages have their own standard to represent their respective characters in 8-bit. Furthermore, in languages such as Chinese, many characters do not fit in 8-bit.

Therefore adopted the specification RFC 3629, which proposed to standardize non-ASCII characters with UTF-8 encoding.

It is important to understand that within the non-ASCII group there are reserved and non-reserved characters.

In the table of non-reserved characters, we have

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
0 1 2 3 4 5 6 7 8 9 - _ .~

These are the reserved ones:

! * ' ( ) ; : @ & = + $ , / ? % # [ ]

Note that ~ is not reserved, however, it can be codified. However, the recommendation is that nay the encoding.

What happens in the example that posted from the cedilla c?
Obviously, as the ç is not ASCII, is treated as UTF8 as recommended RFC 3629, above-mentioned.

This in itself explains why it is encoded in UTF-8, representing 2 hexadecimal pairs.

"ç" is encoded in UTF-8 with 2 bytes C3 (Hex) and A7 (Hex), being represented in this format "%C3" and "%A7" respectively. The scope %HH%HH. The pair A7 is what identifies as UTF-8.

Browsers only print the decoded form. And many protocols transmit UTF-8 without having to format to the %HH scope, either 1 or 2 pairs.

^{*byte != bit}
^{*url encoded != html entities}

Out of curiosity, browsers have supported multibyte characters in the URL for some years.

Browser other questions tagged url

You are not signed in. Login or sign up in order to post.

by zentrunix • **5,511** points · Answer 1 · 2016-10-07T13:54:00+00:00

This string %C3%A7 is the UTF-8 encoding of the ç character for use in Urls.

Reference:
http://www.w3schools.com/tags/ref_urlencode.asp
https://en.wikipedia.org/wiki/UTF-8

Another interesting page:
http://www.fileformat.info/info/unicode/char/00e7/index.htm

Online tools:
http://www.url-encode-decode.com/
http://meyerweb.com/eric/tools/dencoder/

Official definition.
https://tools.ietf.org/html/rfc3629

Transformation of E7 into C3A7

E7: 11 100111
    ^^ ^^^^^^

110x xxxx | 10xx xxxx
1100 0011 | 1010 0111 --> C3A7
       ^^     ^^ ^^^^