The RFC 3986 does not specify which encoding should be used for non-ASCII characters.
The URL encoding involves a pair of hexadecimals, which is equivalent to 8 bits.
It would be possible to represent the non-ASCII characters all within this context. However, what made it unviable is that many languages have their own standard to represent their respective characters in 8-bit. Furthermore, in languages such as Chinese, many characters do not fit in 8-bit.
Therefore adopted the specification RFC 3629, which proposed to standardize non-ASCII characters with UTF-8 encoding.
It is important to understand that within the non-ASCII group there are reserved and non-reserved characters.
In the table of non-reserved characters, we have
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
0 1 2 3 4 5 6 7 8 9 - _ .~
These are the reserved ones:
! * ' ( ) ; : @ & = + $ , / ? % # [ ]
Note that ~
is not reserved, however, it can be codified. However, the recommendation is that nay the encoding.
What happens in the example that posted from the cedilla c?
Obviously, as the ç
is not ASCII, is treated as UTF8 as recommended RFC 3629
, above-mentioned.
This in itself explains why it is encoded in UTF-8, representing 2 hexadecimal pairs.
"ç" is encoded in UTF-8 with 2 bytes C3
(Hex) and A7
(Hex), being represented in this format "%C3" and "%A7" respectively. The scope %HH%HH. The pair A7
is what identifies as UTF-8.
Browsers only print the decoded form. And many protocols transmit UTF-8 without having to format to the %HH scope, either 1 or 2 pairs.
*byte != bit
*url encoded != html entities
Out of curiosity, browsers have supported multibyte characters in the URL for some years.
A tab here can help: https://pt.wikipedia.org/wiki/UTF-8 - Especially the part that explains the variable number of bytes per character.
– Bacco