Responding point by point:
What can be considered as a character in the programming?
For a long time the definition "1 character = 1 byte" was used. Today, within programming, the best definition is "a character is a sign defined in the table Unicode"; it can be represented by 1 to 4 bytes, depending on the coding. I’ll explain Unicode and coding further down.
The classical and obsolete definition: 1 character = 1 byte
It is worth understanding this definition, and also because it is obsolete. In one byte we can represent 256 different symbols.
The table ASCII (pronounce itself "ásqui" and not "asc-dois") is an important standard until today, but it uses only the first 128 possible values, to represent the uppercase and lowercase letters (without accents), digits, some symbols and "invisible" characters to indicate line breaks, tabs and other operations that made sense when computers didn’t have a screen and could only display text on a teletype (that is why in many programming languages the function that displays text is called print).
This 128-character set of the ASCII table practically only serves English text, a language that does not use accents (but even this there are exceptions).
For languages that use accents, or non-Latin characters, companies and governments have created alternative tables using the 128 byte codes that ASCII does not use. In Brazil, many data sources use the table ISO-8859-1 or derivatives, such as Windows-1252. This solution creates several problems in data transmission between countries that use different languages, because languages need different tables, so the meaning of bytes is not universal. The pattern Unicode was created to eliminate this confusion.
Unicode
With the expansion of the Internet, there was a need for a unique pattern to represent characters, which meets the needs of all human languages, including mathematical symbols, emojis and many other signs. This pattern is the Unicode. Note that Unicode and UTF-8 are related but different things. First let’s just talk about Unicode.
In the site Unicode.org you can find tables in PDF where you can meet Arabic, Chinese, Egyptian characters (from the time of the pharaohs), emojis etc. It is worth visiting. The Unicode standard provides codes for 1,114,112 possible characters, but its latest version uses just over 10% of that code space. More than 100,000 of these characters are intended for three languages only: Chinese, Korean and Japanese.
There is difference between character and alphanumeric character?
Informally, we use "alphanumeric character" to refer to letters from A to Z and digits from 0 to 9, only. Often this informal definition excludes accented letters.
Formally, the Unicode standard exists estates linked to each character; one of them says if it is a letter, number or other type of sign.
Modern programming languages, such as Java and Python 3, accept accented characters and even non-Latin letters - such as Chinese ideograms - in variable identifiers, etc. Compilers or interpreters of these languages use a Unicode property to decide what is or is not a "letter".
In the official documents of Unicode.org, the characters are formally called codepoints and identified with a hexadecimal code with U+ prefix and a unique name. For example:
U+0041 A LATIN CAPITAL LETTER A
U+0042 B LATIN CAPITAL LETTER B
U+0043 C LATIN CAPITAL LETTER C
These three examples are ASCII table characters, and their Unicode codes are the same as the ASCII table: the code of the letter A is 41 hexadecimal or 65 decimal. See some emojis:
U+1F600 GRINNING FACE
U+1F601 GRINNING FACE WITH SMILING EYES
U+1F602 FACE WITH TEARS OF JOY
U+1F603 SMILING FACE WITH OPEN MOUTH
Looking at the above codes, and knowing that the largest hexadecimal number that fits in a byte is FF, how do we represent in the computer the thousands of characters after the first 256? That’s where the coding.
Coding
Mathematically speaking, it’s like programming?
Coding (encoding, in English) is an algorithm to convert a Codepoint - as U+1F601 - in bytes for computer storage or network transfer, and also for reverse operation, from bytes to Codepoint.
See how this happens in the interactive Python interpreter 3 (the basic idea is the same, independent of the programming language):
>>> cavalo = '\u265e'
>>> print(cavalo)
♞
>>> cavalo.encode('utf-8')
b'\xe2\x99\x9e'
In a string In Python, the codepoints from U+0000 to U+FFFF can be represented by the sequence ' uXXX' where 'XXXX' is 4 hexadecimal digits (neither more nor less, always 4). Note that the string attributed to cavalo
contains only one character, the U+265E (BLACK CHESS KNIGHT). For codepoints from U+10000 it is necessary to use the prefix ' U' (uppercase) and 8 hexadecimal digits, neither more nor less:
>>> cara = '\U0001F601'
>>> print(cara)
>>> cara.encode('utf-8')
b'\xf0\x9f\x98\x81
To save any string in a file, or transmit it over the network, you need to encode it. There are several encodings in use, but the most common and recommended as standard by W3C is UTF-8. To encode a string in Python, we use the method .encode()
, as in the above examples.
Note that the chess horse is encoded in 3 bytes in UTF-8, and the happy face is encoded in 4 bytes. This shows that the number of bytes varies according to the encoding and character. An important advantage of UTF-8 is that the original 127 ASCII characters are encoded in only one byte, with the same code as the ASCII table.
U+0041 and U+005A are also considered as characters?
These two codepoints represent the letters A and Z, as we can see with Python:
>>> '\u0041'
'A'
>>> '\u005a'
'Z'
In addition to being characters, they are alphanumeric characters.
Happy hacking!
Complementing Luciano’s (excellent) response: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
– rmonico
Related: https://answall.com/q/394834/112052
– hkotsubo