What can be considered a character?

Asked

Viewed 2,054 times

25

In another question, Is it bad practice to put numbers as id in HTML elements? If yes why? is asked about placing numbers as id in HTML elements.

I saw that after a few minutes a great confusion about characters, myself and maybe other people.

I would like some clarification on the subject:

  • What can be considered as a character in the programming?
  • There is difference between character and alphanumeric character?
  • Mathematically speaking, it’s like programming?

There is also a comment saying:

If it were only letters and numbers they would mention, as they did in other excerpts " Character in the range U+0041 to U+005A", i.e., specifying which characters are.

  • The U+0041 and U+005A are also considered as characters?
  • 3

    Complementing Luciano’s (excellent) response: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

  • Related: https://answall.com/q/394834/112052

4 answers

26


Responding point by point:

What can be considered as a character in the programming?

For a long time the definition "1 character = 1 byte" was used. Today, within programming, the best definition is "a character is a sign defined in the table Unicode"; it can be represented by 1 to 4 bytes, depending on the coding. I’ll explain Unicode and coding further down.

The classical and obsolete definition: 1 character = 1 byte

It is worth understanding this definition, and also because it is obsolete. In one byte we can represent 256 different symbols.

The table ASCII (pronounce itself "ásqui" and not "asc-dois") is an important standard until today, but it uses only the first 128 possible values, to represent the uppercase and lowercase letters (without accents), digits, some symbols and "invisible" characters to indicate line breaks, tabs and other operations that made sense when computers didn’t have a screen and could only display text on a teletype (that is why in many programming languages the function that displays text is called print).

This 128-character set of the ASCII table practically only serves English text, a language that does not use accents (but even this there are exceptions).

For languages that use accents, or non-Latin characters, companies and governments have created alternative tables using the 128 byte codes that ASCII does not use. In Brazil, many data sources use the table ISO-8859-1 or derivatives, such as Windows-1252. This solution creates several problems in data transmission between countries that use different languages, because languages need different tables, so the meaning of bytes is not universal. The pattern Unicode was created to eliminate this confusion.

Unicode

With the expansion of the Internet, there was a need for a unique pattern to represent characters, which meets the needs of all human languages, including mathematical symbols, emojis and many other signs. This pattern is the Unicode. Note that Unicode and UTF-8 are related but different things. First let’s just talk about Unicode.

In the site Unicode.org you can find tables in PDF where you can meet Arabic, Chinese, Egyptian characters (from the time of the pharaohs), emojis etc. It is worth visiting. The Unicode standard provides codes for 1,114,112 possible characters, but its latest version uses just over 10% of that code space. More than 100,000 of these characters are intended for three languages only: Chinese, Korean and Japanese.

There is difference between character and alphanumeric character?

Informally, we use "alphanumeric character" to refer to letters from A to Z and digits from 0 to 9, only. Often this informal definition excludes accented letters.

Formally, the Unicode standard exists estates linked to each character; one of them says if it is a letter, number or other type of sign.

Modern programming languages, such as Java and Python 3, accept accented characters and even non-Latin letters - such as Chinese ideograms - in variable identifiers, etc. Compilers or interpreters of these languages use a Unicode property to decide what is or is not a "letter".

In the official documents of Unicode.org, the characters are formally called codepoints and identified with a hexadecimal code with U+ prefix and a unique name. For example:

U+0041  A       LATIN CAPITAL LETTER A
U+0042  B       LATIN CAPITAL LETTER B
U+0043  C       LATIN CAPITAL LETTER C

These three examples are ASCII table characters, and their Unicode codes are the same as the ASCII table: the code of the letter A is 41 hexadecimal or 65 decimal. See some emojis:

U+1F600   GRINNING FACE
U+1F601   GRINNING FACE WITH SMILING EYES
U+1F602   FACE WITH TEARS OF JOY
U+1F603   SMILING FACE WITH OPEN MOUTH

Looking at the above codes, and knowing that the largest hexadecimal number that fits in a byte is FF, how do we represent in the computer the thousands of characters after the first 256? That’s where the coding.

Coding

Mathematically speaking, it’s like programming?

Coding (encoding, in English) is an algorithm to convert a Codepoint - as U+1F601 - in bytes for computer storage or network transfer, and also for reverse operation, from bytes to Codepoint.

See how this happens in the interactive Python interpreter 3 (the basic idea is the same, independent of the programming language):

>>> cavalo = '\u265e'
>>> print(cavalo)
♞
>>> cavalo.encode('utf-8')
b'\xe2\x99\x9e'

In a string In Python, the codepoints from U+0000 to U+FFFF can be represented by the sequence ' uXXX' where 'XXXX' is 4 hexadecimal digits (neither more nor less, always 4). Note that the string attributed to cavalo contains only one character, the U+265E (BLACK CHESS KNIGHT). For codepoints from U+10000 it is necessary to use the prefix ' U' (uppercase) and 8 hexadecimal digits, neither more nor less:

>>> cara = '\U0001F601'
>>> print(cara)

>>> cara.encode('utf-8')
b'\xf0\x9f\x98\x81

To save any string in a file, or transmit it over the network, you need to encode it. There are several encodings in use, but the most common and recommended as standard by W3C is UTF-8. To encode a string in Python, we use the method .encode(), as in the above examples.

Note that the chess horse is encoded in 3 bytes in UTF-8, and the happy face is encoded in 4 bytes. This shows that the number of bytes varies according to the encoding and character. An important advantage of UTF-8 is that the original 127 ASCII characters are encoded in only one byte, with the same code as the ASCII table.

U+0041 and U+005A are also considered as characters?

These two codepoints represent the letters A and Z, as we can see with Python:

>>> '\u0041'
'A'
>>> '\u005a'
'Z'

In addition to being characters, they are alphanumeric characters.

Happy hacking!

8

What can be considered as a character in the programming?

Character is a visible symbol or not.

There is difference between character and alphanumeric character?

alphanumeric is a category represented by A-Z0-9 which for humans are used to communicate measurements, digits ASCII there are also control characters such as \n and \r

  • Hmmm, you forgot the lowercase in the alphanumeric...

7

Much can be said about this, but for the good-natured, half-word.

Follows representative code in some languages.

C# / Java

char um = '1';

I think that in the heavily typed world the discussion has already died.

Let’s go to the weakly typed world:

Javascript

var um = '1';

PHP

$um = '1'

In both cases, the character type does not exist. What we have above are strings. But if you try to get the first character of each string...

Finally, the discussion started because the W3C spoke in characters and was unclear. And as every discussion that takes more than half an hour tends to penetrate a semantic maze (BLOCH, 20011), stayed in the air whether a number can be a character or not.

Well, directly from the W3C itself, on the Structure of HTML 5:

The ASCII digits are the characters in the range ASCII digits.

In our language, with my emphasis:

The ASCII digits are the characters in the range of ASCII digits.

Further on, about how to do Parsing:

Collect a Sequence of characters that are ASCII digits, and interpret the Resulting.

That is to say:

Collect a sequence of characters which are ASCII digits, and interpret the resulting sequence as an integer in base ten.

There are other digit mentions being characters or even complete strings further down in the text.


Ah, just one more thing:

There is difference between character and alphanumeric character?

Both are groups of characters, and one is contained in the other.

Anything you can see in a text (and even some that you don’t see) are characters. But alphanumeric characters are those that are captured by the following regular expression:

a-zA-Z0-9\u00C0-\u00FF

Mathematical operators (+, -, /, %, *, !), despite the mathematics in the name, are neither numerical nor alphabetic, for example. Space is also not alphanumeric.

Mathematically speaking, it’s like programming?

It depends on the algebra and the universe used ;) For example, in linear algebra, there is a business called Levenshtein distance in which characters (all of them) are points in a multidimensional space. This is used to determine, for example, how similar two words are. Already in the day-to-day algebra that we use to pay boletos, there is neither the concept of character.

1 Murphy’s Complete Law, Arthur Block, 2001, Record

  • 1

    Great reference. Kind of old, but I think canonical

  • 2

    @Jeffersonquesado to look for her I swore that the author was Millor Fernandes, I do not know why.

  • "It depends on the algebra and the universe used". Ok, I believe there is nothing else that I can collaborate on an answer

  • 1

    Another mathematics that uses "letters" is "formal languages/formal grammars"

1

Good afternoon, my first contribution, I hope, valid,

Being a character a position of a symbol, for humans or not (line-feed, for example, is not for humans)

At the beginning of the computer 1 character was a byte, 7 bits were enough for uppercase, lowercase, numbers and more things, and a bit more one for parity. With the evolution, the parity bit died and we have 256 characters.

With evolution and globalization 256 were no longer enough and codepages appeared.. depending on the country, bytes > 127 were interpreted according to the special characters of this codepage.

Later, UTF-8 and UNICODE and 1 character went from being a byte to being defined by up to 4 bytes.

The databases stopped using "varchar" and started using , as in db2, vargraphic, so that 40 Chinese symbols or 40 Latin characters can be saved in this field. If it were varchar it would only hold 10 to 20 Chinese symbols.

HTML files have started to use utf8 as an encoding so that we can read, anywhere, Chinese, Arabic, Greek or other Character Collection.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.