Is there a problem using Unicode characters for code identifiers?

Asked

Viewed 1,067 times

35

Today it is common for compilers of programming languages to allow the code file of their programs to accept code with Unicode characters.

This is useful, especially for those who use Portuguese and other languages that escape ASCII to create strings with accents and improve comments in our language.

But it is unusual to use accented identifiers in the code. There are even those who recommend it not to be used.

I myself do not usually use, but it seems to me to give a better sense in these cases (only an isolated example with no definition of language):

class Validação {
    bool ÉValido;
    ...
}

There is some technical reason to avoid accents and other Unicode characters in identifiers?

If there is no technical problem, there is some practical reason to avoid them?

Depends on the programming language? Whereas it supports well accentuation in fullness.

It matters if the code is proprietary and developed by a small and closed team or if it is widely developed, possibly in an open way?

Is there any specific care we should take when we use accents in identifiers?

When using characters other than ASCII is abuse?

  • 7

    I asked this question even to demonstrate how to make something suitable that can give room for breadth and opinions. Programmers also need to learn how to create specifications, whether in programs or questions, negotiate with their peers, defend their arguments, circumvent obstacles without hurting established rules, communicate so that everyone understands. You have to demonstrate real effort and need for the question. Most closed questions could be saved if this was always done. Some third party cases may save, others only OP can do this.

  • Curious tag, estilo-de-codificação, it appears as one of its top 3 tags on this page... Another one that has a curious tag is Math, string, who the Heck is master string? : D

  • @brasofilo eu :D http://answall.com/tags/string/topusers I’m about to win medal for her :D

  • @brasofilo hahahahaha... I’ve thought about it too. Worst q every time I answer a question from String she gets well punctuated. My only two "legal answer" medals are from String :p

2 answers

25


When it comes to using syntactic elements in general (and not just identifiers) that go beyond ASCII, there are a number of factors to consider:

  1. The compiler needs to offer appropriate support for Unicode entries. This goes beyond simple encoding (encoding) of characters: it is necessary to know if the support is limited to the BMP or extends to the Smps, if he takes good care of surrogate pairs, if he works with matching characters or only with preforms, if he accepts escape characters in source code or not. There may be other considerations, that’s just what comes to mind.

    An example would be how the word "tree" is represented in Unicode:

    '\xe1rvore',    // Latin Small Letter A with acute,              r,v,o,r,e
    'a\u0301rvore', // Latin Small Letter A, Combining Acute Accent, r,v,o,r,e
    

    If a library was written in an editor that uses pre-compounds, and the code that tries to use it was written in a combinator, the identifier may not be recognized.

  2. Does the language distinguish between upper and lower case or not? If the answer is no, there is the problem of collation: unless the computer where the code is being compiled has the same locale of the computer where this code was originally designed, may occur the same identifier be interpreted in different ways when normalizing capitalization. Example:

    "mail".toUpperCase(); // MAİL (Turco)
                          // MAIL (Resto do mundo)
    

    Again, if a library has been compiled on a computer with the Turkish locale, and who will use it does not have that locale, the identifiers may not be recognized (when the compiler tries to normalize them).

  3. How hard is it to input Unicode characters? For us, who use Portuguese, entering with accented characters is easy - our own keyboard layout supports this. But if we were to use a library with Japanese identifiers, for example, how would we do it? Similarly, other people may have difficulty typing accented letters, but everyone has at least good ASCII support.

Does this mean that using Unicode identifiers is always bad? No. It depends much more on the scope of the system being developed. As in the case of "whether to write in Portuguese or not", there are a number of factors that help determine whether it is acceptable or not that the system has a more local scope - although at first this was excluding from the global public (see my answer to the linked question for more details). It is useful to write programs in Portuguese, and it is useful that these are written in correct Portuguese. So, in the absence of problems to the contrary, I see no reason not to use characters other than ASCII.

  • Explicitly: if the entire development team uses the same text editor or IDE, problem 1 practically does not exist (unless programmed in traditional Chinese); if all are in the same locale, Problem 2 does not apply; and if everyone uses the same keyboard pattern, 3 does not put anyone at a "disadvantage". That is, these factors are much less relevant to a project in-house than for one open to the public.

Addendum: why didn’t I talk about the problem of encoding, in the sense that a programmer edit at one and another programmer edit at another? Because this is a problem much more general - which affects even the comments in the code. The development team needs to always use the same encoding is global, so it is no hindrance to using Unicode identifiers if so desired.

  • 1

    Note: as an aspiring designer of a programming language, I’ve thought a lot about this dilemma (as well as about using or not using case-insensitive). I have some interesting ideas about it (it is necessary to "think outside the box" and break some taboos), but you can’t get too much into details here, because it would only be opinion and/or speculation...

  • 4

    And there’s some way we can find these ideas? :)

  • 2

    @bigown Soon... I’m not ready to divulge details of my project yet, and it would be too extensive to expose here. Simply put, it would be a matter of standardising identifiers (and other names) to ensure interoperability. I mean, just like in SQL select == SELECT, we would have árvore == arvore. This alleviates the problem 3. Regarding 1 and 2, I leave a question for reflection: is it really necessary that the source codes are always represented in file by plain text? (I’ve even thought about discussing it here, but I think it’s too subjective for the site)

  • 1

    This is fucking :P

  • 1

    Anyway, I wanted to stick to what is of interest to the developer today, and not speculate about the future (after all this is a question of "coding style", not "language design" or "compiler design"). P.S. I added several examples in the answer - in Java (ideone) and Javascript (jsFiddle) - demonstrating problems in practice. And a note clarifying that, for a local project, they are not as bad as they seem.

18

Most modern environments actually support working with Unicode. But then to use this in the code has a large space. The first point to consider before thinking about aesthetics and good practice is whether your language supports this. Most define a finite (and small) set of characters from which the source code must be composed. It is usually a subset of ASCII. For example, the C standard says the following (C11, 5.2.1/3):

Both the basic source and basic Execution Character sets Shall have the following Members: the 26 uppercase Letters of the Latin Alphabet

A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z

the 26 lowercase Letters of the Latin Alphabet

a b c d e f g h i j k l m
n o p q r s t u v w x y z

the 10 decimal digits

0 1 2 3 4 5 6 7 8 9

the following 29 Graphic characters

! " # % & ' ( ) * + , - . / :
; < = > ? [ \ ] ^ _ { | } ~

Using anything outside of this would be invalid. A compiler can accept, of course. And most accept. But if you want a portable code that will work on any platform, it’s good to restrict.

Another problem is the file encoding. It can happen that two files of the same program are saved with different encodings (for whatever reason). Visually you will see the character É in both, but at the time of execution, it may be that the compiler/interpreter sees different identifiers there. In the end you will have a rather difficult error to track, since the error message will not help.

One language that widely supports writing code with non-ASCII characters is Ruby. The parser and other tools were built with this in mind and there is no set limiting characters allowed. This makes room for some interesting things, as the article demonstrates Unicode Whitespace Shenigans for Rubyists peter cooper:

Using a Unicode symbol for space (the same as the &nbsp; of HTML):


(source: in the Gd)

It is not seen as a space, it becomes part of the handle. It allows you to write something as confusing as this:


(source: in the Gd)

Since we have a fullness of space characters to use:


(source: rubyinside.com)

Using Nicode in a codebase makes room for some absurdities and very complicated bugs to track. Another clear problem is trying to copy and paste the code into different tools. You never know what might happen.

Technical problems aside, there is always the question of language (spoken). If it is a large project, or one that becomes opensource, it is always recommended to use English in the code, abolishing the use of Unicode.

In a small project with a team of few developers, there is enough space for rules to be defined and conventions to be created. There being an agreement among all, there is no reason not to. Remembering to always weigh the pros and cons of adopting this style.


A case that I have seen happen and that I consider valid in a certain way is the time to write tests. In many frameworks you define a function/member/method that will be a block of asserts to be executed. When a failure, usually the name of this function is displayed on the screen as the name of the test that failed. Since this is a function you never explicitly call, using placesUnicode in the name can be interesting. Will make the error output much more readable.

  • 4

    In this example of Ruby, IMHO is entirely the fault of Ruby, not Unicode: if in a language "only ASCII" an identifier is [\w_]+ (i.e. letters, numbers and underscore) in a "Unicode" language an identifier should be [\p{L}\p{Nd}\p{Pc}]+ (idem). Allow any kind of character as part of identifier is wanting to look for problem...

  • 2

    "For once" here are better answers than we found in English :)

  • @bigown almost vote me down after the second answer, haha

Browser other questions tagged

You are not signed in. Login or sign up in order to post.