Static string being created with wrong Encode

Asked

Viewed 710 times

5

Hello,

When creating a string in a Java class (for example: String t = "Ola Java!"), it seems that the compiler is choosing the 'wrong' encounter to interpret the bytes that are in the source and generate the String (the 'right' encounter should be UTF-8, which is the Encode I’m using in the sources).

To illustrate the error, I ran the following test:

String t = "ã";
log.debug("t: " + t);
log.debug("t.length(): " + t.length());
log.debug("t.getBytes().length: " + t.getBytes().length);
log.debug("t.getBytes(utf-8).length: " + t.getBytes("utf-8").length);
log.debug("t.getBytes(UTF-8).length: " + t.getBytes("UTF-8").length);
log.debug("t.getBytes(ISO-8859-1).length: " + t.getBytes("ISO-8859-1").length);

(the log mechanism I use is the commons-logging with log4j support, but to do the same using the System.out)

The result was as follows:

t: ã
t.length(): 2
t.getBytes().length: 4
t.getBytes(utf-8).length: 4
t.getBytes(UTF-8).length: 4
t.getBytes(ISO-8859-1).length: 2

The first line could be explained by some conversion problem when converting the string at the time of writing the log file. But the other lines make the problem clear. On the second line (t.length()) to see that the String was created with two characters, and not one, already showing that in the creation of the string the two bytes that represent the character in utf-8 have been treated as two characters (in some other ISO-8859-1 format).

I’m looking for some way to force I find in the interpretation of a static string by the compiler, but I don’t think it’s a good way... is there any way to do this? Or to indicate to the compiler which Encode should be used when interpreting static strings in sources ?

  • 3

    Ok, I solved the problem. There is an option in the compiler to force the Encode that it handles the files. The option is: -encoding (there is also an ant javac task). Placing the utf-8 valve worked. By the way, the compiler uses the default OS for its default, and not the default adopts in the classes (UTF-8)... nothing like putting the written question to have a new view of it!

  • 1

    Add your solution as the answer to the question...

  • Okay, I’ll do it later. At the moment I’m being shadowed to go to a barbecue!...

2 answers

3

It was really a problem when compiling the sources.

Javac was considering that all source files were in the same Ncode, different from utf-8. Perhaps by default javac uses the standard OS Find.

To solve the problem, I used the javac -encoding option, which allows defining which Encode should be considered when reading the sources (the same option exists in the ant javac task).

3


Of the documentation of javac:

-encoding encoding

Set the source file encoding name, such as EUC-JP and UTF-8. If -encoding is not specified, the Platform default converter is used.

That is, if not specified explicitly, the compiler will use the system default. If you think about it, it makes some sense: this is the charset that editors use by default, and it’s the charset that Java applications use by default -- and javac is a Java application.

Of course, specify -enconding on the command line resolve. If you are using any build system (ant, Maven, Gradle, etc), specify this option to ensure that files will be treated the same on any platform.

If you are not using a build system (should! :), you can use the environment variable JAVA_TOOL_OPTIONS, putting something like, for example, -Dfile.encoding=UTF8 in it.

Finally, there is a way to transform your files, whatever their encoding, into ASCII. Along with JDK comes a program called native2ascii. This program will convert the system encoding file, or an encoding that you specify with -encoding, in an ASCII file using syntax \uxxxx to represent any special characters. For example:

Daniels-MacBook-Pro:debug-service dsobral$ cat Teste.java 
class Teste {
    public String test = "Teste de codificação"
}
Daniels-MacBook-Pro:debug-service dsobral$ native2ascii Teste.java
class Teste {
    public String test = "Teste de codifica\u00e7\u00e3o"
}

In case, as I did not specify the output file, he played to the console. I never actually used this program (I use build systems! :), but in simple tests it seems to accept that the output file is the same as input, but I would experiment with large files before relying on it.

  • Interesting this native2ascii! I don’t know if I will ever use, but it’s good to know that there is something like this... thanks!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.