Java String byte array with negative numbers

Asked

Viewed 597 times

3

I’m having trouble figuring out the encoding of a string.

The entrance is:

São Paulo

The original reading of this content is not my job, because the text goes through a Lua wrapper for Java.

On my side, I have already made the following attempt "brute force" and do not find the correct conversion:

byte[] bytes1 = entrada.getBytes();
System.out.println(Arrays.toString(bytes1));
System.out.println(new String(bytes1));
System.out.println(new String(bytes1, StandardCharsets.UTF_8));
System.out.println(new String(bytes1, StandardCharsets.ISO_8859_1));
System.out.println(new String(bytes1, StandardCharsets.US_ASCII));

byte[] bytes2 = entrada.getBytes(StandardCharsets.UTF_8);
System.out.println(Arrays.toString(bytes2));
System.out.println(new String(bytes2));
System.out.println(new String(bytes2, StandardCharsets.UTF_8));
System.out.println(new String(bytes2, StandardCharsets.ISO_8859_1));
System.out.println(new String(bytes2, StandardCharsets.US_ASCII));

byte[] bytes3 = entrada.getBytes(StandardCharsets.ISO_8859_1);
System.out.println(Arrays.toString(bytes3));
System.out.println(new String(bytes3));
System.out.println(new String(bytes3, StandardCharsets.UTF_8));
System.out.println(new String(bytes3, StandardCharsets.ISO_8859_1));
System.out.println(new String(bytes3, StandardCharsets.US_ASCII));

byte[] bytes4 = entrada.getBytes(StandardCharsets.US_ASCII);
System.out.println(Arrays.toString(bytes4));
System.out.println(new String(bytes4));
System.out.println(new String(bytes4, StandardCharsets.UTF_8));
System.out.println(new String(bytes4, StandardCharsets.ISO_8859_1));
System.out.println(new String(bytes4, StandardCharsets.US_ASCII));

And I got the next exit, all wrong:

[83, -29, -81, -96, 80, 97, 117, 108, 111]
S㯠Paulo
S㯠Paulo
S㯠Paulo
S���Paulo

[83, -29, -81, -96, 80, 97, 117, 108, 111]
S㯠Paulo
S㯠Paulo
S㯠Paulo
S���Paulo

[83, 63, 80, 97, 117, 108, 111]
S?Paulo
S?Paulo
S?Paulo
S?Paulo

[83, 63, 80, 97, 117, 108, 111]
S?Paulo
S?Paulo
S?Paulo
S?Paulo

Can anyone help me? I thank you in advance.

1 answer

3

If entrada for a String, you already have it decoded and no use trying to convert it.

It seems to me what you’re trying to do is convert the String in bytes and then bytes in a String again. This does not work since the input original was transformed into a String, in general bytes are decoded and do not remain the same as they were originally.

When you do entrada.getBytes() actually Java will use the encoding standard defined by your system, so there is no difference between any of the other approaches.

The negative numbers are normal, since the primitive type byte in Java is a number that goes from -128 to +127. Nothing more normal than some characters being represented by values within the negative range.

The following code decodes the byte vector in all encodings that Java supports in a given environment:

byte[] b = new byte[] { 83, -29, -81, -96, 80, 97, 117, 108, 111 };
SortedMap<String, Charset> charsets = Charset.availableCharsets();
for (Map.Entry<String, Charset> entry : charsets.entrySet()) {
    System.out.printf("%s: %s%n", entry.getKey(), new String(b, entry.getValue()));
}

I tested this on a Mac and no encoding was able to decode the o of são, which indicates to me that bytes are already corrupted and the problem is not in any previous point.

You must require from whoever is on the "other side" a specification of what encoding is used and that the implementation follows what has been defined.

Another approach would be to directly receive the byte array of the input or some format that does not decode the bytes before reaching your control.

  • In fact, because the o of Paulo should be the same code as o of São. The data is corrupted in the input. The best thing the author would do, would be to show byte to byte input in hexa, for analysis.

  • Thank you very much for your attention. I agree with you, and I suspected it too. I will see what I can do about reading on the Moon.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.