What happens when I convert String to an array of bytes?

Asked

Viewed 1,591 times

4

I have:

String msg = "Texto a ser encriptado";
byte array[] = msg.getBytes();

What happens when I convert the string to an array of bytes?

I’m concerned about an application I’m developing. I need to send encrypted data from a server to an embedded system. The reason for my concern is the character encoding, since my embedded application has a firmware written in C and use char to store the received data. I’ll have some problem with coding?

If necessary, I can put code snippets from my application.

  • 2

    If you have a Unicode string and you convert it from/to the same specific encoding (say, UTF-8) then the resulting byte sequence will always be the same. Detail: make sure the "getBytes" function of your platform allows you to specify coding, avoid methods that use standard system coding, because that might not be portable. And obviously, don’t use broken methods.

  • Avelino, did my reply or @mgibsonbr satisfy you? If so, could you accept one of them? Otherwise, there’s something you still want to clear up?

  • 1

    @Victorstafusa, I ended up accepting your answer as correct by demonstrating how I should use the .getBytes(); more appropriately. But the two answers were very useful (I gave +1 for both) and also very good. Forgive the delay, I’m TCC and I had to analyze them more calmly (and more time).

2 answers

5


Well, here’s what the javadoc of the method getBytes():

getBytes

public byte[] getBytes()

Encodes this String into a Sequence of bytes using the Platform’s default charset, storing the result into a new byte array.

The behavior of this method when this string cannot be encoded in the default charset is unspecified. The CharsetEncoder class should be used when more control over the encoding process is required.

Returns:

The resultant byte array

Since:

JDK1.1

Translating into Portuguese:

getBytes

public byte[] getBytes()

Encodes this String in a sequence of bytes using the charset platform standard, storing the result in a new array of bytes.

The behavior of this method when it string cannot be encoded in the charset default is not specified. The class CharsetEncoder should be used when more control over the coding process is required.

Returns:

The array of bytes resultant

Since:

JDK1.1

That is, this method will not serve your purpose! And yes, you may have coding problems.

One way to improve the situation a little is to use the methods getBytes(String charsetName) or getBytes(Charset charset), that allow you to use a charset well defined and avoid relying on charset platform standard. But still, in the case of String cannot be encoded in the Charset specified, the behavior is still unspecified (ie undefined). Normally this will suffice you if you ensure that all Stringused may always be coded in the charset chosen, or else if you just don’t care for the ones that can’t.

With that, your code would be something like this:

import java.io.UnsupportedEncodingException;
import java.util.Arrays;

public class ConversaoSimples {

    // Método main só para testar o método converte.
    public static void main(String[] args) {
        String minhaString = "João comeu pão com feijão.";
        byte[] bytes = converte(minhaString);
        System.out.println(Arrays.toString(bytes));
        try {
            System.out.println(new String(bytes, "UTF-8"));
        } catch (UnsupportedEncodingException x) {
            throw new AssertionError(x);
        }
    }

    // Método que faz a conversão.
    private static byte[] converte(String str) {
        try {
            return str.getBytes("UTF-8");
        } catch (UnsupportedEncodingException x) {
            throw new AssertionError(x);
        }
    }
}

Here’s the way out:

[74, 111, -61, -93, 111, 32, 99, 111, 109, 101, 117, 32, 112, -61, -93, 111, 32, 99, 111, 109, 32, 102, 101, 105, 106, -61, -93, 111, 46]
João comeu pão com feijão.

However, if the detail that the strings which cannot be coded with the charset specified result in arrays of bytes with somewhat undefined behavior make a difference to you, it is best to use the class CharsetEncoder. Since this is an abstract class (and it makes no sense for you to create subclasses of it), to get instances you will need to use the method newEncoder() class Charset. The class itself Charset is also abstract, and to get an instance of it, use the static method forName(String charsetName).

After acquiring the instance of CharsetEncoder, you may want to reconfigure it with some of your returning methods CharsetEncoder. These methods modify the CharsetEncoder and return it. Input, you do not need to modify it, the default setting will replace the unknown characters by question marks.

After having the CharsetEncoder properly configured, use one of the methods encode of it. The simplest to use is the method encode(CharBuffer in), who receives a CharBuffer and return a ByteBuffer. To get an instance of CharBuffer, use the static method wrap(CharSequence csq) passing as parameter to your String.

After obtaining the ByteBuffer, you can use the method array() to get a array of bytes.

The array of bytes obtained from ByteBuffer may come with a few zeroes at the end, as its size should be a power of 2 (which are used to balance performance with memory usage). These zeros are unfilled values of buffer allocated. You will need to remove these zeros from the end of the array.

Finally, the code gets complicated:

import java.io.UnsupportedEncodingException;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetEncoder;
import java.util.Arrays;

public class ConversaoComplicada {

    // Método main só para testar o método converte.
    public static void main(String[] args) {
        String minhaString = "João comeu pão com feijão.";
        byte[] bytes = converte(minhaString);
        System.out.println(Arrays.toString(bytes));
        try {
            System.out.println(new String(bytes, "UTF-8"));
        } catch (UnsupportedEncodingException x) {
            throw new AssertionError(x);
        }
    }

    // Método que faz a conversão.
    private static byte[] converte(String str) {
        try {
            Charset c = Charset.forName("UTF-8");
            CharsetEncoder ce = c.newEncoder();

            // Se você quiser modificar o CharsetEncoder, faça aqui. Por exemplo:
            // ce.replaceWith(...);

            CharBuffer cb = CharBuffer.wrap(str);
            ByteBuffer bb = ce.encode(cb);
            byte[] b = bb.array();
            return cortaZeros(b);
        } catch (CharacterCodingException x) {
            throw new AssertionError(x);
        }
    }

    private static byte[] cortaZeros(byte[] array) {
        if (array[array.length - 1] != 0) return array;
        int inicio = 0, fim = array.length;
        while (inicio != fim && inicio != fim - 1) {
            int m = (inicio + fim) / 2;
            if (array[m] == 0) {
                fim = m;
            } else {
                inicio = m;
            }
        }
        int tamanho = array[inicio] == 0 ? inicio : inicio + 1;
        byte[] resultado = new byte[tamanho];
        System.arraycopy(array, 0, resultado, 0, tamanho);
        return resultado;
    }
}

Here’s the way out:

[74, 111, -61, -93, 111, 32, 99, 111, 109, 101, 117, 32, 112, -61, -93, 111, 32, 99, 111, 109, 32, 102, 101, 105, 106, -61, -93, 111, 46]
João comeu pão com feijão.

Ah yes, I’m guessing your String there is no zero character in it, especially at the end of it. If this happens, this above algorithm will be even more complicated.

  • 3

    +1, especially for the last part, very detailed. However, I would recommend that anyone who is not supporting a legacy system simply avoid character encodings that are not able to represent all the Unicode characters.

4

A "string" is a abstract type of data, representing a finite sequence of "characters". The implementation of this type on any platform - and even the semantic interpretation of what is a "character" - varies from language to language, from platform to platform.

I could talk about it, but to keep the answer simple let’s assume that your language gives appropriate character support Unicode, and that has methods to convert strings from/to character encodings (Character encodings) well defined. Then we have the following:

  • A well-defined sequence of "Unicode Code Points" is represented in memory in the form of the string type. If this type represents it internally as code Units, code points, bytes, etc., it doesn’t matter. If it represents characters off the BMP as one or two characters, it doesn’t matter. What matters is that each string represents "unambiguously" (I don’t even know if that word exists) any valid sequence of code points.

  • A character encoding represents each code point as a well-defined sequence of bytes. Some encodings have ambiguities that need to be solved (such as the order of bytes that represent each character), but that I know all have a match of one to one between code point and byte sequence.

    In particular, the UTF-8 is designed to have a high degree of compatibility with ASCII-based systems (1 ASCII character turns 1 byte with the same representation - including the null terminator - while 1 non-ASCII character turns two or more bytes encoded in such a way that none of them gets confused with an ASCII character, their order matters and it is very evident which is the "first" and which are the other).

So, when converting from the string’s memory representation to a byte representation, the same string will always generate the same byte sequence, and the same byte sequence will always produce the same original string.

Okay, what about the code in C specifically? According to those two questions on Soen, the ANSI C standard does not determine which type char has exactly 8 bits, but is of a type capable of representing all characters in a given set, called "Execution Character set". What exactly is this set, I can’t say.

I have no experience with C, nor could I interpret 100% of the information presented in the linked questions, but I think it is safe to assume that an "array of chars" in C nay is necessarily able to represent any and all sequence of Unicode Code Points. If your embedded application wants to manipulate strings, you need to find out what features your environment supports to handle character encodings and Unicode.

On this page (in English) there is a demonstration of the various features for handling Unicode strings in the main popular languages. But if you are particularly interested in C, it might be worth opening a separate question looking for a correct way to treat strings on its specific platform (important, otherwise the question will become too wide - or, even if "responsive", it will not necessarily apply to your particular case).

Browser other questions tagged

You are not signed in. Login or sign up in order to post.