How do I calculate to convert a Codepoint to UTF-16?

Asked

Viewed 26 times

1

I have a 32-bit integer representing a Unicode character and would like to convert this single character into its representation utf-16, that is, one or more 16-bit integers.

1 answer

3


The 16-bit (UTF-16) Unicode transformation format is defined in section 2.5 of the Unicode standard, as well as in the RFC 2781. It works like this:

  1. Be the Codepoint U the value you want to encode. If U is less than 65,536, normally emits.
  2. If U is greater than or equal to 65,536, get U' = U - 65536. That one U', by the rules of Unicode, will have the 12 most significant bits equal to zero (since the last Codepoint valid is 0x10FFFF).
  3. Output two bytes, in order:
    1. the first has the six most significant bits 1101 10 and the ten least significant bits equal to the ten most significant bits of U'.
    2. The second has the six most significant bits 1101 11 and the ten least significant bits equal to the ten least significant bits of U'.

In C:

void
utf_16(uint32_t codepoint, FILE * out) {
    uint32_t U;
    uint16_t W;

    assert(codepoint <= 0x10FFFF);
    if (codepoint < 0x10000) {
        W = (uint16_t) codepoint;
        fwrite(W, sizeof(W), 1, out);
    } else {
        U = codepoint - 0x10000;
        W = 0xD800 | (U >> 10);
        fwrite(W, sizeof(W), 1, out);
        W = 0xDC00 | (U & 0x3FF);
        fwrite(W, sizeof(W), 1, out);
    }
}
  • that’s the answer ! it worked perfectly

Browser other questions tagged

You are not signed in. Login or sign up in order to post.