How to transform string into character array?

Asked

Viewed 21,920 times

13

It is possible to transform string into character array?

I only found it with the method .split(param);

I’d like to convert a string in a array of characters, one character in each index.

I would like that 'oi' becomes var[0] = 'o', var[1] = 'i'

How to do this?

  • 2

    But if the split already does that, you want what beyond that?

4 answers

17


To separate a String by characters you can use '' as a separator.

var string = 'oi';
var array = string.split(''); // ["o", "i"]

7

Beyond the split('') that Sergio recommended, it is also possible to access each character directly by index, even without converting to array. For example:

var str = 'teste';
str[2] === 'e' // false
str[2] === 's' // true

3

It depends on what you mean by "character".

The other answers work very well on "ascii world", but nowadays the definition of character became so complicated that I think is worth exploring some possibilities.

That is not to say that the other answers are wrong, only that there are more cases to consider than has already been presented. There are many situations where you won’t have to worry about what comes next (but I still think it’s valid to know, because if you ever need it, it’s there).

But first, just to add one more option to the other answers, you can use Array.from:

console.log(Array.from('oi'));


Now let’s go to the cases where all this fails:

function test(s) {
    console.log(`Testando string ${s}:`);
    console.log('split=', s.split(''));
    console.log('spread=', [...s]);
    console.log('Array.from=', Array.from(s));
}

test('á'); // funciona, todos os arrays terão apenas um caractere ("á")
test('á'); // ops, todos os arrays tem 2 elementos

What happens is that the first "á" is normalized in the NFC form, and the second, in the NFD form. To understand in detail what this is, I suggest reading here, here and here. But in a well summarized way, the character "á" can be represented in two ways: NFC or "composite" (i.e., as a single character, "á") and NFD or "decomposed" (as two characters: the letter "a" without accent and the accent itself, separately). When shown on the screen, both are equal, but if you "brush the bits", you will see that one of them actually has two "characters" (which, when shown on the screen, are "combined" in one). Only that all methods above (split, spread, Array.from) take this into account when assembling the array, which is why in the second case each array has two elements.

An alternative is to normalize to NFC using the method normalize:

console.log('Array.from=', Array.from(stringQualquer.normalize('NFC')));

Unless, of course, you actually want to have the accents separate from the letters, then you normalize to NFD.

But that still doesn’t solve every case...


Emojis

Emojis are characters? The discussion is not the case, but the fact is that today we can have strings like this:

let poo = '';
console.log(poo);

Yes, a straight emoji in the code. And in this case, split doesn’t work anymore:

function test(s) {
    console.log(`Testando string ${s}:`);
    console.log('split=', s.split(''));
    console.log('spread=', [...s]);
    console.log('Array.from=', Array.from(s));
}

test('');

So much spread when Array.from generate an array containing only one element: the "character" (PILE OF POO). But split generated an array with two elements. This is because Javascript internally stores the strings in UTF-16 (or UCS-2, see more details here) and characters whose code point is greater than 0xFFF end up being "decomposed" into two parts (the so-called surrogate pair - the algorithm is described here, in case you got curious).

In the case of PILE OF POO, its code point is 0x1F4A9, which in UTF-16 is converted to the surrogate pair 0xD83D and 0xDCA9, and these are the values that are in the array generated by split:

console.log(''.split('').map(c => c.codePointAt(0).toString(16)));

This behavior of split is explained in more detail here.

In the end, it doesn’t matter if emojis are considered characters or not. The fact is, if you have a string containing emojis and you want to generate an array from it, which is "better": that each element of the array is an emoji, or that the emojis are broken into surrogate pairs? Of course the answer will still be "depends" (there may be a use case where you need to know surrogate pairs), but I understand that in most cases you probably will want an array of emojis.

"So just use spread or Array.from, and normalize to NFC, which all works, right?"

Wrong

Usually emojis correspond to a code point (like the PILE OF POO, which is 0x1F4A9), but this is not always the case. The emojis of families, for example, are combinations of other emojis.

Ex: a family with father, mother and 2 daughters is actually a combination of the emoji of a man, one female and two emojis of girl. To join them, the character is used ZERO WIDTH JOINER (also called only ZWJ - and these sequences of emojis separated by ZWJ are called Emoji ZWJ Sequences).

That is, the emoji of "family with father, mother and 2 daughters" is actually a sequence of seven code points:

This sequence of code points can be displayed in different ways. If the system/program used recognizes this sequence, a single family image is shown:

emoji família com pai, mãe e 2 filhas

But if this sequence is not supported, the emojis are shown next to each other:

emoji família, com membros um do lado do outro, para sistemas que não suportam o emoji de família como uma única imagem

And in that case, none of the methods already seen above works:

function test(s) {
    console.log(`Testando string ${s}:`);
    console.log('split=', s.split(''));
    console.log('spread=', [...s]);
    console.log('Array.from=', Array.from(s));
}

test(String.fromCodePoint(0x1f468, 0x200d, 0x1f469, 0x200d, 0x1f467, 0x200d, 0x1f467));

Notice that split generated an array with 11 elements. That’s because each emoji (man, woman and the 2 girl emojis) was broken into a surrogate pair, totaling 8 elements. Plus the three ZWJ, totals 11. Already the spread and Array.from returned arrays with 7 elements each (each of the emojis plus the ZWJ).

Remember that ZWJ is not only used with emojis. Several other alphabets end up using it in some of their "characters", for example the क्‍ष (in Devaganari, used in India), which consists of 4 code points (one of which is ZWJ):

let s = 'क्‍ष';
console.log([...s].map(c => c.codePointAt(0).toString(16))); // ["915", "94d", "200d", "937"]

These sequences of code points that are interpreted as if they were one thing are only called Grapheme Clusters, and Javascript does not have a native way to get them. In this case, the way is to use some lib.


Anyway, which method to use will depend on each case. If you know that your strings only have ASCII characters, for example, you don’t have to worry about any of this. But if you have accents, emojis and characters in other languages, then you should think about what you really need, and use the most appropriate method for each case.

2

In newer versions of Ecmascript, you can also use scattering operator, which will convert the string, which is an iterable, into an array.

Thus:

const string = 'oi';
const array = [...string];

console.log(array);

Browser other questions tagged

You are not signed in. Login or sign up in order to post.