How to separate characters from a string to a Python list?

Asked

Viewed 871 times

3

How can I convert a string (a word, for example) into a list?

entrada:
a='carro'
saída:
['c', 'a', 'r', 'r', 'o']

I know that by using split, turn into list gets the whole word, like to know if a specific word turns to list.

  • just call the constructor from the list, passing the desired string. Cmo strings are also strings of strings of length 1, that’s exactly what happens: list(a) -> ['c', 'a', 'r', 'r', 'o']

3 answers

13

This way returns a list of characters in each position

list(a)

See working on Coding Ground

4

Complementing the another answer, there are some cases where use only list may not be enough. See this example:

print(list('pá'))
print(list('pá'))

Although it does not seem, the two lines above are different, as this code prints the following:

['p', 'á']
['p', 'a', '́']

Obs: If you are testing the above code, copy and paste, because if you type directly, it may not give the desired effect. More details on this throughout the answer.

It happens because, in a way well summarized, the Unicode defines two ways of representing the letter "a" with acute accent:

  1. composite - like code point U+00E1 (LATIN SMALL LETTER A WITH ACUTE) (á)
  2. decomposed - as a combination of two code points (in this order):

The first form is called NFC, and the second, NFD. For more details about normalization, what is a code point, etc., see here, here and here. In this case, in the first line of the above code, the string is in NFC, and in the second line, it is in NFD (the third element of the list is the "acute accent" character itself - which depending on how it is rendered, can appear almost "stuck" in the closing quotes, almost imperceptible - and note that the second element is the letter "a" without accent).

The detail is that when shown on the screen, both NFC and NFD forms end up being rendered the same way, so there is no apparent difference in the two lines of code above - both show the letter "á" (and so you should copy and paste the above code to test, because if you try to type directly from the keyboard, it will preferably choose only one of the shapes and the lines will end up being identical).

And what happens is list, when receiving an iterable, builds a list in which each element of the iterable becomes an element of the list. And strings are iterable in Python, and by iterating through a string, we’re actually iterating through its code points. So there is this difference if the string is in NFD.

And since it was not specified where the strings come from, this is a situation that can happen, because the data may come from a file, HTTP request, the user copied and pasted from somewhere that was in NFD (but as it is rendered in the same way, did not notice the difference)etc. That is, it can happen without you realizing it (as it happened, for example, in this question).


How to solve?

For simple texts - especially if they are in Portuguese - simply normalize to NFC:

from unicodedata import normalize

print(list(normalize('NFC', 'pá')))

So I guarantee that even if the string is in NFD, the "a without accent" and "accent" characters will be "joined" in one (the "á").

Remember that this is not restricted to the Portuguese language and also works for other types of "characters":

# sim, um emoji direto no código (emojis também tem code points definidos pelo Unicode, então funciona da mesma forma)
print(list(normalize('NFC', '堆積'))) # ['', '堆', '積']

But that doesn’t always work, of course. Not all characters in the world have an NFC correspondent, and normalization to NFC will not always produce a single code point. Ex:

from unicodedata import normalize

s = 'ẛ̣'
# mesmo normalizando em NFC, ainda resulta em dois code points
print(list(normalize('NFC', s))) # ['ẛ', '̣']

Or else:

# emoji de família
s = ''.join(map(chr, [0x1f468, 0x200d, 0x1f469, 0x200d, 0x1f467, 0x200d, 0x1f467]))
print(s) # se seu browser for compatível, mostrará a família ‍‍‍
print(list(normalize('NFC', s))) # ['', '\u200d', '', '\u200d', '', '\u200d', '']

Family emojis are actually formed by several different emojis, united by one Zero Width Joiner (the \u200d shown above). So even if we normalize to NFC, still the resulting list will have separate code points.

And that doesn’t just apply to emojis, there are characters from other alphabets that have the same characteristic (they are formed by more than one code point, and even normalizing to NFC, continue with more than one):

print(list(normalize('NFC', 'नु'))) # ['न', 'ु']

This set of code points that together form "one thing" are called Grapheme Clusters. If the goal is to have a list of Grapheme Clusters (since the code points of these, separately, do not have exactly the same meaning), there is no direct way to do, and the way is to resort to external libraries.

An example is the module grapheme:

# Atenção: módulo não-nativo - Instale em https://pypi.org/project/grapheme/
from grapheme import graphemes

def to_list(s):
    return list(graphemes(s))

# emoji de família
s = ''.join(map(chr, [0x1f468, 0x200d, 0x1f469, 0x200d, 0x1f467, 0x200d, 0x1f467]))
print(to_list(s)) # ['\u200d\u200d\u200d']
print(to_list('नु')) # ['नु']

The difference, of course, is that the list elements will not always only have a code point. But this is because, thanks to all the possibilities that Unicode brings us, the definition of "character" has become more complex. Is an emoji a character (in the sense of being "a single symbol shown on the screen, which has its own meaning")? And the á in NFD, although it has two code points, it is shown as a single character, so is it acceptable to have more than one code point in each element of the list? Or I want a list in which each element is a code point, independent of normalization?

For further discussion, visit link already quoted, but anyway, the "right" solution depends on what you need (if your text will not have grapheme clusters, for example, you would not need to worry about this last part, just normalize to NFC - each case is a case).

  • 2

    Dude, I like to answer informative and thorough. But in this case, the person asked how to get to the bakery from the corner, and you gave all the ways to do this, including by plane and transatlantic (go to a port city, buy a cruise ticket). I think common sense has to prevail - in that case, it would be worth asking a question worthy of this answer - involving the question of Unicode, and grapheme clusters in the matter - then you glue that same answer there: I think it would be more constructive. The answer here is list(a) and it’s over.

  • (to illustrate, my answer with more votes in the S.O. in English is just one of those, where one wonders 'how to count occurrences of a substring within a string) e eu respondo em 3 linhas pra usar o método .Count`. I have a more complete answer in the same question (for substrings that overlap) - but I leave it in a separate answer - it’s a case where common sense says that the simple answer will help more than 90% of the people who fall there)

  • 2

    @jsbueno The question speaks of "separating the characters of a string" without specifying what can be in the string (of course in the given example there is a simple string, but I found it important to emphasize that strings are not limited to this and generalized to "anything"). Not least because many do not even conceive the possibility of these things that I mention in the answer, and so I thought it valid to show this. The answer to the specific question example may be list(a), but if you consider the title (which does not specify anything), I find it valid to speak of Unicode and everything else...

  • 2

    ...because then the definition of "character" becomes not so clear, and "separating the characters" can have several possibilities, and the correct solution will depend on each case. I agree that maybe it would have been better another more specific question, but my fear is that it would end up closed as dup, so in the end I preferred to answer right here.

  • 4

    this character definition is a subject for two (and not one) book chapter - it’s easy to see that it’s not what’s troubling the A.P.. E would not be "dup" - the question focusing on understanding what a Unicode character is and how sequences are is much wider than this, including extending the programming language. What you wrote here is something you can say that 95% of devs don’t know, and it’s very important - but it’s a more advanced topic than "how to separate characters from a string in a list" - when the AP example doesn’t even include accented characters.

  • 2

    @jsbueno the example presented has its merit and cannot be ignored from the point of view of character processing: print(len(list('pá')) != len(list('pá')))

  • 1

    @jsbueno I have always found many of your excellent answers precisely because they are more than a "try this" and go beyond what was asked. That’s what I’ve tried to do here, remembering that content should be useful to anyone, not just the AP - so I find it irrelevant whether or not it was concerned with the implications of Unicode (and since we are speculating what he wants/thinks, will he is in the 95% that you quoted, and so did not mention anything in the question - because he did not even imagine that it could exist) :-) In fact it does not matter, the answers should not serve only the AP...

  • 1

    If we are going to discuss how to separate the characters of a string, I think it is relevant to point out (even if superficially) that the definition of "character" may not be that simple. I did not claim to exhaust the subject (which in fact needs a book to do so) but at least to mention that it exists and who wants, can look for more information - for this I put links to other questions, which in turn has links to many other sources (this will be my last comment here, but if you want, we can continue on the goal)

Show 3 more comments

0

It is possible to convert a string into a list. To do this, simply pass the string as a list parameter. The code below takes any string and converts it into a list.

a = list(input('Digite uma palavra: '))
print('A lista criada é: {}'.format(a))

This code takes any string and then converts it to a list.

  • 1

    That’s right Augusto. I got carried away with the answer and forgot this detail. By default, in Python, everything received by input is converted to string automatically, unless it converts to another type (int, float, etc.). Now yes, the algorithm is perfect.

  • What’s the difference to what another answer presented?

  • @Luiz Felipe, good night. The difference was that my response is more generalized, since this captures any string instead of working with just the word carro.

  • 2

    The other answer also works for any string. It just didn’t put the input, but actually it doesn’t even make a difference, because the string could come from anywhere (input, file, HTTP request, etc.), that the solution would be the same: use list. Therefore, in fact there is no difference to the other answer...

  • @hkotsubo, good evening! Thank you .... and, thank you for the remark. My intention is to share knowledge. I’m not here competing with anyone. Hug.

  • 3

    It is not a question of competing, but of having redundant answers (which does not add to the site, because the idea is not to have multiple answers with basically the same solution). New answers are welcome, as long as it shows other ways of doing, which was not the case here (as using list has already been suggested in the other answer, and the fact of having put input does not make it different, since the solution at the end is the same)

  • @hkotsubo, I fully understand your position and intention. I am here with the same intention as you. I will seek to quintuple my performance. Peace to all.

  • You can still use another syntax in your unpack response print([*"Hello World"]).

Show 3 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.