How to get the first word of a text in Python?

Asked

Viewed 2,416 times

0

frase = str("Eu como abacaxi").split()

How do I show only the first word of the phrase ("I" in case)?

  • 3

    It would not be phrase[0]?

  • 2

    split returns a list, so just take the first element of it with frase[0], as already said. Another detail is to use str in a string is reductive and unnecessary, can do only "Eu como abacaxi".split()

  • 2

    Or just first, rest = frase.split(maxsplit=1), but this does not solve the problem if there are scores, for example, as in frase = 'eu, eu mesmo e irene', the first word would be 'eu,', comma

2 answers

3

The way to do it depends a lot on how your string is and what you consider to be "word".


In your specific case, just take the first element of the list returned by split:

frase = "Eu como abacaxi".split()
primeira_palavra = frase[0]

# ou simplesmente
primeira_palavra = "Eu como abacaxi".split()[0]

Detail: when you have a text in quotes in the code, he’s already a string, then it’s redundant and unnecessary to do str("texto"), for "texto" is already a string.


Limitations

The above solution is very limited, as already pointed out in the comments, if the phrase has any punctuation, it no longer works:

primeira_palavra = "Eu, você e ele comemos abacaxi".split()[0]
print(primeira_palavra) # Eu,

In that case, the first word ends up being Eu, - the comma is part of the "word". And that’s where we should define what exactly a word is (it can’t be "anything other than space," because that’s what you’re considering when calling split with no parameter).

And the first element may also not be a word (e.g.: if the string is "- lorem ipsum", the first "word" will be the hyphen).

One solution is to consider that words are just "consecutive string letters". In this case, just use the method isalpha to check which string characters are letters:

frase = " - 123... Eu, você e ele comemos abacaxi!"
inicio = fim = None
for i, c in enumerate(frase):
    letra = c.isalpha()
    if inicio is None and letra:
        inicio = i # início de palavra
    elif inicio is not None and fim is None and not letra:
        fim = i # fim de palavra
        break # sai do loop
else: # chegou ao fim da string e não encontrou um caractere que não é letra
    fim = len(frase)

if inicio is not None and fim is not None:
    primeira_palavra = frase[inicio:fim]
    print(primeira_palavra)
else:
    print('A frase não contém nenhuma palavra')

I use enumerate to iterate by the string characters and their respective indexes. At each iteration of the for, the variable c will be one of the characters in the string, and i shall be its.

Initially I look for the first character which is a letter to find the initial index. From there I go forward until I find a character that is not a letter, indicating that the word has already ended, in which case I keep the final index and close the loop with break.

Notice you have one else that belongs to the for. It is called if the for nay is interrupted by a break, which in this case indicates that I have reached the end of the string without finding a character that is not letter (i.e., or has no word, or the single word of the string ends just at the end of the string). In this case, we take everything to the end of the string.

After the for we check whether or not the string has a word, and if so, we take the first one using the previously found indexes (using the syntax of slicing to take the string snippet between the initial and final indices).

This solution deals with cases where the entire string is a single word, in addition to cases where it has no word (it may be for example "123", or "@!#").


Compound words

There is only one problem: the previous solution does not consider compound words, such as "hummingbird" (since the hyphen is not a letter and isalpha returns False for this character).

You could adapt the above code to accept a hyphen, as long as the characters immediately before and after are letters. But there is another alternative, which is to use regular expressions (regex), through the module re:

import re

frase = " - 123... Beija-flor, come abacaxi!"
regex = re.compile(r'\b[^\W\d_]+(-[^\W\d_]+)*\b')
match = regex.search(frase)
if match:
    primeira_palavra = match.group()
    print(primeira_palavra) # Beija-flor
else:
    print('A frase não contém nenhuma palavra')

For the word, many would probably use the shortcut \w, which actually considers all letters. Only that it also considers digits and the character _. If you do not want to consider "123" and "abc_def" as words, we have to delete the digits and the _ of expression.

For that we use a character class denied: [^\W\d_]+. In this case, that’s all nay is \W, nor \d, and neither _. The \W is the opposite of \w (that is, everything that is not letter, number and _). Then all that is not \W is the same as \w, only I’m also deleting the digits (\d) and the very _. So there are only letters left.

Deep down it’s a way of saying I only want letters - and this shortcut is better than [a-zA-Z] because it also considers accented letters. And the quantifier + takes one or more occurrences.

Then we have an excerpt containing hyphen and one or more letters, and this excerpt may occur zero or more times (indicated by *). This covers cases where there is more than one hyphen (such as "sponge cake" - although after the orthographic reform I think there is no more hyphen, but anyway).

All this is wrapped up in shortcut \b, indicating a "word boundary" (a position that has an alphanumeric character before and a non-alphanumeric character after), otherwise in cases like "12abc34", the regex would consider that "abc" is a word.

Thus, compound words are also considered.


Of course - always - you can complicate it more. And words with apostrophe? (as "eye-of-water"). In this case we have to include (\'[^\W\d_]+)? in regex (an apostrophe followed by one or more letters, and ? indicates that this whole section is optional):

regex = re.compile(r'\b[^\W\d_]+(\'[^\W\d_]+)?(-[^\W\d_]+(\'[^\W\d_]+)?)*\b')

Unicode

You can always complicate more. See the example below:

import re

frase = "sábio da montanha"
regex = re.compile(r'\b[^\W\d_]+(\'[^\W\d_]+)?(-[^\W\d_]+(\'[^\W\d_]+)?)*\b')
match = regex.search(frase)
if match:
    primeira_palavra = match.group()
    print(primeira_palavra)
else:
    print('A frase não contém nenhuma palavra')

The result is:

sa

What happened is that the string "sábio da montanha" is in NFD (one of the standards defined by Unicode). Basically, the words á ("a" with acute accent) is decomposed into two characters: the letter "a" without accent and the own accent (only that visually can not distinguish, because it is always shown as á). Since the accent character is not a letter, it is not considered part of the word (it has a more detailed explanation about normalization here, here and here - although the links do not speak specifically about Python, the idea is the same).

One option is to normalize to NFC (using the module unicodedata), because thus the characters "a" and the accent are condensed into a single character (the á), and this becomes considered as a letter by regex:

frase = "sábio da montanha"

from unicodedata import normalize
frase = normalize('NFC', frase)
... etc (o resto é igual)

Thus the word "wise" is found by regex.


Alternative: module regex

If you want, you can install the module regex, an excellent extension of the module re. With it you can use the Unicode properties. This is because not all strings in NFC will have the accent character "agglutinated" with the letter, so the above solution that uses normalization will not always work.

In this case, we use \p{L}\p{M}* for a "letter" (\p{L} is any character that Unicode defines as "letter" and \p{M} encompasses characters such as accent and others that can be applied to a letter - so-called Combining characters). You can see the full list on this page (\p{L} encompasses all categories starting with "L" and \p{M}, all beginning with "M").

Then I would be:

import regex
r = regex.compile(r'\b(\p{L}\p{M}*)+(\'(\p{L}\p{M}*)+)?(-(\p{L}\p{M}*)+(\'(\p{L}\p{M}*)+)?)*\b')

# restante do código igual
match = r.search(frase)
etc...

Finally, see how the solution can get complicated depending on what you consider "word". And we are only limiting ourselves to the definition of Portuguese, since there are languages (such as Japanese and Chinese) in which you can have whole sentences with no space between words (\p{L} even considers letters of these alphabets, but if there are no spaces in the sentence, then you would probably have to use specific solutions for each language - example - and if you want to limit yourself to just our alphabet, you can exchange \p{L} for \p{Script=Latin}).

Of course, for simpler texts like yours, perhaps the split already solve (as long as you also treat the case of the string being empty (""), then split returns an empty list and gives error when trying to access the first element - something that does not happen with other solutions, which correctly identify that there is no word). But I thought it was worth expanding a little the problem of "finding the first word of a text" for cases not so obvious.

1


Makes

primeira_palavra = str("Eu como abacaxi").split()[0]

The method split() will break the string by the informed tab, as not reported it will break by space.

After breaking it returns a list like this ['Eu', 'como', 'abacaxi']

So you can use the zero index to take the first element, which will then be the first word.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.