0
frase = str("Eu como abacaxi").split()
How do I show only the first word of the phrase ("I" in case)?
0
frase = str("Eu como abacaxi").split()
How do I show only the first word of the phrase ("I" in case)?
3
The way to do it depends a lot on how your string is and what you consider to be "word".
In your specific case, just take the first element of the list returned by split
:
frase = "Eu como abacaxi".split()
primeira_palavra = frase[0]
# ou simplesmente
primeira_palavra = "Eu como abacaxi".split()[0]
Detail: when you have a text in quotes in the code, he’s already a string, then it’s redundant and unnecessary to do str("texto")
, for "texto"
is already a string.
The above solution is very limited, as already pointed out in the comments, if the phrase has any punctuation, it no longer works:
primeira_palavra = "Eu, você e ele comemos abacaxi".split()[0]
print(primeira_palavra) # Eu,
In that case, the first word ends up being Eu,
- the comma is part of the "word". And that’s where we should define what exactly a word is (it can’t be "anything other than space," because that’s what you’re considering when calling split
with no parameter).
And the first element may also not be a word (e.g.: if the string is "- lorem ipsum"
, the first "word" will be the hyphen).
One solution is to consider that words are just "consecutive string letters". In this case, just use the method isalpha
to check which string characters are letters:
frase = " - 123... Eu, você e ele comemos abacaxi!"
inicio = fim = None
for i, c in enumerate(frase):
letra = c.isalpha()
if inicio is None and letra:
inicio = i # início de palavra
elif inicio is not None and fim is None and not letra:
fim = i # fim de palavra
break # sai do loop
else: # chegou ao fim da string e não encontrou um caractere que não é letra
fim = len(frase)
if inicio is not None and fim is not None:
primeira_palavra = frase[inicio:fim]
print(primeira_palavra)
else:
print('A frase não contém nenhuma palavra')
I use enumerate
to iterate by the string characters and their respective indexes. At each iteration of the for
, the variable c
will be one of the characters in the string, and i
shall be its.
Initially I look for the first character which is a letter to find the initial index. From there I go forward until I find a character that is not a letter, indicating that the word has already ended, in which case I keep the final index and close the loop with break
.
Notice you have one else
that belongs to the for
. It is called if the for
nay is interrupted by a break
, which in this case indicates that I have reached the end of the string without finding a character that is not letter (i.e., or has no word, or the single word of the string ends just at the end of the string). In this case, we take everything to the end of the string.
After the for
we check whether or not the string has a word, and if so, we take the first one using the previously found indexes (using the syntax of slicing to take the string snippet between the initial and final indices).
This solution deals with cases where the entire string is a single word, in addition to cases where it has no word (it may be for example "123", or "@!#").
There is only one problem: the previous solution does not consider compound words, such as "hummingbird" (since the hyphen is not a letter and isalpha
returns False
for this character).
You could adapt the above code to accept a hyphen, as long as the characters immediately before and after are letters. But there is another alternative, which is to use regular expressions (regex), through the module re
:
import re
frase = " - 123... Beija-flor, come abacaxi!"
regex = re.compile(r'\b[^\W\d_]+(-[^\W\d_]+)*\b')
match = regex.search(frase)
if match:
primeira_palavra = match.group()
print(primeira_palavra) # Beija-flor
else:
print('A frase não contém nenhuma palavra')
For the word, many would probably use the shortcut \w
, which actually considers all letters. Only that it also considers digits and the character _
. If you do not want to consider "123" and "abc_def" as words, we have to delete the digits and the _
of expression.
For that we use a character class denied: [^\W\d_]+
. In this case, that’s all nay is \W
, nor \d
, and neither _
. The \W
is the opposite of \w
(that is, everything that is not letter, number and _
). Then all that is not \W
is the same as \w
, only I’m also deleting the digits (\d
) and the very _
. So there are only letters left.
Deep down it’s a way of saying I only want letters - and this shortcut is better than [a-zA-Z]
because it also considers accented letters. And the quantifier +
takes one or more occurrences.
Then we have an excerpt containing hyphen and one or more letters, and this excerpt may occur zero or more times (indicated by *
). This covers cases where there is more than one hyphen (such as "sponge cake" - although after the orthographic reform I think there is no more hyphen, but anyway).
All this is wrapped up in shortcut \b
, indicating a "word boundary" (a position that has an alphanumeric character before and a non-alphanumeric character after), otherwise in cases like "12abc34", the regex would consider that "abc" is a word.
Thus, compound words are also considered.
Of course - always - you can complicate it more. And words with apostrophe? (as "eye-of-water"). In this case we have to include (\'[^\W\d_]+)?
in regex (an apostrophe followed by one or more letters, and ?
indicates that this whole section is optional):
regex = re.compile(r'\b[^\W\d_]+(\'[^\W\d_]+)?(-[^\W\d_]+(\'[^\W\d_]+)?)*\b')
You can always complicate more. See the example below:
import re
frase = "sábio da montanha"
regex = re.compile(r'\b[^\W\d_]+(\'[^\W\d_]+)?(-[^\W\d_]+(\'[^\W\d_]+)?)*\b')
match = regex.search(frase)
if match:
primeira_palavra = match.group()
print(primeira_palavra)
else:
print('A frase não contém nenhuma palavra')
The result is:
sa
What happened is that the string "sábio da montanha"
is in NFD (one of the standards defined by Unicode). Basically, the words á
("a" with acute accent) is decomposed into two characters: the letter "a" without accent and the own accent (only that visually can not distinguish, because it is always shown as á
). Since the accent character is not a letter, it is not considered part of the word (it has a more detailed explanation about normalization here, here and here - although the links do not speak specifically about Python, the idea is the same).
One option is to normalize to NFC (using the module unicodedata
), because thus the characters "a" and the accent are condensed into a single character (the á
), and this becomes considered as a letter by regex:
frase = "sábio da montanha"
from unicodedata import normalize
frase = normalize('NFC', frase)
... etc (o resto é igual)
Thus the word "wise" is found by regex.
regex
If you want, you can install the module regex
, an excellent extension of the module re
. With it you can use the Unicode properties. This is because not all strings in NFC will have the accent character "agglutinated" with the letter, so the above solution that uses normalization will not always work.
In this case, we use \p{L}\p{M}*
for a "letter" (\p{L}
is any character that Unicode defines as "letter" and \p{M}
encompasses characters such as accent and others that can be applied to a letter - so-called Combining characters). You can see the full list on this page (\p{L}
encompasses all categories starting with "L" and \p{M}
, all beginning with "M").
Then I would be:
import regex
r = regex.compile(r'\b(\p{L}\p{M}*)+(\'(\p{L}\p{M}*)+)?(-(\p{L}\p{M}*)+(\'(\p{L}\p{M}*)+)?)*\b')
# restante do código igual
match = r.search(frase)
etc...
Finally, see how the solution can get complicated depending on what you consider "word". And we are only limiting ourselves to the definition of Portuguese, since there are languages (such as Japanese and Chinese) in which you can have whole sentences with no space between words (\p{L}
even considers letters of these alphabets, but if there are no spaces in the sentence, then you would probably have to use specific solutions for each language - example - and if you want to limit yourself to just our alphabet, you can exchange \p{L}
for \p{Script=Latin}
).
Of course, for simpler texts like yours, perhaps the split
already solve (as long as you also treat the case of the string being empty (""
), then split
returns an empty list and gives error when trying to access the first element - something that does not happen with other solutions, which correctly identify that there is no word). But I thought it was worth expanding a little the problem of "finding the first word of a text" for cases not so obvious.
1
Makes
primeira_palavra = str("Eu como abacaxi").split()[0]
The method split()
will break the string by the informed tab, as not reported it will break by space.
After breaking it returns a list like this ['Eu', 'como', 'abacaxi']
So you can use the zero index to take the first element, which will then be the first word.
Browser other questions tagged python string split
You are not signed in. Login or sign up in order to post.
It would not be phrase[0]?
– anonimo
split
returns a list, so just take the first element of it withfrase[0]
, as already said. Another detail is to usestr
in a string is reductive and unnecessary, can do only"Eu como abacaxi".split()
– hkotsubo
Or just
first, rest = frase.split(maxsplit=1)
, but this does not solve the problem if there are scores, for example, as infrase = 'eu, eu mesmo e irene'
, the first word would be'eu,'
, comma– Woss
@Woss I’m glad that "There should be one-- and preferably only one --obvious way to do it", right? :-)
– hkotsubo