Taking into account that your variable text
contains the excerpt "A letra z sozinha nao sera retornado na busca."
, I’m assuming that’s what you need.
So one way to do it is to use alternation: the character |
, which means or:
text = "Texto de busca por palavras contendo a letra z, como por exemplo zebra, zoologico. A letra z sozinha nao sera retornado na busca."
palavras = re.findall(r'\b(z[a-z]+|[a-z]+z|[a-z]+z[a-z]+)\b', text, re.I)
print(palavras) # ['zebra', 'zoologico', 'sozinha']
Thus, regex has 3 alternatives (separated by |
):
- a word that begins with z:
z[a-z]+
, or
- a word that ends with z:
[a-z]+z
, or
- a word with z in the middle:
[a-z]+z[a-z]+
If you do not want any of these cases, simply remove the alternative. For example, if you don’t want the words that contain a "z" in the middle, and you only want the words that start or end with "z":
palavras = re.findall(r'\b(z[a-z]+|[a-z]+z)\b', text, re.I)
Before and after all this I put \b
, which is a shortcut to word Boundary (""a position that has a prior alphanumeric character and a non-alphanumeric character after, or vice versa), which ensures that I am picking up an entire word. This prevents her from taking only the "little girl" from the word "alone".
I’m also assuming that "word" is a sequence of letters from a
to z
. The shortcut \w
also considers numbers (digits from 0 to 9) and the character _
, so if you want to take only the letters, use [a-z]
.
I also used the flag re.I
(case insensitive) to consider uppercase and lowercase letters. Without this flag, the above regex would only consider lower case letters.
The problem is that [a-z]
does not consider accented letters. You could change it to something like [a-záâãàéêíî....]
(include all accented characters inside the brackets), or use \w
even (knowing that he can take numbers and _
).
Or you can still use:
palavras = re.findall(r'\b(z[^\W\d_]+|[^\W\d_]+z|[^\W\d_]+z[^\W\d_]+)\b', text, re.I)
In the case, [^....]
consider all that nay is in square brackets. And within them we have \W
(which is "anything that is not \w
"), \d
(numbers) and _
(the very character _
). I mean, it’s a way of saying "\w
, only without the numbers and _
", that ends up taking all the letters, including the accented.
Another alternative is to use re.split
to separate the text into words, and then I check those that have a "z":
palavras = [ palavra for palavra in re.split(r'\W', text) if len(palavra) > 1 and 'z' in palavra.lower() ]
In the split
i use \W
: everything that is not a \w
(letter, number or _
). If you want, you can use [\W\d_]
not to consider numbers and _
.
Then I pick up the words that have more than one character (len(palavra) > 1
) and that it contains a "z". This eliminates cases where only the "z" is isolated. I also use 'z' in palavra.lower()
to consider both lowercase and uppercase "z", but if you only want to consider lowercase, do 'z' in palavra
.
If you only want the ones that start or end with "z", you can switch to:
palavras = [ palavra for palavra in re.split(r'[\W\d_]', text) if len(palavra) > 1 and (palavra.startswith('z') or palavra.endswith('z')) ]
And again, you can use palavra.lower().startswith('z')
if you want to consider uppercase and lowercase "z".
Another alternative, using the idea jsbueno gave in the comments, is:
text = "Texto de busca por palavras contendo a letra z, como por exemplo zebra, zoológico. A letra z sozinha nao sera retornado na busca traz."
palavras = re.findall(r'\b(?=\w*z)\w{2,}\b', text, re.I)
print(palavras) # ['zebra', 'zoológico', 'sozinha', 'traz']
The idea is to use a Lookahead (the stretch between (?=...)
) to check if there is a z
after \w*
(zero or more alphanumeric characters). That is, if there is a z
at some point of a word.
The detail is that the Lookahead only checks if something exists ahead, but then goes back to where it was and checks the rest of regex. And the rest of regex is \w{2,}
(two or more alphanumeric characters).
That is, the Lookahead ensures that there is a z
which is part of a word (may be at the beginning, middle or end), and the \w{2,}
ensures that it has at least two characters, thus discarding the cases of z
alone.
Obs: but if you do not have this restriction not to consider "z
alone", the regex is (as recalled by @fernandosavio in the comments):
palavras = re.findall(r'\b[a-z]*z[a-z]*\b', text, re.I)
Which is "zero or more letters", the letter "z", and zero or more letters (remembering that you can switch [a-z]
for \w
or [^\W\d_]
, as explained above).
And in the solution with split
, just remove the string size restriction and just check if it contains a "z":
palavras = [ palavra for palavra in re.split(r'[\W\d_]', text) if 'z' in palavra.lower() ]
you accepted the answer with regex, and ok, since the question was explicitly made for regex - but I strongly suggest you use, in your code, the version without regez, which uses simply pure Python
letra in palavra
- 'and much more readable - and, for almost all applications, performance gain with regex in this case (if there is in fact a gain) will be negligible.– jsbueno