Search for words that contain a certain letter

Asked

Viewed 1,422 times

3

I need to search the library re for when I pass a letter, return the whole word containing the letter.

text = "Texto de busca por palavras contendo a letra z, como por exemplo 
        zebra, zoologico. A letra z sozinha nao sera retornado na busca."

 # fazendo a busca assim o codigo retorna do z para frente
 re.findall(r'z\w+', text)
 ['zebra', 'zoologico', 'zinha']

 # fazendo assim ele retorna so palavras que contenham o z no meio
 re.findall(r'\w+z\w+', text)
 ['sozinha']

Would there be some way to return all the words that contain the letter z?

  • you accepted the answer with regex, and ok, since the question was explicitly made for regex - but I strongly suggest you use, in your code, the version without regez, which uses simply pure Python letra in palavra - 'and much more readable - and, for almost all applications, performance gain with regex in this case (if there is in fact a gain) will be negligible.

3 answers

4


Taking into account that your variable text contains the excerpt "A letra z sozinha nao sera retornado na busca.", I’m assuming that’s what you need.

So one way to do it is to use alternation: the character |, which means or:

text = "Texto de busca por palavras contendo a letra z, como por exemplo zebra, zoologico. A letra z sozinha nao sera retornado na busca."
palavras = re.findall(r'\b(z[a-z]+|[a-z]+z|[a-z]+z[a-z]+)\b', text, re.I)
print(palavras) # ['zebra', 'zoologico', 'sozinha']

Thus, regex has 3 alternatives (separated by |):

  • a word that begins with z: z[a-z]+, or
  • a word that ends with z: [a-z]+z, or
  • a word with z in the middle: [a-z]+z[a-z]+

If you do not want any of these cases, simply remove the alternative. For example, if you don’t want the words that contain a "z" in the middle, and you only want the words that start or end with "z":

palavras = re.findall(r'\b(z[a-z]+|[a-z]+z)\b', text, re.I)

Before and after all this I put \b, which is a shortcut to word Boundary (""a position that has a prior alphanumeric character and a non-alphanumeric character after, or vice versa), which ensures that I am picking up an entire word. This prevents her from taking only the "little girl" from the word "alone".

I’m also assuming that "word" is a sequence of letters from a to z. The shortcut \w also considers numbers (digits from 0 to 9) and the character _, so if you want to take only the letters, use [a-z].

I also used the flag re.I (case insensitive) to consider uppercase and lowercase letters. Without this flag, the above regex would only consider lower case letters.


The problem is that [a-z] does not consider accented letters. You could change it to something like [a-záâãàéêíî....] (include all accented characters inside the brackets), or use \w even (knowing that he can take numbers and _).

Or you can still use:

palavras = re.findall(r'\b(z[^\W\d_]+|[^\W\d_]+z|[^\W\d_]+z[^\W\d_]+)\b', text, re.I)

In the case, [^....] consider all that nay is in square brackets. And within them we have \W (which is "anything that is not \w"), \d (numbers) and _ (the very character _). I mean, it’s a way of saying "\w, only without the numbers and _", that ends up taking all the letters, including the accented.


Another alternative is to use re.split to separate the text into words, and then I check those that have a "z":

palavras = [ palavra for palavra in re.split(r'\W', text) if len(palavra) > 1 and 'z' in palavra.lower() ]

In the split i use \W: everything that is not a \w (letter, number or _). If you want, you can use [\W\d_] not to consider numbers and _.

Then I pick up the words that have more than one character (len(palavra) > 1) and that it contains a "z". This eliminates cases where only the "z" is isolated. I also use 'z' in palavra.lower() to consider both lowercase and uppercase "z", but if you only want to consider lowercase, do 'z' in palavra.

If you only want the ones that start or end with "z", you can switch to:

palavras = [ palavra for palavra in re.split(r'[\W\d_]', text) if len(palavra) > 1 and (palavra.startswith('z') or palavra.endswith('z')) ]

And again, you can use palavra.lower().startswith('z') if you want to consider uppercase and lowercase "z".


Another alternative, using the idea jsbueno gave in the comments, is:

text = "Texto de busca por palavras contendo a letra z, como por exemplo zebra, zoológico. A letra z sozinha nao sera retornado na busca traz."
palavras = re.findall(r'\b(?=\w*z)\w{2,}\b', text, re.I)
print(palavras) # ['zebra', 'zoológico', 'sozinha', 'traz']

The idea is to use a Lookahead (the stretch between (?=...)) to check if there is a z after \w* (zero or more alphanumeric characters). That is, if there is a z at some point of a word.

The detail is that the Lookahead only checks if something exists ahead, but then goes back to where it was and checks the rest of regex. And the rest of regex is \w{2,} (two or more alphanumeric characters).

That is, the Lookahead ensures that there is a z which is part of a word (may be at the beginning, middle or end), and the \w{2,} ensures that it has at least two characters, thus discarding the cases of z alone.


Obs: but if you do not have this restriction not to consider "z alone", the regex is (as recalled by @fernandosavio in the comments):

palavras = re.findall(r'\b[a-z]*z[a-z]*\b', text, re.I)

Which is "zero or more letters", the letter "z", and zero or more letters (remembering that you can switch [a-z] for \w or [^\W\d_], as explained above).

And in the solution with split, just remove the string size restriction and just check if it contains a "z":

palavras = [ palavra for palavra in re.split(r'[\W\d_]', text) if 'z' in palavra.lower() ]
  • Instead of using toggle it would not be simpler to use the quantifier * instead of + to pick up zero or more occurrences? In your example would be [a-z]*z[a-z]*...

  • @fernandosavio No, because this regex takes the "z" itself: https://regex101.com/r/dGFFUk/2/

  • But isn’t that what the AP wants? I ask because of the last sentence: "Would there be any way to return all words that contain the letter z?"

  • @fernandosavio Good, in the variable text is the phrase "The letter z alone will not be returned in the search", so I concluded that he does not want these cases. But I agree that as the question stands, it was ambiguous.

  • 1

    Really, I hadn’t seen that. Then it gets kind of hard.. hahahaha

  • @fernandosavio I updated the answer with the 2 options, thanks!

  • Can’t we use some form of "non-consuming groups" to avoid this ugly thing of having to combine the three alternatives? I think of something of a non-consuming-group to detect the "z" in the next word, in any position, combined with the paalavra response group, there to use \w that already finds the accents, and etc...

  • @jsbueno You are right, I updated the response with this alternative, thank you!

Show 3 more comments

1

Function can be used filter to have the same result:

lista = text.lower().split()  # transforma texto em lista de palavras com letras minusculas 
f_filtro = lambda x: 'z' in x  # Definine a função de filtro 
filtro = filter(f_filtro, texto.split())  # Filtra
lista_resultado = list(filtro)  # Cria resultado em lista

Or we can do the same thing on a line:

lista_resultado = list(filter(lambda x: 'z' in x, text.lower().split()))
  • 1

    A classic case in which Python code without using regex gets an order of magnitude or simpler than using regex.

  • 3

    I suggest also incorporating a version with list-comprehension as well, which is even simpler: words = [word for word in text.split() if 'z' in word]

  • It’s even more readable, jsbueno! I thought about performance, but I used the %timeit of the jupyter console to see the processing time of each solution and using list-comprehension still performed better. Very good.

  • 1

    map and filter are tools prior to Python comprehensions - in addition to being simplified reading (after someone explains how to read, of course), comprehensiosions can be executed with less context change (entering and exiting functions) - which makes them more efficient.

0

It may make sense to use regular expressions to potentially control the definition and limits of "word". "Existem no zoo: zebra, ...".split() --> ["zoo:", "zebra,"]

I would write:

import re

palavras = [x for x in re.findall(r'\w*[zZ]\w*', txt) if len(x)>1]
print(palavras)

(OK: I admit that this writing 'zzzz' gives a little sleep)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.