What is the best way to check if a string has any of these words?

Question

What is the best way to check if a string has any of these words?

Asked 4 years, 2 months ago

Viewed 64 times

3

I have a question, consider the following:

nomes_proibidos = ['Corona', 'Clorivaldo', 'Pfizernaldo', 'Hulk', 'Naruto', 'Goku', 'Cloroquina']

def verificar_nome(n):
    for nome in nomes_proibidos:
        if n.find(nome) != -1:
            return False

    return True

verificar_nome('Pedro Hulk da Silva')

Is there a better way to do that? I’m in need of doing something similar, only with a much longer word list and I’m concerned about the performance.

1 answer

Browser other questions tagged python string

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2021-06-03T14:11:30+00:00

An alternative is to use split to break the string into words, and see if each of the words is in the list:

def verificar_nome(frase):
    for palavra in frase.split():
        if palavra in nomes_proibidos:
            return False
    return True

This solution is "naive" because any phrase a little more complicated already breaks the algorithm. For example, if the phrase is Oi, tudo bem?, the "words" will be Oi, (comma), tudo and bem? (with the interrogation), so if "Hi" were on the list of forbidden words, it would not be found.

So deep down, we need to define what a "word" is, as already discussed here, here and here.

If you just want to consider the simple cases (words are always separated by space, no punctuation, etc) then the above solution would solve. Use find can end up picking up snippets within a word (for example, if the forbidden word is "faggot", using find you would block "sent", which is not the desired).

Actually word filters are a old problem and to this day is not very well solved. But finally...

In the above mentioned links (that, that and that) you will see other ways to break a phrase into words, using different definitions. Choose yours and change the split of the above code by the option of your preference.

If you want to disregard the difference between upper and lower case letters, you can use casefold to normalize the words:

nomes_proibidos = list(map(str.casefold, ['Corona', 'Clorivaldo', 'Pfizernaldo', 'Hulk', 'Naruto', 'Goku', 'Cloroquina']))

def verificar_nome(frase):
    for palavra in frase.split():
        if palavra.casefold() in nomes_proibidos:
            return False
    return True

Of course for simpler cases, use lower or upper would also work, but the documentation says that casefold is more "aggressive" and treats exceptional cases as the German character ß, the capital version of which is "SS" (i.e., for Portuguese texts, lower or upper would be enough).

And finally, instead of a list of words, you can switch to a set, since search operations in this structure are faster than in a list:

# usar um set em vez de lista
nomes_proibidos = set(map(str.casefold, ['Corona', 'Clorivaldo', 'Pfizernaldo', 'Hulk', 'Naruto', 'Goku', 'Cloroquina']))

Of course for a few names, the difference will be inconspicuous. Then test to confirm (but at first, should be faster than a list).

Doing a quick test with the module timeit:

from timeit import timeit

set_nomes_proibidos = set(map(str.casefold, ['Corona', 'Clorivaldo', 'Pfizernaldo', 'Hulk', 'Naruto', 'Goku', 'Cloroquina']))
list_nomes_proibidos = list(map(str.casefold, ['Corona', 'Clorivaldo', 'Pfizernaldo', 'Hulk', 'Naruto', 'Goku', 'Cloroquina']))

def verificar_nome_set(frase):
    for palavra in frase.split():
        if palavra.casefold() in set_nomes_proibidos:
            return False
    return True

def verificar_nome_list(frase):
    for palavra in frase.split():
        if palavra.casefold() in list_nomes_proibidos:
            return False
    return True

# frase com mil palavras, e com a palavra proibida no final
frase = ('abc ' * 1000) + 'Pedro Hulk da Silva'

# executa 10 mil vezes cada teste
params = { 'number' : 10000, 'globals': globals() }
# imprime os tempos em segundos
print(timeit('verificar_nome_set(frase)', **params))
print(timeit('verificar_nome_list(frase)', **params))

In my machine, the solution with set was slightly faster (about 0.3 seconds on average), but this may vary according to the hardware, the sentences and the list of forbidden words, so it’s up to you to test for your real cases and see if it makes a difference.