An alternative is to use split
to break the string into words, and see if each of the words is in the list:
def verificar_nome(frase):
for palavra in frase.split():
if palavra in nomes_proibidos:
return False
return True
This solution is "naive" because any phrase a little more complicated already breaks the algorithm. For example, if the phrase is Oi, tudo bem?
, the "words" will be Oi,
(comma), tudo
and bem?
(with the interrogation), so if "Hi" were on the list of forbidden words, it would not be found.
So deep down, we need to define what a "word" is, as already discussed here, here and here.
If you just want to consider the simple cases (words are always separated by space, no punctuation, etc) then the above solution would solve. Use find
can end up picking up snippets within a word (for example, if the forbidden word is "faggot", using find
you would block "sent", which is not the desired).
Actually word filters are a old problem and to this day is not very well solved. But finally...
In the above mentioned links (that, that and that) you will see other ways to break a phrase into words, using different definitions. Choose yours and change the split
of the above code by the option of your preference.
If you want to disregard the difference between upper and lower case letters, you can use casefold
to normalize the words:
nomes_proibidos = list(map(str.casefold, ['Corona', 'Clorivaldo', 'Pfizernaldo', 'Hulk', 'Naruto', 'Goku', 'Cloroquina']))
def verificar_nome(frase):
for palavra in frase.split():
if palavra.casefold() in nomes_proibidos:
return False
return True
Of course for simpler cases, use lower
or upper
would also work, but the documentation says that casefold
is more "aggressive" and treats exceptional cases as the German character ß
, the capital version of which is "SS" (i.e., for Portuguese texts, lower
or upper
would be enough).
And finally, instead of a list of words, you can switch to a set
, since search operations in this structure are faster than in a list:
# usar um set em vez de lista
nomes_proibidos = set(map(str.casefold, ['Corona', 'Clorivaldo', 'Pfizernaldo', 'Hulk', 'Naruto', 'Goku', 'Cloroquina']))
Of course for a few names, the difference will be inconspicuous. Then test to confirm (but at first, should be faster than a list).
Doing a quick test with the module timeit
:
from timeit import timeit
set_nomes_proibidos = set(map(str.casefold, ['Corona', 'Clorivaldo', 'Pfizernaldo', 'Hulk', 'Naruto', 'Goku', 'Cloroquina']))
list_nomes_proibidos = list(map(str.casefold, ['Corona', 'Clorivaldo', 'Pfizernaldo', 'Hulk', 'Naruto', 'Goku', 'Cloroquina']))
def verificar_nome_set(frase):
for palavra in frase.split():
if palavra.casefold() in set_nomes_proibidos:
return False
return True
def verificar_nome_list(frase):
for palavra in frase.split():
if palavra.casefold() in list_nomes_proibidos:
return False
return True
# frase com mil palavras, e com a palavra proibida no final
frase = ('abc ' * 1000) + 'Pedro Hulk da Silva'
# executa 10 mil vezes cada teste
params = { 'number' : 10000, 'globals': globals() }
# imprime os tempos em segundos
print(timeit('verificar_nome_set(frase)', **params))
print(timeit('verificar_nome_list(frase)', **params))
In my machine, the solution with set
was slightly faster (about 0.3 seconds on average), but this may vary according to the hardware, the sentences and the list of forbidden words, so it’s up to you to test for your real cases and see if it makes a difference.