Looking at the documentation of the method replace
, it is said that:
Return a copy of the string with all occurrences of substring old replaced by new
I mean, if you do string.replace('alguma coisa', 'outra coisa')
, all occurrences of "alguma coisa"
shall be replaced by "outra coisa"
. What if instead of "something else" you pass the empty string (''
or ""
- note that there is nothing between the quotes), the "alguma coisa"
is exchanged for "nothing", which in practice is the same as removing the "alguma coisa"
.
So if you must use only replace
and the list of characters to be removed, has no way, has to remove one by one. So:
simbolos_pontuacao = ["'", '"', ",", ".", "!", ":", ";", '#', '@']
texto = 'Lorem: "ipsum" \'dolor\', sit! Amet; bla... bl@ #etc#, whiskas sache!!'
# para cada símbolo, remove-o da string
for c in simbolos_pontuacao:
texto = texto.replace(c, '')
print(texto) # Lorem ipsum dolor sit Amet blabl etc whiskas sache
It’s not very efficient because every call from replace
creates a new string (which is why I need to assign the return to the variable texto
).
For those who like a more functional "footprint", you can also use reduce
:
from functools import reduce
texto = reduce(lambda t, s: t.replace(s, ''), simbolos_pontuacao, texto)
Who deep down does the same thing loop, only with more overhead...
Anyway, using replace
, there’s not much to escape from. But there are other ways to do.
Another way is to traverse through each character of the string and check if it belongs to the list. Then I join everyone who does not belong and mount another string:
texto = ''.join(c for c in texto if c not in simbolos_pontuacao)
It should also not be very efficient because for each character of the string I have to go through the whole list simbolos_pontuacao
to check if the character is one of the ones that should be removed (such as the operator in
does a linear search, the list shall be covered several times).
Another alternative is to use regex. Since all symbols in the list have only one character, I can join them in one character class: ['",.!:;#@]
. Thus, the substitution is made at once:
import re
r = re.compile(f'[{"".join(simbolos_pontuacao)}]')
texto = r.sub('', texto)
If any element of the list simbolos_pontuacao
had more than one character, so I could not use the character class. For example, if I had ab
on the list, then [ab]
it wouldn’t do, because that means "the letter a
or the letter b
". In this case, I have to use alternation:
# incluí um elemento com mais de um caractere
simbolos_pontuacao = ['...', "'", '"', ",", "!", ":", ";", '#', '@']
texto = 'Lorem: "ipsum" \'dolor\', sit! Amet; bla... bl@ #etc#, whiskas sache!!'
import re
r = re.compile(f'({"|".join(map(re.escape, simbolos_pontuacao))})')
texto = r.sub('', texto)
print(texto) # Lorem ipsum dolor sit Amet blabl etc whiskas sache
In this case the regex became (\.\.\.|'|"|,|!|:|;|\#|@)
. Note that the dot was written as \.
, since it has special meaning in regex (meaning "any character, except line breaks"), then so that it matches only the character .
, I have to escape it with \
, and this is done by re.escape
.
In the first case I didn’t need the escape because the metacharacters "lose their powers" when they are inside brackets (in that case I would only need to escape the brackets themselves).
Very good, I confess that I have always been lazy to understand how the
maketrans
works, and finally understood :-) And I was also surprised that the regex was not the slowest of all (by the way check if each character is on the list turns out to be worse than I imagined, and the cost of creating aset
ends up paying)– hkotsubo
I created the answer by thinking that the
translate
was worth it for the C implementation and for apparently doing everything in one loop... But I get it withset
Also caught me by surprise. I didn’t expect so much– fernandosavio