An alternative is to use lookarounds:
resultado = re.sub(r"(?<!\w)([^\w\s]|\d)+(?!\w)", " ", texto)
(?<!\w)
is a lookbehind negative if earlier nay has a \w
, and (?!\w)
is a Lookahead negative check whether nay has a \w
.
Among them I also use alternation (the |
, which means "or") for it to also take numbers (because the \w
also takes numbers, and by denying it with [^
, you were also deleting the replacement numbers).
I also use the quantifier +
(one or more occurrences), for one or more characters that are [^\w\s]
or \d
. That is, regex takes these characters, provided that before and after it has no \w
(letters, numbers or _
).
You could also do split()
as the other answer indicated, the problem is that if the string has different separators of space (such as line breaks, TAB, or even more than one space), they will all be exchanged for one space.
An alternative with split
is to use capture groups, so the separators are also returned:
def substituir(s):
if re.match(r'^\s+$', s): # se é separador, não substitui
return s
return re.sub(r'^[\W\d]+$', '', s)
resultado = ''.join(map(substituir, re.split(r'(\s+)', texto)))
The split
is made by \s+
(one or more spaces, TAB’s, line breaks, etc.). As it is in parentheses, this forms a capture group and these separators are also returned.
Then just pass each part resulting from the split
for the function substituir
, which does nothing if it is a separator, and removes unwanted cases (\W
is "all that is not \w
", and \d
take the numbers). I also use the markers ^
and $
, which indicate respectively the beginning and end of the string, ensuring that I will only replace when the entire "word" has the unwanted characters. If the word is valid, it will not correspond to regex, and in this case it is returned without modification.
Very good! Thank you for the reply.
– user20273