Remove numbers and special characters from a text, but not within a word

Asked

Viewed 62 times

2

I would like to remove numbers and/or special characters from a text, but not within a word. Example:

texto = "ol@ mundo, eu bebo H20 e nao fumo cig@rro !#& 123"

The result should be:

ol@ mundo, eu bebo H20 e não fumo cig@rro

Code:

import re

texto = "ol@ mundo, eu bebo H20 e nao fumo cig@rro !#& 123"
resultado = re.sub(r"[^\w\s]", " ", texto)

print(resultado)

Output:

ol  mundo  eu bebo H20 e nao fumo cig rro     123

2 answers

2


An alternative is to use lookarounds:

resultado = re.sub(r"(?<!\w)([^\w\s]|\d)+(?!\w)", " ", texto)

(?<!\w) is a lookbehind negative if earlier nay has a \w, and (?!\w) is a Lookahead negative check whether nay has a \w.

Among them I also use alternation (the |, which means "or") for it to also take numbers (because the \w also takes numbers, and by denying it with [^, you were also deleting the replacement numbers).

I also use the quantifier + (one or more occurrences), for one or more characters that are [^\w\s] or \d. That is, regex takes these characters, provided that before and after it has no \w (letters, numbers or _).


You could also do split() as the other answer indicated, the problem is that if the string has different separators of space (such as line breaks, TAB, or even more than one space), they will all be exchanged for one space.

An alternative with split is to use capture groups, so the separators are also returned:

def substituir(s):
    if re.match(r'^\s+$', s): # se é separador, não substitui
        return s
    return re.sub(r'^[\W\d]+$', '', s)

resultado = ''.join(map(substituir, re.split(r'(\s+)', texto)))

The split is made by \s+ (one or more spaces, TAB’s, line breaks, etc.). As it is in parentheses, this forms a capture group and these separators are also returned.

Then just pass each part resulting from the split for the function substituir, which does nothing if it is a separator, and removes unwanted cases (\W is "all that is not \w", and \d take the numbers). I also use the markers ^ and $, which indicate respectively the beginning and end of the string, ensuring that I will only replace when the entire "word" has the unwanted characters. If the word is valid, it will not correspond to regex, and in this case it is returned without modification.

  • 1

    Very good! Thank you for the reply.

0

As an alternative to another answer:

You can try to separate by words before, with a split, and perform the regex on each element:

import re
texto = "ol@ mundo, eu bebo H20 e nao fumo cig@rro !#& 123"
palavras = texto.split()

palavras = [re.sub(r"[^\w\s]", "", palavra) 
    if not re.sub(r"[^\w\s]", "", palavra) 
    else palavra
    for palavra in palavras] 
print(" ".join(palavras)) #imprime "ol@ mundo, eu bebo H20 e nao fumo cig@rro  123"

If the word has no characters (if not re.sub(r"[^\w\s]", "", palavra)) after the replacement means that it should be replaced, if not, use the original word.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.