How to remove accents with regular expressions in Python?

Asked

Viewed 4,852 times

7

I am developing a regular expression to try to replace accents and characters with normal characters

Example:

á = a 
ç = c
é = e 

But mine regex is just eliminating some hint?

import re


string_velha = ("Olá você está ????   ")
string_nova = re.sub(u'[^a-zA-Z0-9: ]', '', string_velha.encode().decode('utf-8'))
print(string_nova)

Upshot:

Ol voc est 

3 answers

9


A simple mode that uses the module unicodedata, included in python, to decompose each Unicode accent into its original Codepoint + combination Codepoint, then filter the combination codepoints to have a clean string:

import unicodedata
string_velha = "Olá você está????"
string_nova = ''.join(ch for ch in unicodedata.normalize('NFKD', string_velha) 
    if not unicodedata.combining(ch))
print(string_nova)

Upshot:

Ola voce esta????

Another way is to use the unidecode - this external module needs to be installed, the purpose of it is precisely to generate a uni-ascii representation of Unicode characters. It covers more character possibilities, but is an external dependency.

import unidecode
string_nova = unidecode.unidecode(string_velha)
print(string_nova)
  • Perfect, it was a very clean and succinct code. Thank you!!

1

View the function signature re.sub(pattern, repl, string, count=0, flags=0), the second argument of the same defines the string or função which will be used when Pattern successfully performs a search on the original string, in your case just implement a function that will be called each time this occurs for example:

import re

def repl(match):
    data = {"á": "a", "ç": "c", "ê": "e"}
    return data.get(match.group(0))

string_velha = ("Olá você está ????   ")
string_nova = re.sub(u'[^a-zA-Z0-9: ]', repl, string_velha.encode().decode('utf-8'))
print(string_nova)

0

Your code is swapping all the characters captured in the regular expression with '' and thus removing the accent.

If you want to translate each accent by the respective character without accent, you can use a normal dictionary and do the replace.

import re

# char codes: https://unicode-table.com/en/#basic-latin
accent_map = {
    u'\u00c0': u'A',
    u'\u00c1': u'A',
    u'\u00c2': u'A',
    u'\u00c3': u'A',
    u'\u00c4': u'A',
    u'\u00c5': u'A',
    u'\u00c6': u'A',
    u'\u00c7': u'C',
    u'\u00c8': u'E',
    u'\u00c9': u'E',
    u'\u00ca': u'E',
    u'\u00cb': u'E',
    u'\u00cc': u'I',
    u'\u00cd': u'I',
    u'\u00ce': u'I',
    u'\u00cf': u'I',
    u'\u00d0': u'D',
    u'\u00d1': u'N',
    u'\u00d2': u'O',
    u'\u00d3': u'O',
    u'\u00d4': u'O',
    u'\u00d5': u'O',
    u'\u00d6': u'O',
    u'\u00d7': u'x',
    u'\u00d8': u'0',
    u'\u00d9': u'U',
    u'\u00da': u'U',
    u'\u00db': u'U',
    u'\u00dc': u'U',
    u'\u00dd': u'Y',
    u'\u00df': u'B',
    u'\u00e0': u'a',
    u'\u00e1': u'a',
    u'\u00e2': u'a',
    u'\u00e3': u'a',
    u'\u00e4': u'a',
    u'\u00e5': u'a',
    u'\u00e6': u'a',
    u'\u00e7': u'c',
    u'\u00e8': u'e',
    u'\u00e9': u'e',
    u'\u00ea': u'e',
    u'\u00eb': u'e',
    u'\u00ec': u'i',
    u'\u00ed': u'i',
    u'\u00ee': u'i',
    u'\u00ef': u'i',
    u'\u00f1': u'n',
    u'\u00f2': u'o',
    u'\u00f3': u'o',
    u'\u00f4': u'o',
    u'\u00f5': u'o',
    u'\u00f6': u'o',
    u'\u00f8': u'0',
    u'\u00f9': u'u',
    u'\u00fa': u'u',
    u'\u00fb': u'u',
    u'\u00fc': u'u'
}

def accent_remove (m):
  return accent_map[m.group(0)]

string_velha = "Olá você está ????   "
string_nova = re.sub(u'([\u00C0-\u00FC])', accent_remove, string_velha.encode().decode('utf-8'))

print(string_nova)

I put in the Repl.it to see spinning.

  • Thank you!! That was exactly it

Browser other questions tagged

You are not signed in. Login or sign up in order to post.