separate special characters

Question

separate special characters

Asked 6 years, 7 months ago

Viewed 77 times

0

I have a text file of this kind:

Olá podes dizer-me quando o 1 passa aqui? ele passa, quando passar o Carlos Alberto.

I need to in python Remove special characters such as punctuation, numbers, uppercase letters, make the accented characters normal characters and separate each character individually. Something like this:

o, l, a, p, o, d, e, s, d, i, z, e, r, m, e, q, u, a, n, d, o, o, p, a, s, s, a a, q, u, i, e, l, e, p, a, s, s, a, q, u, a, n, d, o, p, a, s, s, a, r, o, a, a, r, l, o, s, a, l, b, e, r, t, o

Is there any split or with the use of import re do all this?

I have it:

#letra minuscula
data = ''.join(data).lower()
#tirar os nuneros
data = re.sub('#\d{3}\/\d{3}', '', data)

nfkd = unicodedata.normalize('NFKD', data)
dataNova = u"".join([c for c in nfkd if not unicodedata.combining(c)])
dataNovaNova = re.sub('[^a-zA-Z0-9 \\\]', '', dataNova)

lista= []
lista = list(dataNovaNova)

Where date is a string

1 answer

Browser other questions tagged python unicode

You are not signed in. Login or sign up in order to post.

by jsbueno • **30,668** points · Answer 1 · 2019-01-13T02:55:14+00:00

Yeah, but it’s not "split" related. It is possible to convert the whole string to lower case, and use the normalize function of the module unicodedata, as you used, to separate accents and letters into distinct characters.

This is done, just use a normal filter, with a list comprehension, to include only characters that are in the Unicode "lowercase" category - that is, those for which the function unicodedata.category returns 'Ll'.

import unicodedata

def normalize_chars(text):
    text = unicodedata.normalize("NFKD", a.lower())  
    return [char for char in text if unicodedata.category(char) == 'Ll']

The output from this function to the phrase you passed as an example is:

['o', 'l', 'a', 'p', 'o', 'd', 'e', 's', 'd', 'i', 'z', 'e', 'r', 'm', 'e', 'q', 'u', 'a', 'n', 'd', 'o', 'o', 'p', 'a', 's', 's', 'a', 'a', 'q', 'u', 'i', 'e', 'l', 'e', 'p', 'a', 's', 's', 'a', 'q', 'u', 'a', 'n', 'd', 'o', 'p', 'a', 's', 's', 'a', 'r', 'o', 'c', 'a', 'r', 'l', 'o', 's', 'a', 'l', 'b', 'e', 'r', 't', 'o']

ps: I don’t know what use you’ll have for it - but the package extradict in pypi Python has a dictionary type that automatically works with normalized keys - it may serve for the use you have in mind. Example of use:

In [3]: import extradict                                                                                 

In [4]: dct = extradict.NormalizedDict()                                                                 

In [5]: dct["maca"] = "Vermelha"                                                                         

In [6]: dct["Maçã"]                                                                                      
Out[6]: 'Vermelha'

The "extradict" package can be installed with pip install extradict. (Disclaimer: I am the author).