How to do a search ignoring Python accent?

Question

How to do a search ignoring Python accent?

Asked 10 years, 5 months ago

Viewed 3,083 times

17

Suppose I have a list of words, in Python (if necessary, already ordered according to the rules of collation):

palavras = [
    u"acentuacao",
    u"divagacão",
    u"programaçao",
    u"taxação",
]

Notice I didn’t use the cedilla (ç) nor the tilde (ã) consistently. How can I search in this list by "programming", but ignoring the accentuation, so that various types of searches return results? Ex.:

buscar(palavras, u"programacao")
buscar(palavras, u"programação")

I Googled for "collation search" and found nothing useful. I also looked for "search ignoring accents", in various ways, and even found a solution for Mysql (which confirms that the right path is actually leotard), but nothing for Python (only references to how to sort a list, which in itself does not answer the question). The module locale also did not offer much help. How to do?

1

in English: http://stackoverflow.com/questions/517923/

– bfavaretto

2014/01/08 at 20:13
@bfavaretto interesting approach, had not thought about it (normalize and remove diacritics). Too bad this would involve modifying the original array (although in practice in many cases this would be no problem).

– mgibsonbr

2014/01/08 at 20:20
It’s an approach I’ve seen used a lot in other languages (eg js, php). Unfortunately my python skills are pretty shallow, so I don’t risk posting an answer.

– bfavaretto

2014/01/08 at 20:24
@bfavaretto If you want to try, I’ll help you... P Otherwise, I’ll post an answer myself later. I was investigating whether it would be possible to do this via binary search and strcoll, but without result... I will try to create a proof-of-concept using your suggestion, but if you want to give a partial answer, for me it is of good size.

– mgibsonbr

2014/01/08 at 20:28
Go ahead and post yours. Who knows when I study python (I can’t find time!) I can post one better than yours :)

– bfavaretto

2014/01/08 at 20:30
If it were in C#, it could send to the SE team: http://meta.answall.com/questions/449/permit-searchesignoring-accentuao :P

– Maniero

2014/01/08 at 20:40
@bigown hehe was that question on the goal that inspired me to post this one... P

– mgibsonbr

2014/01/08 at 20:49
@bigown In the case of SOPT is relatively simple, just use COLLATE in the query. I would solve here, but not for any localized site - the collation would need to be parametrized.

– bfavaretto

2014/01/09 at 00:37

Show 3 more comments

3 answers

8

Based on the comment and reference from @bfavaretto, I was able to put together a proof-of-concept. The solution is to remove the diacritics from both the search list and the search term. To do this, the string is first normalized to ensure that the matching characters are represented separately, then these characters are removed (which in the case of accents, cedilla etc, have the Unicode category Mn).

I tried to replace it using the module regex, unsuccessfully, so I opted for a separate function. [binary] search code came of that answer in the English OS.

import unicodedata
from bisect import bisect_left

def remover_combinantes(string):
    string = unicodedata.normalize('NFD', string)
    return u''.join(ch for ch in string if unicodedata.category(ch) != 'Mn')

palavras_norm = [remover_combinantes(x) for x in palavras]

def binary_search(a, x, lo=0, hi=None):   # can't use a to specify default for hi
    hi = hi if hi is not None else len(a) # hi defaults to len(a)   
    pos = bisect_left(a,x,lo,hi)          # find insertion position
    return (pos if pos != hi and a[pos] == x else -1) # don't walk off the end

def buscar(lista, palavra):
    return binary_search(lista, remover_combinantes(palavra))

>>> buscar(palavras_norm, u'programacao')
2
>>> buscar(palavras_norm, u'programação')
2

2

Yes, the unit is the scheme. If you don’t need to retrieve the original word (with the accents and such), consider creating a set: dicionario = set(unidecode.unidecode(p) for p in palavras) and then search with unidecode.unidecode(busca) in dicionario.

– elias

2014/01/09 at 02:11

Browser other questions tagged python array quest accentuation

You are not signed in. Login or sign up in order to post.

by user14079 · Answer 1 · 2015-01-13T11:38:59+00:00

There is an easier solution, that is to install a module that does this work directly. This module is the unidecode, existing be for Python 2 than for Python 3.

If you are on a Unix-like system, the best installation solution is to use directly on terminal the pip for Python 2 or pip3 for Python 3 as follows:

pip install unidecode

for Python 2
pip3 install unidecode

for Python 3

This here is a practical and complete example using the list you gave as an example:

import unidecode

palavras = [
    u"acentuacao",
    u"divagacão",
    u"programaçao",
    u"taxação",
]

def to_ascii(ls):
    for i in range(len(ls)):
        ls[i] = unidecode.unidecode(ls[i])

to_ascii(palavras)
print(palavras)

And the output is as follows:

['acentuacao', 'divagacao', 'programacao', 'taxacao']

For more information about the module, see here or here in official page python. If you are interested in modifying or simply viewing the code, here the repository in the Github.

For further information, there is at least this post or this in the other OS that may be useful.

by avelino • **352** points · Answer 2 · 2014-01-29T14:30:47+00:00

You can write a method to remove accents:

import unicodedata

def remove_accents(input_str):
    nkfd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nkfd_form.encode('ASCII', 'ignore')
    return only_ascii

lista = [remove_accents(i) for i in ['é', 'á']]
'e' in lista

So I believe it’s easy for you to lead to your need!