How to do a search ignoring Python accent?

Asked

Viewed 3,083 times

17

Suppose I have a list of words, in Python (if necessary, already ordered according to the rules of collation):

palavras = [
    u"acentuacao",
    u"divagacão",
    u"programaçao",
    u"taxação",
]

Notice I didn’t use the cedilla (ç) nor the tilde (ã) consistently. How can I search in this list by "programming", but ignoring the accentuation, so that various types of searches return results? Ex.:

buscar(palavras, u"programacao")
buscar(palavras, u"programação")

I Googled for "collation search" and found nothing useful. I also looked for "search ignoring accents", in various ways, and even found a solution for Mysql (which confirms that the right path is actually leotard), but nothing for Python (only references to how to sort a list, which in itself does not answer the question). The module locale also did not offer much help. How to do?

  • 1

    in English: http://stackoverflow.com/questions/517923/

  • @bfavaretto interesting approach, had not thought about it (normalize and remove diacritics). Too bad this would involve modifying the original array (although in practice in many cases this would be no problem).

  • It’s an approach I’ve seen used a lot in other languages (eg js, php). Unfortunately my python skills are pretty shallow, so I don’t risk posting an answer.

  • @bfavaretto If you want to try, I’ll help you... P Otherwise, I’ll post an answer myself later. I was investigating whether it would be possible to do this via binary search and strcoll, but without result... I will try to create a proof-of-concept using your suggestion, but if you want to give a partial answer, for me it is of good size.

  • Go ahead and post yours. Who knows when I study python (I can’t find time!) I can post one better than yours :)

  • If it were in C#, it could send to the SE team: http://meta.answall.com/questions/449/permit-searchesignoring-accentuao :P

  • @bigown hehe was that question on the goal that inspired me to post this one... P

  • @bigown In the case of SOPT is relatively simple, just use COLLATE in the query. I would solve here, but not for any localized site - the collation would need to be parametrized.

Show 3 more comments

3 answers

8


Based on the comment and reference from @bfavaretto, I was able to put together a proof-of-concept. The solution is to remove the diacritics from both the search list and the search term. To do this, the string is first normalized to ensure that the matching characters are represented separately, then these characters are removed (which in the case of accents, cedilla etc, have the Unicode category Mn).

I tried to replace it using the module regex, unsuccessfully, so I opted for a separate function. [binary] search code came of that answer in the English OS.

import unicodedata
from bisect import bisect_left

def remover_combinantes(string):
    string = unicodedata.normalize('NFD', string)
    return u''.join(ch for ch in string if unicodedata.category(ch) != 'Mn')

palavras_norm = [remover_combinantes(x) for x in palavras]

def binary_search(a, x, lo=0, hi=None):   # can't use a to specify default for hi
    hi = hi if hi is not None else len(a) # hi defaults to len(a)   
    pos = bisect_left(a,x,lo,hi)          # find insertion position
    return (pos if pos != hi and a[pos] == x else -1) # don't walk off the end

def buscar(lista, palavra):
    return binary_search(lista, remover_combinantes(palavra))

>>> buscar(palavras_norm, u'programacao')
2
>>> buscar(palavras_norm, u'programação')
2
  • 2

    Yes, the unit is the scheme. If you don’t need to retrieve the original word (with the accents and such), consider creating a set: dicionario = set(unidecode.unidecode(p) for p in palavras) and then search with unidecode.unidecode(busca) in dicionario.

6

There is an easier solution, that is to install a module that does this work directly. This module is the unidecode, existing be for Python 2 than for Python 3.

If you are on a Unix-like system, the best installation solution is to use directly on terminal the pip for Python 2 or pip3 for Python 3 as follows:

  1. pip install unidecode

    for Python 2

  2. pip3 install unidecode

    for Python 3

This here is a practical and complete example using the list you gave as an example:

import unidecode

palavras = [
    u"acentuacao",
    u"divagacão",
    u"programaçao",
    u"taxação",
]

def to_ascii(ls):
    for i in range(len(ls)):
        ls[i] = unidecode.unidecode(ls[i])

to_ascii(palavras)
print(palavras)

And the output is as follows:

['acentuacao', 'divagacao', 'programacao', 'taxacao']

For more information about the module, see here or here in official page python. If you are interested in modifying or simply viewing the code, here the repository in the Github.

For further information, there is at least this post or this in the other OS that may be useful.

  • Some time ago I asked this question, at the time they had already suggested the unit (see that comment) but for some reason I do not remember I decided not to use it. I need to do some tests... Anyway, it is a good solution, +1

3

You can write a method to remove accents:

import unicodedata

def remove_accents(input_str):
    nkfd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nkfd_form.encode('ASCII', 'ignore')
    return only_ascii

lista = [remove_accents(i) for i in ['é', 'á']]
'e' in lista

So I believe it’s easy for you to lead to your need!

  • The function return in Python 3 is of the type bytes (<class 'bytes'>). However, just call the decode() at the end to return a common string: return only_ascii.decode()

  • Thanks for the suggestion, it works, but has the side effect of deleting any non-ASCII character from the string (ex.: "§"). Not that this type of character is used much in practice... [in searches] Anyway, I believe it is a valid solution for the language porgutuesa at least.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.