17
Suppose I have a list of words, in Python (if necessary, already ordered according to the rules of collation):
palavras = [
u"acentuacao",
u"divagacão",
u"programaçao",
u"taxação",
]
Notice I didn’t use the cedilla (ç
) nor the tilde (ã
) consistently. How can I search in this list by "programming", but ignoring the accentuation, so that various types of searches return results? Ex.:
buscar(palavras, u"programacao")
buscar(palavras, u"programação")
I Googled for "collation search" and found nothing useful. I also looked for "search ignoring accents", in various ways, and even found a solution for Mysql (which confirms that the right path is actually leotard), but nothing for Python (only references to how to sort a list, which in itself does not answer the question). The module locale
also did not offer much help. How to do?
in English: http://stackoverflow.com/questions/517923/
– bfavaretto
@bfavaretto interesting approach, had not thought about it (normalize and remove diacritics). Too bad this would involve modifying the original array (although in practice in many cases this would be no problem).
– mgibsonbr
It’s an approach I’ve seen used a lot in other languages (eg js, php). Unfortunately my python skills are pretty shallow, so I don’t risk posting an answer.
– bfavaretto
@bfavaretto If you want to try, I’ll help you... P Otherwise, I’ll post an answer myself later. I was investigating whether it would be possible to do this via binary search and
strcoll
, but without result... I will try to create a proof-of-concept using your suggestion, but if you want to give a partial answer, for me it is of good size.– mgibsonbr
Go ahead and post yours. Who knows when I study python (I can’t find time!) I can post one better than yours :)
– bfavaretto
If it were in C#, it could send to the SE team: http://meta.answall.com/questions/449/permit-searchesignoring-accentuao :P
– Maniero
@bigown hehe was that question on the goal that inspired me to post this one... P
– mgibsonbr
@bigown In the case of SOPT is relatively simple, just use COLLATE in the query. I would solve here, but not for any localized site - the collation would need to be parametrized.
– bfavaretto