Compare two strings with C accentuation

Asked

Viewed 1,216 times

4

I have the following problem, I need to compare two strings ignoring the accent, for example:

                                 Étnico | Brasil

Using a normal comparison function it is returned that "Ethnic" comes before "Brazil" in view of the lexicographic order of the words.

I hope you have given to understand my doubt.

Does anyone have any idea how to treat this problem?

  • Enter your current code so we can see how you are doing. So we can help you close to what you need.

  • You have to "play" with character representation (ISO-8859-? , UTF-8, ...) and locale (probably "pt-BR").

  • I had come to vote as a duplicate, but I withdrew the vote because on second thought doesn’t necessarily seem to be the same thing. : ) But it may be useful as well: http://answall.com/questions/1828/como-fazer-um-algoritmo-fon%C3%A9tico-para-o-portugu%C3%Aas-brasileiro

2 answers

4

The lexicographic order or Collation is very related to the language and alphabet you are using, and say is a problem the question of the appropriate choice of a Charset, has already been resolved by UNICODE.

For your doubt I recommend an essential reading:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Approach to problem in the C
Recommendation is always to use a UNICODE representation instead of using literal characters expressed in char mainly for example the extended representation of accented Latin characters are multi-byte, ie will not be represented correctly in a char(-128 to 127) or even using unsigned char (0 to 255).

Using as a reference :

IS = LATIN CAPITAL LETTER AND WITH ACUTE

It would be the Unicode-Codepoint U+00C9 being the hexa c3 89 occupying 2 bytes in UTF-8.
This would have to be represented by a wchar_t multibyte-Character type.

Suppose the question revolves around receiving an input, converting it and testing it as you exposed:

need to compare two strings ignoring the accent


An approach would be like this example, using the Wide-Character I/O functions to replace all the É:

//constante unicode representado por um type wide char
const wchar_t E_GRANDE_ACENTO L'\u00C9';

int main()
{
    //obtem o locale default do ambiente, linux padrão normalmente UTF-8
    setlocale(LC_ALL, "");
    //fputs para wide char type
    fputws(L"Informe a String: ", stdout);

    wchar_t wbuff[128];
    //fgets para wide char type
    fgetws(wbuff, 128, stdin);

    int len = wcslen(wbuff);
    for (int n = 0; n < len ;++n)
    {
        if (buff[n] == E_GRANDE_ACENTO)
            buff[n] = L'E';
    }

    wprintf(L" %ls\n", buff);

    return 0;
}


This is an example of reference in the case of a broader approach for this type of problem to Informed API (UNAC) by @Intruder would be more recommended.

What about the Collation of a UNICODE stream?
Maybe this would be the approach you expected, I recommend using the API ICU - International Components for Unicode, it solves the issue of sorting using existing patterns or even with specific ruleset declared during your instance.

Example Collator using ICU API for Unicode array ordering.

UChar *s [] = { /* lista de strings unicode */ };
uint32_t listSize = sizeof(s)/sizeof(s[0]); 
UErrorCode status = U_ZERO_ERROR; 
UCollator *coll = ucol_open("en_US", &status); 
uint32_t i, j; 
if(U_SUCCESS(status)) {
  for(i=listSize-1; i>=1; i--) {
    for(j=0; j<i; j++) {
      if(ucol_strcoll(s[j], -1, s[j+1], -1) == UCOL_LESS) {
        swap(s[j], s[j+1]);
     }
   }
} 
ucol_close(coll); 
}

3

The answer to this dilemma will depend on the focus of the application, as well as every application that needs to deal with particularities of some kind of culture (date, time, language, zone, etc...).

In the most specific case of your doubt, the language and encoding used. Because it’s the factors that will guide you in the character set you want to treat. This becomes clear when you compare an application that has to deal with English and Brazilian Portuguese. The set of accentuation from one to the other is very different and in English the task would be relatively much easier.

The next step is to analyze the used encoding and ensure that the data (if it does not come from the same source) is in at least one format (Find).

If you are programming an engine for web search, for example, the content will vary greatly and the work to do what you want will turn a project into part within the software. But if the project is one to analyze a particular set of single source documents, then both the language and the NCO will be very specific and you can solve it more easily.

My suggestion is to start by analyzing possible data sources and then check the most suitable long-term and at first glance, two solutions are more direct answers:

1) Create a character-by-character mapping function, which takes the string and returns the value without the accents.

2)Use something ready like Unac (http://www.makelinux.net/man/3/U/unac)

unac is a C library that removes Accents from characters, regardless of the Character set (ISO-8859-15, ISO-CELTIC, KOI8-RU...) as long as iconv(3) is Able to Convert it into UTF-16 (Unicode).

A good piece of material on this: http://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html

Browser other questions tagged

You are not signed in. Login or sign up in order to post.