The lexicographic order or Collation is very related to the language and alphabet you are using, and say is a problem the question of the appropriate choice of a Charset, has already been resolved by UNICODE.
For your doubt I recommend an essential reading:
The Absolute Minimum Every Software Developer Absolutely, Positively
Must Know About Unicode and Character Sets (No Excuses!)
Approach to problem in the C
Recommendation is always to use a UNICODE representation instead of using literal characters expressed in char
mainly for example the extended representation of accented Latin characters are multi-byte, ie will not be represented correctly in a char
(-128 to 127) or even using unsigned char
(0 to 255).
Using as a reference :
IS = LATIN CAPITAL LETTER AND WITH ACUTE
It would be the Unicode-Codepoint U+00C9
being the hexa c3 89
occupying 2 bytes in UTF-8.
This would have to be represented by a wchar_t
multibyte-Character type.
Suppose the question revolves around receiving an input, converting it and testing it as you exposed:
need to compare two strings ignoring the accent
An approach would be like this example, using the Wide-Character I/O functions to replace all the É
:
//constante unicode representado por um type wide char
const wchar_t E_GRANDE_ACENTO L'\u00C9';
int main()
{
//obtem o locale default do ambiente, linux padrão normalmente UTF-8
setlocale(LC_ALL, "");
//fputs para wide char type
fputws(L"Informe a String: ", stdout);
wchar_t wbuff[128];
//fgets para wide char type
fgetws(wbuff, 128, stdin);
int len = wcslen(wbuff);
for (int n = 0; n < len ;++n)
{
if (buff[n] == E_GRANDE_ACENTO)
buff[n] = L'E';
}
wprintf(L" %ls\n", buff);
return 0;
}
This is an example of reference in the case of a broader approach for this type of problem to Informed API (UNAC) by @Intruder would be more recommended.
What about the Collation of a UNICODE stream?
Maybe this would be the approach you expected, I recommend using the API ICU - International Components for Unicode, it solves the issue of sorting using existing patterns or even with specific ruleset declared during your instance.
Example Collator using ICU API for Unicode array ordering.
UChar *s [] = { /* lista de strings unicode */ };
uint32_t listSize = sizeof(s)/sizeof(s[0]);
UErrorCode status = U_ZERO_ERROR;
UCollator *coll = ucol_open("en_US", &status);
uint32_t i, j;
if(U_SUCCESS(status)) {
for(i=listSize-1; i>=1; i--) {
for(j=0; j<i; j++) {
if(ucol_strcoll(s[j], -1, s[j+1], -1) == UCOL_LESS) {
swap(s[j], s[j+1]);
}
}
}
ucol_close(coll);
}
Enter your current code so we can see how you are doing. So we can help you close to what you need.
– Maniero
You have to "play" with character representation (ISO-8859-? , UTF-8, ...) and locale (probably
"pt-BR"
).– pmg
I had come to vote as a duplicate, but I withdrew the vote because on second thought doesn’t necessarily seem to be the same thing. : ) But it may be useful as well: http://answall.com/questions/1828/como-fazer-um-algoritmo-fon%C3%A9tico-para-o-portugu%C3%Aas-brasileiro
– Luiz Vieira