Before talking about your problem itself, a short introduction on Unicode:
Normalization
In Unicode, each character* has a unique numeric code (called code point, read this article to understand the details). But some characters can be represented in different ways, defined by the forms of standardisation. Without going into too much detail, basically means that some characters can be represented by different codepoints.
An example is the character Á
(the letter A
uppercase with acute accent), which can be represented in two ways:
- like code point U+00C1 (LATIN CAPITAL LETTER A WITH ACUTE) - in Unicode the value of the code point is represented in the form "U+xxxx", where "xxxx" is the value in hexadecimal
- as two code points:
- U+0041 (LATIN CAPITAL LETTER A)
- U+0301 (COMBINING ACUTE ACCENT)
We can see this using Java code. First I create a method that converts a String
for a certain form, and then print out its code points:
// converte a String para a forma indicada e imprime os codepoints
public void showCodePoints(String str, Normalizer.Form forma) {
String s = Normalizer.normalize(str, forma);
System.out.printf("Code points da string '%s' em %s\n", s, forma);
s.codePoints().forEach(cp -> {
System.out.printf(" - U+%04X %s\n", cp, Character.getName(cp));
});
}
Let’s test this method with Á
, in the forms NFC and NFD:
String s = "Á";
showCodePoints(s, Normalizer.Form.NFC);
showCodePoints(s, Normalizer.Form.NFD);
The exit is:
Code points da string 'Á' em NFC
- U+00C1 LATIN CAPITAL LETTER A WITH ACUTE
Code points da string 'Á' em NFD
- U+0041 LATIN CAPITAL LETTER A
- U+0301 COMBINING ACUTE ACCENT
Note that in the NFC form, the String
only has the code point U+00C1, and in the NFD form, it has the code points U+0041 and U+301. But both, when printed, are shown as Á
.
Trying to find out what happened
You said you’re getting the String
with the value C301LCULO
. Assuming that the original text should be CÁLCULO
, looks like that for some reason String
was in - or was converted to - NFD, but the A
was lost and only the accent arrived - in fact, the accent code (U+0301), which was transformed into text and placed in the String
, what would explain the 301
.
Then the problem must be at the origin (in who generated and sent this String
). If you can, investigate the problem there (and correct, of course). If you can’t change the source, the way is to do the "gambiarra" same. No more details on how to String
is generated, that’s all I can say...
About your solution
In the your answer you use the regex "C+[\\d+]+LCULO"
. I will comment a little on it and suggest an improvement.
The excerpt C+
means "one or more letters C
" (is that the +
means). That is, if the String
start with CCCCC
, regex will accept. If you only want a letter C
, take off the +
from there.
Already [\\d+]
is not what it looks like. Brackets define a character class, that is, they take everything that is inside them. For example, [ab]
means "the letter a
or the letter b
". Therefore [\\d+]
means "one digit from 0 to 9 (\d
) or the character +
". This happens because inside the brackets the +
"loses its powers" and becomes a common character with no special meaning.
But the +
after the clasps did not lose its powers, so [\\d+]+
means "one or more occurrences of digits or +
". I mean, your regex will accept Strings
as CCCC+++++LCULO
. See here this regex working.
In case, you just want the letter "C" followed by multiple numbers, then just use C\\d+
(see here the difference). Another point is to use Optional
to check the regex seems to me a bit exaggerated. You can get the same result with a simple replaceAll
:
public static String trataNomeArquivo(String nomeArquivo) {
nomeArquivo = nomeArquivo.toUpperCase().replaceAll("C\\d+LCULO", "CALCULO");
return Normalizer.normalize(nomeArquivo, Normalizer.Form.NFD)
.replaceAll("[^\\p{ASCII}]", "").toUpperCase();
}
If the String
does not correspond to the regex C\\d+LCULO
, the replaceAll
does not make any substitution and the String
does not undergo any modification, so it is OK to call you directly (do not need to check if the regex has found a match, or test whether the value of Optional
is null, etc.).
Of course if the code is at all times 301, just do it replaceAll("C301LCULO", "CALCULO")
.
* The term "character" itself is a confusing concept. Many think that a code point is equal to a character, but in fact it is more complicated than that.
To learn more about Unicode and normalization, read here, here and here.
What normalize does (depending on the adopted form) is to break the character in N another one
Ã
flippedA
and~
This is done by the root code of a3
won’t be aE
– rray