Normalize or other way to format String

Asked

Viewed 651 times

-1

I have this method:

public static String trataNomeArquivo(String nomeArquivo) {
    return Normalizer.normalize(nomeArquivo, Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", "").toUpperCase();
}

She takes seats and everything, but if I get one String for example: DEMONSTRATIVO DE C301LCULO, it cannot format and leave DEMONSTRATIVO DE CALCULO.

I did tests with UrlDecoder (UTF-8, ISO-8859-1) and did not work either.

  • 1

    What normalize does (depending on the adopted form) is to break the character in N another one à flipped A and ~ This is done by the root code of a 3 won’t be a E

2 answers

2


Before talking about your problem itself, a short introduction on Unicode:

Normalization

In Unicode, each character* has a unique numeric code (called code point, read this article to understand the details). But some characters can be represented in different ways, defined by the forms of standardisation. Without going into too much detail, basically means that some characters can be represented by different codepoints.

An example is the character Á (the letter A uppercase with acute accent), which can be represented in two ways:

  1. like code point U+00C1 (LATIN CAPITAL LETTER A WITH ACUTE) - in Unicode the value of the code point is represented in the form "U+xxxx", where "xxxx" is the value in hexadecimal
  2. as two code points:
    • U+0041 (LATIN CAPITAL LETTER A)
    • U+0301 (COMBINING ACUTE ACCENT)

We can see this using Java code. First I create a method that converts a String for a certain form, and then print out its code points:

// converte a String para a forma indicada e imprime os codepoints
public void showCodePoints(String str, Normalizer.Form forma) {
    String s = Normalizer.normalize(str, forma);
    System.out.printf("Code points da string '%s' em %s\n", s, forma);
    s.codePoints().forEach(cp -> {
        System.out.printf(" - U+%04X %s\n", cp, Character.getName(cp));
    });
}

Let’s test this method with Á, in the forms NFC and NFD:

String s = "Á";
showCodePoints(s, Normalizer.Form.NFC);
showCodePoints(s, Normalizer.Form.NFD);

The exit is:

Code points da string 'Á' em NFC
 - U+00C1 LATIN CAPITAL LETTER A WITH ACUTE
Code points da string 'Á' em NFD
 - U+0041 LATIN CAPITAL LETTER A
 - U+0301 COMBINING ACUTE ACCENT

Note that in the NFC form, the String only has the code point U+00C1, and in the NFD form, it has the code points U+0041 and U+301. But both, when printed, are shown as Á.


Trying to find out what happened

You said you’re getting the String with the value C301LCULO. Assuming that the original text should be CÁLCULO, looks like that for some reason String was in - or was converted to - NFD, but the A was lost and only the accent arrived - in fact, the accent code (U+0301), which was transformed into text and placed in the String, what would explain the 301.

Then the problem must be at the origin (in who generated and sent this String). If you can, investigate the problem there (and correct, of course). If you can’t change the source, the way is to do the "gambiarra" same. No more details on how to String is generated, that’s all I can say...


About your solution

In the your answer you use the regex "C+[\\d+]+LCULO". I will comment a little on it and suggest an improvement.

The excerpt C+ means "one or more letters C" (is that the + means). That is, if the String start with CCCCC, regex will accept. If you only want a letter C, take off the + from there.

Already [\\d+] is not what it looks like. Brackets define a character class, that is, they take everything that is inside them. For example, [ab] means "the letter a or the letter b". Therefore [\\d+] means "one digit from 0 to 9 (\d) or the character +". This happens because inside the brackets the + "loses its powers" and becomes a common character with no special meaning.

But the + after the clasps did not lose its powers, so [\\d+]+ means "one or more occurrences of digits or +". I mean, your regex will accept Strings as CCCC+++++LCULO. See here this regex working.

In case, you just want the letter "C" followed by multiple numbers, then just use C\\d+ (see here the difference). Another point is to use Optional to check the regex seems to me a bit exaggerated. You can get the same result with a simple replaceAll:

public static String trataNomeArquivo(String nomeArquivo) {
    nomeArquivo = nomeArquivo.toUpperCase().replaceAll("C\\d+LCULO", "CALCULO");
    return Normalizer.normalize(nomeArquivo, Normalizer.Form.NFD)
            .replaceAll("[^\\p{ASCII}]", "").toUpperCase();
}

If the String does not correspond to the regex C\\d+LCULO, the replaceAll does not make any substitution and the String does not undergo any modification, so it is OK to call you directly (do not need to check if the regex has found a match, or test whether the value of Optional is null, etc.).


Of course if the code is at all times 301, just do it replaceAll("C301LCULO", "CALCULO").


* The term "character" itself is a confusing concept. Many think that a code point is equal to a character, but in fact it is more complicated than that.

To learn more about Unicode and normalization, read here, here and here.

1

I had not attempted that these numbers do not correspond to any letter, so probably there is no java function to format this. What I did, it was a real gambit:

private static final Pattern PATTERN_CALCULO = Pattern.compile("C+[\\d+]+LCULO");

public static String trataNomeArquivo(String nomeArquivo) {
    String group = Optional.ofNullable(nomeArquivo)
            .map(String::toUpperCase)
            .map(PATTERN_CALCULO::matcher)
            .filter(Matcher::find)
            .map(Matcher::group)
            .orElse(null);

    if(Objects.nonNull(group)) nomeArquivo = nomeArquivo.toUpperCase().replaceAll(group, "CALCULO");

    return Normalizer.normalize(nomeArquivo, Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", "").toUpperCase();
}

Browser other questions tagged

You are not signed in. Login or sign up in order to post.