Doubt Normalizer + Regex

Asked

Viewed 154 times

6

Could someone explain me the code below?

returnStr = Normalizer.normalize(returnStr, Normalizer.Form.NFD)
    .replaceAll("[^\\p{ASCII}]", "");

returnStr has as initial value a url.

  • The \p has a denied anchor \P, this way the regex could be simply "\\P{ASCII}", not needing the [^ ]

  • but why " p" and not just " p"? did not understand that part of regex

  • Because it is a string, in string the character of \ is a shortcut to literal, so p final result of the string conversion to REGEX is [^\p{ASCII}], that makes it right, if I had only one \ would be [^p{ASCII}] and would create a mistake.

3 answers

3


Analyzing the sentence [^\\p{ASCII}]

  • She’s in a String which will be converted into REGEX, and \ in String is an escape character, which makes near literal, so after conversion the result will be [^\p{ASCII}].
  • [] means a chain that will be captured, [^ ] is a denied chain, ie instead of capturing only what fits in the chain, it will capture everything that nay fits.
  • \p*, as well as \w, \d, \x is an anchor, serves to write less, instead of doing [A-Za-z0-9_] just one \w
  • {ASCII} is a condition for the \p, this will depend on the REGEX library of the language/compiler it is using.

Note

Using the REGEX101, in Quick Ference>Meta sequences we have the \p.

Corresponds to a Unicode character that was passed as parameter.

On this page we have some controllers supported, however compared to the ascii table, not all sentences match, because the ASCII table is only standard of 0-127, after this it does not follow an absolute standard so it will depend on the language it is using, for example the Pcre from what I’ve seen, it doesn’t support {ASCII} of \p.

Addendum

The \p has a denied anchor \P, this way you could convert to String simply to \\P{ASCII}, not needing the chain denied [^ ].

1

The other answers focused on explaining the regex, but failed to explain the Normalizer, and how they both work together.


Regex

Just to complement/reaffirm, as it has already been covered in detail by the other answers, the shortcut \\p{ASCII} is defined in the documentation in this way:

\p{ASCII}   All ASCII:[\x00-\x7F]

That is, it is equivalent to all characters of ascii table (code points between zero and 127, which are written in hexadecimal: \x00 and \x7F). To understand what a code point is, read here

Remembering that the backslash is written as \\ because of language escape rules.

Already the [^ ] is a character class denied (anything that nay whatever is inside the brackets). That’s why the expression is picking up any character that nay is ASCII (and as already stated in another answer, the expression [^\\p{ASCII}] could be exchanged for \\P{ASCII} - with a capital "P").


Normalizer

Already the Normalizer serves to apply the Unicode normalizations in a string. To understand in detail, I suggest reading here, here and here (and to delve into the subject, you can read all the related questions).

But so well summarized, the Unicode defines that some characters have more than one way of being represented. For example, the á (letter "a" with acute accent) can be represented in two ways:

  1. composite - like code point U+00E1 (LATIN SMALL LETTER A WITH ACUTE) (á)
  2. decomposed - as a combination of two code points (in this order):

The first form is called NFC (Canonical Composition), and the second, NFD (Canonical Decomposition). The two above forms are considered "canonically equivalent" when it comes to representing the letter a with acute accent. That is, they are two ways of representing the same thing.

Therefore, Normalizer.normalize(returnStr, Normalizer.Form.NFD) is converting the string to the NFD form. If it has accented characters, they will be decomposed according to the above description: á is decomposed into a and the acute accent, õ is decomposed into o and the til, ç is decomposed into c and the cedilla, etc..

And as the characters representing the accents/cedilla are not in the ASCII value range, the replaceAll ends up removing them, leaving only the letters without accents/cedilla.

So what that code does is remove the accents and the cedilla (á, â, ã and á become a, ç becomes c, etc). Ex:

String str = "opção";

// imprime "opcao"
System.out.println(Normalizer.normalize(str, Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", ""));

It is worth remembering that regex removes any character that is not ASCII, which means that letters from other languages, among other characters, such as emojis, will also be removed:

String str = "時opção";

// além dos acentos, também remove o caractere japonês e o emoji, e imprime "opcao"
System.out.println(Normalizer.normalize(str, Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", ""));

Not counting the thousands of "special" characters that Unicode defines, such as the various different types of spaces, hyphens, quotation marks, etc, many of them - most of them actually - outside the ASCII table, and therefore would be removed as well.


But then of course it depends on the strings you have. If you only have Portuguese texts, for example, it wouldn’t be that problematic. But if you have other non-ASCII characters that should not be removed, then you would have to change the regex according to each case. An example would be to remove only the accents, using the Unicode category to which they belong, for example:

Normalizer.normalize(str, Normalizer.Form.NFD).replaceAll("\\p{M}", "")

In the case, \\p{M} takes only the categories Mark, Spacing Combining, Mark, Enclosing and Mark, Nonspacing, which have accentuation characters and cedilla (not only those used in Portuguese, click on the links to see the full list). Thus, emojis and letters from other alphabets are preserved. But anyway, there is already a little outside the scope of the question, it is only to show that regex is not something magical and depends a lot on adjustments that vary according to the context (even more when it involves these properties Unicode).

1

Analyzing the regular expression, this is a denied list. It will replace any occurrence of non-ASCII character per void (will remove).

  1. Take the URL
  2. Remove any non-ASCII character.

ASCII is a character table, containing letters numbers and symbols and computational code corresponding to it.

  • The answer is right, but I think it would be better if it explains why it considers ASCII characters, explaining that the \p is an "anchor/shortcut" as well as \x, in which I believe is the real doubt of the AP.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.