Regex is not checking the accented character, even though it is in the expression

Asked

Viewed 142 times

3

I have the following Python regex to filter my application text input:

import re
good_chars_regexp = re.compile(r"^[A-Za-záéíóúÁÉÍÓÚâêîôÂÊÎÔãõÃÕçÇ00-9\,\.\-\?\"\'\’\!\“\s\;\:\“\”\–\‘\’\’\/\\]+$", re.IGNORECASE)

So phrases like: 'Olá amigos da Internet!' and 'Dúvida sobre Python' must pass through.

When I perform the following check:

lista = ['Olá amigos da Internet!', 'Dúvida sobre Python', '@StackOverFlow']
for l in lista:
    print(re.match(good_chars_regexp, l) is not None)

I have as return: False, True, False. The first sentence: 'Olá amigos da Internet!' should return True.

The intention of my regex is to allow accented words, numbers and the special characters I have specified, such as !, ?, etc..

1 answer

7


You fell for "The Unicode"! But let’s go in pieces.


First, I copied and pasted the word Olá of your code and I did the following:

from unicodedata import name
for s in 'Olá': # para cada caractere da string
    print(f'{s} {ord(s):4X} {name(s)}')

That is, for each character of the string, I print the character itself, the your code point and its Unicode name. The result was:

O   4F LATIN CAPITAL LETTER O
l   6C LATIN SMALL LETTER L
a   61 LATIN SMALL LETTER A
́  301 COMBINING ACUTE ACCENT

Yes, 4 characters, and the last one is acute accent.

What happens is that this string is in the form NFD (one of standards defined by Unicode). To understand in detail, I suggest you read here, here and here, but in short, the character á (letter "a" with acute accent) can be represented in two ways:

The first form is known as NFC, and the second, as NFD (read the links already suggested above to learn more).

The problem is that both forms, when rendered, are shown on the screen - in the vast majority of fonts, if not in "all" - the same way, and you only notice the difference if you "brush the bits" and check what actually is in the string. So the regex will not do match in this string because the accent was not included in the list of valid characters.

An alternative to solve is to convert the string to NFC, using unicodedata.normalize. Thus the letter "a" and the accent are combined in the character á:

from unicodedata import name, normalize
for s in normalize('NFC', 'Olá'):
    print(f'{s} {ord(s):4X} {name(s)}')

See the difference:

O   4F LATIN CAPITAL LETTER O
l   6C LATIN SMALL LETTER L
á   E1 LATIN SMALL LETTER A WITH ACUTE

Another detail is that inside brackets, several characters do not need to be escaped with \. And how you used to flag re.IGNORECASE, do not need to capitalize and lower case the expression, because the flag you will already consider both (i.e., you can leave regex only at the lower case - or only at the upper case).

And the compiled expression (returned by re.compile) also owns the method match, which you can use directly (instead of re.match(good_chars_regexp, etc), can do only good_chars_regexp.match(etc)):

import re

lista = ['Olá amigos da Internet!', 'Dúvida sobre Python', '@StackOverFlow']
good_chars_regexp = re.compile(r"^[a-záéíóúâêîôãõç0-9,.\-?\"'’!“\s;:“”\–‘’’/\\]+$", re.IGNORECASE)

from unicodedata import normalize

for l in lista:
    print(good_chars_regexp.match(normalize('NFC', l)) is not None)

The exit is:

True
True
False

If you want to install an external module, an alternative is module regex, that has some more functionality than the module re. One that might help in this case is support for Unicode properties:

import regex
good_chars_regexp = regex.compile(r"^([0-9,.\-?\"'’!“\s;:“”\–‘’’/\\]|\p{Script=Latin}\p{M}?)+$", regex.IGNORECASE)
for l in lista: # não precisa mais normalizar
    print(good_chars_regexp.match(l) is not None)

Thus, regex considers numbers and other characters (dot, comma, hyphen, quotes, etc), or \p{Script=Latin}\p{M}?.

In the case \p{Script=Latin} are all characters from this list (which may be too comprehensive if you only want texts in Portuguese) and \p{M} includes the "Mark" categories (all starting with "M" from this list), in which the acute accent is included. The ? soon after indicates that it is optional (that is, we can only have the letter, or letter followed by the accent, in case the string is in NFD).


Note: Note also that this regex does not check for words. For example, if the string is !!!,,," ", she considers valid. Of course, there is already a little outside the scope of the question, but if the idea is, for example, to verify that it has at least one letter or something, maybe it helps to take a look here, here and here.


Finally, an option - a little more complicated - that works independent of the string being in NFC or NFD, and that does not require normalization, would be:

good_chars_regexp = re.compile(r"^([a-záéíóúâêîôãõç0-9,.\-?\"'’!“\s;:“”\–‘’’/\\]|[aeiou]\u0301|[aeio]\u0302|[ao]\u0303|c\u0327)+$", re.IGNORECASE)

In case, I consider the accented letters (áéí...), or the letters followed by the respective accent - and for that I used the Unicode escapes (\u followed by the hexadecimal code of each character), using the codes of the acute accent, circumflex, til and cedilla (each preceded by the respective letters that may have them). Thus, the regex takes both the cases in NFC and NFD.

  • 1

    Thank you for the reply! Very complete and full of references!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.