Regular expression that recognizes Portuguese words (accented) using Python

Asked

Viewed 420 times

3

Develop the function frequency(), that takes a string as input, calculates the frequency of each word in the string and returns a dictionary that maps words in the string at its frequency. You must use a regular expression (regex) to get the list of all words in the string.

What I did:

from re import findall
def frequency(string):
    padrao = "[a-zA-Z]+"
    palavras = findall(padrao,string)
    dicionario = {}
    for w in palavras:
        if w in dicionario:
            dicionario[w] += 1
        else:
            dicionario[w] = 1
    return dicionario


string = "Eu desconfio que houve uma sabotagem, exatamente para manchar a gestão eficiente que está sendo feita na Cedae, preparando ela para o leilão"
print(frequency(string))

Exit:

{'Eu': 1, 'desconfio': 1, 'que': 2, 'houve': 1, 'uma': 1, 'sabotagem': 1, 'exatamente': 1, 'para': 2, 'manchar': 1, 'a': 1, 'gest': 1, 'o': 3, 'eficiente': 1, 'est': 1, 'sendo': 1, 'feita': 1, 'na': 1, 'Cedae': 1, 'preparando': 1, 'ela': 1, 'leil': 1}

You can see on the output that the regular expression [a-zA-Z]+ does not recognise the accentuation of words in the Portuguese language.

How to improve regex in order to capture words accented correctly?

2 answers

7


An alternative is to use the shortcut \w, than in Python 3, by default, already picks accented letters:

from re import findall
def frequency(string):
    palavras = findall(r"\w+", string)
    dicionario = {}
    for w in palavras:
        if w in dicionario:
            dicionario[w] += 1
        else:
            dicionario[w] = 1
    return dicionario


string = "Eu desconfio que houve uma sabotagem, exatamente para manchar a gestão eficiente que está sendo feita na Cedae, preparando ela para o leilão"
print(frequency(string))

Although \w is very comprehensive, including picking up letters from other alphabets (such as Japanese, Arabic, Cyrillic, etc.), and picking up numbers and the character _ (that is, will consider that "123" and "a_b" are words).

If you only want the letters, you can use:

palavras = findall(r"[^\W\d_]+", string)

This is a character class denied, and take everything that is not between [^ and ]. In case we have \W (everything that is not \w), \d (numbers) and the character itself _. That is, it takes only the letters that \w already caught, ignoring the numbers and _.

For both of the above cases, the output is:

{'Eu': 1, 'desconfio': 1, 'que': 2, 'houve': 1, 'uma': 1, 'sabotagem': 1, 'exatamente': 1, 'para': 2, 'manchar': 1, 'a': 1, 'gestão': 1, 'eficiente': 1, 'está': 1, 'sendo': 1, 'feita': 1, 'na': 1, 'Cedae': 1, 'preparando': 1, 'ela': 1, 'o': 1, 'leilão': 1}

Detail: Unicode normalization

An addendum, test with this string (copy and paste instead of typing directly):

string = "leilão leilão"
print(frequency(string))

The result will be:

{'leilão': 1, 'leila': 1, 'o': 1}

Doubt? See here

This is because one of the words "auction" is normalized to the NFD form. For more details on what this is, I suggest you read here and here, but basically Unicode defines that some accented letters may have more than one way of being represented, and in NFD form, letters like ã are decomposed into 2 characters (the a and the ~).

So the regex can no longer detect that it is all one word, since the ~ is not a letter, nor number, nor _, therefore the \w ignores this character.

A solution for this case would be to normalize to NFC (so the 2 characters are "united" into one, ie the a and the ã become the ã, that regex can detect):

from re import findall
import unicodedata as uc

def frequency(string):
    # normalizar para NFC
    palavras = findall(r"[^\W\d_]+", uc.normalize('NFC', string))
    # o resto da função é igual

It is not clear where the strings come from that the function will parse, but the fact is that it is a possibility that they are in NFD (and as you can see from the example above, visually it is not possible to know; only when the program is manipulating the string will this be detected, and can give difference if it is not treated).


If you want to limit yourself to only words of the Portuguese language, can also do:

palavras = findall(r"[a-záéíóúçâêôãõà]+", uc.normalize('NFC', string), re.I)

I included the accents and the cedilla (never saw a word with circumflex accent in the i, so I did not put, but if the case is just add). I also used the flag re.I for regex to consider both upper and lower case letters (so I don’t need to put A-ZÁÉÍ... in regex).


It is worth remembering that in Portuguese there are compound words, so the hyphen should be included in the regex (but this should have at least a few letters before and after). An alternative would be:

palavras = findall(r"[a-záéíóúçâêôãõà]+(?:-[a-záéíóúçâêôãõà]+)*", uc.normalize('NFC', string), re.I)

Thus, we have a sequence of letters (equal to the previous example), followed by "hyphen + letters", and this sequence "hyphen + letters" can occur zero or more times (indicated by quantifier *). Parentheses use the syntax (?: for this to be a catch group (without the ?:, this would be a capture group and the documentation says that findall returns only the groups when they are present - that is, it would return only the portion that was caught by the parentheses; using a non-sample group I guarantee that all the pouch are returned).

Thus, the string can have words like "hummingbird" and "sponge cake", which they will be counted as if they were one (using the previous regex, "kisses" and "flower" would be considered separate words).


Finally, for the case mentioned in the comments (that letters after the apostrophe should be ignored), an alternative is:

palavras = findall(r"(?<!')[a-záéíóúçâêôãõà]+(?:-[a-záéíóúçâêôãõà]+)*", uc.normalize('NFC', string), re.I)

Now I wear a Negative lookbehind (the stretch (?<!')) which checks whether before the letters nay has an apostrophe. Thus, he ignores the "’s" in "Coleridge’s" (whereas only "Coleridge" is a word, as the "s" will not be accounted for and will not even appear in the results). Remembering that without the lookbehind, the "s" is counted as a separate word.


But if you want "Coleridge’s" to be a single word, just use:

palavras = findall(r"[a-záéíóúçâêôãõà]+(?:[-'][a-záéíóúçâêôãõà]+)*", uc.normalize('NFC', string), re.I)

The difference is that now I use [-'] (a hyphen or apostrophe) to mark the words "compound".

Evidently you can trade [a-záéí...] for \w or [^\W\d_].


The disadvantage of these expressions for compound words is that you have to repeat the passage that corresponds to the letter, but this is to ensure that there will be no words that start or end with ' or hyphen. But this can be bypassed like this:

letras = r"[a-záéíóúçâêôãõà]+"
palavras = findall(f"{letras}(?:[-']{letras})*", uc.normalize('NFC', string), re.I)

So you only need to change the definition of "letter" once.


Finally - not directly related to regex - you could also assemble the result like this:

for w in palavras:
    dicionario[w] = dicionario.get(w, 0) + 1

Because dictionaries have the method get which can optionally return a value default if the key does not exist (if the key w does not exist, returns zero).

Or you can use a Counter, that serves just what you need:

from re import findall
import unicodedata as uc
from collections import Counter

def frequency(string):
    letras = r"[a-záéíóúçâêôãõà]+"
    palavras = findall(f"{letras}(?:[-']{letras})*", uc.normalize('NFC', string), re.I)
    return Counter(palavras)

4

The easiest way to accept accentuation is this:

[A-zÀ-ú]+ // aceita caracteres minúsculos e maiúsculos
[A-zÀ-ÿ]+ // como acima, mas incluindo letras com um trema (inclui [] ^ \ × ÷)
[A-Za-zÀ-ÿ]+ // como acima, mas sem incluir [] ^ \
[A-Za-zÀ-ÖØ-öø-ÿ]+ // como acima, mas sem incluir [] ^ \ ×

so your code can stay:

 # -*- coding: utf-8 -*-
from re import findall
def frequency(string):
    padrao = "[A-Za-zÀ-ÿ]+"
    palavras = findall(padrao,string)
    dicionario = {}
    for w in palavras:
        if w in dicionario:
            dicionario[w] += 1
        else:
            dicionario[w] = 1
    return dicionario


string = "Eu desconfio que houve uma sabotagem, na gestão exatamente para manchar a gestão eficiente que está sendo feita na Cedae, preparando ela para o leilão"
print(frequency(string))

Exit:

{'exatamente': 1, 'a': 1, 'feita': 1, 'eficiente': 1, 'sendo': 1, 'para': 2, 'manchar': 1, 'na': 2, 'ela': 1, 'estão': 1, 'o': 1, 'preparando': 1, 'sabotagem': 1, 'houve': 1, 'desconfio': 1, 'uma': 1, 'gestão': 2, 'Eu': 1, 'que': 2, 'Cedae': 1, 'leilão': 1}

Example with special characters in the text contents:

# -*- coding: utf-8 -*-
from re import findall
def frequency(string):
    padrao = "[A-Za-zÀ-ÿ^\']+"
    palavras = findall(padrao,string)
    dicionario = {}
    for w in palavras:
        if w in dicionario:
            dicionario[w] += 1
        else:
            dicionario[w] = 1
    return dicionario


string = "[Coleridge's \"Ancient Mariner.\"]"
print(frequency(string))

Exit:

{'Mariner': 1, 'Ancient': 1, "Coleridge's": 1}
  • Didn’t miss a "+" at the end? I tested here and took letters and not words

  • Yes, I’ve updated my answer now.

  • picked up another text that appears [Coleridge’s "Ancient Mariner."] in the middle... And now? I just want the words...

  • @Eds In this case Coleridge's is a single word? Or "Coleridge" and "s" are different "words"?

  • @hkotsubo can ignore’s

  • @Brunovdutra: You forgot the "+" at the end of the example regex. Could you please edit?

  • You can use the following regex [A-Za-zà-"']+ to disregard the character (').

  • @Eds Atualizei my answer with the option to ignore the 's (and another option to consider "Coleridge’s" a single word)

  • @Eds updated for you there.

Show 4 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.