Regex for full sentence in capital letters

Asked

Viewed 609 times

3

I’m going through the first column of the body of an HTML table and extracting the text from each of the cells using Javascript and testing against a Regex.

Regex has the function of identifying which sentences or strings (content of each cell) are formed exclusively by words with all letters capital letters, ignoring special characters, except accents in letters, punctuation marks and mathematical symbols.

I face difficulties in finding the correct Regex for the following sample cases in Javascript, which represent the whole properly. On the left, we have the text that is extracted from the table and, on the right, we have the result that should be returned by the function .test() javascript:

(+/-) Provisão Despesas Administrativas - DPVAT ---------- false
PRÊMIOS EMITIDOS(-) PLANOS DE APOSENTADORIA -------------- true
(+) Prêmios - Riscos Vigentes Não Emitidos --------------- false
(+) OUTRAS RECEITAS E DESPESAS OPERACIONAIS -------------- true
(=) LUCRO LÍQUIDO / PREJUÍZO ----------------------------- true
CIRCULANTE ----------------------------------------------- true
(-) Redução ao valor recuperável ------------------------- false
Operações com Resseguradoras ----------------------------- false
ATIVOS DE RESSEGURO E RETROCESSÃO - PROVISÕES TÉCNICAS --- true
Prêmios Diferidos - PPNG --------------------------------- false
**********TOTAL DO ATIVO********** ----------------------- true

However, the Regex I include in the function isUpperCase() below, does not correctly treat the above cases, considering true, for example, the value "Prêmios Diferidos - PPNG", when this should be false.

Function 1 below tests the string against Regex and Function 2 adds formatting to the text if all the words in the cell are all uppercase. Both are executed correctly, given the Regex limitation.

Function 1 - isUpperCase()

const isUpperCase = ( str ) => {
    // RegEx contra a qual a palavra será testada
    // FIX
    return (/[A-ZÁÀÂÃÉÈÍÏÓÔÕÖÚÇÑ]{3}/).test(str);
}

Function 2 - addFormatting()

const addFormatting = ( table ) => {
    let tableElement = document.querySelector('.' + table);
    let cellCollection = tableElement.querySelectorAll('tbody tr > td:first-child');

    cellCollection.forEach( function ( cell ) {
        if( isUpperCase( cell.innerText ) ) {
            cell.parentNode.setAttribute(
                "style",
                `color: rgb(0, 92, 169);
                padding-left: -15px;
                font-weight: bold;`
            )
        } else {
            cell.style.paddingLeft = "32px";
        }
    });
}

I need help to find the appropriate Regex for the highlighted cases, which would be sufficient for my use case and my specific situation.

2 answers

6


One solution would be to eliminate everything that is not a letter, and then test if what is left has only uppercase letters:

function isUpperCase(s) {
    return /^[A-Z]+$/.test(s.normalize('NFD').replace(/[^A-Za-z]/g, ''));
}

[ 
   '(+/-) Provisão Despesas Administrativas - DPVAT', 
   'PRÊMIOS EMITIDOS(-) PLANOS DE APOSENTADORIA', 
   '(+) Prêmios - Riscos Vigentes Não Emitidos', 
   '(+) OUTRAS RECEITAS E DESPESAS OPERACIONAIS', 
   '(=) LUCRO LÍQUIDO / PREJUÍZO', 
   'CIRCULANTE', 
   '(-) Redução ao valor recuperável', 
   'Operações com Resseguradoras', 
   'ATIVOS DE RESSEGURO E RETROCESSÃO - PROVISÕES TÉCNICAS', 
   'Prêmios Diferidos - PPNG', 
   '**********TOTAL DO ATIVO**********'
].forEach(s => {
    console.log(`${s} = ${isUpperCase(s)}`);
});

I used normalize to convert the string to NFD form. Here has a more detailed explanation of how it works, but basically characters like the á (letter a accentuated) are "broken" (decomposed) in two: the letter a without accent and the accent itself.

Then I use replace, that removes [^A-Za-z] (anything that nay be a letter of a to z, uppercase or lowercase) - I actually replace these characters with '' (empty string), which is the same as removing them, and the flag g ensures that all occurrences are removed. Thanks to NFD normalization, accents are also deleted (as done here), for the á was broken into a and ´, and the accent is deleted. Therefore, what is left are only upper and lower case letters.

Then I use test with the regex ^[A-Z]+$: the markers ^ and $ indicate respectively the beginning and end of the string, and [A-Z]+ checks if there is one or more capital letters.

That is, the normalize along with replace eliminates accents and anything that is not a letter, and then I test if what is left has only uppercase letters.


There is an option (which does not yet work on all browsers), which is to use Unicode Property Escapes:

// *** Não funciona no Firefox e IE ***
function isUpperCase(s) {
    return /^\p{Lu}+$/u.test(s.replace(/\P{L}/ug, ''));
}

[ 
   '(+/-) Provisão Despesas Administrativas - DPVAT', 
   'PRÊMIOS EMITIDOS(-) PLANOS DE APOSENTADORIA', 
   '(+) Prêmios - Riscos Vigentes Não Emitidos', 
   '(+) OUTRAS RECEITAS E DESPESAS OPERACIONAIS', 
   '(=) LUCRO LÍQUIDO / PREJUÍZO', 
   'CIRCULANTE', 
   '(-) Redução ao valor recuperável', 
   'Operações com Resseguradoras', 
   'ATIVOS DE RESSEGURO E RETROCESSÃO - PROVISÕES TÉCNICAS', 
   'Prêmios Diferidos - PPNG', 
   '**********TOTAL DO ATIVO**********'
].forEach(s => {
    console.log(`${s} = ${isUpperCase(s)}`);
});

The logic is the same as the previous regex, but now I use the Unicode properties. In case, I remove everything that is not letter (\P{L}), and check if what’s left is only capital letters (\p{Lu}). In this case you do not need to normalize the string, but remember that the Unicode categories are very comprehensive and consider letters from other alphabets (such as Japanese, Arabic, Cyrillic, Greek, etc). If you don’t want to be so comprehensive and consider only our alphabet, use the previous solution (or the alternative below, without regex).

Note that in this case the regex should have the flag u qualified, so that the Unicode Properties are correctly recognized.

It is worth remembering that according to the documentation, Firefox and IE still do not support this syntax (which may be another reason not to use - I only left it here as additional information even).


Regex-free

You can also solve without regex, just by testing the characters of the string one by one:

function isUpperCase(s) {
    s = s.normalize('NFD');
    let hasUpper = false; // verifica se tem pelo menos uma letra maiúscula
    for (let i = 0; i < s.length; i++) {
        let c = s.codePointAt(i);

        // se for letra maiúscula, marca que encontrou
        if (65 <= c && c <= 90) hasUpper = true;

        // se for letra minúscula, já retorna false
        if (97 <= c && c <= 122) return false;
    }

    return hasUpper;
}

[ 
   '(+/-) Provisão Despesas Administrativas - DPVAT', 
   'PRÊMIOS EMITIDOS(-) PLANOS DE APOSENTADORIA', 
   '(+) Prêmios - Riscos Vigentes Não Emitidos', 
   '(+) OUTRAS RECEITAS E DESPESAS OPERACIONAIS', 
   '(=) LUCRO LÍQUIDO / PREJUÍZO', 
   'CIRCULANTE', 
   '(-) Redução ao valor recuperável', 
   'Operações com Resseguradoras', 
   'ATIVOS DE RESSEGURO E RETROCESSÃO - PROVISÕES TÉCNICAS', 
   'Prêmios Diferidos - PPNG', 
   '**********TOTAL DO ATIVO**********'
].forEach(s => {
    console.log(`${s} = ${isUpperCase(s)}`);
});

I keep converting the string to the NFD form to eliminate the accents, and then use codePointAt to get each Codepoint from the string. To better understand what a Codepoint is, read here. To simplify, the letters A to Z have codepoints with the same values as the ascii table, then in the check just compare with these values.

If you have any lower-case letters, I’ll be right back false. If I found at least one uppercase letter, return true (if you haven’t found any, the return is false). Any other character other than letter is ignored (thanks to the normalization for NFD, the accent is also ignored as the letter á is decomposed into a and ´, that is, the letter is considered, but the accent is not).

  • 1

    Excellent! I loved your explanation too! Thank you so much!

2

The @hkotsubo response is very good and covers any type of situation if there is a possibility that the text contains accented letters that are not part of Portuguese (e.g.: ¡, ö, ă etc.).

But I would like to make a solution a little simpler if the texts are only in Portuguese, where you do not need to use the .mormalize() and the .replace(), using the regex:

/[a-zà-ü]/

This expression will cover characters:

a-z -> de "a" a "z" minúsculas
à-ü -> à, á, â, ã, ä, å, æ, ç, è, é, ê, ë, ì, í, î,
       ï, ð, ñ, ò, ó, ô, õ, ö, ÷, ø, ù, ú, û, ü

Many of these characters are not part of Portuguese, but how I want to pick up the à to ü following the table Unicode, it becomes simpler to use à-ü than slicing the sequence. Although the ü (letter u with trema) was removed from modern Brazilian Portuguese, but included it to ensure there is also.

In the if verifying the function isUpperCase(), you make a negative check by adding the ! before the function name:

if( !isUpperCase( cell.innerText ) ) {
    ↑

The return /[a-zà-ü]/.test(str); will return true if the text has any of these lowercase characters -- including all the accented letters of Brazilian Portuguese --, that is, indicating that there is at least one lowercase letter in the text. As you are doing a negative check, you also need to change the code of else hair of the if and vice versa.

Testing:

const isUpperCase = ( str ) => {
    // RegEx contra a qual a palavra será testada
    // FIX
    return /[a-zà-ü]/.test(str);
}

const addFormatting = ( table ) => {
    let tableElement = document.querySelector('.' + table);
    let cellCollection = tableElement.querySelectorAll('tbody tr > td:first-child');

    cellCollection.forEach( function ( cell ) {
        if( !isUpperCase( cell.innerText ) ) {
           cell.nextElementSibling.textContent = "true"; // apague esta linha
            cell.style.paddingLeft = "32px";
        } else {
           cell.nextElementSibling.textContent = "false"; // apague esta linha
            cell.parentNode.setAttribute(
                "style",
                `color: rgb(0, 92, 169);
                padding-left: -15px;
                font-weight: bold;`
            )
        }
    });
}

addFormatting("tabela");
<table border="1" class="tabela">
   <tr>
      <td>
         (+/-) Provisão Despesas Administrativas - DPVAT
      </td>
      <td>
      </td>
   </tr>
   <tr>
      <td>
         PRÊMIOS EMITIDOS(-) PLANOS DE APOSENTADORIA
      </td>
      <td>
      </td>
   </tr>
   <tr>
      <td>
         (+) Prêmios - Riscos Vigentes Não Emitidos
      </td>
      <td>
      </td>
   </tr>
   <tr>
      <td>
         (+) OUTRAS RECEITAS E DESPESAS OPERACIONAIS
      </td>
      <td>
      </td>
   </tr>
   <tr>
      <td>
         (=) LUCRO LÍQUIDO / PREJUÍZO
      </td>
      <td>
      </td>
   </tr>
   <tr>
      <td>
         CIRCULANTE
      </td>
      <td>
      </td>
   </tr>
   <tr>
      <td>
         (-) Redução ao valor recuperável
      </td>
      <td>
      </td>
   </tr>
   <tr>
      <td>
         Operações com Resseguradoras
      </td>
      <td>
      </td>
   </tr>
   <tr>
      <td>
         ATIVOS DE RESSEGURO E RETROCESSÃO - PROVISÕES TÉCNICAS
      </td>
      <td>
      </td>
   </tr>
   <tr>
      <td>
         Prêmios Diferidos - PPNG
      </td>
      <td>
      </td>
   </tr>
   <tr>
      <td>
         **********TOTAL DO ATIVO**********
      </td>
      <td>
      </td>
   </tr>
</table>

  • Only one detail: the interval à-ü also considers other "strange" characters, such as æ and the ÷, among others (full list) - although it’s better to have more than not. And he also doesn’t consider the cases where the accents are in NFD (example). I would change the name of the function, because now it does the opposite of isUpperCase :-) However, +1

  • 1

    Rs... You are mt perfectionist, and I admire it. Tb am a little, but not so much rs.. Thanks friend!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.