Capitalize Javascript text, ignoring abbreviations

Asked

Viewed 908 times

10

I have a Javascript code to capitalize text that treats some exceptions.

However, I would like to address a few more, such as ignoring abbreviations, which would be to have a point before or after the first letter and ignore some Roman numerals.

Any suggestions? Follow my code so far:

function testaCapitalize() {
  var texto1 = "alto dO cruzeiro";
  var texto2 = "joão paulo II";
  var texto3 = "N.S. aparecida";
  var texto4 = "N.S. DAS GRAÇAS";
  document.getElementById('resultado').innerHTML =
    capitalize(texto1) +
    "<br/>" +
    capitalize(texto2) +
    "<br/>" +
    capitalize(texto3) +
    "<br/>" +
    capitalize(texto4);
}

function capitalize(texto) {

  texto = texto.toLowerCase().replace(/(?:^|\s)\S/g, function(capitalize) {
    return capitalize.toUpperCase();
  });
  //preposição digitada
  var PreposM = ["Da", "Do", "Das", "Dos", "A", "E", "De", "DE"];
  //preposição substituta
  var prepos = ["da", "do", "das", "dos", "a", "e", "de", "de"];

  for (var i = PreposM.length - 1; i >= 0; i--) {
    texto = texto.replace(RegExp("\\b" + PreposM[i].replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\$&') + "\\b", "g"), prepos[i]);
  }

  return texto;
}
<input type="button" onclick="testaCapitalize()" value='Testa'>
<div id='resultado'>

</div>

function capitalize(texto) {

    texto = texto.toLowerCase().replace(/(?:^|\s)\S/g, function(capitalize) {
      return capitalize.toUpperCase();
    });
    //preposição digitada
    var PreposM = ["Da", "Do", "Das", "Dos", "A", "E", "De", "DE"];
    //preposição substituta
    var prepos = ["da", "do", "das", "dos", "a", "e", "de", "de"];

    for (var i = PreposM.length - 1; i >= 0; i--) {
      texto = texto.replace(RegExp("\\b" + PreposM[i].replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\$&') + "\\b", "g"), prepos[i]);
    }

  return texto;
}
  • By the array items, I suppose you are from Juazeiro/BA, Marlucio! Your question, the answers and comments here helped me a lot! Thanks!

2 answers

12


Maybe it’s simpler to make one split to break the text into words, then you check each word separately. Then just put it all back together at the end.

This is easier than trying a giant regex that treats all cases at once (although it is even possible, I don’t think it’s worth the hassle).

Assuming there’s always a space separating the words, one way to solve it would be:

function abreviacao(s) {
    return /^([A-Z]\.)+$/.test(s);
}

function numeralRomano(s) {
    return /^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$/.test(s);
}

function capitalize(texto) {
    let prepos = ["da", "do", "das", "dos", "a", "e", "de" ];
    return texto.split(' ') // quebra o texto em palavras
       .map((palavra) => { // para cada palavra
           if (abreviacao(palavra) || numeralRomano(palavra)) {
                return palavra;
           }

           palavra = palavra.toLowerCase();
           if (prepos.includes(palavra)) {
                return palavra;
           }
           return palavra.charAt(0).toUpperCase() + palavra.slice(1);
       })
       .join(' '); // junta as palavras novamente
}

let textos = ["alto dO cruzeiro", "joão paulo II", "N.S. aparecida", "N.S. DAS GRAÇAS"];
textos.forEach(t => console.log(capitalize(t)));

The split returns an array containing the words. Next I use map to exchange each word for its equivalent "capitalized".

If the word is abbreviation or Roman numeral, it is returned without modification. If it is one of the prepositions, it is transformed into lowercase.

If none of the exceptions, only the first letter is capitalized. Finally, I use join to join all words in a single string, having words separated by space.

The result of the above code is:

Alto do Cruzeiro
João Paulo II
N.S. Aparecida
N.S. das Graças

For the abbreviation I used the regex /^([A-Z]\.)+$/: a capital letter followed by a dot, may be repeated several times (ex: "A.", "A.B.", "A.B.C.", etc., all are considered abbreviations). The markers ^ and $ (respectively the beginning and end of the string) ensure that the word can only have this.

But if you just want to check that the first letter is followed by a dot and whatever the rest does, you can switch to /^[A-Z]\./ (or /^[A-Z]\./i to consider upper and lower case letters).

For Roman numerals, I took the regex of this answer. At first you could check if the string only has the characters I, V, X, L, D, C and M, but regex also checks the quantities and order in which they appear, avoiding invalid cases like XM, for example (see here a few more examples).

  • 2

    Wow! Who made this giant regex? I’ve never seen one this size. By the way, good answer, as always. :)

  • 2

    @Luizfelipe Se está falar of that, was myself :-) I confess that I only did to know if it was possible, because I think I would not use it in production. Also it does not work in Javascript, would have to change some things and would remain a monster. Finally, it has worse regex than this: https://answall.com/a/384502/112052, https://stackoverflow.com/a/3845829

  • 1

    The preposition array has a "de" repeated.

  • 1

    @Isac Corrected, thank you!

5

To facilitate the work you can use the library https://www.npmjs.com/package/capitalize-pt-br

Installing as follows, if NPM is used:

npm install --save capitalize-pt-br

In case of non-use of NPM, you can download the library and implement in the project by manually importing the script

The implementation with the library would be as follows:

const capitalize = require('capitalize-pt-br')

capitalize('HELLO WORLD')

This function would return the following:

Hello World

If there is a need to keep a lowercase word, you can use the second parameter of the function, as follows:

const capitalize = require('capitalize-pt-br')

capitalize('HELLO WORLD', ['world'])

This function would return the following:

Hello world

If there is a need to keep a lowercase and a uppercase word, you can use the second parameter of the function, as follows:

const capitalize = require('capitalize-pt-br')

capitalize('HELLO WORLD', ['world'], ['hello'])

This function would return the following:

HELLO world

  • 1

    understood, but does not resolve the issue of the points in abbreviations.

  • 2

    @Marluciopires You can create a file with the following structure https://github.com/ranpa/capitalize-pt-br/blob/master/src/keep-uppercase.js and pass the acronyms you want to keep more. If there is a need to automate such capitalisation of acronyms, the use of REGEX

Browser other questions tagged

You are not signed in. Login or sign up in order to post.