function does not execute regex equal words

Question

function does not execute regex equal words

Asked 5 years, 4 months ago

Viewed 112 times

4

I got some string:

"novamente mais brevemente uma vez claro, claramente demente, igualmente. novamente."

I run a Javascript regex to get the words finished in "mind":

var target = $("#content").text();
var exp = /\w+mente/g; // regex ok
var resultado = null;

while (resultado = exp.exec(target)) {
   marcarTexto_adverbio(resultado); // função que coloca uma tag <mark> em volta
};

The variable resultado passes all perfect values. Including the duplicated "again" at the end.

But the function that puts the tag  does not put in the second "again", ie does not put in repeated words:

function marcarTexto_adverbio(target) {
    $("#content").html(function (_, html) {
        return html.replace(target, '<mark>' + target + '</mark>')
    });
}

Any logic errors in running this function? (since the array of 'results' will complete, with all regex words)

2 answers

2

First, your regex does not just search for words that end in "mind", but any word that has at least one character before "mind":

let s = 'seus dementes, a mente engana frequentemente, plante sementes';
console.log(s.match(/\w+mente/g)); // [ "demente", "frequentemente", "semente" ]

See in the example above that, although the text has the words "demented" and "seeds", in the result appears "demented" and "seed". That is, his replacement would result in dementes and sementes, that’s not quite what you need.

This happens because the regex is catching \w+ (one or more letters, numbers or _), followed by "mind", but this regex alone does not guarantee that it cannot have another letter after.

To avoid this and just take the words that actually end in "mind", use the shortcut \b, indicating a word Boundary ("boundary between words, "a position having a prior alphanumeric character and a non-alphanumeric character afterwards, or vice versa):

let s = 'seus dementes, a mente engana frequentemente, plante sementes';
console.log(s.match(/\b\w+mente\b/g)); // [ "frequentemente" ]

The word "mind" is also not considered, because \w+ says it must have at least one character before "mind". But if you want to take the word "mind", change it to \w*.

Another detail is that you are changing all the HTML of the element. Although it works in many cases, it will not always be what you expect, because HTML is much more complex than a regex is able to handle (see more about this here).

I took the liberty of adapting the another answer to illustrate some problems that may occur:

function marcarTexto_adverbio(target) {
    // mostrando o HTML no console
    $("#content").html(function (_, html) {
        let novoHTML = html.replace(new RegExp(target, "g"), '<mark>' + target + '</mark>');
        console.log(novoHTML);
        return novoHTML;
    });
}

function teste() {
  let target = $("#content").text();
  let exp = /\b\w+mente\b/g;
  let resultado = null;

  let palavrasReplace = new Map();
  while (resultado = exp.exec(target)) {
    const palavra = resultado[0];
    if (!palavrasReplace.has(palavra)) {
      marcarTexto_adverbio(palavra); // função que coloca uma tag <mark> em volta
      palavrasReplace.set(palavra, true);
    }
  };
}

<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>

<p id="content">novamente
 <a href="www.novamente.com">link</a>
 <img src="novamente.gif" alt="mostra novamente a imagem">
 <span>tem comentário aqui<!-- novamente --></span></p>

<button onclick="teste()">Mark</button>

I changed the function marcarTexto_adverbio to display the final HTML in the console. Note that the output was:

<mark>novamente</mark>
 <a href="www.<mark>novamente</mark>.com">link</a>
 <img src="<mark>novamente</mark>.gif" alt="mostra <mark>novamente</mark> a imagem">
 <span>tem comentário aqui<!-- <mark>novamente</mark> --></span>

That is, both the href of the link, how much the src and the alt of the image, and even the text that was in the comments, had its contents unduly altered.

Using regex in this way, without worrying about the element’s HTML structure, can lead to catastrophic results. regex will only work if inside the element it has only simple text (or if a word that occurs in the text does not occur within HTML attributes, or anywhere other than one textContent).

The solution to this is a little more complicated, because we have to break every textNode in several nodes, some of which will be elements mark, while others will be textNodes. For example, the text "Happened again today", which in HTML is just one textNode, will have to be broken into 3 nodes: two textNodes for the words "Happened" and "today", and an element mark to "again". And if you have other tags inside the element, I must call the same function recursively, so that it handles the most internal elements of the element.

Would something like this:

function markWords(element) {
    let e = document.createElement('div');
    for (let child of element.childNodes) {
        if (child.nodeType == Node.TEXT_NODE) {
            child.nodeValue.split(/(\b\w+mente\b)/g).forEach(s => {
                if (! /^\w+mente$/.test(s)) {
                    e.appendChild(document.createTextNode(s));
                } else {
                    let novo = document.createElement('mark');
                    novo.appendChild(document.createTextNode(s));
                    e.appendChild(novo);
                }
            });
        } else e.appendChild(markWords(child));
    }
    element.innerHTML = e.innerHTML;
    return element;
}

function teste() {
    markWords(document.querySelector('#content'));
    // somente para mostrar o HTML gerado, pode apagar quando for usar na sua página
    console.log(document.querySelector('#content').innerHTML);
}

<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>

<p id="content">novamente pois novamente o
 <a href="www.novamente.com">link</a>
 <img src="novamente.gif" alt="mostra novamente a imagem">
 <span>tem comentário aqui<!-- novamente --></span>
 <span>antigamente, sementes, demente, novamente</span> <span>e novamente</span> fim.</p>

<button onclick="teste()">Mark</button>

The result is the correct HTML, with only modified words (correctly preserving HTML comments and attributes):

<mark>novamente</mark> pois <mark>novamente</mark> o
 <a href="www.novamente.com">link</a><img src="novamente.gif" alt="mostra novamente a imagem"><span>tem comentário aqui<!-- novamente --></span><span><mark>antigamente</mark>, sementes, <mark>demente</mark>, <mark>novamente</mark></span><span>e <mark>novamente</mark></span>

One last detail is that the shortcut \w takes letters, digits and the character _. If you want to consider only letters (including accents), see some options here.

i created a text editor in div with contentEditable = True so I won’t have tags inside it. only plain text. I took care, including to include a cleaning of characters and tags if the user makes 'Ctrl+v' with clipboardData.getData('text/plain'). And only in this div i run regex,. Of course, the tags marks are inserted/deleted with js followed by normalize to correct some behaviors of textNode that have emerged.

– Luke Negreiros

2020/04/11 at 00:11
@Lukenegreiros All right, anyway, the answers from [pt.so] should also be for anyone who visits the site in the future, so I found it interesting to show a more general case, because regex + HTML can be a dangerous combination if the scenario is not restricted (as is your case)

– hkotsubo

2020/04/11 at 00:48

Browser other questions tagged javascript regex

You are not signed in. Login or sign up in order to post.

by Daniel Mendes • **6,211** points · Answer 1 · 2020-04-10T03:01:23+00:00

There is a problem, the replace is only performing a substitution the way it is being used, when finding the first occurrence it replaces and has ended... Since you have two identical words, it replaces exactly the same word twice.

If you inspect the HTML you will see that word again is within two tags mark.

You can work with the replace using regex and the flag g, with this the substitution will be made in all found words.

Its function marcarTexto_adverbio would be more or less as follows.

function marcarTexto_adverbio(target) {
    $("#content").html(function (_, html) {
        return html.replace(new RegExp(target, "g"), '<mark>' + target + '</mark>')
    });
}

This will create the tags mark in all words, but as there are repeated words, some may have two taks mark or even more.

To correct this, we may choose to put the already replaced words in a Map, see an example:

var palavrasReplace = new Map();

while (resultado = exp.exec(target)) {
  const palavra = resultado[0];

  if (!palavrasReplace.has(palavra)) {
    marcarTexto_adverbio(palavra); // função que coloca uma tag <mark> em volta
    palavrasReplace.set(palavra, true);
  }
};

See the full example:

function marcarTexto_adverbio(target) {
    $("#content").html(function (_, html) {
        return html.replace(new RegExp(target, "g"), '<mark>' + target + '</mark>')
    });
}

function teste() {
  var target = $("#content").text();
  var exp = /\w+mente/g; // regex ok
  var resultado = null;

  var palavrasReplace = new Map();

  while (resultado = exp.exec(target)) {
    const palavra = resultado[0];

    if (!palavrasReplace.has(palavra)) {
      marcarTexto_adverbio(palavra); // função que coloca uma tag <mark> em volta
      palavrasReplace.set(palavra, true);
    }
  };
}

<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>

<p id="content">novamente mais brevemente uma vez claro, claramente demente, igualmente. novamente.</p>

<button onclick="teste()">Mark</button>

Documentations:

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Map