What is the right way to search and replace HTML words

Asked

Viewed 62 times

2

I am redoing a brief text editor in the browser. I am using an editable div that contains the following:

<div id="content" class="content_editor" contenteditable="true" onpaste="limpar_formatacao()">
<p>
O meu texto tema a palavra nascer e nascer e também nasceram.
</p> 
<p>
Lorem Ipsum não é simplesmente um texto randômico mas tem nascer no meio.
</p>
</div>

I need a function to fix text by tagging <mark> all words repeated of a sentence. To look at it this way: (for example, all occurrences of the word 'born')

<p>
O meu texto tema a palavra <mark>nascer</mark> e <mark>nascer</mark> e também nasceram.
</p>
<p>
Lorem Ipsum não é simplesmente um texto randômico mas tem nascer no meio. Uma última frase.
</p>

Considerations:

  • Note that the phrase that does not have repeated word (i.e., appear only once), is not marked!
  • Can’t get by <p> because the separation is by sentence, not by paragraph.

So I did a function to:

  • get all the text
  • create an array of phrases:
var textEditor = $("#content").text();
var frases = textEditor.match(/(?! )(.*?(\b\w\w\.))*.*?[.?!](?![.?!])/g);
  • started single, non-repeated word array
  • separated by word
  • a p/ each word loop
    • if greater than 3 letters
    • checks if it exists in the array
      • if not: adds
      • if yes: mark with the two words

It didn’t look elegant but this whole function was like this:

    // ------- algoritmo para repeticao_palavra ------- //
    var textEditor = $("#content").text();
    //separar por frase
    var frases = textEditor.match(/(?! )(.*?(\b\w\w\.))*.*?[.?!](?![.?!])/g);


  frases.forEach((frase) => {
        frase = frase.trimStart(); // limpar o começo de cada frase
        frase = cleanPunctuation(frase); limpar pontuação para não interferir na comparação
        
        var palavras_unicas = []; //iniciar array palavras unicas, não-repetidas
        var palavras = frase.split(' '); //separar por palavra

        //laço p/ cada palavra:
        palavras.forEach((palavra) => {
            palavra = palavra.toLowerCase();
            
            if (palavra.length > 3) { // se maior que 3 letras
                //verificar se existe no array
                if (palavras_unicas.indexOf(palavra) === -1) { //se não: adiciona
                    palavras_unicas.push(palavra);
                }
                else {
                    frase = marcar_palavra_repetida(frase, palavra) //se sim: colocar a tag <mark>
                }
            }
        })
    });


The function is marcar_palavra_repetida() and stayed like this

function marcar_palavra_repetida(frase, palavra) {

    let str = document.getElementById("content").innerHTML;
    var fraseHTML = frase.replaceAll(palavra, '<mark>' + palavra + '</mark>');

    document.getElementById("content").innerHTML = str.replace(frase, fraseHTML);

    return str.replace(frase, fraseHTML); // esse return cria uma recursividade que marca se houver mais de uma palavra repetida na mesma frase
}

How best to mark an html word and tag it with tags <mark>?

  • I have already advised against the use of regex, so I am reading about it and considering even though the separation in sentences is with regex.
  • Function marks correctly with tag <mark> but it takes all occurrences that have 'birth' (as an example). Including derivatives or plural.

It marks that:

<p>
Uma frase com a palavra <mark>nascer</mark> e <mark>nascer</mark> de novo, mas pega também <mark>nascer</mark>am e <mark>nascer</mark>ão na mesma frase.
</p>
  • When I asked the question I thought it had the minimum verifiable and logical of what I’m doing. I have read the usage policy, how to use and several times I still do not ask right to the stackoverflow community.

  • yes, of course! I edited the question and tried to be more correct in the placements. And yes, the tag mark suffers with addition/withdrawal of classes and even a href...

  • Couldn’t adapt the solution from here? Just include one more check: if the match returns more than one occurrence you make the substitution (sorry, I didn’t stop to see more details of the question, but in a quick look it seems to me almost the same problem)

  • Would that be: https://jsfiddle.net/9j3x7fug/1/ ? Of course there is a need to separate one p in several sentences, but the basic idea is this (check if the word occurs 2 or more times in the sentence).

  • @Lukenegreiros, I think I understand the question now. The regular expression is applied to the text within each paragraph, it had understood that a regex would be applied to analyze the whole div. Before I publish a check reply to see if this code solves the problem: https://jsfiddle.net/ez9yj1ok/ cc>>@hkotsubo

  • @hkotsubo has a strange occurrence when you click several times on your solution published in jsfiddle.

  • @Augusto Vasques the behavior is the same. I tried to use a regex with a variable, but he did not catch the boundaries. Anyway... I’ll do some tests later and read the code more slowly. Publish the answer!

  • Now see this variation of text applying your solution: https://jsfiddle.net/sqfm1nda/ note the word 'Army' and it takes the <br>. I need to work on that code yet.

  • The problem is that the \b does not consider accented characters, it would have to be adjusted as follows: https://jsfiddle.net/4x5hjmoa/1/ - for <br>, the link I sent earlier deals with this case, because it keeps the tags daughters. Then it would be the case to combine the solutions. It is also unclear what is the limit of a sentence. If you have two different sentences in it p, account or not? ex: <p>Viu tudo? Quase tudo.</p> <- are 2 sentences, so "everything" is duplicated or not?

  • 'everything' would not be duplicated. because the calculation of duplication is done by sentence and not by paragraph. I got a satisfactory result, but as you said the b does not catch accentuation. I did not know it. Putz that bag. This regex new RegExp(\b${word} b, 'g') does not take the word 'water' for example! but takes 'army' (accent in the middle)

  • Well, an alternative would be texto.split(/(?<=[.?!])/) to break the text into sentences, and in replace vc would have to check if the word is no longer inside <mark> before doing replace (something like /(?<!<mark>)palavra(?!<mark>)/), Because I think that’s the problem with my link. Although I still think it’s better to adapt the first link I sent, which maintains the structure of the daughter tags (the problem is to join this with the sentence separation, since the same phrase can have several tags and the duplicate word can appear in the internal tags, IE, it is very complicated)

Show 6 more comments
No answers

Browser other questions tagged

You are not signed in. Login or sign up in order to post.