Breaking text with regular expression in Javascript

Asked

Viewed 108 times

0

I’m trying to break a text using regular expression with Javascript, but I’m not getting the expected result.

I have the following string:

var texto = "texto inicial <div>Texto dentro da DIV</div> Texto fora da DIV <p>Texto dentro do P</p> texto final";

I need to break this text into an array to look like this:

0: texto inicial
1: <div>Texto dentro da DIV</div>
2: Texto fora da DIV
3: <p>Texto dentro do P</p>
4: texto final

This is the code I’m trying:

var regex = new RegExp(".*[(<.*>.*<\/.*>)].*", "g");
var blocosTexto = texto.match(regex);

console.log(blocosTexto);
  • 1

    Do not use Regex to analyze HTML. Please read Analyzing Html the Cthulhu Way, if you do not know how to read English click with the left mouse button on the page and translate to Portuguese (the same is true for the links suggested by this article).

  • 3

    By placing something between brackets, you are setting a list of characters, so [(<.*>.*<\/.*>)] means "the character (, or <, or ., or *, etc" (only one of them) - see here. Anyway, regex is not the best way, as already said. It may even "work" for simple cases, but it does a little HTML and the regex begins to turn into a "monster".

1 answer

5

Depending on the complexity of the HTML tags that are contained in your string, do this with regex can no longer be trivial. It is almost consensus, too, that regular expressions should not be used to make the parse of strings containing HTML.

It may sound absurd, but if you are working with HTML, use a parser HTML might not be a bad idea. See an example using the API DOMParser, present in browsers:

const htmlStr = 'texto inicial <div>Texto dentro da DIV</div> Texto fora da DIV <p>Texto dentro do P</p> texto final';

const parser = new DOMParser();
const doc = parser.parseFromString(htmlStr, 'text/html');

const arr = Array.from(doc.body.childNodes).map((node) => {
  const text = node.nodeType === Node.TEXT_NODE
    ? node.textContent
    : node.outerHTML;
    
  return text.trim();
});

console.log(arr);

If you’re in an environment that doesn’t natively support Domparser (like Node.js), you can use some package that does this, such as jsdom.

Use a parser how this will be, in most cases (especially the more complex ones), better than dealing with regular expressions (and which may not be fully suited to the task). The advantage is that you have a much more robust API to develop as the complexity of the HTML present in the string grows.

  • Hello @Luiz Felipe, thank you for the quick reply. I am working in a text editor, and I need to have control of the text content to break the page correctly and show a preview to the user. There may be another way to do this control, but this is the one I found.

  • 1

    The DOMParser (and alternatives such as jsdom) are good to solve this kind of problem. However, if performance is a concern, maybe they are not ideal (although regex maybe not either - you do the proper tests to check). In this case, it may be another option too you develop a small and simple parser, aiming exclusively at its requirements.

  • 3

    You can use the DOM of the document itself to make that Parsing. For example: https://jsfiddle.net/u4c3phdt/

  • Really, @bfavaretto! : ) I can steal the idea of nodeType instead of instanceof? The latter seems less performative to me, and I had forgotten the nodeType... :)

  • 1

    Yes, you can, no problem

Browser other questions tagged

You are not signed in. Login or sign up in order to post.