Breaking text with regular expression in Javascript

Question

Breaking text with regular expression in Javascript

Asked 4 years, 1 month ago

Viewed 108 times

0

I’m trying to break a text using regular expression with Javascript, but I’m not getting the expected result.

I have the following string:

var texto = "texto inicial <div>Texto dentro da DIV</div> Texto fora da DIV <p>Texto dentro do P</p> texto final";

I need to break this text into an array to look like this:

0: texto inicial
1: <div>Texto dentro da DIV</div>
2: Texto fora da DIV
3: <p>Texto dentro do P</p>
4: texto final

This is the code I’m trying:

var regex = new RegExp(".*[(<.*>.*<\/.*>)].*", "g");
var blocosTexto = texto.match(regex);

console.log(blocosTexto);

1

Do not use Regex to analyze HTML. Please read Analyzing Html the Cthulhu Way, if you do not know how to read English click with the left mouse button on the page and translate to Portuguese (the same is true for the links suggested by this article).

– Augusto Vasques

2020/12/03 at 17:51
3

By placing something between brackets, you are setting a list of characters, so [(<.*>.*<\/.*>)] means "the character (, or <, or ., or *, etc" (only one of them) - see here. Anyway, regex is not the best way, as already said. It may even "work" for simple cases, but it does a little HTML and the regex begins to turn into a "monster".

– hkotsubo

2020/12/03 at 17:56

1 answer

Browser other questions tagged javascript string regex

You are not signed in. Login or sign up in order to post.

by Luiz Felipe • **32,886** points · Answer 1 · 2020-12-03T17:42:34+00:00

Depending on the complexity of the HTML tags that are contained in your string, do this with regex can no longer be trivial. It is almost consensus, too, that regular expressions should not be used to make the parse of strings containing HTML.

It may sound absurd, but if you are working with HTML, use a parser HTML might not be a bad idea. See an example using the API DOMParser, present in browsers:

const htmlStr = 'texto inicial <div>Texto dentro da DIV</div> Texto fora da DIV <p>Texto dentro do P</p> texto final';

const parser = new DOMParser();
const doc = parser.parseFromString(htmlStr, 'text/html');

const arr = Array.from(doc.body.childNodes).map((node) => {
  const text = node.nodeType === Node.TEXT_NODE
    ? node.textContent
    : node.outerHTML;
    
  return text.trim();
});

console.log(arr);

If you’re in an environment that doesn’t natively support Domparser (like Node.js), you can use some package that does this, such as jsdom.

Use a parser how this will be, in most cases (especially the more complex ones), better than dealing with regular expressions (and which may not be fully suited to the task). The advantage is that you have a much more robust API to develop as the complexity of the HTML present in the string grows.