How to select a full xml/html tag with Regular Expression even if there are identical tags internally?

Asked

Viewed 2,182 times

4

I am trying to do the following processing in a javascript string using ER (Regular Expression):

With that input: um <b>negrito<b>negrito interno</b>externo</b> aqui <b>negrito</b> <i>italico</i>., would like to get the tag <b> complete, with all its contents up to its closing pair </b>, this being the expected result: <b>negrito<b>negrito interno</b>externo</b> and <b>negrito</b>.

But I’m failing to consider that a tag can contain the same internally, and I was able to get as far as possible to this result (which does not consider the possibility of a tag equal internally, as can be seen in the first result where it is <b>negrito<b>negrito interno</b> instead of <b>negrito<b>negrito interno</b>externo</b>:

var entrada = 'um <b data-remove>negrito<b>negrito interno</b>externo</b> aqui <b>negrito</b> <i>italico</i>.';
var regex = /<(b)>.*?<\/\1>/g;

// limpa DOM para imprimir
document.body.innerHTML = "";

entrada.replace(regex, function(match) {
  console.log(match);
  // para imprimir do DOM
  document.body.appendChild(document.createTextNode(match));
  document.body.appendChild(document.createElement("br"));
  return match;
});
body {
  white-space: pre;
  font-family: monospace;
}

My knowledge of ER is limited, and has practically reached the limit in this situation. So I await some precious hint of some expert in ER, or a "Forget it’s not possible with ER =(".

Edit 2 Expected solution:

The way I look and do not know how to do would be something that was counting/accumulating the occurrences of opening tags and ignoring the closures until it is the matching closure for the opening (equivalent to the first opening tag).

If there are any questions comment!

Edit 1: My real case for better understanding of the problem:

This actual example is only intended to demonstrate the context where I am using the function in question, and why I cannot do this via jQuery or any other parser in the browser’s DOM. Because I need to leave the DOM correct, so that the CSS is applied correctly and only after the conversion to style inline I can remove what was only for the Browser to render correctly and then get the result of my expected template.

$(function() {
  $('#btnGenerateHtmlMail').click(function(ev) {
    var $report = $('#report');
    convertCssToInlineStyle($report);
    var reportHtml = $report.html();
    reportHtml = reportHtml
      /* remove class attribute */
      .replace(/class=('|").*?\1/g, "")
      /* remove id attribute */
      .replace(/id=('|").*?\1/g, "")
      /* remove comments html */
      .replace(/<!--.*?-->/g, "")
      /* remove tab, enter and whitespace */
      .replace(/\s\s+/g, ' ')
// ----->>>   // esse é o meu caso de problema, nesse exemplo não da problema pois nnão há tags iguais dentro do tr, mas sei que isso seria um bug que quero resolver para tornar a ferramenta generica
      .replace(/<(tr) data-remove="true".*?>.*?<\/\1>/g, function replacer(match) {
        console.log(match);
        return match.match(/{{.*?}}/g);
      });
    $('#result').text(reportHtml);
  });
});


/* Metódos irrelevantes para o problema */

function getCssDeclared($elem) {
  var sheets = document.styleSheets,
    o = {};
  for (var i in sheets) {
    var rules = sheets[i].rules || sheets[i].cssRules;
    for (var r in rules) {
      if ($elem.is(rules[r].selectorText)) {
        o = $.extend(o, css2json(rules[r].style), css2json($elem.attr('style')));
      }
    }
  }
  return o;
}

function css2json(css) {
  var s = {};
  if (!css)
    return s;
  if (css instanceof CSSStyleDeclaration) {
    for (var i in css) {
      if ((css[i]).toLowerCase) {
        s[(css[i]).toLowerCase()] = (css[css[i]]);
      }
    }
  } else if (typeof css == "string ") {
    css = css.split("; ");
    for (var i in css) {
      var l = css[i].split(": ");
      s[l[0].toLowerCase()] = (l[1]);
    }
  }
  return s;
}

function convertCssToInlineStyle($root) {
  $root.each(function() {
    var $item = $(this);

    var style = getCssDeclared($item);
    $item.css(style);

    // recursive call chields
    convertCssToInlineStyle($item.children());
  });
}
table {
	border-collapse: collapse;
	border-spacing: 0;
	-webkit-box-sizing: border-box;
	-moz-box-sizing: border-box;
	box-sizing: border-box;
	width: 100%;
}

table td, table th {
	padding: 8px;
	padding-top: 3px;
	padding-bottom: 3px;
	line-height: 1.428571429;
	border: 1px solid #ddd;
}

table > tfoot {
	font-weight: bold;
	text-align: center;
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.0/jquery.min.js"></script>
<div id="report">
  <table>
    <thead>
      <tr data-remove="true">
        <th>{{theadContent}}</th>
      </tr>
    </thead>
    <tbody>
      <tr data-remove="true">
        <th>{{tbodyContent}}</th>
      </tr>
    </tbody>
    <tfoot>
      <tr data-remove="true">
        <th>{{tfootContent}}</th>
      </tr>
    </tfoot>
  </table>
</div>
<div id="tools">
  <button id="btnGenerateHtmlMail">
    Gerar HTML E-mail
  </button>
  <div contenteditable="true" id="result" style="width: 99%;resize: none;border: 1px solid #ccc;padding: 0.5%;"></div>
</div>

Note: In this example (real) not the problem because there are no identical tags inside tr, but I know this would be a bug I want to solve to make the tool generic.

  • 1

    This code is server-side (Node.js, IO.js), or you will run it in the same browser?

  • 1

    @ctgPi, even Browser, client-side.

  • 2

    Condemn your question. You have a lot of unnecessary information. Show only your well specified question and what you have tried. Just explain that you need it to be with Regex. No need to justify your need.

  • 1

    @Guill, I was without my real example and eventually answers came very far from the solution, so I tried to add the almost complete problem to understand the context of the problem. But I will see if I remove some things that may be irrelevant to the problem in my actual code.

  • 1

    @Guill, yes it works for any internal occurrence number (I’ve already edited and removed) but it’s not what you expect to look this example of your solution to my real case, I’m thinking that’s not possible =(.

Show 1 more comment

5 answers

3

The problem is that you are trying to process a language that is not regular (HTML) with regular expressions. The solution is you write a recursive function that cleans, something like:

var attributeWhiteList = ['style'];  // atributos que você quer deixar
var elementWhiteList = ['#text', 'TABLE', 'THEAD', 'TBODY', 'TFOOT', 'TR', 'TH', 'TD', 'P', 'B', 'DIV'];  // elementos que você quer deixar

function cleanHTMLForEMail(node) {
    if (node.nodeName === '#text') {
        // aqui você editar node.textContent pra tirar espaço em branco
        return node;
    }

    // listar atributos
    var attributeNames = [];
    for (var i = 0; i < node.attributes.length; i++) {
        attributeNames.push(node.attributes[i].name);
    }

    // tirar todos os atributos fora da whitelist
    for (var i = 0; i < attributeNames.length; i++) {
        if (attributeWhiteList.indexOf(attributeNames[i]) === -1) {
            node.removeAttribute(attributeNames[i]);
        }
    }

    // listar filhos
    var children = [];
    for (var i = 0; i < node.childNodes.length; i++) {
        children.push(node.childNodes[i]);
    }

    // tirar todos os filhos fora da whitelist
    // e limpar os que estão dentro
    for (var i = 0; i < children.length; i++) {
        if (elementWhiteList.indexOf(children[i].nodeName) === -1) {
            node.removeChild(children[i]);
        } else if (children[i].nodeName === 'TR' && children[i].dataset.remove === 'true') {
            node.removeChild(children[i]);
        } else {
            node.replaceChild(cleanHTMLForEMail(children[i]), children[i]);
        }
    }

    return node;
}

Edited: Jsfiddle, fixing small implementation errors, and working on the example you wanted to solve.

  • 1

    I don’t know if I don’t understand your solution, but what I want is that the result of this structure of this example (which works), and that is another structure with <tr> inside <tr> may have the same result as the previous one. I don’t know is it possible to understand the problem? And I didn’t understand in what your solution can help?

  • 1

    If I understand you want to "peel" the table rows (take the <tr> and leave the crumb)? Ds tags that contains data-remove="true", you only want the text contained within them?

  • 1

    Okay, I saw in the other comment what the data-remove="true" that is to say. Among the {{ and }}, only has plain text, or can have tags? If you can have tags, something prevents you from exchanging {{ and }} by, and.g. <div data-preserve="true"> and </div>?

  • 1

    Your first comment is perfect, that’s exactly what I want.

  • Type this here, then? I just changed the line that treats the <tr> in the code above.

  • regarding his second comment, between the {{ and the }} can only have text, at most something similar to that: {{cliente.nome}} (with the dot). And regarding the markup, it will be used by another Java framework that will replace, for example: {{cliente.nome}} by "Fernando". Understood?

  • I get it. Look at the second Jsfiddle then; I think he solves his life.

  • Uhm, apparently it worked, for now +1, I’ll analyze it right here and if it’s the case mark as answer. Look at it working here.

Show 3 more comments

3


As said by @ctgPi:

HTML is not a regular language and therefore cannot be processed by a regular expression.

Therefore it is necessary to write functions to perform HTML processing.

Here a sample of code you can work on (uses regular expressions).

// String com seu HTML
var string = '<table><thead><tr data-remove="true"><th></th><th><th>{{theadContent}}</th></th><th></th></tr></thead><tbody><tr data-remove="true"><th>{{tbodyContent}}</th></tr></tbody><tfoot><tr data-remove="true"><th>{{tfootContent}}</th></tr></tfoot></table>';

// Converte a String em Objeto JQuery
var $element = $(string);

//Itera sobre as raízes realizando as substituições necessárias
$('*[data-remove=true]', $element).each(function(index) {
  $(this).replaceWith($(this).html().replace(/.*?(\{\{[^\}]*\}\}).*/, '$1'));
});

// Converte o objeto JQuery em String
var string_processada = $element.get(0).outerHTML;

// Imprime na tela
$('body').text(string_processada);
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.0/jquery.min.js"></script>

It cuts off branches of DOM whose root has the attribute data-remove worthwhile true; leaving only the part involved in "{{" and "}}".

May contain bugs that I didn’t see.

1

  • 1

    Like this is not my case, I’m doing a treatment of the DOM string, I use $('body'). html() to treat regex. The application is an API to convert CSS to style inline (for HTML Mail), and this is a final treatment already with html with style inline already in string format. But even so I appreciate the attention and willingness to help.

  • 1

    Sorry I still don’t understand the purpose you need this solution. It would be to check if all html tags have been closed and to validate that the html structure is correct ?

  • 1

    Neither one nor the other. The real example in the question here, you can see the correct functioning of the application, which for this particular case is working as there are no tags <tr> within other tags <tr>. Where what I do is remove the entire structure marked with data-remove="true" keeping the internal structure marked between {{ and }}. I make that mark with data-remove="true" to make html when rendering before converting css to inline style, not modified as demonstrated here.

1

You can (and in my opinion should) use the browser parser itself:

buffer = document.createElement('div');
buffer.innerHTML = 'um <b>negrito<b>negrito interno</b>externo</b> aqui <b>negrito</b> <i>italico</i>.';
console.log(buffer.querySelectorAll('b'));

If you only want the b in first level, you can create two div, one inside the other, and search only div > b.

(from what I have tested, this seems to be immune to XSS as long as you discard the resulting node and do not insert it directly into the document)

  • I will post another answer to the updated question.

1

Basically this Regex works for your problem. In some samples it should fail. But for your problem it serves.

\<(minhaTag)(?: .*?)?\>(?:[^\<]|\<(.*?)\>[^\<]*\<\/\2\>)*\<\/\1\>

Regex Tested Here

Substitute minhaTag by the name of tag desired. This regex will reference the most superficial element of tag specified and its contents. The element can contain attributes.

Tips:

  • Be careful with the operators *? and * study their differences.

  • Remember to include \n (new line) to class . through the modifier single line (s), in case this modifier is supported.

  • Use the modifier global (g) in case you want all the tags surfaces specified in the sample (vine the link above).

  • 1

    HTML is not a regular language and therefore cannot be processed by a regular expression.

  • 2

    @ctgPi, in question at no time treats var entrada = 'um <b data-remove>negrito<b>negrito interno</b>externo</b> aqui <b>negrito</b> <i>italico</i>.'; as an actual HTML content, rather as a string with a certain pattern, so in my opinion your comments on the answers would be invalid. For that is the answer that has come closest to the solution so far.

  • 1

    My comment is a mathematical statement - a theorem, to be more precise. The var entrada is, yes, a string, which represents a fragment of HTML; she is not the pipe, she is the representation of the pipe. It is totally valid to use strings to represent HTML when it is being passed from one side to the other (in this case, via email); it is wrong manipulate the string when what you want is to manipulate the underlying HTML.

  • 1

    @Guill, with your latest edition we’re back to square one again, look what happens of adding a <tr> inside the other. This is looking more complicated than I thought. =(

  • Here it worked as expected. Use the chat link in the comment in the question. To avoid polluting the comments. More.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.