Regular expression to detect nested structures in a template

Question

Regular expression to detect nested structures in a template

Asked 7 years, 10 months ago

Viewed 311 times

17

I’m trying to create a template engine using Javascript. The syntax will be more or less similar to that of Laravel (Blade), with some modifications.

I’m at the part of creating the expressions. The first one I’m doing is the if, but right away I faced a problem.

First of all, I would like to make it clear that I created a template in a separate file, which has the content viewed through the module fs. In order to be able to run in the browser, I put all the string returned within a constant template:

const template = `
<div class="wrapper">
  <p>Um paráfrafo qualquer fora do if.</p>
  @if (true)
    <p>
      Um parágrado dentro do if.
      <strong>Tags aninhadas.</strong>
    </p>
  @endif
</div>
`;

I created the following regular expression to search within the view:

/@if(?:\s+|)\(([\s\S]*?)\)([\s\S]*?)@endif/gi

And I’m using the following code to "run" the expression if:

const renderView = (template, data = {}) => {
  const IF_REGEX = /@if(?:\s+|)\(([\s\S]*?)\)([\s\S]*?)@endif/gi;
  template = template.replace(IF_REGEX, (template, conditional, content) => {
    if (! eval(conditional)) return '';
    return content;
  });

  return template;
};

Works perfectly when there’s only one block if (no nesting of blocks).

However, when I nest two blocks:

<div class="wrapper">
  <p>Parágrafo fora dos ifs.</p>

  @if (false)
    <div>Oi?</div>
    @if (true)
    <span>Span...</span>
    @endif
  @endif
</div>

It doesn’t work as expected:

How can I correct this error? I imagine it is in regular expression.

3

Don’t use regex for that sort of thing, a good explanation for this

– Lucas

2018/03/21 at 18:09
So what should I use to create this kind of thing? What do Handlebars.JS or EJS use to create this? I want to create my template-engine. I need to know how to search HTML.

– Luiz Felipe

2018/03/21 at 18:15
why not look at the handlebars source code?

– user8545

2018/03/21 at 22:29
It doesn’t help, I didn’t get what I’m looking for.

– Luiz Felipe

2018/03/21 at 23:53
I edited the message, added more details...

– Luiz Felipe

2018/03/23 at 15:48
Regular expression cannot handle recursive nesting. What you need to use there is a parser.

– BrunoRB

2018/03/23 at 17:47
How can I do this using Javascript? .-.

– Luiz Felipe

2018/03/23 at 20:05
1

I’m afraid it’s a little more complicated than you expected. That’s basic compiler theory, you need a lexer that breaks the text into tokens, then you take these tokens and build a syntactic tree with a parser, this tree is what gives you the hirarquica structure with the conditional ones you want. For JS recommend the JISON. If you’ve never heard of any of this really will get a little lost, but unfortunately there aren’t many shortcuts, give a search on lexers and parsers.

– BrunoRB

2018/03/24 at 18:37

Show 3 more comments

2 answers

4

^{As already said in the comments, do not use regex, use a parser (or adapt an existing, or build a). Regex is not the right tool, especially when there are nested structures. Throughout the answer, I hope you understand the reasons, so let’s go...}

To another answer may have worked for your specific case, but what if your template have another @if then it no longer works:

function renderView(template, data = {}) {
  const IF_REGEX = /@if(?:\s+|)\(([\s\S]*?)\)([\s\S]*?)+@endif/gi;
  template = template.replace(IF_REGEX, (template, conditional, content) => {
    if (! eval(conditional)) return '';
    return content;
  });

  return template;
}

const template = `
<div class="wrapper">
  <p>Parágrafo fora dos ifs.</p>

  @if (true)
    <div>Oi?</div>
    @if (false)
    <span>Não devo ser renderizado</span>
    @endif
  @endif

  <p>Estou entre os if's</p>

  @if (true)
    <div>estou em outro if</div>
  @endif
</div>
`;

console.log(renderView(template));

The result is:

<div class="wrapper">
  <p>Parágrafo fora dos ifs.</p>

   
</div>

Which is not quite what it should be - from what I understand, it should also have rendered the tags <div>Oi?</div>, <p>Estou entre os if's</p> and <div>estou em outro if</div>.

This happens because the regex took everything from the first @if to the last @endif. Simply add a quantifier to the snippet ([\s\S]*?) only makes it possible to repeat itself over and over again, but as [\s\S] corresponds to anything, regex may even pick up occurrences of @endif, if you think it necessary.

And by putting the quantifier applied to the capture group, did the content be empty (see here, Group 2 content is empty). Then the callback past to replace ends up returning empty, eliminating all the stretch between the first @if and the last @endif.

In fact, use ([\s\S]*?)+ is kind of "weird" because when using *? you are indicating that [\s\S] should be repeated as few times as possible (see here and here to better understand how the *?), but to surround everything with +, you are saying that it can be repeated as many times as possible (i.e., the default behavior of the quantifier +, which is to be "greedy", cancels the "Azy" behavior of the *?, then in practice use only [\s\S]* would give in the same - generating even the same problem already mentioned above - with the advantage that the regex gets a little faster, because when placing nested quantifiers, you increases the possibilities to be tested; already using only one, there are fewer possibilities and regex has fewer cases to test).

Another detail is that (?:\s+|) means "one or more \s, or nothing", then can be changed to \s* (zero or more \s).

So how do I get the if's separately? With a single regex that does everything at once, it is not possible (maybe even if we use recursive regex with subroutines, but are features that Javascript does not support).

Another alternative is to treat @if's from the inside out (first I check the most internal, which has no other @if in), treat them, updating the template and then I’ll repeat this process until there are no more @if:

function renderView(template, data = {}) {
  const IF_REGEX = /@if\s*\(((?:[\s\S](?!@if))+?)\)((?:[\s\S](?!@if))+?)@endif/gi;

  while (template.includes('@if')) { // enquanto tem @if, substitui
    template = template.replace(IF_REGEX, function(template, conditional, content) {
      if (! eval(conditional)) return '';
      return content;
    });
  }

  return template;
}

const template = `
<div class="wrapper">
  <p>Parágrafo fora dos ifs.</p>
  @if (true)
    <div>Eu serei renderizado</div>
    @if (false)
    <span>Não devo ser renderizado</span>
    @endif
  @endif
  <p>Estou entre os if's</p>
  @if (true)
    <div>estou em outro if</div>
    @if (true)
      <span>Estou em um if aninhado</span>
      @if (true)
        <span>Estou em outro if aninhado</span>
        @if (false)
          <span>Não serei renderizado</span>
        @endif
      @endif
    @endif
  @endif
</div>

@if (true)<div>Eu também apareço</div>@endif
`;

console.log(renderView(template));

The regex uses the Lookahead negative [\s\S](?!@if), that checks if it is a character that does not have @if soon after (and all this is repeated once or more). In fact, I changed the quantifier * for +, for the * means "zero or more occurrences", which indicates that it could have something like @if(). Already trading for + (one or more occurrences), ensures that there has to be at least one character inside (but I don’t check if there are only spaces, for example, so @if ( ) would still be accepted).

Finally, the Lookahead ensures that I will only take the @if who has no other @if inside it. Then I update the template with the result of the evaluation of @if, and I keep checking if there’s any left @if to be analyzed. When no more template rendered.

The result is (eliminating blank lines for easy viewing):

<div class="wrapper">
  <p>Parágrafo fora dos ifs.</p>
    <div>Eu serei renderizado</div>
  <p>Estou entre os if's</p>
    <div>estou em outro if</div>
      <span>Estou em um if aninhado</span>
        <span>Estou em outro if aninhado</span>
</div>
<div>Eu também apareço</div>

But of course it is still a naive implementation. If the template has @if commented (may have? ), he will try to evaluate (which would not occur with a parser, since the comment would be detected and correctly ignored). In addition, regex uses the flag i, then the template may have things like @IF and @EndIf. If that’s what you really need, you can leave the flag, otherwise it is better to remove it.

Another point is that depending on the case, a lot of processing can be done for no reason. For example, if the template something like that:

@if (condicao_falsa)
  @if (bla1)
    @if (bla2)
      @if (bla3)
        @if (bla4)
        @endif
      @endif
    @endif
  @endif
@endif

The algorithm starts by evaluating bla4, afterward bla3, etc. But if the conditions of all @if's internal return true and only the condition of the outermost first returns false, all the interiors will have been rendered for no reason. In a parser well implemented, the first condition would be evaluated first, and if it were false, nor would need to evaluate the most internal. But as we have seen that it is not possible to guarantee with regex that we always manage to catch the @if external, is one more disadvantage that you will have to accept, if you use regex instead of parser.

Not to mention that the regex’s own performance will not be those things, for she first begins the search on @if external, until it detects that it has another @if in. Then she does the backtracking and try from the second @if, and after the third, etc., until finding the innermost. And how is a loop, will do all this process again (starts from the most external, finds an internal, starts from that, etc.), for each level of nesting. To templates very large and with many @if'nested, will eventually become very inefficient, and perhaps even unfeasible.

There are also the cases of template be in error (for example, a @if internal does not have the closure, so the regex will end up going to the @endif from the outside, thing that a parser would not do because it would detect that lacked close one of them), and many other situations that regex is not able to detect (or even is, but ends up getting so complicated that it is not worth it).

Browser other questions tagged javascript regex

You are not signed in. Login or sign up in order to post.

by Sam • **79,597** points · Answer 1 · 2018-03-25T04:24:09+00:00

The problem is in your regex which will select only until the final first occurrence in @endif, ignoring the second (or too).

To resolve, add the quantifier + which will allow selecting how many occurrences preceded by @endif whatever there is:

@if(?:\s+|)\(([\s\S]*?)\)([\s\S]*?)+@endif
                                   ↑
                             quantificador

See the print on regex.: