Regex to pick up everything in parentheses in certain situations

Asked

Viewed 67 times

1

I need to create a regex to extract from a code everything that is passed as a parameter for the functions, but can only be captured in certain functions. I have the regex that takes everything in parentheses, but when I apply the lookbehind ends up bugging everything.

Regex:

(?<=Html->script|cell|element|Html->css)([^-]+)[^\(']+(\([^\)]+\))

Situations that it should capture the function parameters:

$this->element('element.ctp');
$this->cell('teste.ctp');
$this->Html->script([
    'teste.js'
]);
$this->Html->script([
    'teste.js',
    'teste2.js',
]);
$this->Html->script(['teste.js']);
$this->Html->script('teste.js');
$this->Html->css(['teste.css']);
$this->Html->css('teste.css');
  • 1

    What language or tool/engine are you using? It is important to put, because each one implements regex in a way, and what works for one may not work for the other. Even more so for a complex case of this kind (for which, I would say, regex nay is the best solution)

  • regex will be built in typescript and will read PHP code. I want to create a plugin for VS Code

  • In this case, it is worth remembering that lookbehind is not yet supported by all browsers (currently, Firefox and Safari do not support). I think there are some libs that do the polyfill, but I still don’t think it’s worth it. I still think it’s better to research some parser, for example, this one looks promising (but I haven’t tested): https://github.com/glayzzle/php-parser

  • This one really seems to be much better and has already expanded further on what I could do in the plugin. I’ll test, thank you!

1 answer

2


What you want is a parser, and not a regex. Research a specific one for the language in question and use it, it will take less work than a regex (and even if it takes "more" work, it will still pay off, because regex is not the most suitable tool for this task).

Regex may even work for simpler cases, but language source code in general is often more complex and has several situations that a regex can’t detect (or even can detect, but it gets so complicated that it’s not worth it).

For example, in your case, something that could work is:

(?<=Html->script|cell|element|Html->css)\(([^\)]+)\)

I removed the parentheses from the capture group so that it only took the parameters, and removed the other excerpts that didn’t seem to make sense: [^-]+ is one or more characters that are not hyphenated, and [^\(']+ is one or more characters that are not ( nor '. That is, its regex required it to have at least two characters before the first ( - the problem was not the lookbehind in itself, but the requirement of these characters before the first (.

The above regex takes everything between parentheses, see.


But as I said, programming languages accept more complex expressions than that. What if your code has something like:

$this->Html->css(outraFuncaoQueRetornaOcss());

The regex only takes the stretch outraFuncaoQueRetornaOcss(, leaving the ) from outside. That’s because the [^\)]+ only takes the characters that are not ), so she stops when she finds one. And then you’d have to wear something like recursive regex to check balanced parentheses (and it’s so complicated that it’s not worth it - see here an example - in addition to many languages not even having support for this resource).

What if the excerpt is commented? It’s not clear which language is this (but looks like PHP), but anyway:

// $this->Html->css('teste.css');

/* ou comentário multi-linha
$this->Html->script('teste.js');
$this->Html->css(['teste.css']);
*/

The regex does not detect this, and picks up the above snippets erroneously, see (on the other hand, a parser would detect this and ignore the above lines without problem). You can even make a regex that detects comments, but it is worth adding something like that to an expression that is no longer very simple?

And comments can be even more treacherous:

$this->Html->script([
    'teste.js', // algum comentário
    'teste2.js', // outro comentário
]);

The function parameter is the array ['teste.js', 'teste2.js'], but regex considers that the comments are part of the same. Good luck doing a regex to detect this and delete comments correctly - maybe it is possible, but it will be so complicated that it is not worth it (already a parser would ignore the comments quietly).

Anyway, there are too many situations to detect, and many of them are not obvious to do with regex. In the background, you would have to write a "mini-parser" with regex, which is not a very smart solution (for learning purposes, it may be worth trying, but for a solution to be used in real systems, there is no). Regex is cool (I like it a lot), but it is not always the best solution.

  • As I want to create a plugin for VS Code I believe that the regex you gave me already meets, because what it extract I will still deal with in the code, but I will still test more situations and give a study on Parser. Thank you!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.