How to use Regex to represent a more complex BNF?

Asked

Viewed 131 times

-1

I need to create a program with Regex that represents the following BNF’s:

<list> ::= <element><list> | <element>
<element> ::= <letter><digit>
<letter> ::= A | B 
<digit> ::= 1|2

For the above BNF I tried to generate a program but I believe it is not so correct:

import re

r = re.compile('^[A-B-1-2]+$')

print(r.match('A1'))
print(r.match('A2B1'))

<exp> ::= <exp> + <exp> | <term>
<term> ::= <term> * <term> | <factor>
<factor> ::= ( <exp> ) | <digit>
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 

And I also need to implement a program with Regex representing the BNF above.

  • 3

    As the site is a repository for a wide audience, it is important to always explain objectively the difficulty encountered in advancing the solution, so that the answers can focus on a very specific problem, instead of providing "complete ready solution"even if it wasn’t intentional. To better enjoy the site, understand and avoid closures and negativations worth understanding What is the Stack Overflow and read the Stack Overflow Survival Guide (summarized) in Portuguese.

  • 2

    The trivial is to use an analyzer generator like the Pybison or Lark instead of the native regular expression library.

  • 1

    A regular language can be expressed by a regular expression. Your BNF is a context-free grammar and therefore cannot be represented by a regular expression.

1 answer

2


In accordance with informed in the comments, regex is not the ideal tool for this (there are suggested more suitable options), and on this subject I also suggest you read here and here. That said, let’s go to the regex.


The first BNF says that "element" repeats once or more. In turn, an "element" is a sequence of a letter A or B, followed by a digit 1 or 2.

The problem is you used [A-B-1-2], meaning "letter A or B, or hyphen, or 1 or 2". That is, it even takes valid cases like "A1" and "A2B1" but also takes invalid cases like "----" and "1212A". If the idea is to always make an alternate letter and digit, and this sequence can be repeated several times, the regex should be:

r = re.compile('^([AB][12])+$')

So I have [AB] (letter A or B), followed by [12] (digit 1 or 2), and this sequence repeats once or more.


The second is a little more complicated and think which is not possible with regex, as its definition is recursive (and without a base case):

  • "factor" can be an "Exp" in parentheses
  • "Exp" can be a "term"
  • "term" can be a "factor"
  • "factor" can be an "Exp" in parentheses
  • ...

Of course there is recursive regex, but Python does not support natively (and frankly, it is such a complicated business that I question whether it would be worth using).

But just for the record, I used the module regex (which has recursive regex support) and tried to implement this little "monster":

import regex

r = regex.compile(r"""(?(DEFINE)
       (?<digit>     (?: [0-9] ) )
       (?<factor>    (?: \( (?&exp) \) | (?&digit) ) )
       (?<term>      (?: (?&term) \* (?&term) | (?&factor) ) )
       (?<exp>       (?: (?&exp) \+ (?&exp) | (?&term)  ) )
    )
    ^ (?&exp) $
""", regex.VERBOSE)

Basically she tries to use subroutines, which allows the same subexpression to be used recursively.

But as it turned around, there was a MemoryError. I tested this same regex on the regex101.com site and a "engine error occurred" (see here):

The match was Halted because your Expression contains a recursive loop. This Means either that the Whole Pattern, or a subpattern, has been called recursively for the Second time at the same position in the Subject string.

That is, I think it is not possible, precisely because it has a recursive definition (an expression that may have terms, that may have factors, that may be other expressions, etc.) and without a base case on which it can end.

  • Thank you so much, you’ve helped me enough! As always!

  • It seems that the module regex recursion accepted. Worth testing. Taken from documentation Recursive and repeated Patterns are supported.

  • @Paulomarques The last example uses the module regex and the expression has subroutines (which is one of the recursion types of regex). But it didn’t work because the definition doesn’t have a base case and ends up becoming an infinite recursion (from what I understand). But then I’ll see if there’s another way - although I don’t think with regex at all...

  • 1

    Apparently a parser would have to be built...

  • 1

    @Paulomarques Yes, even options were suggested in the comments - updated the answer to make it clearer that regex is not the ideal tool, thank you!

  • It seems to me that to suddenly run in a better way, if I used the nltk library in Python, it would work for this BNF.

  • @Camillamarques, NLTK It’s like using a meteor to kill an insect. A parser generator as already suggested is enough and suitable for your problem as you want to analyze a context-free formal grammar. Before moving forward on this topic it is necessary to know formal language theory, computer theory and compilers

  • @Camillamarques, on the site you find a good collection on the theme [tag:formal languages], [tag:theory-of-computing], [tag:design-of-language], [tag:compilers], [tag:parser],...

Show 3 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.