How to use Regex to represent a more complex BNF?

Question

How to use Regex to represent a more complex BNF?

Asked 4 years, 8 months ago

Viewed 131 times

-1

I need to create a program with Regex that represents the following BNF’s:

<list> ::= <element><list> | <element>
<element> ::= <letter><digit>
<letter> ::= A | B 
<digit> ::= 1|2

For the above BNF I tried to generate a program but I believe it is not so correct:

import re

r = re.compile('^[A-B-1-2]+$')

print(r.match('A1'))
print(r.match('A2B1'))

<exp> ::= <exp> + <exp> | <term>
<term> ::= <term> * <term> | <factor>
<factor> ::= ( <exp> ) | <digit>
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

And I also need to implement a program with Regex representing the BNF above.

3

As the site is a repository for a wide audience, it is important to always explain objectively the difficulty encountered in advancing the solution, so that the answers can focus on a very specific problem, instead of providing "complete ready solution"even if it wasn’t intentional. To better enjoy the site, understand and avoid closures and negativations worth understanding What is the Stack Overflow and read the Stack Overflow Survival Guide (summarized) in Portuguese.

– Bacco

2020/11/15 at 00:44
2

The trivial is to use an analyzer generator like the Pybison or Lark instead of the native regular expression library.

– Augusto Vasques

2020/11/15 at 00:57
1

A regular language can be expressed by a regular expression. Your BNF is a context-free grammar and therefore cannot be represented by a regular expression.

– anonimo

2020/11/15 at 13:30

1 answer

Browser other questions tagged python regex

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2020-11-15T00:40:21+00:00

^{In accordance with informed in the comments, regex is not the ideal tool for this (there are suggested more suitable options), and on this subject I also suggest you read here and here. That said, let’s go to the regex.}

The first BNF says that "element" repeats once or more. In turn, an "element" is a sequence of a letter A or B, followed by a digit 1 or 2.

The problem is you used [A-B-1-2], meaning "letter A or B, or hyphen, or 1 or 2". That is, it even takes valid cases like "A1" and "A2B1" but also takes invalid cases like "----" and "1212A". If the idea is to always make an alternate letter and digit, and this sequence can be repeated several times, the regex should be:

r = re.compile('^([AB][12])+$')

So I have [AB] (letter A or B), followed by [12] (digit 1 or 2), and this sequence repeats once or more.

The second is a little more complicated and think which is not possible with regex, as its definition is recursive (and without a base case):

"factor" can be an "Exp" in parentheses
"Exp" can be a "term"
"term" can be a "factor"
"factor" can be an "Exp" in parentheses
...

Of course there is recursive regex, but Python does not support natively (and frankly, it is such a complicated business that I question whether it would be worth using).

But just for the record, I used the module regex (which has recursive regex support) and tried to implement this little "monster":

import regex

r = regex.compile(r"""(?(DEFINE)
       (?<digit>     (?: [0-9] ) )
       (?<factor>    (?: \( (?&exp) \) | (?&digit) ) )
       (?<term>      (?: (?&term) \* (?&term) | (?&factor) ) )
       (?<exp>       (?: (?&exp) \+ (?&exp) | (?&term)  ) )
    )
    ^ (?&exp) $
""", regex.VERBOSE)

Basically she tries to use subroutines, which allows the same subexpression to be used recursively.

But as it turned around, there was a MemoryError. I tested this same regex on the regex101.com site and a "engine error occurred" (see here):

The match was Halted because your Expression contains a recursive loop. This Means either that the Whole Pattern, or a subpattern, has been called recursively for the Second time at the same position in the Subject string.

That is, I think it is not possible, precisely because it has a recursive definition (an expression that may have terms, that may have factors, that may be other expressions, etc.) and without a base case on which it can end.