How to validate a numerical expression with a regular expression?

Asked

Viewed 438 times

3

I need within my application in PHP7 to validate a numerical expression with a regular expression, the case is that I almost succeeded but I ran into a problem. Non-significant zeros within the expression:

When I have the numerical expression below I can validate :

10 + ( 10 * 10 ) - 20

Using the following regular expression :

$cRegex  = '/^' ;
//          '|------|----|-----|---|-----|----|---|
$cRegex .= '([-+\(]?[\(]?[0-9]+[.]?[0-9]+[\)]?[\)]?)?' ;
//          '|---------|----|----|-----|---|-----|----|----|---|
$cRegex .= '([-+*\/\(]?[\(]?[\(]?[0-9]+[.]?[0-9]+[\)]?[\)]?[\)]?)?' ; 
//          '|---------|----|----|-----|---|-----|----|----|---|
$cRegex .= '([-+*\/\(]?[\(]?[\(]?[0-9]+[.]?[0-9]+[\)]?[\)]?[\)]?)?' ; 
//          '|---------|----|----|-----|---|-----|----|----|---|
$cRegex .= '([-+*\/\(]?[\(]?[\(]?[0-9]+[.]?[0-9]+[\)]?[\)]?[\)]?)?' ; 

$cRegex .= '$/' ;

My problem is that when the following numerical expression appears: 690/09. The return is true and should be false because the correct would be 690/9. In PHP7 the "09" of the numerical expression gives problem.

That’s why I’m asking for help to improve my regular expression so I can detect it.


I got it this way:

$cRegex  = '/^' ;
//          '|---------|----|---------------|---|-----|----|---|
$cRegex .= '([-+*\/\(]?[\(]?([1-9]{1}[0-9]+)[.]?[0-9]+[\)]?[\)]?)' ; 
//          '|---------|----|--------------|--------|---|--------|----|---|
$cRegex .= '([-+*\/\(]?[\(]?[\(]?([1-9]{1})([0-9]+)?[.]?([0-9]+)?[\)]?[\)]?)?' ; 
//          '|---------|----|--------------|--------|---|--------|----|---|
$cRegex .= '([-+*\/\(]?[\(]?[\(]?([1-9]{1})([0-9]+)?[.]?([0-9]+)?[\)]?[\)]?)?' ; 
//          '|---------|----|--------------|--------|---|--------|----|---|
$cRegex .= '([-+*\/\(]?[\(]?[\(]?([1-9]{1})([0-9]+)?[.]?([0-9]+)?[\)]?[\)]?)?' ; 
//          '|---------|----|--------------|--------|---|--------|----|---|
$cRegex .= '([-+*\/\(]?[\(]?[\(]?([1-9]{1})([0-9]+)?[.]?([0-9]+)?[\)]?[\)]?)?' ; 
//          '|---------|----|--------------|--------|---|--------|----|---|
$cRegex .= '([-+*\/\(]?[\(]?[\(]?([1-9]{1})([0-9]+)?[.]?([0-9]+)?[\)]?[\)]?)?' ; 
$cRegex .= '$/' ;

I tested with the following numerical expressions :

  • +(690.91)+1 - Validated
  • +(690/2)+20.01+(10*3)-010 - Correctly rejected
  • +(690/2)+20.01+(10*3)-10 - Validated
  • 690 - Validated
  • 690/09 - Correctly rejected
  • 690/9 - Validated

Obs: I took the whitespace of the expressions.

  • 1

    And can your expression be any one or will it always be in this format? Are you familiar with Polish (reverse) notation? Maybe your problem will be solved more easily with it.

  • Since the content comes from a mysql database, it can be anything, so I need to know if it is a number or numerical expression to change the path.

  • About the Polish notation, I really do not know but I would like to apply in php7. Can you give me an example ?

1 answer

2


Like you said that the expression can be "anything", I’m guessing it can have more than one pair of nested parentheses, like for example:

690.91+(1.3*(4-7/(3+6.4)))

Your regex can’t detect that, because of the nested parentheses. In fact, they have other problems as well. For example, you cared so much about validating the most complex cases (such as +(690.91)) that ended up passing the simplest cases, as 1+1 (see here).

Some details of your regex:

  • [\(] is the same as \( (which, by the way, is the same as [(], for inside brackets, many characters need not be escaped with \, see an example). Anyway, if you want to capture only one character, you don’t need the brackets, so to check a parenthesis opening, just use \(.
    • the same goes for [.], which can be written only as \.
    • brackets are useful when there is more than one character possible (e.g.: [ab] means "the letter a or the letter b"), but when you just want to capture a character, they are unnecessary
  • the quantifier {1} means "exactly an occurrence", but by default, anything you use in a regex already indicates that you want an occurrence of that. So [1-9]{1} is the same as [1-9].
  • in the first part of the regex you used ([1-9]{1}[0-9]+)[.]?[0-9]+ (a digit from 1 to 9 followed by one or more digits from 0 to 9 followed by an optional dot followed by one or more digits from 0 to 9). That is, this section only validates numbers that have at least 3 digits (if it does not have the point), or at least 2 digits before the point (see). That’s why regex does not validate 1+1.
    • in the other parts you use ? instead of +, what makes some snippets optional (so the second number can have less than 3 digits, as in 690+1)
  • You have left both opening and closing parentheses optional. This means that your regex accepts expressions that have an open parenthesis but do not have the corresponding closure, or that it does not have the opening but has the closing (example).

Correcting/improving...

For the numbers, you could use something like -?(?=[1-9]|0(?!\d))\d+(?:\.\d+)?.

It starts with the minus sign optional (-?). Then we have a Lookahead - the stretch within (?= - that checks if the one in front is a digit from 1 to 9 or a zero that is not followed by another digit (the character | means or and the excerpt 0(?!\d) ensures that there is no digit ahead of zero). Thus the expression can have the number zero alone (0), but it cannot have 09, for example).

Then we have \d+ (one or more digits), optionally followed by one point and more digits (so we can have 10 and 10.1).

If you want regex to also accept numbers as .123 (which is another way of writing 0.123), just change to -?(?:(?=[1-9]|0(?!\d))\d+(?:\.\d+)?|\.\d+) - This means that it accepts numbers in the manner already explained above or a point followed by one or more digits. (see here some examples of this regex).


Then, for the arithmetic expression itself, it is not enough to make a lot of sub-expressions and leave them optional. It is necessary to check, among other things, if the parentheses are balanced (for each opening there is the corresponding closure).

The other problem is that the expression can have several levels of nesting of the parentheses, so it would not be enough to do as you did (several different options followed), because the possibilities are many: expressions without parentheses, with 1 pair of parentheses in each operand, with several nested pairs in each operand, etc.

Not to mention that your regex limits the expression to only 6 operands (ex: +690.91+1-2-1-3-4). If we add one more (like +690.91+1-2-1-3-4-1), it is no longer validated (see). (This case could even be solved by changing the ? at the end of each * (zero or more occurrences), but still would not solve the other problems already mentioned).

The solution, in this case, is to use recursive patterns and subroutines (regex deliberately copied from here and adapted to the case in question):

$regex = '{
    (?(DEFINE)
       (?<number>    (?: -? (?: (?= [1-9]|0(?!\d) ) \d+ (?:\.\d+)? ) | \.\d+ ) )
       (?<sign>      (?: [-+] ))
       (?<expr>      (?: (?&term) (?: \s* [-+] \s* (?&expr))? ))
       (?<term>      (?: (?&factor) (?: \s* [/*] \s* (?&term))? ))
       (?<factor>    (?: (?&number) |
                         (?&sign) \s* (?&factor) |
                        \( \s* (?&expr) \s* \)))
    )
    ^ \s* (?&expr) \s* $
}x';

That regex is well complex. The first section (inside the block (?(DEFINE)) define subroutines. Basically, "subexpressions" are created, each a name. The syntax (?<nome> defines the subroutine, and the syntax (?&nome) replaces this section with the corresponding regex.

For example, the first subroutine is called "number" (its definition is within the section delimited by (?<number>), and it corresponds to the regex that verifies a number (the same already mentioned above). Then in the other subroutines we see the use of (?&number) - this part is replaced by the corresponding regex.

Then we have the subroutine "Sign", which captures the signal ([-+], a minus or plus sign). Next we define the subroutines "factor", "term" and "expr":

  • an "expr" can have a "term" alone, or added/subtracted to another "expr"
  • a "term" can be a "factor" alone, or multiplied/divided by another "term"
  • a "factor" can be a "number", or a "factor" with a "Sign" before, or an "expr" between brackets

Note that the structure is recursive (so regex can check several nested parentheses and expressions of any size). And in many places I use \s* (zero or more spaces), so regex allows there to be spaces in the expression.

After the block DEFINE, there is the regex itself: ^ \s* (?&expr) \s* $. The markers ^ and $ are, respectively, the beginning and end of the string. Then we have optional spaces at the beginning and end, and in the middle of them we have the expression.

Another important point is that I use the modifier x (at the end of the string), as this causes regex to ignore line breaks and whitespace. This allows you to write it the way above (with multiple spaces and line breaks, leaving it more organized and a little easier to read). If I didn’t use the x, all the above regex would have to be written in a single line, and no spaces - which would make it even harder to read and understand.

Another detail is that instead of delimiting the regex with /, i used keys ({}). With this, bars inside the regex do not need to be written as \/ (all right that there is only one bar in regex, but particularly, I prefer to minimize the amount of \ whenever possible).

Testing the regex:

$list = array('+(690.91)+1', '+(690/2)+20.01+(10*3)-010', '+(690/2)+20.01+(10*3)-10', '690',
  '690/09', '690/9', '1+1', '690 + 1', '10+(10*10)-20', '690.91+(1.3*(4-7/(3+6.4)))',
  '690.91+(01.3*(4-7/(3+6.4)))', '.24+3', '+(690.91+1', '+690.91+1-2-1-3-4-1',
  '690.91+(1.3*(4-7/(3+6.4)))/(-1.3*4/(3.2-(1/7.5)))');
foreach ($list as $exp) {
    echo $exp. '='. (preg_match($regex, $exp) ? 'ok' : 'nok'), PHP_EOL;
}

She agrees with your examples, with the bonus of validating the cases that your regex can not (nested parentheses, 1+1, expressions with spaces, etc):

+(690.91)+1=ok
+(690/2)+20.01+(10*3)-010=nok
+(690/2)+20.01+(10*3)-10=ok
690=ok
690/09=nok
690/9=ok
1+1=ok
690 + 1=ok
10+(10*10)-20=ok
690.91+(1.3*(4-7/(3+6.4)))=ok
690.91+(01.3*(4-7/(3+6.4)))=nok
.24+3=ok
+(690.91+1=nok
+690.91+1-2-1-3-4-1=ok
690.91+(1.3*(4-7/(3+6.4)))/(-1.3*4/(3.2-(1/7.5)))=ok

But maybe regex is not the best solution for your case. Have you tried to see some parser specific for arithmetic expressions? Though it’s nice, regex is not always the best solution. It is also worth remembering that regex only validates the expression, but does not calculate its value (and in this case, a better solution would be to use specific functions/Apis).

  • 1

    No doubt my experience with regex is simple, I’m just getting started. Very good your solution outside the regex class, I will apply in my project and try to understand well what you passed for future implementations.

  • 1

    Before posting on this subject I researched and even bought a book, but due to the complexity and seeing its own solution, I am happy to leave the record here for other people to enjoy what was past. I’ll see about the hint of "parser specific for arithmetic expressions" and grateful for the support.

  • 1

    @Marciosouza Two regex sites that I like are that and that, has very good tutorials and with several examples. And books, I recommend that (which is well advanced and goes deep into the subject) and that. Regex is an endless subject, so good studies :-)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.