Like you said that the expression can be "anything", I’m guessing it can have more than one pair of nested parentheses, like for example:
690.91+(1.3*(4-7/(3+6.4)))
Your regex can’t detect that, because of the nested parentheses. In fact, they have other problems as well. For example, you cared so much about validating the most complex cases (such as +(690.91)
) that ended up passing the simplest cases, as 1+1
(see here).
Some details of your regex:
[\(]
is the same as \(
(which, by the way, is the same as [(]
, for inside brackets, many characters need not be escaped with \
, see an example). Anyway, if you want to capture only one character, you don’t need the brackets, so to check a parenthesis opening, just use \(
.
- the same goes for
[.]
, which can be written only as \.
- brackets are useful when there is more than one character possible (e.g.:
[ab]
means "the letter a
or the letter b
"), but when you just want to capture a character, they are unnecessary
- the quantifier
{1}
means "exactly an occurrence", but by default, anything you use in a regex already indicates that you want an occurrence of that. So [1-9]{1}
is the same as [1-9]
.
- in the first part of the regex you used
([1-9]{1}[0-9]+)[.]?[0-9]+
(a digit from 1 to 9 followed by one or more digits from 0 to 9 followed by an optional dot followed by one or more digits from 0 to 9). That is, this section only validates numbers that have at least 3 digits (if it does not have the point), or at least 2 digits before the point (see). That’s why regex does not validate 1+1
.
- in the other parts you use
?
instead of +
, what makes some snippets optional (so the second number can have less than 3 digits, as in 690+1
)
- You have left both opening and closing parentheses optional. This means that your regex accepts expressions that have an open parenthesis but do not have the corresponding closure, or that it does not have the opening but has the closing (example).
Correcting/improving...
For the numbers, you could use something like -?(?=[1-9]|0(?!\d))\d+(?:\.\d+)?
.
It starts with the minus sign optional (-?
). Then we have a Lookahead - the stretch within (?=
- that checks if the one in front is a digit from 1 to 9 or a zero that is not followed by another digit (the character |
means or and the excerpt 0(?!\d)
ensures that there is no digit ahead of zero). Thus the expression can have the number zero alone (0
), but it cannot have 09
, for example).
Then we have \d+
(one or more digits), optionally followed by one point and more digits (so we can have 10
and 10.1
).
If you want regex to also accept numbers as .123
(which is another way of writing 0.123
), just change to -?(?:(?=[1-9]|0(?!\d))\d+(?:\.\d+)?|\.\d+)
- This means that it accepts numbers in the manner already explained above or a point followed by one or more digits. (see here some examples of this regex).
Then, for the arithmetic expression itself, it is not enough to make a lot of sub-expressions and leave them optional. It is necessary to check, among other things, if the parentheses are balanced (for each opening there is the corresponding closure).
The other problem is that the expression can have several levels of nesting of the parentheses, so it would not be enough to do as you did (several different options followed), because the possibilities are many: expressions without parentheses, with 1 pair of parentheses in each operand, with several nested pairs in each operand, etc.
Not to mention that your regex limits the expression to only 6 operands (ex: +690.91+1-2-1-3-4
). If we add one more (like +690.91+1-2-1-3-4-1
), it is no longer validated (see). (This case could even be solved by changing the ?
at the end of each *
(zero or more occurrences), but still would not solve the other problems already mentioned).
The solution, in this case, is to use recursive patterns and subroutines (regex deliberately copied from here and adapted to the case in question):
$regex = '{
(?(DEFINE)
(?<number> (?: -? (?: (?= [1-9]|0(?!\d) ) \d+ (?:\.\d+)? ) | \.\d+ ) )
(?<sign> (?: [-+] ))
(?<expr> (?: (?&term) (?: \s* [-+] \s* (?&expr))? ))
(?<term> (?: (?&factor) (?: \s* [/*] \s* (?&term))? ))
(?<factor> (?: (?&number) |
(?&sign) \s* (?&factor) |
\( \s* (?&expr) \s* \)))
)
^ \s* (?&expr) \s* $
}x';
That regex is well complex. The first section (inside the block (?(DEFINE)
) define subroutines. Basically, "subexpressions" are created, each a name. The syntax (?<nome>
defines the subroutine, and the syntax (?&nome)
replaces this section with the corresponding regex.
For example, the first subroutine is called "number" (its definition is within the section delimited by (?<number>
), and it corresponds to the regex that verifies a number (the same already mentioned above). Then in the other subroutines we see the use of (?&number)
- this part is replaced by the corresponding regex.
Then we have the subroutine "Sign", which captures the signal ([-+]
, a minus or plus sign). Next we define the subroutines "factor", "term" and "expr":
- an "expr" can have a "term" alone, or added/subtracted to another "expr"
- a "term" can be a "factor" alone, or multiplied/divided by another "term"
- a "factor" can be a "number", or a "factor" with a "Sign" before, or an "expr" between brackets
Note that the structure is recursive (so regex can check several nested parentheses and expressions of any size). And in many places I use \s*
(zero or more spaces), so regex allows there to be spaces in the expression.
After the block DEFINE
, there is the regex itself: ^ \s* (?&expr) \s* $
. The markers ^
and $
are, respectively, the beginning and end of the string. Then we have optional spaces at the beginning and end, and in the middle of them we have the expression.
Another important point is that I use the modifier x
(at the end of the string), as this causes regex to ignore line breaks and whitespace. This allows you to write it the way above (with multiple spaces and line breaks, leaving it more organized and a little easier to read). If I didn’t use the x
, all the above regex would have to be written in a single line, and no spaces - which would make it even harder to read and understand.
Another detail is that instead of delimiting the regex with /
, i used keys ({}
). With this, bars inside the regex do not need to be written as \/
(all right that there is only one bar in regex, but particularly, I prefer to minimize the amount of \
whenever possible).
Testing the regex:
$list = array('+(690.91)+1', '+(690/2)+20.01+(10*3)-010', '+(690/2)+20.01+(10*3)-10', '690',
'690/09', '690/9', '1+1', '690 + 1', '10+(10*10)-20', '690.91+(1.3*(4-7/(3+6.4)))',
'690.91+(01.3*(4-7/(3+6.4)))', '.24+3', '+(690.91+1', '+690.91+1-2-1-3-4-1',
'690.91+(1.3*(4-7/(3+6.4)))/(-1.3*4/(3.2-(1/7.5)))');
foreach ($list as $exp) {
echo $exp. '='. (preg_match($regex, $exp) ? 'ok' : 'nok'), PHP_EOL;
}
She agrees with your examples, with the bonus of validating the cases that your regex can not (nested parentheses, 1+1
, expressions with spaces, etc):
+(690.91)+1=ok
+(690/2)+20.01+(10*3)-010=nok
+(690/2)+20.01+(10*3)-10=ok
690=ok
690/09=nok
690/9=ok
1+1=ok
690 + 1=ok
10+(10*10)-20=ok
690.91+(1.3*(4-7/(3+6.4)))=ok
690.91+(01.3*(4-7/(3+6.4)))=nok
.24+3=ok
+(690.91+1=nok
+690.91+1-2-1-3-4-1=ok
690.91+(1.3*(4-7/(3+6.4)))/(-1.3*4/(3.2-(1/7.5)))=ok
But maybe regex is not the best solution for your case. Have you tried to see some parser specific for arithmetic expressions? Though it’s nice, regex is not always the best solution. It is also worth remembering that regex only validates the expression, but does not calculate its value (and in this case, a better solution would be to use specific functions/Apis).
And can your expression be any one or will it always be in this format? Are you familiar with Polish (reverse) notation? Maybe your problem will be solved more easily with it.
– Woss
Since the content comes from a mysql database, it can be anything, so I need to know if it is a number or numerical expression to change the path.
– Marcio Souza
About the Polish notation, I really do not know but I would like to apply in php7. Can you give me an example ?
– Marcio Souza