Regular word start expression until next number

Asked

Viewed 637 times

2

Needed a regular expression to pick group from a string.

Let’s imagine the following string:

(a >3) and ( b + c = 4 and M < 45) and (d + e = 6 and M between 40 and 60) and z>10

I needed to capture the groups in bold. Note that the first is from the expression M to finding a number: M < 45.

However the same rule does not apply for the following case as it has between 40 and 60, and the same rule would stop at M between 40.

Can someone give me a help in getting a regular expression to catch the cases in bold?

I don’t have much knowledge of the regular expressions, so I asked for the community’s help.

2 answers

3

Regular expressions require well-defined rules, as you yourself could not define a rule for when the expression should continue looking for numbers, and only provided an example, what I can imagine would be something like this: /M\D*[\d\.]+\s*[a-zA-Z]*\s*[\d\.]+/


Trying to understand:

M is the initial character of its sequence

\D* and a possible sequence of characters (0 or more), excluding numerics, i.e., >, >=, between are valid sequences

[\d\.]+ is a sequence of numerical characters (1 or more) or dots. I am considering the possibility of a float.

\s*[a-zA-Z]*\s* is a possible sequence of spaces, followed by a possible sequence of characters, followed by another possible sequence of spaces. Note that they are all optional.


Utilizing:

$exp = '/M\D*[\d\.]+\s*[a-zA-Z]*\s*[\d\.]+/';
$str = '(a >3) and ( b + c = 4 and M < 45) and (d + e = 6 and M between 40 and 60) and z>10';

preg_match_all($exp, $str, $match);
$palavras_encontradas = $match[0];

See working here.

  • Hello.. I noticed something.. If the Expression is M < 4 instead of 45 it does not catch.. only picks up 2 numbers. ex: M < 4 does not find M < 45 finds M < 100 finds. any suggestions?

  • 1

    @Bruno Vc saw my answer below? It works for this case: http://ideone.com/lTrEpf and http://ideone.com/n1lk05

  • One detail (which I also often forget) is that inside the brackets the stitch does not need to escape, then it might just be [\d.]+, see

3


At first, regex may not be the most suitable tool, because what you probably need is a parser of logical expressions (or something like that). Just see how complicated regex can become, depending on the cases we want to deal with.


To another answer already shows a regex that is not very simple, and yet it assumes some premises (and consequently, has some possible problems):

  • it assumes that there are no variables whose names have more than one letter. That is, there is no such thing as Mx > 10 (because she considers these cases as well, see)
  • a regex for numbers ([\d\.]+) is a well simplified version, as it considers that .... and 1.2.3.4 are valid "numbers", see. So M > ... is considered a valid expression (see)
  • the passage that recognizes "between" recognizes any sequence of letters, or even any letter (spaces only), since this whole passage uses the quantifier *, which means "zero or more occurrences" - so it could even have "zero letters" between the numbers (examples)
  • there is still the fact that only recognize at least two digits (M > 1 is not recognized, see)

Of course that nay means she’s totally wrong (except for the last item, maybe). If you guarantee that your entries will always be valid expressions, there is no problem in using a simpler regex. It’s always important to balance complexity and convenience: if regex takes everything you want, and also ignores everything you don’t want, you don’t have to complicate it for nothing.

That being said, if you want to be more rigid, you can use a more restricted regex (and the price to pay is the increased complexity):

$regex = '{
    (?(DEFINE)
       (?<number>    (?: -? (?: (?= [1-9]|0(?!\d) ) \d+ (?:\.\d+)? ) | \.\d+ ) )
       (?<variable>  (?: \bM\b ) )
       (?<clause>    (?: 
                         (?&variable) \s* [<>=] \s* (?&number) |
                         (?&variable) \s+ between \s+ (?&number) \s+ and \s+ (?&number)
                     ))
    )
    (?&clause)
}x';

$str = '(a > 3) and ( b + c = 4 and M < 45) and (d + e = 6 and M between 40 and 60) and z>10';
if (preg_match_all($regex, $str, $matches)) {
    foreach ($matches[0] as $m) {
      echo $m. PHP_EOL;
    }
}

The excerpt (?(DEFINE)) serves to define subroutines. Each section containing (?<nome> defines a subroutine: a specific "subregex". Then we can use (?&nome), which serves to use the expression defined by the subroutine "name", and thus do not have to repeat the same regex several times.

Note, for example, the regex for the number (which validates negative numbers and with decimal places, ignores values like 000, among other cases - see), and see how many times it repeats itself in the "clause" subroutine. It would be possible to write this regex without subroutines, but the regex for numbers would have to be repeated 3 times and it would be giant and unreadable.

I also used the flag x, that ignores the spaces and line breaks and allows writing the regex in this way, leaving it a little less confused. Without this flag, I would have to write everything in a single line and no spaces, and it would be unreadable.

Basically, regex defines that the variable is \bM\b. The shortcut \b (word Boundary) defines a "boundary between words" (a position with an alphanumeric character before and a non-alphinical character after, or vice versa). Thus, I guarantee that I can only take the variable "M", but not "Mx" or "AM".

In the "clause" subroutine I define that it can be two different things, using alternation (the character |, which means or). that is, a "clause" can be one of the two options:

  • "M operator number": where the operator is [<>=] (the sign of greater or lesser or equal). If you want to consider >= and <=, can exchange this section for (?:[<>]=?|=).
  • "M between number and number": here I reuse the subroutine that defined the number, and make it clear that the text can only be "between number and number". It is better be more specific to avoid cases like the ones mentioned above (accept any text between the numbers, or no)

Therefore, regex will only take these cases (assuming that M will not be part of arithmetic expressions, such as "M + 1 > 3", because then we would have to include these cases as well).

The result is:

M < 45
M between 40 and 60

Of course, if the expressions are always well formed and the numbers are only integers, you can simplify to:

$regex = '/\bM\b(\s*[<>=]\s*\d+|\s+between\s+\d+\s+and\s+\d+)/';

It basically follows the same logic as the previous one, but all condensed into one regex. The only significant difference is the verification of numbers, which uses only \d+ (one or more digits). Breaking the regex into parts, we have:

  • \bM\b: the variable "M" (using the \b to avoid names like "Mx" or "AM")
  • toggle with two options:
    • \s*[<>=]\s*\d+: zero or more spaces (\s*), followed by the operator (and you can exchange [<>=] for (?:[<>]=?|=) if you also want to consider >= and <=), followed by zero or more spaces and number, or
    • \s+between\s+\d+\s+and\s+\d+ :one or more spaces (\s+), "between", spaces, number, "and", spaces, number

If you want "between" and "and" to be case insensitive (whether they are uppercase or lowercase), it is possible to use:

$regex = '{
    (?(DEFINE)
       (?<number>    (?: -? (?: (?= [1-9]|0(?!\d) ) \d+ (?:\.\d+)? ) | \.\d+ ) )
       (?<variable>  (?: \bM\b ) )
       (?<clause>    (?: 
                         (?&variable) \s* [<>=] \s* (?&number) |
                         (?&variable) \s+ (?i) between \s+ (?&number) \s+ and (?-i) \s+ (?&number)
                     ))
    )
    (?&clause)
}x';

The indication (?i) says "from here, the regex is case insensitive", then I just put this before the "between". And right after the "and" I turn that mode off, using (?-i). That is, only the section "between number and" is affected (and as the number regex only uses digits, it makes it case insensitive or not).

I didn’t make the whole regex case insensitive for it to consider only the variable "M" (ignoring variables called "m").

  • Thank you for your contribution :)

  • Just one more question.. You are only getting M<45, however tb would need M <= 45 or M >= 45 Obg

  • @Bruno This is mentioned in the answer: "If you want to consider also >= and <=, you can exchange this excerpt for (?:[<>]=?|=)"

  • :) Thank you very much

Browser other questions tagged

You are not signed in. Login or sign up in order to post.