Using REGEX in PHP to capture any number that is not within single quotes

Asked

Viewed 396 times

4

I have been studying regex for some time and now I have a problem: capture all the numbers, including decimals, other than in single quotes.

I’m creating a kind of Viewer for PHP code in order to learn how to use regex better. I have the following regex working, it returns to me all the decimal numbers of a given string:

preg_match_all('/(\d+\.\d+)/', $text, $matches, PREG_SET_ORDER, 0);

What I would like is to return not only the decimals, but all the numeric characters other than in single quotes. Any idea how I could do that? I appreciate any lighting, because I’m totally in the dark, I’ve tried several regex combinations and none of them worked. I always do my tests on regex101.com.

NOTE: I can return all numeric characters WITHIN quotation marks and not only those outside them:

preg_match_all('/(\'(\d+)\')/', $text, $matches, PREG_SET_ORDER, 0);

2 answers

7


Although it is possible to make a regex - possibly very complicated - involving lookaheads and lookbehinds, find it easier to use a small "trick" that uses capture groups.

Basically, if you have a string like this:

$texto = "123 abc '456' def789'112' ghi";

From what I understand, you just want to capture 123 and 789, because the numbers are not in single quotes ('). Then you could have an expression like that:

preg_match_all("/\'\\d+\'|(\\d+)/", $texto, $matches);

This regex uses alternation (|) to say you want something or other. These "things" are:

  1. number in single quotes: '\d+', or
  2. number (without quotation marks) and within parentheses to form a capture group: (\d+)

Remembering that some regex characters are properly escaped with \ by being inside a string.

With that, a match of the regex can fall in one of 2 cases:

  • if the number is between single quotes, falls in the first snippet
  • otherwise, falls in the second stretch

If it falls in the first case, the catch group is not filled in, and if it falls in the second stretch, the catch group is filled in.

So to get the numbers that are not in single quotes, just check if the capture group is filled. And for the array to return in an easier format to check this, we can use the option PREG_SET_ORDER:

$texto = "123 abc '456' def789'112' ghi";
preg_match_all("/\'\\d+\'|(\\d+)/", $texto, $matches, PREG_SET_ORDER, 0);
var_dump($matches);

This code produces the following output:

array(4) {
  [0]=>
  array(2) {
    [0]=>
    string(3) "123"
    [1]=>
    string(3) "123"
  }
  [1]=>
  array(1) {
    [0]=>
    string(5) "'456'"
  }
  [2]=>
  array(2) {
    [0]=>
    string(3) "789"
    [1]=>
    string(3) "789"
  }
  [3]=>
  array(1) {
    [0]=>
    string(5) "'112'"
  }
}

Notice that in pouch falling in the second case (number is not between single quotes), the array has 2 positions. The first corresponds to all the match, and the second corresponds to the capture group (in this case they are equal, but depending on the expression, it may not be).

In cases where the number is in quotes, the respective array has only one position, in which case the capture group is not filled.

Then just go through the array of pouch and check which of the internal arrays has the set capture group (ie, just see if the size is greater than 1):

foreach ($matches as $m) {
    if (count($m) > 1) { // grupo de captura preenchido (número não está entre aspas)
        echo $m[1]. "\n";
    }
}

The way out of this foreach is:

123
789

If you want numbers with decimals, just change \d+ for \d+\.\d+ (that inside the string would be \\d+\\.\\d+) or any other expression you are using to capture the numbers.

If the boxes after the comma are optional, for example, you can use \d+(?:\.\d+)?. It is not the specific focus of the question, but the validation of numbers can become complicated, because it all depends on which cases you want to consider.


As remembered by @fernandosavio us comments, it is possible to delimit the string with single quotes as well, so the \ does not need to be written as \\:

preg_match_all('/\'\d+\'|(\d+)/', $texto, $matches, PREG_SET_ORDER, 0);

Behold here an example.


This "trick" was based in this tutorial.

  • 1

    Really perfect. It gave the result I hoped to obtain and without the least of the problems. By the way, excellent explanation of everything involved. I will read the external links more deeply now to deepen the subject and improve my skills. I am using the preg_match_all("/\'\d+(?:\.\d+)?\'|(\d+(?:\.\d+)?)/", $texto, $matches, PREG_SET_ORDER, 0); Thank you again.

  • 1

    @Diegoborges Two sites that I find very good to learn regex are this and this <- The latter has some very advanced topics, including, and that is where I got this solution. And get ready to read a lot, because regex is an endless subject! The more I study, the more I realize I do not know the half.. :-)

  • 2

    Just a hint, when creating regex in PHP it is easier to use simple quotes, because simple quotes do not parse the backslash as a special character. Ex.: echo "0\t1"; // "0 1" and echo '0\t1"; // "0\t1'

  • @fernandosavio Thanks for the tip, it’s been a long time since I’ve programmed professionally in PHP (just a few Stack Overflow scripts and responses from time to time), so I’m pretty "rusty" in this language :-) I updated the answer, thanks!

  • 1

    I’d like to be in your situation. hahahaha

  • 2

    Gave a great lesson! Regexp is a nice mt topic to practice / discuss.

Show 1 more comment

2

I created a REGEX that I believe meets your need:

(?<=\s|^)(\d+[.,]{1}\d+|\d+)+(?=\s|$)

Take a test:

12,2 12.1021 14 '51' '1' '23323' 12

The only rule for it to work is that the numbers are separated by spaces.

Explanations there @Guilhermenascimento:

(?<=\s|^)(\d+[.,]{1}\d+|\d+)+(?=\s|$)
  ^       ^            ^       ^
  .       .            .       ................ tem que ser o final da string ou ter espaços
  .       .            ............... pega apenas numeros
  .       .
  .       ................. pega numeros que possam ter (. ou ,) com numeros depois
  .
  ............... positive lookbehind (se houver espaçamento antes) ou é o inicio da string

The numbers that will be redeemed are:

12,2 
12.1021 
14
12

Would that be the code:

$string = "12,2 12.1021 14 '51' '1' '23323' 12";

preg_match_all("/(?<=\s|^)(\d+[.,]{1}\d+|\d+)+(?=\s|$)/", $string, $output_array);

print_r($output_array);

See the Working

Browser other questions tagged

You are not signed in. Login or sign up in order to post.