Operator "AND" in Regex

Asked

Viewed 1,799 times

4

I’m trying to use the operator AND in regex. I have looked at that link, but this case did not help me. My case is, for example, in the text:

"FORT/310117/200826/12979898000170"

I want you to return the numbers with exactly six digits and return the number with exactly 14 digits. If one or the other does not exist, I do not want to return anything (hence the need of the operator "AND"). That is, if the example is:

"FORT/310117/200826" ou "FORT/12979898000170" 

returns nothing.

The regex I’m using is with the or and it returns if it finds 6 or 14 digits. How could you change it to the AND?

(\b\d{6}\b)|(\b\d{14}\b)
  • Which programming language you use?

  • I use the PHP language

  • An example in javascript helps you?

2 answers

8


An alternative is to use lookaheads:

^(?=.*\b\d{6}\b)(?=.*\b\d{14}\b)

Each Lookahead is bounded by (?= and ) and his idea is Lookahead look for the expression inside it, and if you find, go back to where you were and continues to evaluate the rest of the regex.

In this case, the expression within the first Lookahead (?=.*\b\d{6}\b) means:

  • .*: zero or more characters (any character)
  • \b: word-Boundary ("word border"), which means that it is the position of the string that delimits a word (ie there is an alphanumeric character before and a non-alphinical character after - or vice versa).
  • \d{6}: exactly 6 digits
  • \b: another word-Boundary. With this, we indicate that the 6 digits are delimited by non-alphanumeric characters (not at the risk of taking sequences larger than 6 digits).

This Lookahead is just after the bookmark ^, meaning "beginning of the string". That is, if the Lookahead find a 6-digit string somewhere in the string, it goes back to where it was (in this case, to the beginning of the string) and continues to evaluate the rest of the regex.

In this case, the rest of the regex is another Lookahead, very similar to the first, but looking for a sequence of exactly 14 digits.

That is, this regex first checks for a 6-digit sequence. How \d{6} is in a Lookahead, it goes back to the beginning of the string and evaluates the second Lookahead, searching for a 14-digit sequence.

If any of the lookaheads fail, regex as a whole also fails. Example:

$regex = '/^(?=.*\b\d{6}\b)(?=.*\b\d{14}\b)/';

echo preg_match($regex, "FORT/310117/200826/12979898000170") . "\n"; // 1
echo preg_match($regex, "FORT/310117/200826/") . "\n"; // 0
echo preg_match($regex, "FORT/12979898000170") . "\n"; // 0

preg_match returns 1 if regex finds a match and zero if not. In the above case, returned 1 only for the case where there are the two sequences (6 and 14 digits). For cases where there is only one of them, returns 0.

This regex also does not accept 7-digit sequences, for example:

// tem sequência de 14, mas não tem de 6 (somente de 7)
echo preg_match($regex, "FORT/3101176/12979898000170") . "\n"; // 0

If the delimiter is always /, you can exchange \b by the bar:

$regex = '/^(?=.*\/\d{6}(\/|$))(?=.*\/\d{14}(\/|$))/';

The difference in this case is that the bar must be escaped and written as \/ (not to be confused with the regex delimiter at the beginning and end of the string). Also, after the sequence of digits I put (\/|$), which means "a bar or the end of the string ($)" (with \b this is not necessary, because \b already considers the end of the string as a word-Boundary).


"Split" instead of super-regex-do-it-all

The above regex only validates if the string has such sequences of digits. But if you want to take the digits themselves, I find it easier to break the string and go through the parts one by one:

$partes = explode("/", "FORT/310117/200826/12979898000170");
$temSequenciaDe6 = false;
$temSequenciaDe14 = false;
foreach ($partes as $parte) {
    if (preg_match('/^\d{6}$/', $parte)) {
        $temSequenciaDe6 = true;
        echo $parte . "\n";
    } else if (preg_match('/^\d{14}$/', $parte)) {
        $temSequenciaDe14 = true;
        echo $parte . "\n";
    }
}

In this case the expressions are simpler. ^\d{6}$, for example, checks exactly 6 digits from start (^) at the end ($) string, while ^\d{14}$ checks exactly 14 digits from start to finish.

This code prints the numbers you want (only if it has 6 or 14 digits), and the boolean values $temSequenciaDe6 and $temSequenciaDe14 indicate which ones exist in the string. To know if the string has both, just do:

if ($temSequenciaDe6 && $temSequenciaDe14) {
    // string possui sequências de 6 e de 14 dígitos
}

It is also possible to break the string using a regex as a criterion, case a / is not the only tab. Just use preg_split:

$partes = preg_split('/[\/ ]/', "FORT 310117 200826 12979898000170");

In this case, regex will be used to break the string. How I used [\/ ], that means both the / how much space will be used as delimiters (note that there is a space before the ]).

If separators are only one character, just add all possible separators inside the brackets. For example, [\/ ,\-] considers that the separator may be the /, or a space, or the comma, or the hyphen. Put all the characters you need.

If the separator has more than one character, then it is better to use alternation (|). Ex: ( |\/|xyz) would use a space, or the bar, or the string xyz as separator. If separators are only one character (and not a string with 2 or more), I find it easier to use brackets.


Other alternatives

I couldn’t get a single regex that does both (validates if there are 6 and 14 digit sequences and gets the numbers). But if you don’t want to use preg_split, you can use the first regex above (with 2 lookaheads) to validate the string, and then use the regex below to get the numbers:

$str = "FORT/310117/200826/12979898000170";
if (preg_match('/^(?=.*\b\d{6}\b)(?=.*\b\d{14}\b)/', $str)) { 
    preg_match_all('/\b(\d{6}(\d{8})?)\b/', $str, $matches, PREG_SET_ORDER);
    foreach($matches as $m) {
        echo $m[0] . "\n";
    }
}

The first regex (with the lookaheads) validates whether the string contains at least a 6-digit sequence and a 14-digit. Then the second regex gets all the numbers of the string that match these sequences.

The excerpt (\d{6}(\d{8})?) means "6 digits, whether or not followed by 8 digits" - the ? after (\d{8}) makes all this excerpt optional. That is, this regex takes both 6 and 14 digits (and the \b before and after ensures that it will not pick up any more digits). And like the previous regex (with the lookaheads) already ensured that both sequences exist, I need not bother to check it again. The output of this code is:

310117
200826
12979898000170

One more option

Another option is to use a regex to capture the 6-digit groups (ensuring there is at least one 14-digit group), and then another regex to do the reverse (capture the 14-digit groups, ensuring there is at least one of 6).

The first regex is:

(?|(\b\d{6}\b)(?=.+?\b\d{14}\b)|(?<=\b\d{14}\b).+?(\b\d{6}\b))

The excerpt (\b\d{6}\b)(?=.+?\b\d{14}\b) means:

  • (\b\d{6}\b): 6 digits (within parentheses to form a capture group)
  • (?=.+?\b\d{14}\b): Lookahead to check for any 14-digit sequence ahead

And the stretch (?<=\b\d{14}\b).+?(\b\d{6}\b) means:

  • (?<=\b\d{14}\b): lookbehind to check if there is a 14-digit sequence before the 6-digit sequence.
  • .+?: any characters. The + means "one or more" and the ? means that you will take the minimum number of characters to satisfy the expression
  • (\b\d{6}\b): 6 digits

That is, the whole regex checks for a 14-digit sequence before or after the 6 digit sequence.

I also use the (?|, which means branch reset. Since the 6 digits appear twice in the expression, it means that it has 2 possible capture groups. If I don’t use the branch reset, I’ll have to check which of the 2 groups is filled, but using it I guarantee it will always be Group 1.

Then I can use the same logic to capture the groups of 14 digits, and use the lookaheads and lookbehinds to check whether there is at least one 6-digit group before or after:

$str = "FORT/310117/200826/12979898000170";
// pega os grupos de 6 dígitos (verificando se há grupo de 14 dígitos antes ou depois)
preg_match_all('/(?|(\b\d{6}\b)(?=.+?\b\d{14}\b)|(?<=\b\d{14}\b).+?(\b\d{6}\b))/', $str, $matches, PREG_SET_ORDER);
foreach($matches as $m) {
    echo $m[1] . "\n";
}

// pega os grupos de 14 dígitos (verificando se há grupo de 6 dígitos antes ou depois)
preg_match_all('/(?|(\b\d{14}\b)(?=.+?\b\d{6}\b)|(?<=\b\d{6}\b).+?(\b\d{14}\b))/', $str, $matches, PREG_SET_ORDER);
foreach($matches as $m) {
    echo $m[1] . "\n";
}

The first foreach takes the numbers with 6 digits, and the second takes the numbers with 14. The lookaheads and lookbehinds guarantee that it will only take one of the sequences if the other exists (it only takes 6 digits if there is at least one of 14, and vice versa).

The exit is:

310117
200826
12979898000170

I tried to put the 2 regex up in one, but she ended up skipping the second number (200826), see on IDEONE. I haven’t figured out the reason yet, but anyway, this is the closest I’ve come to a single regex that takes all the numbers and validates whether there is at least a 6-digit sequence and a 14-digit sequence.

  • Hello! Very good your explanation!! Thank you very much!! Helped a lot! However, I need the numbers and it does not always come separated by bars. That’s why I’m using regular expression instead of explodes, because I think you could make strlen after the explosion. But I have n cases, such as "FORT 310117 200826 12979898000170"

  • 1

    @Christian In this case just use preg_split. I updated the answer with an example

  • @Christian I put a few more options in the answer. If you have any other cases not mentioned that the regex does not cover, please edit the question and ask there.

3

You can use preg_match_all with its own regex, but naming the groups. For example, the group of 6 digits you give the name of d6, and that of 14 digits, the name of d14:

$string = "FORT/310117/200826/12979898000170";
// ou $string = "FORT 310117 200826 12979898000170";
preg_match_all("/(?<d6>\b\d{6}\b)|(?<d14>\b\d{14}\b)/", $string, $matches);

That is, if you find no 6-digit sequence, the group d6 will return an array with empty indices; the same applies to the group d14.

Then you use preg_grep in both groups to check if there is any input in the array that is not empty:

$d6 = preg_grep('/.{1,}/', $matches['d6']);
$d14 = preg_grep('/.{1,}/', $matches['d14']);

I only used the regex .{1,} which checks whether any (or more) index(s) of the array has at least 1 character.

Now just do one if simple checking whether the two variables are true (if it has data). If one of them is false (has no data), the if is not answered:

if($d6 && $d14){
   var_dump($d6);
   var_dump($d14);
}

The result of $d6 will be:

array(2) {
  [0]=>
  string(6) "310117"
  [1]=>
  string(6) "200826"
}

and of $d14 will be:

array(1) {
  [2]=>
  string(14) "12979898000170"
}

As the index of $d14 is [2], you can convert into string with:

$d14 = implode('', $d14);

To take the values of $d6 you use $d6[0] and $d6[1].

Code:

$string = "FORT/310117/200826/12979898000170";
preg_match_all("/(?<d6>\b\d{6}\b)|(?<d14>\b\d{14}\b)/", $string, $matches);

$d6 = preg_grep('/.{1,}/', $matches['d6']);
$d14 = preg_grep('/.{1,}/', $matches['d14']);

if($d6 && $d14){

   $d6_1 = $d6[0];
   $d6_2 = $d6[1];
   $d14 = implode('', $d14);

   echo $d6_1 ." / ". $d6_2 ." / ". $d14;
   // saída: 310117 / 200826 / 12979898000170

}

IODINE

Browser other questions tagged

You are not signed in. Login or sign up in order to post.