Regex - Selecting the first occurrence of a sequence of a number block

Asked

Viewed 299 times

2

I have a String sort of like this:

987654 original4343 - Co 123456 asd.pdf

This String may vary the groupings of numbers, but what I need is to get the grouping that has 6 digits.

So I did this regex:

.*([0-9]{6}).*

The problem is that I am returning the last occurrence, but I need the first. In the example I quoted above, it is returning:123456

But I need you to come back: 987654

I tried to use the \b or other alternatives I looked for, but in none I was happy.

2 answers

4


Your regex took the last occurrence of the 6 digits because the quantifier * is greedy: he takes the largest possible sequence of characters satisfying the expression.

And how you used .* and the dot means "any character" (any one, including numbers¹), the following happens::

  • .* take the stretch 987654 original4343 - Co (is a sequence of zero or more characters)
  • [0-9]{6} take the stretch 123456 (a sequence of 6 digits)

The regex could have even taken the 987654, but how .* is greedy, he tries to catch the longest string he can. And as the dot can be any character, including numbers, the 987654 ends up being "swallowed" by .* (since then regex found another 6-digit sequence that satisfies its "greed").


In this case, you don’t need the .*, can only use the 6-digit sequence, since the Matcher by default starts the search at the beginning of String, and go on until you find something:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

String s = "987654 original4343 - Co 123456 asd.pdf";
Pattern p = Pattern.compile("[0-9]{6}");
Matcher m = p.matcher(s);
if (m.find()) { // verifica se encontrou
    System.out.println(m.group());
}

It is important to check whether the find() returns true. If the String no 6 digit sequence (if find() return false) and you call group() soon after, the result will be a java.lang.IllegalStateException (example).

The code above prints 987654. As the regex no longer has .*, it does not need to catch the longest string possible. Using only [0-9]{6}, it focuses on checking only 6 digits in a row.


Heed

But there’s still one detail. If we have the String:

1234567 abc 987654

It starts with a 7-digit number, but if we use the code above, the result will be 123456 (since there are 6 digits in a row - your regex didn’t say whether or not you could have another digit after). If that’s what you need, you can use the code above. But if you just want to take the 6-digit sequences that don’t have any other digits before or after, the code changes a little bit:

String s = "1234567 abc 987654";
Pattern p = Pattern.compile("(?<=^|[^0-9])[0-9]{6}(?=[^0-9]|$)");
Matcher m = p.matcher(s);
if (m.find()) {
    System.out.println(m.group());
}

This prints 987654. I added a few things to the expression to make sure it only takes 6 digits that don’t have any other digits before or after. Let’s go in pieces:

The bookmark ^ means "string start". The character | means "or" and [^0-9] is "anything that nay be it a digit from 0 to 9". That is, ^|[^0-9] means "the beginning of the string or a character that is not a digit".

I put everything inside a lookbehind (indicated by (?<=). It is used to check if something exists before the current position. In this case, it can be the beginning of the string or some character that is not a digit.

Then I put [0-9]{6}, which is what I really want to get.

At last we have a Lookahead (indicated by (?=), which is similar to lookbehind: it serves to check if something exists after the current position. And inside it we have: [^0-9]|$ - a character that is not a digit or the end of the string ($),

That is, regex will take the 6-digit sequences as long as they have no other digits before or after. The above code prints 987654.


You said you tried to use \b, but it means a "boundary between words" (word Boundary), that is, positions of the string that has an alphanumeric character before and a non-alphanumeric character after (or vice versa). So if you have cases like a123456, it is not considered as there is a letter before the digits, and the letter is not considered a "boundary between words". Ex:

String s = "1234567 abc a987654 111222";
Pattern p = Pattern.compile("\\b[0-9]{6}\\b");
Matcher m = p.matcher(s);
if (m.find()) {
    System.out.println(m.group());
}

This code prints out 111222. The a987654 was not caught because there is a a before the digits (hence the position before the 9 does not correspond to \b).

Using (?<=^|[^0-9])[0-9]{6}(?=[^0-9]|$), as we have seen, the code returns 987654. Then choose the regex that best suits your data and what you need to extract from it.


A few more details

Another alternative is to use the shortcuts \d (which corresponds to [0-9]) and \D (which corresponds to [^0-9]). Remembering that within strings the character \ must be escaped and written as \\:

Pattern p = Pattern.compile("(?<=^|\\D)\\d{6}(?=\\D|$)");

And if you want to get all the occurrences (and not just the first), just make a loop:

while (m.find()) { // enquanto tiver sequências de 6 dígitos, imprime
    System.out.println(m.group());
}

And if you are going to process several different strings, you do not need to create the Pattern and Matcher all the time. Just create once and go resetting the Matcher at each iteration:

String[] strings = {
    "987654 original4343 - Co 123456 asd.pdf",
    "1234567 abc 987654",
    "1234567 abc a987654 111222" };
Pattern p = Pattern.compile("(?<=^|\\D)\\d{6}(?=\\D|$)");
Matcher m = p.matcher("");
for (String s : strings) {
    m.reset(s); // resetar o Matcher com outra string
    System.out.println("Testando " + s);
    while (m.find()) {
        System.out.println("- encontrado: " + m.group());
    }
}

The exit is:

Testando 987654 original4343 - Co 123456 asd.pdf
- encontrado: 987654
- encontrado: 123456
Testando 1234567 abc 987654
- encontrado: 987654
Testando 1234567 abc a987654 111222
- encontrado: 987654
- encontrado: 111222

(1): In fact the point by default corresponds to any character except line breaks. But it is possible to make him also consider line breaks, using the flag DOTALL.

2

That REGEX .*([0-9]{6}).* in addition to bringing the last number within the group, as detailed in the @hkotsubo reply, your counter (.*) brings as much of any character as possible that is before the 6-digit number and after it, bringing all the String with it.

The simplest way I see to bring the numbers is this:

  ([0-9]{6}) // Trazer um grupo de números com 6 dígitos

Depending on what you really need, you may need to analyze each part found by regex, or give it more complexity to make a more specific filter.

If you only need the first occurrence, just do this:

   String str = "987654 original4343 - Co 123456 asd.pdf";

   Pattern regex = Pattern.compile("([0-9]{6})");    
   Matcher pesquisa = regex.matcher(str);    
   pesquisa.find();
   System.out.println(pesquisa.group());// primeira ocorrência

But you want all the numbers found, you can do so:

   Pattern regex = Pattern.compile("([0-9]{6})");    
   Matcher pesquisa = regex.matcher(str);    
   while (pesquisa.find()) {    
        System.out.println(pesquisa.group());
   }

Just don’t forget to import the classes:

import java.util.regex.*;

Interesting to highlight this quote from @hkotsubo:

That one .* is tricky, some more radical authors say to "never" use - which I think is a bit exaggerated, of course, you should use carefully, but if know what you’re doing no problem at all

  • 2

    "I don’t know how this REGEX .([0-9]{6}). brought only the last number" - I brought only the last one because the first one .* is Greedy Then grab as much as you can.

  • In fact, about the .*, is what @Isac said. I put an answer with an explanation detailing this - and a few more remarks :-)

  • @hkotsubo yes... I ended up expressing myself badly.

  • @Isac had expressed me badly... What I meant was "I don’t know how this REGEX[.. ]" is that it doesn’t take JUST the last number. It’s ALL String. I did tests with java and the way that regex was, it pulled everything!

  • Anyway, I edited the answer to make it clearer. =)

  • 1

    It brings up the whole string because of the according to .* after the digits. "Greed" applies to both: the first .* takes as much as possible before the digits (making you consider the last 6-digit sequence), and the second takes all the rest after those 6 digits (i.e., goes to the end of the string). The result is the entire string... That .* is tricky, some more radical authors say to "never" use - which I think is a bit exaggerated, of course, you should use carefully, but if you know what you’re doing no problem at all.

  • @hkotsubo yes... I will insert that you said.

Show 2 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.