Regular Expression White Space

Question

Regular Expression White Space

Asked 10 years, 11 months ago

Viewed 10,545 times

2

With the following regular expression (^DOC)*[0-9] I can capture all numbers after the "DOC" sequence. However, by testing this text :

TEXT TEXT TEXT DOCUMENT:240010 24/09/2014

It returns me "24001024092014", the date comes along. The question is, how do I get the numerical sequence, and if I find a space, it doesn’t include it in the regex ? I would like to capture only the document number.

Follows java code:

public class Teste {

    public static void main(String args[]){

        String CAPTURAR_SOMENTE_NUMEROS_APOS_PALAVRA_DOC = "(^DOC)*\\d+ ";

        Pattern pattern = Pattern.compile(CAPTURAR_SOMENTE_NUMEROS_APOS_PALAVRA_DOC);

        Matcher matcher = pattern.matcher("TEXTO TEXTO TEXTO TEXTO DOCUMENTOLEGAL:240010 24/09/2014 ");

        while(matcher.find()){
            System.out.printf(matcher.group());
        }

    }
}

What code are you using to apply this regex? matches, lookingAt, find... P.S. Are you sure this regex works? In my understanding, (^DOC)* means "The sequence 'DOC' at the beginning of the line, zero or more times", and [0-9] means "a single digit".

– mgibsonbr

2014/09/26 at 14:26
Yes it works. I found another way using the following ER : "( DOC)* d+ ", but it returns me along with the space. I am layman in regular expression. I need to study it better

– Alexandre Hideki

2014/09/26 at 14:36
What I mean is that (^DOC)* is irrelevant, you could replace your entire regex by [0-9] and you would still have the same result (try your original regex with TEXTO TEXTO 123 TEXTO TEXTO DOCUMENTO:240010 24/09/2014, he’s gonna get the 123 before of DOC ). Similarly, your second suggestion can be replaced by \\d+ and nothing else. If you want me to explain further, please post the snippet of Java code you are using to apply this regex to a string.

– mgibsonbr

2014/09/26 at 14:45
Truth !! If you can remove this doubt for me, I will be grateful mgibsonbr.

– Alexandre Hideki

2014/09/26 at 14:52
I put an answer. Unfortunately I don’t know any good regex tutorial to tell you, but if you click the tag [tag:regex] and "Learn more..." or "info" you will find some useful references on the subject.

– mgibsonbr

2014/09/26 at 15:44

3 answers

2

The method Matcher.find looks for occurrences of a regular expression in a string. That is, it returns any substring that matches the regex you are looking for. If you want to extract the number that is just ahead of a DOC, there are two ways: capture groups and lookarounds.

Capture Groups

The method of capture groups is the simplest, as demonstrated in the response of Rodrigo Rigotti: you establish the text you want to marry, and in parentheses you place the subtexts that interest you the most. Simple example:

DOCUMENTO:([0-9]+)
DOCUMENTO:(\d+)

This picks up the string DOCUMENTO: and any sequence of numbers that follows it, and nothing else. The sequence of numbers - being within a hood group - can be accessed through the method group(int):

matcher.find();
System.out.printf(matcher.group(0)); // Pega o primeiro (no caso, o único) grupo de captura

If you have any variance in the word DOCUMENTO - for example accepting both DOCUMENTO how much DOC - you can mark the suffix as optional (through the operator ?):

DOC(UMENTO)?:(\d+)

But note that in doing so, you have created a new capture group - which will be the zero group, and the numbers you want will be the one group. If you want to prevent the first group from being captured, you can use (?:regex) instead of (regex):

DOC(?:UMENTO)?:(\d+)

Finally, if you have other possible entries - as in your example, DOCUMENTOLEGAL - you can adjust the regex accordingly, or even accept any sequence of letters after DOC:

DOC\w*:(\d+)

Just be careful not to marry more than you want (\w accepts any letter, number or underscore; regex* accepts zero or more occurrences of regex).

Lookarounds

The lookarounds are parts of regex that are checked, but do not enter the final result. A Lookahead tries to marry stretches ahead while a lookbehind Try marrying parts back. Sometimes this technique is useful, but it is too complicated (and does not work equally in all implementations of regular expressions), so I suggest avoiding it when possible. By way of example, his expression would look like this:

(?<=DOCUMENTO:)\d+

That is, "take a sequence of numbers, but make sure that this sequence is preceded by DOCUMENTO: not including in marriage". The disadvantage of this technique is that it is not anything that can be placed in a lookbehing - in particular, many implementations require regex to have a fixed size. Which in your case is a problem as you need to check both for DOC how much for DOCUMENTO, etc....

Sensational !! Thanks for the brief and rich explanation !! You’re right. Maybe I misunderstood the concept of REGEX, but I think I can solve my problem partially. Thank you !

– Alexandre Hideki

2014/09/26 at 16:03

Browser other questions tagged java regex

You are not signed in. Login or sign up in order to post.

by Rodrigo Rigotti • **12,139** points · Answer 1 · 2014-09-26T14:25:26+00:00

1

A suggestion:

DOCUMENTO:(\d+)\s*(\d+)\/(\d+)\/(\d+)

Example of implementation:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexMatches
{
    public static void main(String args[]) {

      String line = "TEXTO TEXTO TEXTO TEXTO DOCUMENTO:240010 24/09/2014";
      String pattern = "DOCUMENTO:(\d+)\s*(\d+)\/(\d+)\/(\d+)";

      Pattern r = Pattern.compile(pattern);

      Matcher m = r.matcher(line);
      if (m.find( )) {
         System.out.println(m.group(0) + m.group(1) + m.group(2) + m.group(3));
      } else {
         System.out.println("Sem resultados.");
      }
   }
}

Your regex is fine, but the idea of the AP is nay include the date in the result, i.e. you can ignore groups 1, 2 and 3.

– mgibsonbr

2014/09/26 at 14:28
Rodrigo, very good your tip, thank you !! However, I put only "DOC" in the ER because the text can come either "DOC" or "DOCUMENT", so the idea is to take everything that is number after the sequence "DOC" until you find a space. In case ( DOC)(\d+) s(\d+)/( d+)/( d+) would solve my problem ?

– Alexandre Hideki

2014/09/26 at 14:40
1

@Alexandrehideki I don’t know what you think (^DOC)* does, but the correct way to do what you ask is DOC(UMENTO)?: - i.e. "the sequence 'DOC', whether or not followed by 'UMENTO', followed by two dots".

– mgibsonbr

2014/09/26 at 14:47
1

As I didn’t have much time to search and study regex, at first I found q ( DOC)* would make q o regex find this sequence, deny it, and take values defined by me from it. However, you showed me up there that it is irrelevant, because I had not tested the regex with numbers before the word DOC. The conclusion I reached when I performed for the first time and he brought me only the numbers, was that it was correct !

– Alexandre Hideki

2014/09/26 at 14:55
@Alexandrehideki There is a type of pattern check called glob, much used in file systems and things like that. That’s probably what you thought regular expressions were. But in fact, regex is a totally different language... :)

– mgibsonbr

2014/09/26 at 15:50

by Pedro Rangel • **2,747** points · Answer 2 · 2014-09-26T14:32:05+00:00

String[] vetor = texto.split((^DOC)*[0-9]\\S);

The resulting vector will have two positions: vector[0]= 240010 and vector[1]= 24092014, just take the vector[0] that interests it.