The method Matcher.find
looks for occurrences of a regular expression in a string. That is, it returns any substring that matches the regex you are looking for. If you want to extract the number that is just ahead of a DOC
, there are two ways: capture groups and lookarounds.
Capture Groups
The method of capture groups is the simplest, as demonstrated in the response of Rodrigo Rigotti: you establish the text you want to marry, and in parentheses you place the subtexts that interest you the most. Simple example:
DOCUMENTO:([0-9]+)
DOCUMENTO:(\d+)
This picks up the string DOCUMENTO:
and any sequence of numbers that follows it, and nothing else. The sequence of numbers - being within a hood group - can be accessed through the method group(int)
:
matcher.find();
System.out.printf(matcher.group(0)); // Pega o primeiro (no caso, o único) grupo de captura
If you have any variance in the word DOCUMENTO
- for example accepting both DOCUMENTO
how much DOC
- you can mark the suffix as optional (through the operator ?
):
DOC(UMENTO)?:(\d+)
But note that in doing so, you have created a new capture group - which will be the zero group, and the numbers you want will be the one group. If you want to prevent the first group from being captured, you can use (?:regex)
instead of (regex)
:
DOC(?:UMENTO)?:(\d+)
Finally, if you have other possible entries - as in your example, DOCUMENTOLEGAL
- you can adjust the regex accordingly, or even accept any sequence of letters after DOC
:
DOC\w*:(\d+)
Just be careful not to marry more than you want (\w
accepts any letter, number or underscore; regex*
accepts zero or more occurrences of regex
).
Lookarounds
The lookarounds are parts of regex that are checked, but do not enter the final result. A Lookahead tries to marry stretches ahead while a lookbehind Try marrying parts back. Sometimes this technique is useful, but it is too complicated (and does not work equally in all implementations of regular expressions), so I suggest avoiding it when possible. By way of example, his expression would look like this:
(?<=DOCUMENTO:)\d+
That is, "take a sequence of numbers, but make sure that this sequence is preceded by DOCUMENTO:
not including in marriage". The disadvantage of this technique is that it is not anything that can be placed in a lookbehing - in particular, many implementations require regex to have a fixed size. Which in your case is a problem as you need to check both for DOC
how much for DOCUMENTO
, etc....
What code are you using to apply this regex?
matches
,lookingAt
,find
... P.S. Are you sure this regex works? In my understanding,(^DOC)*
means "The sequence 'DOC' at the beginning of the line, zero or more times", and[0-9]
means "a single digit".– mgibsonbr
Yes it works. I found another way using the following ER : "( DOC)* d+ ", but it returns me along with the space. I am layman in regular expression. I need to study it better
– Alexandre Hideki
What I mean is that
(^DOC)*
is irrelevant, you could replace your entire regex by[0-9]
and you would still have the same result (try your original regex withTEXTO TEXTO 123 TEXTO TEXTO DOCUMENTO:240010 24/09/2014
, he’s gonna get the123
before ofDOC
). Similarly, your second suggestion can be replaced by\\d+
and nothing else. If you want me to explain further, please post the snippet of Java code you are using to apply this regex to a string.– mgibsonbr
Truth !! If you can remove this doubt for me, I will be grateful mgibsonbr.
– Alexandre Hideki
I put an answer. Unfortunately I don’t know any good regex tutorial to tell you, but if you click the tag [tag:regex] and "Learn more..." or "info" you will find some useful references on the subject.
– mgibsonbr