Regex to select only a number within certain Strings

Asked

Viewed 105 times

1

I have a multi-line log that generates multiple events, however I need to take the . and take only the information that is before the "ms", ie "509833", "780414", etc.

2020-04-23 15:21:10,602 INFO  ecp-1-2089600 25000 Execution Info
+[Job_TransformGIM].............................................................509833 ms. Invocations 1
|-- [INIT]........................................................................2297 ms. Invocations 1
|--+[RUN].......................................................................507380 ms. Invocations 1
   |-- [AGENTtoRESOURCE]...........................................................125 ms. Invocations 1
   |-- [SCRIPTtoRESOURCE]..........................................................172 ms. Invocations 1
2020-04-23 15:40:38,347 INFO  ecp-1-2089600 25000 Execution Info
+[Job_TransformGIM].............................................................285409 ms. Invocations 1
|-- [INIT]........................................................................1875 ms. Invocations 1
|--+[RUN].......................................................................283362 ms. Invocations 1
   |-- [AGENTtoRESOURCE]............................................................93 ms. Invocations 1
   |-- [SCRIPTtoRESOURCE]...........................................................93 ms. Invocations 1
   |-- [ENDPOINTtoRESOURCE].........................................................78 ms. Invocations 1
   |-- [SWITCHtoRESOURCE]...........................................................78 ms. Invocations 1
2020-04-23 15:21:10,602 INFO  ecp-1-2089600 25000 Execution Info
+[Job_TransformGIM]...........................................................54509833 ms. Invocations 1
|-- [INIT]........................................................................2297 ms. Invocations 1
|--+[RUN].......................................................................507380 ms. Invocations 1
   |-- [AGENTtoRESOURCE]...........................................................125 ms. Invocations 1

I’ve set up the following regex, but I think I have a better way to do that:

^\d+\-\d+\-\d+\s+\d+:\d+:\d+,\d+\s+\w+\s+\w+\-\d+\-\d+\s+\d+\s+\w+\s+\w+\s+\+\[\w+_\w+\]\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.(?P<teste>\d+\s+)
  • Fixed, the application is multi-line made in Java, but I need to take only the value that is in the Job_transformgim line, ie "509833", "285409". They’re gigantic logs, so it generates an event like this every minute.

1 answer

3


An alternative is \[\w+\]\.+(\d+)\s+ms\..

Like the shortcut \w already takes letters, digits and also the character _, you don’t have to do \w+_\w+ (in fact this way forces to have a _, already using only \w+ it also accepts cases that have only letters - unless the intention is to actually get the names that have at least one _, clear-cut).

For the stitches, I used \.+ (one or more occurrences of character .), since the amount can vary and you have no way of knowing the exact amount (even because the value of the number that comes after influences the amount of points, so I did so it is simpler and seems to contemplate all cases).

Then I get the numbers (\d+), followed by one or more spaces (\s+), followed by ms.. The numbers are in parentheses to form a catch group, so I can get their value later. You had used a named group ((?P<teste>), but I did not find it necessary to have a name, because regex only has one group and I can refer to it by numbering (as it is the first pair of parentheses of regex, so it is group 1).

Not to mention the syntax (?P<teste> is invalid in Java. Named groups have a syntax variation between languages, and according to the documentation, in Java would just (?<teste>.

Anyway, assuming that all the text is in a string, it would look like this:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

...
String texto = "2020-04-23 15:21:10,602 INFO  ecp-1-2089600 25000 Execution Info\n"
                + "+[Job_TransformGIM].............................................................509833 ms. Invocations 1\n"
                + "|-- [INIT]........................................................................2297 ms. Invocations 1\n"
                + "|--+[RUN].......................................................................507380 ms. Invocations 1\n"
                + "   |-- [AGENTtoRESOURCE]...........................................................125 ms. Invocations 1\n"
                + "   |-- [SCRIPTtoRESOURCE]..........................................................172 ms. Invocations 1\n"
                + "2020-04-23 15:40:38,347 INFO  ecp-1-2089600 25000 Execution Info\n"
                + "+[Job_TransformGIM].............................................................285409 ms. Invocations 1\n"
                + "|-- [INIT]........................................................................1875 ms. Invocations 1\n"
                + "|--+[RUN].......................................................................283362 ms. Invocations 1\n"
                + "   |-- [AGENTtoRESOURCE]............................................................93 ms. Invocations 1\n"
                + "   |-- [SCRIPTtoRESOURCE]...........................................................93 ms. Invocations 1\n"
                + "   |-- [ENDPOINTtoRESOURCE].........................................................78 ms. Invocations 1\n"
                + "   |-- [SWITCHtoRESOURCE]...........................................................78 ms. Invocations 1\n"
                + "2020-04-23 15:21:10,602 INFO  ecp-1-2089600 25000 Execution Info\n"
                + "+[Job_TransformGIM]...........................................................54509833 ms. Invocations 1\n"
                + "|-- [INIT]........................................................................2297 ms. Invocations 1\n"
                + "|--+[RUN].......................................................................507380 ms. Invocations 1\n"
                + "   |-- [AGENTtoRESOURCE]...........................................................125 ms. Invocations 1";
Matcher matcher = Pattern.compile("\\[\\w+\\]\\.+(\\d+)\\s+ms\\.").matcher(texto);
while (matcher.find()) {
    System.out.println(matcher.group(1)); // pegar o grupo 1
}

Note that, by being inside a string, the character \ should be written as \\.

The Matcher goes through the string looking for occurrences of regex. When it finds, just take group 1, which contains the desired numbers. The output is:

509833
2297
507380
125
172
285409
1875
283362
93
93
78
78
54509833
2297
507380
125

Now, if you are processing the file line by line, you can do so:

String[] linhas = // linhas do arquivo
Matcher matcher = Pattern.compile("\\[\\w+\\]\\.+(\\d+)\\s+ms\\.").matcher("");
for (String linha : linhas) {
    if (matcher.reset(linha).find()) {
        System.out.println(matcher.group(1));
    }
}

Assuming that each line has only one occurrence of the value in milliseconds. If a line may have more than one occurrence, exchange the if by a while.


If you want to turn the value into a number, just do:

while (matcher.find()) {
    int ms = Integer.parseInt(matcher.group(1));
    // usar o valor de ms para o que você precisar
}

Or, if it has values above 2,147,483,647 (which is maximum value of a int supports):

while (matcher.find()) {
    long ms = Long.parseLong(matcher.group(1));
}

A long supports values up to 9.223.372.036.854.775.807. As its outputs refer to execution durations, it will hardly exceed this value (since this amount of milliseconds corresponds to more than 290 million years).


If you are "absolutely sure" that the "ms." numbers sequence does not occur anywhere else in the file (only in the places you want to pick up), you can simplify regex to "(\\d+)\\s+ms\\." (or "(\\d+) ms\\.", if there is always only one space between the number and the text "ms."). The rest of the code is the same.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.