Regex separates by type of occurrences, how to do and the simple way to achieve?

Asked

Viewed 93 times

1

Hello, I am working with user-agent and would like a help.

I want to separate a value from a string that has this pattern:

Mozilla/<version> (<system-information>) <platform> (<platform-details>) <extensions>

Existing examples among thousands:

type 1:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36

type 2

5.0 (Linux; Android 5.0.2; MotoE2(4G-LTE) Build/LXI22.50-53.8) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.76 Mobile Safari/537.36

type 3

Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0

Form of separation:

Divide the values at every closure of external parentasses, regardless of how many you have according to the user-agent standard posted above;

or

In other words, whenever there is a sequence of text followed with the opening, texts inside and closing of the kinship, it should be divided from the text until the closing of the first occurrence kinship and so on;

Rule to be applied: If he has a kinship within one that occurred first, he should not be divided like his father!

Examples of expected results with the values posted above:

type 1:

array (
0 => 'Mozilla/5.0 (X11; Linux x86_64) ',
1 => 'AppleWebKit/537.36 (KHTML, like Gecko) ',
2 => 'Chrome/73.0.3683.86 Safari/537.36'
);

type 2:

array (
0 => '5.0 (Linux; Android 5.0.2; MotoE2(4G-LTE) Build/LXI22.50-53.8) ',
1 => 'AppleWebKit/537.36 (KHTML, like Gecko) ',
2 => 'Chrome/47.0.2526.76 Mobile Safari/537.36'
);

type 3:

array (
0 => 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:66.0) ',
1 => 'Gecko/20100101 Firefox/66.0'
);

I believe I tried to explain a lot earlier, so I did this mega edition, trying to make things clearer, please feel free to ask...

  • 2

    Perhaps it is not so simple because the format varies a lot, and particularly one parenthesis within another is complicated (although it is not impossible). But I don’t understand, do you want to validate whether the string is in the right format or extract specific data from it? Anyway, did you try it with the get_browser function? https://www.php.net/manual/en/function.get-browser.php

  • @hkotsubo just separate the values I posted as examples according to each closure of the first parentase. I didn’t know the get_browser function, I’m taking a look, thank you very much!

1 answer

1


Creating a regex that interprets all (or several) user Agents is a very difficult task, since the format is very open and cover all cases seems unfeasible to me.

An alternative is to use the function get_browser. To use it, you must enable the directive browscap in your file php.ini:

browscap = /caminho/do/arquivo/browscap.ini

Whereas the file browscap.ini must be in the machine too. At this link has an example that can be downloaded.

Once configured, just pass the string:

$user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36';
$result = get_browser($user_agent, true);

The second parameter indicates whether an array will be returned (if passed true), or an object (if passed false, which is also the value default if this is not passed).

The returned data is varied and it is up to you to evaluate the respective values and see if it is what you need.


Regex

As already said, a regex to interpret the whole string of user agent is very complicated. The hardest thing is to check parentheses inside other parentheses. Parse the string with a loop it would be a lot easier than using regex, but anyway gets the record below.

The solution below does not check the whole string, yes only until the first pair of parentheses, which is what was asked in the question. In addition, the string is assumed to have the format "name/version (text)":

function parseUserAgent($user_agent) {
    if (preg_match('{^([^/]*)/?(\d+\.\d+)\s+(\(([^)(]+|(?3))*+\))}', $user_agent, $matches)) {
        echo($matches[1]).PHP_EOL; // Mozilla, em branco
        echo($matches[2]).PHP_EOL; // versão
        $p = preg_replace('/^\((.*)\)$/', '$1', $matches[3]);
        echo($p).PHP_EOL; // conteúdo dentro do primeiro par de parênteses
    }
}

parseUserAgent('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36');
parseUserAgent('5.0 (Linux; Android 5.0.2; MotoE2(4G-LTE) Build/LXI22.50-53.8) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.76 Mobile Safari/537.36');

Detail I used { and } to delimit regex (instead of using bars -> /...regex.../). This allows me to use the bars inside the regex without having to escape with \. I chose { and } because they are not used in regex, so I save some keys and make the regex a little less difficult to read.

The first part is simple:

  • ^ is the bookmark for "string start", so I guarantee that regex won’t pick up something in the middle of the string by accident.
  • ([^/]*)/?: zero or more characters that are not /, to get all the text before the first bar. Then there is an optional bar. This takes the "Mozilla/" chunk, for example, but can also take the empty string. I could use [a-zA-Z]* also, because then I would limit this excerpt to only letters, but I’m being very simplistic here (if you know that the strings will always be user Agents, for example, you can afford to simplify the regex knowing that there is not so much risk of false positives).
  • (\d+\.\d+): the version number, defined as "multiple digits, followed by a dot, followed by several digits"

Note that both the snippet before the bar and the version are in parentheses. This forms capture groups, which allows me to retrieve these snippets later, as was done in the above code ($matches[1] picks up the stretch that was captured in the first group, $matches[2] in the second of the group, etc).

Then we have one or more spaces (\s+), and then we got to the tricky part.


To be able to check pairs of nested parentheses, the way was to appeal to recursive regex. Halving:

  • \( and \): are the brackets themselves (opening and closing)
  • [^)(]+: one or more characters other than ) nor (
  • |: means "or"

This whole section is wrapped in a pair of parentheses: (\(([^)(]+|(?3))*+\)). Since it is the third pair of parentheses, I can refer it recursively using (?3). Does that mean that (?3) is recursively replaced by all subexpression within these parentheses.

I mean, I might have:

  • an opening of parentheses followed by:
    • one or more characters that are not parentheses, or
    • an opening of parentheses followed by:
      • one or more characters that are not parentheses, or
      • an opening of parentheses followed by:
        • one or more characters that are not parentheses, or
        • an opening of parentheses followed by:
          • and so on...
        • followed by a closing of parentheses
      • followed by a closing of parentheses
    • followed by a closing of parentheses
  • followed by a closing of parentheses

This ensures that we will have well-formed parenthesis sequences.

Then I get the pouch. To the third match, the parentheses are also part of what was captured, so I just remove them using preg_replace, but it is up to you to remove or not.

The exit is:

Mozilla
5.0
X11; Linux x86_64

5.0
Linux; Android 5.0.2; Motoe2(4G-LTE) Build/LXI22.50-53.8

  • was great but did not work as expected and scored as the right answer, but we can correct... using only preg_match('{^([^/])/?(\d+. d+) s+((([ )(]+|(? 3))+))}', $user_agent, $Matches) it returns me the array with the separated values almost right, in the case that I just quoted with preg_match it returns with 5 values in the array, from 0 to 4, would it return with all the string separate? as in the question examples, I will edit the last one to seem clearer!

  • 1

    @Robertcezar Daria to do preg_match('{^([^/]*)/?(\d+\.\d+)\s+(\((?:[^)(]+|(?3))*+\))}', $user_agent, $matches), which decreases an array element (the (?: makes the chunk inside the parentheses not to be a capture group and it is not returned in the array). But the position 0 of the array cannot be eliminated, because it at all times will have all the chunk that was captured by regex. The only way is to handle the data of the array manually, as I did...

  • 1

    I understood the situation, just adapt the logic and ready, it was of great help @hkotsubo.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.