I’m having trouble picking a value between two strings in PHP using REGEX

Asked

Viewed 327 times

1

$exemplo = "Olá, como vocês estão? Eu queria pegar o valor entre esses colchetes {# Olá mundo} e que esse aqui {#Teste} não interferisse, e me retornasse:";

I’d like the code to return:

[
   [0] => (string) "Olá mundo",
   [1] => (string) "Teste"
]

My code:

preg_match_all('/\{#(.*)\}/', $exemplo, $results);
vardump($results);

But he returns:

 array(2) { [0]=> array(1) { [0]=> string(39) "{# Olá mundo} e que esse aqui {#Teste}" } [1]=> array(1) { [0]=> string(37) "# Olá mundo} e que esse aqui {#Teste" } }
  • 4

    try it /\{#[^}]*)\}/ - [^}]* means "anything but }" - Another possibility /\{#(.*?)\}/ - *? means "as little as possible" - *? is the "not cute" version of *

  • 1

    The first did not work, however, the second(/{#(.*?)}/ - *?) worked perfectly. Thank you very much.!

  • If the answer from fellow Cypherpotato resolves, you can mark as accepted by clicking on the green V next to her

2 answers

4

That’s right, but with a little addition:

\{#(.*?)\}

The character ? right after a * means "the least amount possible", contrary to * alone, which is "as much as possible". After this group he looks for the next match, which is the }.

Another possibility is to capture the content inside the { ... } without the keys, with:

(?<=\{).+?(?=\})

This will capture # Olá mundo and #Teste in the string used.

  • I got the following error: Warning: preg_match_all(): Delimiter must not be alphanumeric or backslash in. However, I’ve solved my problem anyway, thank you so much!

  • 1

    @Sydo you remembered to put the / around? Note that the / / are not part of Regex, are the delimiters. In your question you have /\{#(.*)\}/, but the same regex is the \{#(.*)\} - how precisely to put the / or another character not used in the expression

  • You’re right, it worked when putting them up. Thanks again, and thanks also @Cypherpotato

  • Hello Cypher, I have another question, if it’s no bother, you could inform me a regex that picks values within the php tag? <?php valor ?>

  • @Sydo You could try something like /\<\?php(.*?)\?\>/gms. Note that you should use the tag single-line in the code.

2


The problem with your regex is that you used the point, which in regex means "any character" (except line breaks), together with the quantifier * (meaning "zero or more occurrences").

Only by default the quantifiers are "greedy" (Greedy) and try to pick up as many characters as possible. And how the point corresponds to any character (anyone, including their own { and }, if the regex finds it necessary), it ends up picking up more text than you would like (in this case, it picks up from the first { to the last } what to find).

One of the solutions has already been pointed out in another answer: use the .*? to make the quantifier "lazy" (Lazy), so he picks up as few characters as possible - that is, he only goes to the first } what to find (more details on quantifiers Lazy here and here).


But I would like to suggest another solution (which has already been given in the comments, But I’m going to go into more detail anyway).

Instead of using the dot (which picks up any character), you can be more specific and indicate exactly what you need. In this case, you do not want any character, but rather "any character that nay be the }". For that we use a character class denied: [^}] (meaning "any character other than }"). Notice I didn’t have to use the exhaust \}, because inside brackets many metacharacters do not need the \. The code would look like this:

$exemplo = "Olá, como vocês estão? Eu queria pegar o valor entre esses colchetes {# Olá mundo} e que esse aqui {#Teste} não interferisse, e me retornasse:";
preg_match_all('/\{#([^}]+)\}/', $exemplo, $results);
foreach ($results[1] as $texto) {
    echo $texto.PHP_EOL;
}

Notice I switched the quantifier * for +. In the case, the * means "zero or more occurrences", which means that if there is nothing between the # and the }, regex would also accept (and an empty string would appear in the results). Already + means "one or more occurrences", requiring there to be at least one character between the # and the }.

I also place the passage [^}]+ in brackets to form a catch group. And since it is the first pair of parentheses of regex, this section will be in group 1 (which in turn will be at position 1 of the array $results). Traversing the array $results[1], i get all the captured texts. The output is:

 Olá mundo
Teste

Notice that it includes the space before "Hello world", since the regex takes everything between the # and the }.


The detail is that this regex also captures texts like {#abc{xyz} (in this case, it captures the excerpt "abc{Xyz"). If you don’t want it to take the { between the c and the x, may include the { in the denied character class, leaving /\{#([^{}]+)\}/ - now I’m saying I want to [^{}] (anything that nay be it { nor }).

Another detail (which in this case I admit is nothing more than micro-optimization, but still worth commenting on) is that using [^{}] lets regex a little faster than using the dot. Compare here and here the number of steps of each regex. Obviously that for a few small strings, the difference will be insignificant (probably milliseconds or even less), but if it is to process a large amount of data, depending on the strings and the regex, the use of .*? can lead to catastrophic results.

This happens because the quantifier Lazy, though convenient, charges its price. Upon finding .*?\}, regex first tries to find a match with zero characters (i.e., checks whether the next one is already the }). If not, try with a character and see if the second is }. If not, try with two characters and see if the third is }, and so on, until we find a }. This one comes and goes is called backtracking, and depending on the case, it can be very costly. Already when using [^{}]+, regex can advance "without fear" - and without backtracking - 'cause she’ll stop as soon as she finds one { or }. Avoiding the backtracking and ensuring that it stops at the right point, I make it more efficient (this explains the difference of steps in the links above - and as the text increases, the proportion between the amount of steps increases further, see here and here).

I understand that using .* it seems easier, and for simpler cases "it works", but only use it if you really want "anything". This is not the case with the question, because in fact you wanted any character other than the keys.


Another difference is that the dot by default does not pick up line breaks. That is, if the text has something like:

{#texto em
várias
linhas}

A regex with .*? does not take because it disregards line breaks. But it is possible to change this behavior by adding to flag s in regex:

// com a flag "s", o ponto pega também as quebras de linha
preg_match_all('/\{#(.*?)\}/s', $exemplo, $results);
                            ^ aqui

Already using [^{}], line breaks are already considered, without needing the flag s. In this case, if nay want to pick up line breaks, just include them in the denied character class:

// incluir \n e \r para não pegar as quebras de linha
preg_match_all('/\{#([^{}\n\r]+)\}/', $exemplo, $results);

Browser other questions tagged

You are not signed in. Login or sign up in order to post.