How to recover all occurrences [[xxx]] of a string in PHP?

Asked

Viewed 194 times

1

I have a text with placeholders [[xx]], [[ccvf]], [[dfg]], etc. The text inside the placeholder is undetermined and the number of placeholders is variable.

Then in the following text as I could have an array with all placeholders?

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed aliquet imperdiet sapien, [[xx]] vitae luctus augue convallis quis. Pellentesque felis eros, dignissim vitae [[ccvf]] dignissim sed, porta a leo. Duis tincidunt, ex sit amet sollicitudin vehicula, nibh velit ultrices ipsum, at feugiat enim arcu et [[dfg]] enim.

Desired result:

$placeholders = ['xx','ccvf','dfg']

2 answers

1


In the your answer (which is not wrong), you use the flag U, which causes the quantifier * not be Greedy (greedy), making it Lazy (lazy). It’s the same as wearing /\[\[(.*?)\]\]/ - use .*? without flag U has the same effect as using .* with the flag, and both work perfectly. I would just like to propose an alternative.

Despite .*? (or .* with the flag U, that are the same thing) work, the quantifiers Lazy have their price. Basically, the regex has to go back and forth several times in the string, to find an excerpt that satisfies it. And as the point corresponds to any character, the possibilities that the regex needs to check can increase exponentially, depending on the case (not the case for your regex, but still it is important to know this and not use .* always, "on automatic").

In regex, the best is say exactly what you want and what you don’t want. The point corresponds to any character, but you really want it to have "anything" between [[]]?

An alternative would be to use '/\[\[([^\[\]]*)\]\]/'. I switched the point by [^\[\]], which is a character class denied. That is, it corresponds to any character that nay whatever is between [^ and ]. In case, it is \[\], that is, it is any character that is not [ nor ]. So I don’t even need the flag U, for the quantifier * will stop when you find the first ] (or another [, avoiding cases like [[[, that I understood that you should not appear).

The difference in performance in this case is not so great, but the class of characters denied is slightly faster: see the version with .* and compare the amount of steps with the second version. Of course, for a few small strings, the difference will be imperceptible. But there is another difference between these solutions.

If the string has a placeholder incomplete (for example, [[abc] - with a ] missing, perhaps by typo - or [[abc - without closing), the regex with .* ends up picking up more characters than it should (since the point corresponds to any character, including the characters themselves [ and ], and if the regex deems it necessary, the point takes the [ or the ] as part of match). Example:

$str = "Lorem ipsum [[xx] et [[dfg]] abc [[ops abc [[xyz]]";
preg_match_all('/\[\[(.*)\]\]/U', $str, $pat_array);
var_dump($pat_array);
preg_match_all('/\[\[([^\[\]]*)\]\]/', $str, $pat_array);
var_dump($pat_array);

The output of this code is:

array(2) {
  [0]=>
  array(2) {
    [0]=>
    string(16) "[[xx] et [[dfg]]"
    [1]=>
    string(17) "[[ops abc [[xyz]]"
  }
  [1]=>
  array(2) {
    [0]=>
    string(12) "xx] et [[dfg"
    [1]=>
    string(13) "ops abc [[xyz"
  }
}
array(2) {
  [0]=>
  array(2) {
    [0]=>
    string(7) "[[dfg]]"
    [1]=>
    string(7) "[[xyz]]"
  }
  [1]=>
  array(2) {
    [0]=>
    string(3) "dfg"
    [1]=>
    string(3) "xyz"
  }
}

Note that the first regex with .* ends up getting xx] et [[dfg and ops abc [[xyz, because she can’t detect that the xx there’s only one ] and the ops does not have the lock. And since the point corresponds to any character, regex continues to advance on the string until it finds some occurrence of ]]. So she ends up getting more than she should.

If I ever use [^\[\]]*, a regex for when you find a [ or ], and if not found, the regex fails and it can continue searching at other points of the string. So it only finds the placeholders who possess the opening ([[) and closure (]]), ignoring the other cases.

In addition, the second regex is more efficient and takes less time to detect and ignore these problems. Compare to number of steps of the first regex with the of the second. Again, for a few small strings the difference in performance will not be so great, and if all strings have placeholders properly delimited (that is, if the problems of missing the ] at closing), this problem will not occur.


If you want to be even more specific, you can place a regex that corresponds exactly to that placeholder can be. If it can only have letters, for example, just use '/\[\[([a-zA-Z]*)\]\]/'. The excerpt [a-zA-Z] corresponds to any letter of a to z, uppercase or lower case.

Another detail is that the * means "zero or more occurrences", so he may end up catching the string [[]]. If you want to force at least one character between the [[ and ]], change to '/\[\[([a-zA-Z]+)\]\]/', for the + means "one or more occurrences" (i.e., you must have at least one letter between the brackets). Another option is to use fixed values such as '/\[\[([a-zA-Z]{3,20})\]\]/' (between 3 and 20 letters) or '/\[\[([a-zA-Z]{3,})\]\]/' (at least 3 letters, no maximum limit). Adapt according to what you need.

1

I found the answer:

$str = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed aliquet imperdiet sapien, [[xx]] vitae luctus augue convallis quis. Pellentesque felis eros, dignissim vitae [[ccvf]] dignissim sed, porta a leo. Duis tincidunt, ex sit amet sollicitudin vehicula, nibh velit ultrices ipsum, at feugiat enim arcu et [[dfg]] enim."

preg_match_all ("/\[\[(.*)\]\]/U", $str, $pat_array);

result:

array:2 [▼
  0 => array:3 [▼
    0 => "[[RegAPR]]"
    1 => "[[AnnualFees]]"
    2 => "[[SpecialOffer]]"
  ]
  1 => array:3 [▼
    0 => "RegAPR"
    1 => "AnnualFees"
    2 => "SpecialOffer"
  ]
]

Browser other questions tagged

You are not signed in. Login or sign up in order to post.