Traverse string with PHP and return searched values

Asked

Viewed 270 times

0

I am developing a script that is able to locate certain tags in a string and return all found.

The pattern of tag will be: [exibebanner id="300"], being that within id="int" will always be an integer value.

The script below does the search and returns all the values of id found:

    $str = '[exibebanner id="300"] <br/>é simplesmente uma simulação de texto da indústria tipográfica  <br/><b>[exibebanner id="40"]</b> e de impressos, e vem sendo utilizado desde o século XVI, quando um impressor desconhecido pegou um<br/> <b>[exibebanner id="90"]</b>';   


preg_match_all('/[exibebanner [^]]*id=["|\']([^"|\']+)/i', $str, $matches);

foreach ($matches[1] as $key=>$value) {
    echo PHP_EOL . $value;
}

RETURN:

300 40 90

The return I need is as follows:

[exibebanner id="300"] [exibebanner id="40"] [exibebanner id="90"]

With these values I can apply a function that finds these occurrences inside the string and swaps for a function like: add_banner([exibebanner id="40"])

I have no experience with regular expressions and I need guidance.

2 answers

1


I made a change in regex to the value of id and it worked, the \d takes only numeric values, and the expression {,} refers to one or more occurrences.

preg_match_all('/(\[exibebanner id="\d{1,}"\])/i', $str, $matches);

A suggestion for when using regex test on sites like https://regexr.com/

Look at this case here: https://regexr.com/4pcqb

  • For "one or more occurrences", the most common - and simplest too - is to use +, then I would be \d+. The quantifier {x,} is used when the value of x is greater than 1 (because there is no other alternative)

1

First I think it’s worth explaining why your regex didn’t work.

Basically, the square brackets [] have special meaning in regex: they serve to determine a character class. For example, [abc] is a regular expression meaning "the letter a, or the letter b, or the letter c" (only one of them, any one serves). And order does not matter, therefore [abc], [bac] and [cab] are equivalent (see here an example).

That is, in its regex, the [exibebanner [^] means "the letter e, or the letter x, or the letter i ..., or a space, or the character [, or the character ^". See that the letters e, b, and n appear more than once, which is redundant. But what matters is that all this stretch corresponds to only one character (anyone indicated inside the brackets - see here an example). For regex to consider the characters themselves [ and ], we must escape them with \ (that is, we should write them as \[ and \]).

Then we have ]*. Here we have a "prank". Since there is no matching opening bracket, the engine interprets that this is the character itself ] (that is, in this case you need not escape it with \). And the quantifier * means "zero or more occurrences". This means that if you have a ] serves, if you have several ]]]] also serves, and even if you have none, also serves (after all, are zero or more occurrences of ]) - see here an example.

This all explains how the match of this regex. The [exibebanner [^] captured the blank space before id= (see that there is a space inside the brackets, so it is one of the characters that correspond to this class). Already the snippet ]* did not take any character (because * also accepts zero occurrences of the character), and then the rest of the regex took the id= forward (here you can see better what each stretch of regex picks up).

So the first thing to do is to escape the clasps with \. There are other details to be improved, an alternative would be to do so:

preg_match_all('/\[exibebanner id=["\']\d+["\']\]/i', $str, $matches);

foreach ($matches[0] as $value) {
    echo PHP_EOL . $value;
}

I removed the $key of foreach because it was not being used. The output is:

[exibebanner id="300"]
[exibebanner id="40"]
[exibebanner id="90"]

I changed other details in your regex.

First the quotation marks (right after id=): you had used ["|\'], but as we’ve seen before, brackets define a character class, so this snippet takes the characters ", | and '. This is a very common error when using regex, as the character | is used to alternation, but within brackets it "loses its powers" and becomes a common character. So if you had something like id=|300, the regex would find a match (see). So I removed the | from there and left only ["\'].

Then you had used [^"|\'], which is a denied character class (the ^ shortly after the [ says I want the characters that nay are in the list). That is, it will pick up any character that nay be it ", | and '. That means if the text has id="abc" or id="!@#$%&", regex will also find a match (see).

Like you said the id it’s always an integer, so be more specific and use the shortcut \d, considering only digits. I also used the quantifier +, meaning "one or more occurrences" (the another answer used {1,}, that is equivalent, but + is most usual for this case - use {x,} or {x,y} makes more sense when quantities have different values, such as {3,} to indicate "at least 3 occurrences").

Another detail is that you used the flag i (the letter i just after the second /, at the end of the regex), which indicates that the regex is case insensitive, that is, it does not differentiate between upper and lower case letters. So, if the text has [EXIBEbanner Id="123"], it will also be found. If you want only lowercase letters to be accepted, simply remove the i of expression.

Also note that I have removed the parentheses as they are not needed here. You can simply pick $matches[0], which has all the parts found. The parentheses form capture groups, and in your case, as it only had a couple of parentheses, they were the first group, and so were available in $matches[1]. But in this case the regex already takes all the stretch you need and there is no need to create a group.


But the above regex still has a problem.

How you used ["\'] for quotation marks, I understand that the text can have both double quotation marks and single quotation marks. Only regex does not check if the character that was used in the opening is the same as the lock. This means that it finds a match in cases such as [exibebanner id="90'].

To avoid this problem, we can use a capture group (now yes it is necessary), along with backreferences:

preg_match_all('/\[exibebanner id=(["\'])\d+\1\]/i', $str, $matches);
// restante do código igual

I place the parentheses around the opening quotes (right after id=), and this creates the first capture group. Then I use the backreference \1, meaning "the same text that was captured in group 1". In this case, it will be the character corresponding to the opening quotes (" or '). This ensures that regex only picks up cases where opening and closing quotes use the same character, ignoring cases like id="90'.

Of course if the data is controlled and you "know" that the quotes are always correct (there are no cases like id="90'), there need not so much precious. The ideal is be as specific as possible, but finding a balance between the complexity of the regex and its accuracy in the data being analyzed.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.