Fetch words between quotation marks through regular expression

Asked

Viewed 2,472 times

4

The code below is returning me this error:

Notice: Undefined offset: 1

Code:

<?php
$matches = array(); 
# Executa expressao
$string1 = 'string(10) "CURITIBA" string(11) "SP"'; 
$pattern = '/""(.*)""/'; 
preg_match($pattern , $string1, $matches);
echo $matches[1];
?>

I wish my echo would return me like this:

Curitiba
SP
  • This is happening because your $pattern is incorrect. Let’s wait for the guys who know regular expressions help you with this issue.

  • ok, n understand mt of regular expression, waiting :)

  • Ideally you would edit the question to something like "Get words in quotes through regular expression". Because the alert Notice: Undefined offset: 1 is given to the fact that the array $matches does not have the index 1.

  • thanks for the tip, I’ll change!

2 answers

5


It is because your regex is not finding anything in the string. There is no occurrence in the string between a pair of double quotes. The names you want are between two quotes only. But also you should use preg_match_all() to find all occurrences, because the preg_match() you’ll only find one.

And in regex you must use the character ? (Lazy) after the * to return only the string inside the quotation marks, otherwise it will fetch from the first quotation marks until the last (question about Lazy). Would be:

$pattern = '/"(.*?)"/';

It will return a subarray with the two information, one in each index of index 1 (group 1) of the main array:

echo $matches[1][0];  // CURITIBA
echo $matches[1][1];  // SP

Check on IDEONE

No need to create the array on this line: $matches = array();, because the $matches in function preg_match_all will already return an array.

  • Sam, you could explain what each regex character would be or point out a quality content to study these expressions. Thank you!

  • Okay, very good friend, I appreciate your comment, I help me a lot!

  • Hi @Victorcarnaval. Dude, I don’t think it’s necessary in this case to explain each character of the regex. As she already had the regex ready, missing only the ?, I just put a link to a question I asked about the character, and there is everything explained in the answers.

  • @Victorcarnaval If you had to create a new different regex, then you would need to explain what it does in detail.

  • I’ll read your question, thank you!

  • @Victorcarnaval About study material, two sites that I like a lot (and that I always quote in the answers) are that and that. And if you want to get to the bottom of it, you have this book: https://www.amazon.com/dp/0596528124/

  • Thanks for the @hkotsubo content.

Show 2 more comments

3

Just complementing, an alternative is:

$string1 = 'string(10) "CURITIBA" string(11) "SP"';
preg_match_all('/"([^"]+)"/', $string1, $matches);
foreach($matches[1] as $m) {
    echo $m.PHP_EOL;
}

The difference to the another answer is that the regex is "([^"]+)":

  • at the beginning and end we have the quotation marks
  • in the middle we have [^"], which is a character class denied. Basically, it means any character that nay be the "
  • the quantifier + means "one or more occurrences". It is different from *, which means "zero or more occurrences". That is, if you use *, regex also considers cases where there is nothing between the quotes. Using +, I only take the cases where there is at least one character between them (see the difference here and here). Use whatever makes the most sense to you.
  • the brackets serve to form a capture group, so the array of pouch have a position to store the sections that correspond to the parentheses (in this case, it is $matches[1], because it’s the first pair of parentheses, so it’s the first capture group, which is at index 1)

The result is:

CURITIBA
SP

The other difference is that [^"]+ is a little more efficient than .*?. This is because the dot corresponds to any character (anyone, including quotation marks, so it is necessary to ? so that the quantifier * don’t take more characters than you should - see the difference here and here). And as he can pick up any character, including the quotation marks if he finds it necessary, regex ends up testing too many possibilities, until it finds the pouch (the quantifier Lazy - as is called the *? - is very useful, but charges its price).

Already using [^"]+, the regex can proceed without fear, for it no longer corresponds to any character, but to any character other than the ". I mean, that guarantees that the regex will stop when it finds a ". This makes it more efficient, just compare the amount of steps here and here.

Obviously, for small strings and few runs, it doesn’t make that much difference (maybe the gain is milliseconds or even less). But for larger strings, or for processes where regex will run many times, it starts to make a difference (compare here and here - and note that the biggest difference is in the cases where the regex fails because the quotation marks do not close, as the point generates much more possibilities to be tested - and the regex tests all until I find a match, or until you realize that there is no).

Another difference is that by default the dot picks any character, except line breaks. Already [^"] consider line breaks. So if we have a string containing a line break between the quotes, only the second finds a match - compare here and here. (but in this case, just use the flag s in regex: '/"(.+?)"/s' - for thus the point also considers line breaks).


If you want to be more specific, you can use something like:

preg_match_all('/"([A-Z]+)"/', $string1, $matches);

Now the regex will only take cases where there are uppercase letters between the quotes ([A-Z]+ is "one or more letters of A to Z"). It would make a difference if you had cases like "123" and wanted to ignore them, for example.

Use .* It seems to be easier, but you don’t always want "anything". Often you have a well-defined set of characters that you want to consider (or ignore), and generally it is better that the expression says exactly what you want and what you don’t want.


Note: its regex had two quotes at the beginning and two at the end, so I couldn’t find anything.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.