Regex to pick sequence of letters or numbers

Question

Regex to pick sequence of letters or numbers

Asked 5 years, 3 months ago

Viewed 822 times

2

I have a website where I do a lot of youtube embed, so to make it easier I thought about using {youtube 54sd3} for example, and at the time of printing the result it replaced with:

<iframe width="560" height="315" src="https://www.youtube.com/embed/54sd3" frameborder="0"></iframe>

I found this solution, but only works with number, which regex to accept all characters?

$content='{youtube 123}';
$pattern = '#\{youtube ([0-9][a-Z]+)\}#i';
preg_match_all($pattern, $content, $matches);

2 answers

Browser other questions tagged php regex

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2020-04-13T16:01:15+00:00

Your regex was only taking one number ([0-9]) followed by one or more letters. With the detail that [a-Z] (with capital "Z") gives error, see.

Anyway, if you want it to have letters or numbers, just use a single character class with both intervals: [0-9a-z] (note that I have put the tiny "z"). How you used the flag i, regex will already consider both upper and lower case letters (if you only want lower case, remove the i of the end of the Pattern).

Finally, to extract the passages you have {youtube blablabla} and turn into the tag desired:

$content='texto {youtube 123} blablabla {youtube 54sd3} xyz';
$pattern = '#\{youtube ([0-9a-z]+)\}#i';
if (preg_match_all($pattern, $content, $matches)) {
    foreach ($matches[1] as $valor) {
        echo "<iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/{$valor}\" frameborder=\"0\"></iframe>";
    }
}

Exit:

<iframe width="560" height="315" src="https://www.youtube.com/embed/123" frameborder="0"></iframe>
<iframe width="560" height="315" src="https://www.youtube.com/embed/54sd3" frameborder="0"></iframe>

Or, if you want to directly replace the string itself:

echo preg_replace('#\{youtube ([0-9a-z]+)\}#i',
    '<iframe width="560" height="315" src="https://www.youtube.com/embed/$1" frameborder="0"></iframe>',
    $content);

Exit:

texto <iframe width="560" height="315" src="https://www.youtube.com/embed/123" frameborder="0"></iframe> blablabla <iframe width="560" height="315" src="https://www.youtube.com/embed/54sd3" frameborder="0"></iframe> xyz

In both cases I put the section corresponding to the code in parentheses, because there form a catch group. With this, this section is available in $matches[1] (in the case of preg_match_all) and in $1 (in the case of preg_replace) - because it is the first pair of parentheses of regex, so it corresponds to group 1.

Another detail is that as the regex is case insensitive, she also accepts cases like {YOUTube abCD123}. If you want the text "youtube" to be always lowercase and only the code can be uppercase and lowercase, you should remove the flag i and add the uppercase letters, then '#\{youtube ([0-9a-zA-Z]+)\}#'.

To another answer suggested using dot, only that there is a problem: regex may end up picking more characters than desired, since the point corresponds to any character (including the }, spaces, etc). That is, if I do:

$content='texto {youtube 123} blablabla {youtube 54sd3} xyz';
$pattern = '#\{youtube (.+)\}#i';
if (preg_match_all($pattern, $content, $matches)) {
    foreach ($matches[1] as $valor) {
        echo "<iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/{$valor}\" frameborder=\"0\"></iframe>".PHP_EOL;
    }
}

The result will be:

<iframe width="560" height="315" src="https://www.youtube.com/embed/123} blablabla {youtube 54sd3" frameborder="0"></iframe>

Notice that inside the src all content was placed between the 123 to the last }. This is because the point corresponds to any character (anyone, including the }), and quantifiers such as + sane "greedy" and take as many characters as possible (so he takes more than "should").

Only in this case you don’t want "any character", but "a sequence of letters and numbers", so it’s better be more specific and put in the regex exactly what you want.

by Thomas Erich Pimentel • **3,059** points · Answer 2 · 2020-04-13T15:12:46+00:00

I will not address whether this practice is the best, or whether this is correct.

Answering the question:

Which regex to accept all characters?

.   Curinga: corresponde a qualquer caractere único, exceto \n.

Font: Regular expression language - quick reference

Thinking of the given expression: {youtube 54sd3}

We may use, taking into account, that the expression takes 1 or more characters after youtube:

[{youtube ]{9}.+[}]{1}

For testing: rule