Regex - Operator "." - Meta character capture

Asked

Viewed 189 times

5

Situation

I’m conducting a survey with regex in a specific word inválido, but I decided to use inv.lido. Which I knew I had in the test string but didn’t return.

Testing

vr = var_dump
pr = print_r

$string = 'até, atenção, Hipótese, você, português, café, órgão';

vr(preg_match('~at.~', $string, $match));
pr($match);
vr(preg_match('~aten..o~', $string, $match));
pr($match);
vr(preg_match('~Hip.tese~', $string, $match));
pr($match);
vr(preg_match('~voc.~', $string, $match));
pr($match);
vr(preg_match('~portugu.s~', $string, $match));
pr($match);
vr(preg_match('~caf.~', $string, $match));
pr($match);
vr(preg_match('~.rg.o~', $string, $match));
pr($match);

Out

int(1)
Array([0] => at�)

int(0)
Array()

int(0)
Array()

int(1)
Array([0] => voc�)

int(0)
Array()

int(1)
Array([0] => caf�)

int(0)
Array()

Question

As you can see he didn’t capture the words, except for a few, but even the ones he captured, I don’t know what the , because even using utf8_decode or even utf8_encode it does not return me the correct character.

From what little I know C and binary, I suppose it has to do with the fact that these characters are formed by two 8bit squares, however they are present in the table ASCII and as far as I know regex follows the table ASCII.

Why did this happen?

  • 2

    Accented characters do not belong to the ASCII character set. Some sets (and have more than one) ASCII extended - not much more used nowadays, in favor of Unicode - actually represent accented characters in a single byte, but this is not followed that I know by PHP or any other regex library that I know of.

  • In fact, I replaced "." with the respective character in Hexa "\x82", "\x88" and he also did not return to me, yet the operator "." means anything except \n, if not configured with x.

  • Yes. But I suspect (I’m not experienced with PHP) that he’s only considering the first byte of the UTF-8 representation of each accented letter, at least this interpretation is consistent with the result you’re getting. See my answer below for more details.

2 answers

5


PHP regular expressions do not support Unicode by default, unless you use the flag u:

preg_match('~aten..o~', $string, $match);
print_r($match);

Array ( )

preg_match('/aten..o/u', $string, $match);
print_r($match);

Array ( [0] => attention )

Example in the ideone.

What results you are getting (e.g..: at�), the reason is that accented characters are usually represented by more than one byte, for example in UTF-8 encoding. A pattern:

at.

without the flag u will marry 3 bytes, the first one a, the second one t and the third the first byte of é. Since this first byte is not a valid ASCII (nor Unicode) character, the function print_r does not know how to represent it, so prints a . Already the standard:

aten..o

When applied to the word atenção house the first . with the first byte of the ç, the second point with the second byte of the ç, and when he tries to marry the o with the first byte of the ã cannot, and marriage fails.

By activating the flag u, the engine takes full characters (and not just bytes) into account in the match, so that the first point matches ç, the second with ã, and the result is right as expected.

  • Very good response, in fact when testing with ~aten....o~ he captured me the correct word. I did not know this peculiarity of the u, and I always thought that PHP followed exactly the ASCII.

  • 2

    @Guilhermelautert Maybe you are confusing ASCII with Unicode. I suggest that Joel Spolsky article, very good, just don’t be scared by the title hehe!

  • 1

    I’ll do this, thanks for the article.

4

As already quoted by @mgibsonbr, by default, PHP does not support unicode in regular expressions of preg.

In addition to the solution already presented, what can be done is to use regular expression functions of Multibyte String.

Example:

$str = 'inválido';

var_dump(mb_ereg_match('inv.lido', $str)); // bool(TRUE);

Observing:

According to that answer in SOEN, the functions mb_ereg_* are not marked as obsolete. Therefore, it is OK to use them.

  • ereg functions have been discontinued since PHP 5.3 http://php.net/manual/en/function.ereg.php

  • 1

    Ops, ereg is different from mb_ereg. See that no E_DEPRACATED is fired. I think you made a mistake

  • 1

    See that the mb_ereg function is not marked as depreciated

  • 1

    I tbm not understand why they depreciated ereg_* and not mb_ereg_* together, lol is quite contradictory, not to mention that preg_* should be or should be the standard lib that. Imagine tomorrow if they deprecate all functions they have i .... standard standard! + 1.

  • I added the remark, with the reply in the SOEN, that is not depreciated functions of ereg of Multi Byte

Browser other questions tagged

You are not signed in. Login or sign up in order to post.