5
Situation
I’m conducting a survey with regex
in a specific word inválido
, but I decided to use inv.lido
. Which I knew I had in the test string but didn’t return.
Testing
vr
= var_dump
pr
= print_r
$string = 'até, atenção, Hipótese, você, português, café, órgão';
vr(preg_match('~at.~', $string, $match));
pr($match);
vr(preg_match('~aten..o~', $string, $match));
pr($match);
vr(preg_match('~Hip.tese~', $string, $match));
pr($match);
vr(preg_match('~voc.~', $string, $match));
pr($match);
vr(preg_match('~portugu.s~', $string, $match));
pr($match);
vr(preg_match('~caf.~', $string, $match));
pr($match);
vr(preg_match('~.rg.o~', $string, $match));
pr($match);
Out
int(1)
Array([0] => at�)
int(0)
Array()
int(0)
Array()
int(1)
Array([0] => voc�)
int(0)
Array()
int(1)
Array([0] => caf�)
int(0)
Array()
Question
As you can see he didn’t capture the words, except for a few, but even the ones he captured, I don’t know what the �
, because even using utf8_decode
or even utf8_encode
it does not return me the correct character.
From what little I know C
and binary, I suppose it has to do with the fact that these characters are formed by two 8bit squares, however they are present in the table ASCII
and as far as I know regex
follows the table ASCII
.
Why did this happen?
Accented characters do not belong to the ASCII character set. Some sets (and have more than one) ASCII extended - not much more used nowadays, in favor of Unicode - actually represent accented characters in a single byte, but this is not followed that I know by PHP or any other regex library that I know of.
– mgibsonbr
In fact, I replaced "." with the respective character in Hexa "
\x82
", "\x88
" and he also did not return to me, yet the operator "." means anything except\n
, if not configured withx
.– Guilherme Lautert
Yes. But I suspect (I’m not experienced with PHP) that he’s only considering the first byte of the UTF-8 representation of each accented letter, at least this interpretation is consistent with the result you’re getting. See my answer below for more details.
– mgibsonbr