With preg_split
you can use as a criterion the regex you already have (spaces, provided you do not have "from", "do", "da", etc before) and add the first word as another possibility:
$string = "Wallace de Souza Vizerra";
var_dump(preg_split('/(^\w+\s)|(?<!de|da|do|dos|das)\s/', $string, -1, PREG_SPLIT_NO_EMPTY));
I use alternation (the character |
, which means "or") to indicate that the split should be done considering two possibilities:
(^\w+\s)
: the first word, that is, the beginning of the string (^
), followed by one or more letters, numbers or _
(\w+
), followed by a space (\s
), or
(?<!de|da|do|dos|das)\s
: a space, provided that it does not previously have "from", "do", "da", "dos" or "das" - the (?<!
indicates a lookbehind negative, that something checks out nay exists before the current position
Thus the split
is done in spaces or in the first word. The output is:
array(2) {
[0]=>
string(8) "de Souza"
[1]=>
string(7) "Vizerra"
}
I had to use the flag PREG_SPLIT_NO_EMPTY
, otherwise the first position of the array would have an empty string.
Alternative: match instead of split
But it is also possible to get the array the way you want it using preg_match_all
.
For deep down, split and match are two sides of the same coin:
- in the split I say what I nay I want it to be part of the final result (spaces, provided that before them there is no "of", "of", "of", etc., or the first word)
- in the match I say what I want (words except the first, but if before I have "of", "of", etc., then I along with the next word).
So the regex would be:
$string = "Wallace de Souza Vizerra";
if (preg_match_all('/(?:^\w+\s\K)?(?:d(?:e|[ao]s?)\s)?\w+/', $string, $matches)) {
var_dump($matches[0]);
}
All parentheses are with (?:
to form catch groups. If you only use parentheses without the ?:
, they become capture groups and each group is returned separately in the array $matches
, but how groups do not interest me (only the match all), so I use no-capture groups so that the array doesn’t have more things than I need.
The excerpt d(?:e|[ao]s?)\s
is another way of saying that you can have "of", "of", "of", "of" or "of". I use alternation to say that after the letter d
, may have two possibilities:
- the letter
e
- one character class (
[ao]
) to indicate that I can have the letter a
or the letter o
, and then there’s a s
optional (s?
).
I didn’t use this in the example with preg_split
because within a lookbehind a regex does not accept patterns with variable size (because of ?
), then that expression only works using patterns with fixed size.
All this followed by a space ( \s
- detail that this shortcut also picks line breaks and other characters, see here for more details). That is, this section takes "from", "do", "da", "dos" or "das", followed by a space. And the ?
soon after it makes this whole section optional (as not all words will have it before).
Then we have \w+
, which takes one or more occurrences of letters, numbers or the character _
.
And at the beginning we have the big trick to ignore the first word: first the excerpt ^\w+\s
takes the start of the string (the bookmark ^
), followed by \w+
(one or more letters, numbers or _
) followed by a space (\s
). Then we have the shortcut \K
, that according to the documentation serves to discard the stretch that was found until then.
That is, regex finds the first word followed by space (^\w+\s
), but then finds the \K
, which causes this word to be discarded from match, and so it will not be part of the final outcome.
This causes the first word to be picked up, and then discarded. Then the rest of the regex picks up the second word onwards. The detail is that only the second word will have the passage ^\w+\s\K
before it, so I leave this optional excerpt (putting the ?
soon after). Thus, upon finding the second word, the \K
discards what came before (in this case, the first word), and from the third word on, he discards nothing.
The exit code above is:
array(2) {
[0]=>
string(8) "de Souza"
[1]=>
string(7) "Vizerra"
}
I only take $matches[0]
because preg_match_all
creates an array of arrays. But in this case, this regex will generate an array with only one position, containing another array with the results you need. So just take $matches[0]
.
Another test:
$string = 'Fulano da Silva dos Santos das Dores Teixeira de Carvalho etc blablabla do fim do mundo';
if (preg_match_all('/(?:^\w+\s\K)?(?:d(?:e|[ao]s?)\s)?\w+/u', $string, $matches)) {
var_dump($matches[0]);
}
Exit:
array(9) {
[0]=>
string(8) "da Silva"
[1]=>
string(10) "dos Santos"
[2]=>
string(9) "das Dores"
[3]=>
string(8) "Teixeira"
[4]=>
string(11) "de Carvalho"
[5]=>
string(3) "etc"
[6]=>
string(9) "blablabla"
[7]=>
string(6) "do fim"
[8]=>
string(8) "do mundo"
}
As already said, \w
take letters, numbers and the character _
(that is to say, 1_23
is also considered a "word"). If you want to be more specific, you can use [a-zA-Z]
to pick only letters, but this will not catch accented letters. So an alternative is to use Unicode Character properties:
$string = "Fábio de Souza Lázaro do Patrocínio";
if (preg_match_all('/(?:^\p{L}+\s\K)?(?:d(?:e|[ao]s?)\s)?\p{L}+/u', $string, $matches)) {
var_dump($matches[0]);
}
\p{L}
takes any letter defined by Unicode (including letters from other alphabets, such as Japanese, Arabic, etc). Do not forget to use the flag u
(after the second bar delimiting the regex), otherwise the \p{L}
doesn’t work.
The exit is:
array(3) {
[0]=>
string(8) "de Souza"
[1]=>
string(7) "Lázaro"
[2]=>
string(14) "do Patrocínio"
}
match vs split
Although in this case, the split seems to me more "easy" (or less complicated) than the match. Anyway, it’s interesting to know that you can get the same results using both approaches. If the criterion for one is too difficult, sometimes it is easier to use the other.
Another detail is that the above solutions only consider the case where there is only one space between words. But if the string is for example "Wallace de Souza Vizerra"
, simply change the regex to consider one or more spaces (\s+
):
$string = "Wallace de Souza Vizerra";
if (preg_match_all('/(?:^\w+\s+\K)?(?:d(?:e|[ao]s?)\s+)?\w+/', $string, $matches)) {
var_dump($matches);
}
Exit:
array(1) {
[0]=>
array(2) {
[0]=>
string(10) "de Souza"
[1]=>
string(7) "Vizerra"
}
}
However, with preg_split
doesn’t work if you use \s+
:
$string = "Wallace de Souza Vizerra";
var_dump(preg_split('/(^\w+\s+)|(?<!de|da|do|dos|das)\s+/', $string, -1, PREG_SPLIT_NO_EMPTY));
Exit:
array(3) {
[0]=>
string(3) "de "
[1]=>
string(5) "Souza"
[2]=>
string(7) "Vizerra"
}
I kept trying to find a solution for the split, but so far unsuccessful. That is, this is a case in which the match ended up being "easier" than the split.
I got the first result! With regular expression
/(?<!de|da|do|dos|das)\W+/
– Wallace Maxters