preg_split to separate words, but ignoring some

Question

preg_split to separate words, but ignoring some

Asked 10 years, 1 month ago

Viewed 958 times

2

I need a regular expression that splits a string, more specifically a full name of a person, and transforms it into an array of words.

$string = "Wallace de Souza Vizerra";

$array = preg_split('/\s+/', $string, -1, PREG_SPLIT_NO_EMPTY);

['Wallace', 'de', 'Souza', 'Vizerra']

However, when there are occurrences de, da, do, das and dos, they must not be separated from the following word:

['Wallace', 'de Souza', 'Vizerra']

Someone who knows regular expression could help me and explain how the regular expression used in the answer would work?

If there is any way, I would also like to remove, through the regular expression, the first word as well.

That is to say:

$string = "Wallace de Souza Vizerra"

Shall return:

['de Souza', 'Vizerra']

I got the first result! With regular expression /(?<!de|da|do|dos|das)\W+/

– Wallace Maxters

2015/07/03 at 20:24

2 answers

5

Success, but there is no way to remove the first item from array resulting from preg_split using the function itself and not even another function inline, therefore, array_shift was used (could also use unset($array[0])).

separa_palavras.php

<?php

$nome_completo = 'Wallace de Souza Vizerra dos Santos';

$resultado = preg_split('/(?<!de|da|do|dos|das)[\s]/i', $nome_completo);
$nome_removido = array_shift($resultado);
reset($resultado); // re-ordena chaves

var_export($resultado);

Exit

array (
  0 => 'de Souza',
  1 => 'Vizerra',
  2 => 'dos Santos',
)

1

Buddy, the array_shift already reorders the array. You don’t need the reset

– Wallace Maxters

2015/07/06 at 11:46
It wasn’t what happened or I didn’t notice hahaha but thank you

– Felipe Douradinho

2015/07/06 at 14:02

Browser other questions tagged php regex

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2020-03-03T12:58:45+00:00

With preg_split you can use as a criterion the regex you already have (spaces, provided you do not have "from", "do", "da", etc before) and add the first word as another possibility:

$string = "Wallace de Souza Vizerra";
var_dump(preg_split('/(^\w+\s)|(?<!de|da|do|dos|das)\s/', $string, -1, PREG_SPLIT_NO_EMPTY));

I use alternation (the character |, which means "or") to indicate that the split should be done considering two possibilities:

(^\w+\s): the first word, that is, the beginning of the string (^), followed by one or more letters, numbers or _ (\w+), followed by a space (\s), or
(?<!de|da|do|dos|das)\s: a space, provided that it does not previously have "from", "do", "da", "dos" or "das" - the (?<! indicates a lookbehind negative, that something checks out nay exists before the current position

Thus the split is done in spaces or in the first word. The output is:

array(2) {
  [0]=>
  string(8) "de Souza"
  [1]=>
  string(7) "Vizerra"
}

I had to use the flag PREG_SPLIT_NO_EMPTY, otherwise the first position of the array would have an empty string.

Alternative: match instead of split

But it is also possible to get the array the way you want it using preg_match_all.

For deep down, split and match are two sides of the same coin:

in the split I say what I nay I want it to be part of the final result (spaces, provided that before them there is no "of", "of", "of", etc., or the first word)
in the match I say what I want (words except the first, but if before I have "of", "of", etc., then I along with the next word).

So the regex would be:

$string = "Wallace de Souza Vizerra";
if (preg_match_all('/(?:^\w+\s\K)?(?:d(?:e|[ao]s?)\s)?\w+/', $string, $matches)) {
    var_dump($matches[0]);
}

All parentheses are with (?: to form catch groups. If you only use parentheses without the ?:, they become capture groups and each group is returned separately in the array $matches, but how groups do not interest me (only the match all), so I use no-capture groups so that the array doesn’t have more things than I need.

The excerpt d(?:e|[ao]s?)\s is another way of saying that you can have "of", "of", "of", "of" or "of". I use alternation to say that after the letter d, may have two possibilities:

the letter e
one character class ([ao]) to indicate that I can have the letter a or the letter o, and then there’s a s optional (s?).

^{I didn’t use this in the example with preg_split because within a lookbehind a regex does not accept patterns with variable size (because of ?), then that expression only works using patterns with fixed size.}

All this followed by a space ( \s - detail that this shortcut also picks line breaks and other characters, see here for more details). That is, this section takes "from", "do", "da", "dos" or "das", followed by a space. And the ? soon after it makes this whole section optional (as not all words will have it before).

Then we have \w+, which takes one or more occurrences of letters, numbers or the character _.

And at the beginning we have the big trick to ignore the first word: first the excerpt ^\w+\s takes the start of the string (the bookmark ^), followed by \w+ (one or more letters, numbers or _) followed by a space (\s). Then we have the shortcut \K, that according to the documentation serves to discard the stretch that was found until then.

That is, regex finds the first word followed by space (^\w+\s), but then finds the \K, which causes this word to be discarded from match, and so it will not be part of the final outcome.

This causes the first word to be picked up, and then discarded. Then the rest of the regex picks up the second word onwards. The detail is that only the second word will have the passage ^\w+\s\K before it, so I leave this optional excerpt (putting the ? soon after). Thus, upon finding the second word, the \K discards what came before (in this case, the first word), and from the third word on, he discards nothing.

The exit code above is:

array(2) {
  [0]=>
  string(8) "de Souza"
  [1]=>
  string(7) "Vizerra"
}

I only take $matches[0] because preg_match_all creates an array of arrays. But in this case, this regex will generate an array with only one position, containing another array with the results you need. So just take $matches[0].

Another test:

$string = 'Fulano da Silva dos Santos das Dores Teixeira de Carvalho etc blablabla do fim do mundo';
if (preg_match_all('/(?:^\w+\s\K)?(?:d(?:e|[ao]s?)\s)?\w+/u', $string, $matches)) {
    var_dump($matches[0]);
}

Exit:

array(9) {
  [0]=>
  string(8) "da Silva"
  [1]=>
  string(10) "dos Santos"
  [2]=>
  string(9) "das Dores"
  [3]=>
  string(8) "Teixeira"
  [4]=>
  string(11) "de Carvalho"
  [5]=>
  string(3) "etc"
  [6]=>
  string(9) "blablabla"
  [7]=>
  string(6) "do fim"
  [8]=>
  string(8) "do mundo"
}

As already said, \w take letters, numbers and the character _ (that is to say, 1_23 is also considered a "word"). If you want to be more specific, you can use [a-zA-Z] to pick only letters, but this will not catch accented letters. So an alternative is to use Unicode Character properties:

$string = "Fábio de Souza Lázaro do Patrocínio";
if (preg_match_all('/(?:^\p{L}+\s\K)?(?:d(?:e|[ao]s?)\s)?\p{L}+/u', $string, $matches)) {
    var_dump($matches[0]);
}

\p{L} takes any letter defined by Unicode (including letters from other alphabets, such as Japanese, Arabic, etc). Do not forget to use the flag u (after the second bar delimiting the regex), otherwise the \p{L} doesn’t work.

The exit is:

array(3) {
  [0]=>
  string(8) "de Souza"
  [1]=>
  string(7) "Lázaro"
  [2]=>
  string(14) "do Patrocínio"
}

match vs split

Although in this case, the split seems to me more "easy" (or less complicated) than the match. Anyway, it’s interesting to know that you can get the same results using both approaches. If the criterion for one is too difficult, sometimes it is easier to use the other.

Another detail is that the above solutions only consider the case where there is only one space between words. But if the string is for example "Wallace de Souza Vizerra", simply change the regex to consider one or more spaces (\s+):

$string = "Wallace    de   Souza   Vizerra";
if (preg_match_all('/(?:^\w+\s+\K)?(?:d(?:e|[ao]s?)\s+)?\w+/', $string, $matches)) {
    var_dump($matches);
}

Exit:

array(1) {
  [0]=>
  array(2) {
    [0]=>
    string(10) "de   Souza"
    [1]=>
    string(7) "Vizerra"
  }
}

However, with preg_split doesn’t work if you use \s+:

$string = "Wallace    de   Souza   Vizerra";
var_dump(preg_split('/(^\w+\s+)|(?<!de|da|do|dos|das)\s+/', $string, -1, PREG_SPLIT_NO_EMPTY));

Exit:

array(3) {
  [0]=>
  string(3) "de "
  [1]=>
  string(5) "Souza"
  [2]=>
  string(7) "Vizerra"
}

I kept trying to find a solution for the split, but so far unsuccessful. That is, this is a case in which the match ended up being "easier" than the split.