Split a string that contains scores

Asked

Viewed 129 times

4

I’m trying to split the following string

Eu irei amanhã à casa. E tu vens?

To get the following result inside a php array

array(
    [0] => eu
    [1] => irei
    [2] => amanhã
    [3] => à
    [4] => casa
    [5] => .
    [6] => E
    [7] => tu
    [8] => vens
    [9] => ?
)

I appreciate any help.

3 answers

9


If it were just spaces, it would be a case of

$partes = explode( ' ', $todo );

A solution, depending on what you want, would be to force a space before the characters you want to treat as isolated:

$todo = str_replace( array( '.', ',' ,'?' ), array( ' .', ' ,', ' ?'), $todo );
$partes = explode( ' ', $todo );

See working on IDEONE.

Note that I have placed the valid separators directly in the replace, but if you want to do this with a string, it compensates for a more complex function.

If you prefer to consider all alphanumeric symbols separately, you can use a Regex, and solve in one line:

preg_match_all('~\w+|[^\s\w]+~u', $todo, $partes ); 

See working on IDEONE.

In addition, it would be the case to add spaces before and after the symbols, remove double spaces, depending on the criterion. The intention of the answer was only to give an initial direction.

  • Wonderful! Thank you ;)

  • cool! (+1)... preg_match_all : php is better than I thought!

  • 1

    @Jjoao is only a wrapper, the implementation is PCRE, PHP only accesses third-party lib functionality. It would be the same thing in Harbour or any language that accesses that lib.

7

A more general approach would be to use regex to solve the problem.

$string = "Eu irei amanhã à casa. E tu vens?";

/*
    Adiciona um espaço em todos os boundaries da string
    Ex.: Início e fim de palavras, pontuações, etc...
    O modificador 'u' e para tratar a string como Unicode
*/
$resultado = preg_replace('/\b/u', ' ', $string);

// Cria um array usando como delimitador um regex que casa com qualquer espaço
$resultado = preg_split('/\s+/', trim($resultado));

var_dump($resultado);

output:

array(10) {
  [0]=>
  string(2) "Eu"
  [1]=>
  string(4) "irei"
  [2]=>
  string(7) "amanhã"
  [3]=>
  string(2) "à"
  [4]=>
  string(4) "casa"
  [5]=>
  string(1) "."
  [6]=>
  string(1) "E"
  [7]=>
  string(2) "tu"
  [8]=>
  string(4) "vens"
  [9]=>
  string(1) "?"
}

Fez this example code to illustrate.

  • Very practical. Very good! ;)

3

Based on the 99.9% response from @Acco... blatantly!

preg_match_all('~\b\w[\w\-.*#]*\w\b|\w|\.\.\.|[,.:;()[\]?!]|\S~u', $t, $ps);
print_r($ps)

( [0] => Array
   (    [0] => Baseando-me
        [1] => 99.9
        [2] => %
        [3] => na
        [4] => resposta
        [5] => do
        [6] => @
        [7] => bacco
        [8] => ...
        [9] => descaradamente
        [10] => !
    )
)

Upgrade Actually tokenizing text in its elements is sometimes complex: text is not just simple words...

A somewhat more robust approach follows (I employed ~ux for better readability):

  preg_match_all('~
           https?://\S+             ## url
         | \d+/\d+/\d+              ## data
         | \b\w [\w\-.*#]* \w\b     ## vou-me 12.2 f.html
         | \w
         | \.\.\.                   ## ...
         | [!?]+                    ## ???   ?!
         | [,.:;()[\]]
         | \S
         ~ux', $todo, $partes );
     print_r($partes)

Hazard: (not tested...)

  • I liked the complement, it would be nice to mention the differences in the response for the staff who does not know Regex follow (for example, mention that implemented the composition with point and dash, detection of reticence etc).

  • @Bacco: I think I’ve caused more confusion...

  • It was much better with the explanation, to understand the changes..

Browser other questions tagged

You are not signed in. Login or sign up in order to post.