Get most used words from a string

Asked

Viewed 453 times

4

I got a big string:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas porttitor non felis quis dignissim. Morbi varius arcu lorem, eget efficitur nibh interdum vitae. Aenean tristique hendrerit diam a consequat. Nunc eleifend dolor ut rhoncus sollicitudin. Suspendisse tincidunt sodales turpis et egestas. Sed maximus libero malesuada lacus tempor, quis placerat nunc varius. Nam eget lectus imperdiet, lobortis mi sit amet, tristique justo. Fusce in felis et erat auctor vehicula quis dapibus libero. In commodo a leo eu eleifend.

how can I capture the 4 most repeated words in this string?

4 answers

5


You have to use 3 steps:

This method counts the number of words in a string. When it goes by 1 as parameter, returns an array with all words.

This method returns a new array, where the values of the initial are keys and the values of these keys are the frequency of this value.

This method organizes the array to have the highest values at the beginning.

Example:

$string = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas porttitor non felis quis dignissim. Morbi varius arcu lorem, eget efficitur nibh interdum vitae. Aenean tristique hendrerit diam a consequat. Nunc eleifend dolor ut rhoncus sollicitudin. Suspendisse tincidunt sodales turpis et egestas. Sed maximus libero malesuada lacus tempor, quis placerat nunc varius. Nam eget lectus imperdiet, lobortis mi sit amet, tristique justo. Fusce in felis et erat auctor vehicula quis dapibus libero. In commodo a leo eu eleifend.';
$palavras = array_count_values(str_word_count($string, 1));
arsort($palavras);
var_dump($palavras);

Will give:

array(64) {
  ["quis"]=>
  int(3)
  ["tristique"]=>
  int(2)
  ["varius"]=>
  int(2)
  ["a"]=>
  int(2)
  ["eleifend"]=>
  int(2)
  ["et"]=>
  int(2)
  ["libero"]=>
  int(2)
  ["felis"]=>
  int(2)
  ["eget"]=>
  int(2)
  etc...
  • @user3163662 has different ways. How do you want to use the result? one by one, within an array? it matters the number of times it appeared or only that it is one of the x first?

  • You’re too quick to answer... or I’m getting old... I ended up using the same thing as you, but I did a beautiful job to deal with it :P

  • 2

    @Zuul: Here in Sweden it is cold, we have to move our fingers fast to warm up :)

3

Essentially, it will be necessary to break the text into words for an array. Then we need to count the repeaters, order the result of the highest number of repeaters for the lowest number of repeaters and finally stay only with the first X.

For this we will use the PHP function array_count_values() to count the values in the matrix, the PHP function str_word_count() to count the number of times the word exists in the given text, the PHP function arsort() to sort the array in descending order without losing the key relation and finally the PHP function array_slice() to remain in the matrix only the desired amount of words:

/**
 * Palavras Mais Repetidas
 * Com base no texto recebido, devolver as primeiras X
 * palavras mais repetidas
 *
 * @param string $texto O texto a avaliar
 * @param integer $quantidade A quantidade de palavras a devolver
 *
 * @return array Matriz com as palavras mais repetidas
 */
function palavrasMaisRepetidas($texto="", $quantidade=4) {

  $palavras = array_count_values(str_word_count($texto, 1));

  arsort($palavras);

  return array_slice($palavras, 0, $quantidade);
}

Example:

$texto = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas porttitor non felis quis dignissim. Morbi varius arcu lorem, eget efficitur nibh interdum vitae. Aenean tristique hendrerit diam a consequat. Nunc eleifend dolor ut rhoncus sollicitudin. Suspendisse tincidunt sodales turpis et egestas. Sed maximus libero malesuada lacus tempor, quis placerat nunc varius. Nam eget lectus imperdiet, lobortis mi sit amet, tristique justo. Fusce in felis et erat auctor vehicula quis dapibus libero. In commodo a leo eu eleifend.";

var_dump(palavrasMaisRepetidas($texto, 5));

Upshot:

array(4) {
  ["quis"]=>
  int(3)
  ["tristique"]=>
  int(2)
  ["varius"]=>
  int(2)
  ["a"]=>
  int(2)
}

See example in Ideone.

2

The answers of Sergio and Zuul probably have better performance, but follows a didactic solution that uses strtok to break the text into words, and count manually. This solution is case-insensitive.

<?php
$texto = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas porttitor non felis quis dignissim. Morbi varius arcu lorem, eget efficitur nibh interdum vitae. Aenean tristique hendrerit diam a consequat. Nunc eleifend dolor ut rhoncus sollicitudin. Suspendisse tincidunt sodales turpis et egestas. Sed maximus libero malesuada lacus tempor, quis placerat nunc varius. Nam eget lectus imperdiet, lobortis mi sit amet, tristique justo. Fusce in felis et erat auctor vehicula quis dapibus libero. In commodo a leo eu eleifend.";
$frequencias = array();
$separadores = " .,;:!?/\"'()[]{}\n\r\t";
$palavra = strtok($texto, $separadores);
while($palavra !== false) {
    if(array_key_exists(strtoupper($palavra), $frequencias)) {
        $frequencias[strtoupper($palavra)]++;
    } else {
        $frequencias[strtoupper($palavra)] = 1;
    }
    $palavra = strtok($separadores);
}
arsort($frequencias);
print_r($frequencias);

http://ideone.com/xlpD4b

Upshot:

Array
(
    [QUIS] => 3
    [ET] => 2
    [VARIUS] => 2
    [IN] => 2
    [TRISTIQUE] => 2
    [A] => 2
    [LIBERO] => 2
    [LOREM] => 2
    [ELEIFEND] => 2
    [NUNC] => 2
    [FELIS] => 2
    [EGET] => 2
    [DOLOR] => 2
    [SIT] => 2
    [AMET] => 2
    [LEO] => 1
    [TEMPOR] => 1
    ...
)

1

Browser other questions tagged

You are not signed in. Login or sign up in order to post.