How to return most common words from a text with PHP?

Question

How to return most common words from a text with PHP?

Asked 11 years, 1 month ago

Viewed 1,376 times

5

I would like to know how best to return the most frequent occurrences of substrings in a string containing text. Example:

$texto = "Hoje nós vamos falar de PHP. PHP é uma linguagem criada no ano de ...";

And the output:

array(
    "PHP" => 2
    "de" => 2
    //...
);

The idea is that one be returned array with the words most commonly used in certain string.

I am currently using the function substr_count(), but the problem is that it only works if you already pass a word to be checked, ie I would need to know the words of the text to check one by one.

Is there any other way to do this?

What is the best solution? The question remains...

– Jorge B.

2014/06/30 at 09:00
1

Yes, I will try to set up a performance test as soon as I have time to evaluate the best result, but all solutions are very interesting.

– Kazzkiq

2014/06/30 at 12:53
In PHP every string is considered an array.

– Ivan Ferrer

2015/08/17 at 15:30

3 answers

5

Try it like this:

print_r(array_count_values(str_word_count($texto, 1, "óé")));

Upshot:

Array ( 
   [Hoje] => 1 
   [nós] => 1 
   [vamos] => 1 
   [falar] => 1 
   [de] => 2 
   [PHP] => 2 
   [uma] => 1 
   [linguagem] => 1 
   [criada] => 1 
   [no] => 1 
   [é] => 1
   [ano] => 1 
)

To understand how array_count_values works see the php manual.

Editing

A smarter solution (language independent)

With the previous solution it is necessary to specify the entire set of utf-8 special characters (just as it was done with the ó and the é).

Following a solution, but complicated, however, eliminates the special character set problem.

$text = str_replace(".","", "Hoje nós vamos falar de PHP. PHP é uma linguagem criada no ano de ...");
$namePattern = '/[\s,:?!]+/u';
$wordsArray = preg_split($namePattern, $text, -1, PREG_SPLIT_NO_EMPTY);
$wordsArray2 = array_count_values($wordsArray);
print_r($wordsArray2);

In this solution I use regular expressions to break the words and then I use the array_count_values to count words. The result is:

Array 
( 
  [Hoje] => 1 
  [nós] => 1 
  [vamos] => 1 
  [falar] => 1 
  [de] => 2 
  [PHP] => 2 
  [é] => 1 
  [uma] => 1 
  [linguagem] => 1 
  [criada] => 1 
  [no] => 1 
  [ano] => 1 
)

This solution also meets the need, however, the points must be eliminated before the split of words, otherwise will appear in the result words with . and words without the ..For example:

  ...
  [PHP.] => 1 
  [PHP] => 1 
  ...

Counting words is never such a simple task. It is necessary to know well the stringwho wishes to count the words before applying a definitive solution.

3

What is the string for "óé" in the end?

– Kazzkiq

2014/06/29 at 19:59
1

@Kazzkiq not to break "we" in two words, nor disregard the "is". You would have to by all non-ASCII characters in this solution not to break accented words into several. This solution is good when you want to ignore dots, commas, numbers, and everything else that is not word, but have to assemble all the accentuation. The @Sergio is already good if you want to count several strings, regardless of whether they are words or numbers ( 1987, bem-vindo, etc.... )

– Bacco

2014/06/29 at 20:13
@Kazzkiq in the new solution applied, we still have the problem of cleaning the text, however, I consider this new way to implement a little more intelligent and simple to iterate all words.

– anmaia

2014/06/29 at 20:34

Browser other questions tagged php

You are not signed in. Login or sign up in order to post.

by Sergio • **133,294** points · Answer 1 · 2014-06-29T19:31:43+00:00

My "artisanal" way would be:

$texto = "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.";

$palavras = explode(' ', $texto);
echo count($palavras); // 91
$ocorrencias = array();

for($i = 0; $i<count($palavras); $i++){
    $palavra = $palavras[$i];
    $ocorrencias[$palavra]++;
}

arsort($ocorrencias);
var_dump($ocorrencias);

Upshot:

array(69) { 
    ["the"]=> int(6) 
    ["Lorem"]=> int(4) 
    ["of"]=> int(4) 
    ["Ipsum"]=> int(3) 
    ["and"]=> int(3) 
    ["a"]=> int(2) 
    // etc

The advantage of this alternative is that I only need to separate by blanks.

You can also join a line like this before the explode():

$texto = preg_replace('/[,\.?!;]*/', '', $texto);

to clean commas and stitches, etc. Depending on what you are looking for.

by Olimon F. • **1,173** points · Answer 2 · 2014-06-30T02:06:43+00:00

My solution

This solution is a little more robust, it separates each word and treats it "thoroughly", after being processed and approved, it passes to a new array which is then organized by the number of occurrences.

<?php
$texto = "Hoje nós vamos falar de PHP! mas o que é PHP?? 
PHP é uma linguagem criada no ano de ...";

/* Separar cada palavra por espaços (raw, sem filtro) */
$palavras_raw = explode(" ", $texto);

// Array de caracteres para serem removidos
$ignorar = 
[".", ",", "!", ";", ":", "(", ")", "{", "}", "[", "]", "<", ">",
"?", "|", "\\", "/"];

// Array para as palavras tratadas.
$palavrasTratadas = array();

/* Criar uma nova array de palavras, agora tratadas */
$palavras_raw_count = count($palavras_raw);
for ($i=0;$i<$palavras_raw_count;++$i) {
    $palavraAtual = $palavras_raw[$i];
    $palavraAtual = trim($palavraAtual);
    if (!empty($palavraAtual)) {
        $palavraTratada = str_replace($ignorar, "", $palavraAtual);
        $palavraTratada = strtolower($palavraTratada);
        if (!empty($palavraTratada)) {
            $palavrasTratadas[$palavraTratada]++;
        }
    }
}

// Organizar pela ordem de mais ocorrências.
arsort($palavrasTratadas);

// DEBUG
print_r($palavrasTratadas);

It separates each word by the spaces criterion and removes the special characters from the array $ignorar then it treats all words to prevent errors / unexpected results and passes to the array $palavrasTratadas, it is worth noting that he DOES NOT DIFFERENTIATE uppercase of minuscules, because someone can start the sentence with the uppercase Today and then use today in the rest of the text, but the function of passing the words to minuscules of PHP is made to English, so it does not convert A to á, for example.