How to hide short words with PHP

Asked

Viewed 109 times

-1

I am developing a tool that searches keywords in a text and compares them with those registered by the client. If there’s a coincidence, then he fires an e-mail to the client.

The question is this: I want to dispense with short words such as "from", "with", etc.

Example: "Car rental Type Bus", the intended result will be: "Rental, vehicles, type, Bus".

2 answers

1

<?php
$titulo   = 'Locacao de onibus';
$palavras = explode(' ', $titulo);

$ocultar = ['de', 'para', 'com'];

$resultado = array_diff($palavras, $ocultar);
$resultado = implode(' ', $resultado);
?>
  • It worked perfectly, thank you! :)

  • What if the string contains another word to be removed that is not within that $hide array? What if the word is capitalized? It would not be better to create a filter using the word size as a criterion, rather than buying the whole word?

  • takes $title and converts to lowercase before mb_strtolower()

  • What if the word contains an accent? What if the word is followed by a comma? What if it doesn’t exist in the array? Do you suggest including all Portuguese words in $hide? Again, I think this solution could be rethought.

  • @user140828, no solution will be perfect. To solve all these issues would have to be something extremely elaborate, such as NLTK (for Python) or Prose (for Golang), which are tools for human text handling. I believe that no answer here will attempt to "rewrite" these libraries, to indicate the use of these tools for such a limited scope does not seem reasonable, nor do I know the efficiency of such tools for that purpose. Anyway, you can add a new answer too.

0


Once it says

as "of", "with", etc

Apparently you want to remove prepositions, not just "short words", so you could create an array with the most common prepositions (...):

$lista = ["a", "o", "as", "os", "com", "em", "por", "per", "ante", "contra", "entre", "sem", "após", "de", "para", "sob", "até", "desde", "perante", "sobre"];

If there are many texts, you might consider checking which ones are the most frequent, since these frequent words are precisely those of "from", "to", "in"...


So you could just use a loop and remove them :

$texto = "Locação de veículos Tipo Ônibus";
$lista = ["a", "o", "as", "os", "com", "em", "por", "per", "ante", "contra", "entre", "sem", "após", "de", "para", "sob", "até", "desde", "perante", "sobre"];

foreach(($texto = explode(" ", $texto)) as $i => $s) {
    if (in_array(strtolower($s), $lista)) {
        unset($texto[$i]);
    }
}

echo implode(', ', array_filter($texto));

Upshot:

Rental, Vehicles, Type, Bus

Maybe I can use some kind of preg_replace or similar, but it is the same.


If you want to "hide short words", based precisely on their length, you could use a simple:

echo preg_replace("/(^|\s)[a-z]{1,3}\s/", " ", "Locação de veículos Tipo Ônibus");

This, assuming that "short" would be up to 3 characters (defined in {1,3}). This may work, but there are side effects, since there are several words with 3 characters (ego, defendant, have, come, vim, pain, light, flight, voice...) and there are prepositions with more than 3 characters (such as the "for").

  • Initially I had thought to hide all the short words, but as you said yourself, there would be side effects, because there are short words and important acronyms that would pass beaten. My problem was only the prepositions. It helped me a lot, thank you! : D

Browser other questions tagged

You are not signed in. Login or sign up in order to post.