Higher frequency of a string

Asked

Viewed 1,417 times

2

I have a text where I am treating several things. Now I need to get the 3 words that are most repeated in the entire text. How can I do this? What is the best solution?

I thought about storing in a list but I don’t know how to put the word and an accountant. I also don’t know if it would be a "smart" way to do?

2 answers

4


Assuming you’ve already done the job of separating text into words, the data structure you’re looking for is the java.util.Map (also known as "dictionary"). Its function is to precisely map an object (the "key") to another object (the "value") - or in your case, the word to the number of times it appears.

Map<String,Integer> quantas = new HashMap<String,Integer>();

quantas.put("foo", 1); // Diz que a palavra "foo" apareceu 1 vez

int x = quantas.get("foo"); // Pega o número de vezes que ela apareceu
quantas.put("foo", x+1);    // Atualiza o valor (sobrescreve)

// Enumera todas as palavras do dicionário
for ( Map.Entry<String,Integer> par : quantas.entrySet() )
    System.out.println(par.getKey() + " apareceu " + par.getValue() + " vezes.");

If you are using Java 8, you can do this even more simply by using Amble:

for ( String palavra : listaPalavras )
    quantas.compute(palavra, (k, v) -> v == null ? 1 : v+1);

(i.e. if the word is not yet in the dictionary - v == null - check that she appeared 1 time; otherwise increment 1 in the number of times she appeared)

As for a "clever" way of finding the N words that most appeared, I suggest a priority queue (PriorityQueue) where:

  • The comparator used in creating order pairs (palavra, nº ocorrências) by increasing order of the number of occurrences;
  • Once created and populated the Map quantas, you go through it by adding elements to this priority queue, keeping its size limited to the number you want (i.e. remove the smaller ones when the queue grows beyond the point you want - so the performance of the algorithm will be good, as it will avoid comparing elements that do not matter).

I won’t post a complete example because <Rant> even the simplest tasks in Java take 10x more code than in a decent language</Rant>. But if you find it difficult to use the above method you may ask that I explain it better. Or, if you’re not worried about performance and just want a simple, straightforward method, put these pairs on a list and sort them.

Updating: it wasn’t so bad, but boy, what’s missing that an inference of guys doesn’t make...

int max = 3;

PriorityQueue<Map.Entry<String,Integer>> fila = 
    new PriorityQueue<Map.Entry<String,Integer>>(max+1, new Comparator<Map.Entry<String,Integer>>() {
        public int compare(Map.Entry<String,Integer> a, Map.Entry<String,Integer> b) {
            return a.getValue() < b.getValue() ? -1 :
                   a.getValue() > b.getValue() ? 1 :
                   a.getKey().compareTo(b.getKey()); // Desempate
        }
    });

for ( Map.Entry<String,Integer> entry : quantas.entrySet() ) {
    fila.add(entry);
    if ( fila.size() > max )
        fila.poll(); // Remove o menor
}
  • I don’t quite understand how it does the query of how many times the word has appeared: int x = how many.get("foo"); //inside the quotes can I pass any word that it will return the value? how many.put("foo", x+1); // if 0, it will do 0+1?

  • 1

    @That’s right. And do not need to put a literal, can be a variable too: String palavra = "bar"; int x = quantas.get(palavra);. One only has to be careful if the word does not yet exist in the dictionary - the result of the get will be null (and will provoke a NullPointerException by doing the cast for int). So it might be good to test before if the word is already there (if ( quantas.containsKey(palavra) ) { ... }) and, if not, put it in. The second part is also right - whatever the value of x, you can do an operation with it and save the result back on Map.

  • Are you sure how many.containsKey(word) works? Because it always returns null to me.

  • 1

    @Peace Yes, it works... See an example in ideone

  • I managed to make Map<> work the way you said... but I didn’t understand how to join what I did with the priority queue. You can explain to me better? :)

  • 1

    If you filled out the Map, then you have a collection of Map.Entry (i.e. pairs of String and Integer). To get the 3 most appearing words, simply sort a list or array of Map.Entry increasingly by Integer, and take the last three elements, but that’s kind of a waste, because ordering is n*log(n) and you will throw away most of the elements. The idea of the priority queue is to precisely eliminate this waste. You add 3 elements into it, then every new element you add you remove the smallest. Example in the ideone

  • I understood.. Now another question, how can I sort the values. In the Ideone example it returns me: [b=20, c=30, e=25].. How can I make him return me: [c=30, ,e=25 b=20] ?

  • 1

    @Peace Put in a list and order. Or - if you want to reuse the already written comparator - order in ascending order even and then invert. I updated the example. If you have any further questions, I suggest you open up a new question before it becomes a "chameleon question"... :P

  • Just one last question: pq the number 4 here: Priorityqueue<Map.Entry<String,Integer>> queue = new Priorityqueue<Map.Entry<String,Integer>>(4, comp);

  • 1

    @Peace This number is not required, it says what is the initial capacity of the queue. As we know that the most elements it will contain will be 4 (the 3 you want and 1 more you are testing against all others) so putting this value ensures you will have enough "space" for them [without the data structure having to resize to accommodate others] and doesn’t waste memory, by just allocating the space that will really be needed. It’s a micro-optimization, and normally it wouldn’t even be necessary, but as the builder who gets a Comparator request this parameter, I gave the most indicated

Show 5 more comments

2

  • Create a HashMap of String and integers scan the text by adding words to the map if they are not in the same.

    The advantage of the map is precisely that it is a structure that operates with keys (in this case the words) and values (integers).
    This way just increment the value, if it is already on the map.

  • Check between all map entries which have the highest frequency.

Sample code:

import java.util.HashMap;
import java.util.Map;

public class FreqPalavra {
    private static final String LOREM_IPSUM = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec a diam lectus. Sed sit amet ipsum mauris. Maecenas congue ligula ac quam viverra nec consectetur ante hendrerit. Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean ut gravida lorem. Ut turpis felis, pulvinar a semper sed, adipiscing id dolor. Pellentesque auctor nisi id magna consequat sagittis. Curabitur dapibus enim sit amet elit pharetra tincidunt feugiat nisl imperdiet. Ut convallis libero in urna ultrices accumsan. Donec sed odio eros. Donec viverra mi quis quam pulvinar at malesuada arcu rhoncus. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. In rutrum accumsan ultricies. Mauris vitae nisi at sem facilisis semper ac in est."
            + "Vivamus fermentum semper porta. Nunc diam velit, adipiscing ut tristique vitae, sagittis vel odio. Maecenas convallis ullamcorper ultricies. Curabitur ornare, ligula semper consectetur sagittis, nisi diam iaculis velit, id fringilla sem nunc vel mi. Nam dictum, odio nec pretium volutpat, arcu ante placerat erat, non tristique elit urna et turpis. Quisque mi metus, ornare sit amet fermentum et, tincidunt et orci. Fusce eget orci a orci congue vestibulum. Ut dolor diam, elementum et vestibulum eu, porttitor vel elit. Curabitur venenatis pulvinar tellus gravida ornare. Sed et erat faucibus nunc euismod ultricies ut id justo. Nullam cursus suscipit nisi, et ultrices justo sodales nec. Fusce venenatis facilisis lectus ac semper. Aliquam at massa ipsum. Quisque bibendum purus convallis nulla ultrices ultricies. Nullam aliquam, mi eu aliquam tincidunt, purus velit laoreet tortor, viverra pretium nisi quam vitae mi. Fusce vel volutpat elit. Nam sagittis nisi dui."
            + "Suspendisse lectus leo, consectetur in tempor sit amet, placerat quis neque. Etiam luctus porttitor lorem, sed suscipit est rutrum non. Curabitur lobortis nisl a enim congue semper. Aenean commodo ultrices imperdiet. Vestibulum ut justo vel sapien venenatis tincidunt. Phasellus eget dolor sit amet ipsum dapibus condimentum vitae quis lectus. Aliquam ut massa in turpis dapibus convallis. Praesent elit lacus, vestibulum at malesuada et, ornare et est. Ut augue nunc, sodales ut euismod non, adipiscing vitae orci. Mauris ut placerat justo. Mauris in ultricies enim. Quisque nec est eleifend nulla ultrices egestas quis ut quam. Donec sollicitudin lectus a mauris pulvinar id aliquam urna cursus. Cras quis ligula sem, vel elementum mi. Phasellus non ullamcorper urna."
                    .replaceAll("[.,]", "");

    public static void main(String[] args) {
        Map<String, Integer> mapaFreq = new HashMap<>();
        // Cria o mapa de Frequências
        for (String palavra : LOREM_IPSUM.split("\\s+")) {
            if (!mapaFreq.containsKey(palavra)) {
                mapaFreq.put(palavra, 1);
            } else {
                mapaFreq.put(palavra, 1 + mapaFreq.get(palavra));
            }
        }
        // Arrays para armazenar os 3 valores mais frequentes.
        String[] palavrasMaisFrequentes = new String[3];
        int[] freqPalavras = new int[3];
        //Percorre todos os valores do mapa
        for (Map.Entry<String, Integer> entrada : mapaFreq.entrySet()) {
            //Se achar algo mais frequente que a primeira posição
            if (entrada.getValue() >= freqPalavras[0]) {
                freqPalavras[0] = entrada.getValue();
                palavrasMaisFrequentes[0] = entrada.getKey();

            } else {
                if (entrada.getValue() >= freqPalavras[1]) {
                    freqPalavras[1] = entrada.getValue();
                    palavrasMaisFrequentes[1] = entrada.getKey();
                } else if (entrada.getValue() >= freqPalavras[2]) {
                    freqPalavras[2] = entrada.getValue();
                    palavrasMaisFrequentes[2] = entrada.getKey();
                }
            }
//          System.out.println(entrada.getKey() + "/" + entrada.getValue()); imprime todo o mapa
        }
        for (int i = 0; i < freqPalavras.length; i++) {
            System.out.println(i + 1 + " palavra: " + palavrasMaisFrequentes[i]
                    + " \nFrequência: " + freqPalavras[i]
                    + "\n------------------------\n");
        }

    }
}

You can see this example running in the ideone.

Observing

To simplify the above code does not pass the value to the other positions of the array (which is a logic error, because if a term is more frequent it should update the cascade array)

  • Example:
freqPalavras[]     |   palavrasMaisFrequentes[]
      13           |          "estouro"
      11           |          "da"
      09           |          "pilha"

If a word "Batman" often 14 appears the new order should be:

freqPalavras[]     |   palavrasMaisFrequentes[]
      14           |          "batman"
      13           |          "estouro"
      11           |          "da"

But in the above program it would be:

freqPalavras[]     |   palavrasMaisFrequentes[]
      14           |          "batman"
      11           |          "da"
      09           |          "pilha"
  • 2

    "To simplify the above code..." it is precisely because of this complexity that a priority list would break a branch... if only it were easier to use it. But from what I see, Java does not have a generic data structure to represent [orderly] pairs. Anyway, +1 for the full example, despite this limitation.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.