Recognize word repeats in String

Asked

Viewed 5,853 times

19

I have a text inside a StringBuffer and I need to check and mark the words that appear more than once. At first I used a circular row of 10 positions, because I am interested only repeated words in a "radius" of 10 words.
It is worth noting that the marking of repeated words can only occur if the repeated words are within a radius of 10 words between them. If the repeated words are within a "distance" of more than 10 words, they should not be marked.
The method Contem returns null if there is no repetition or returns the word that has repetition. String is only the variable containing the full text.

StringBuffer stringProximas = new StringBuffer();
String has = "";
Pattern pR = Pattern.compile("[a-zA-Zà-úÀ-Ú]+");
Matcher mR = pR.matcher(string);
while(mR.find()){
  word = mR.group();
  nextWord.Inserir(word);//inserir na lista
  has = nextWord.Contem();//verifica se há palavras iguais na lista
  //um if pra verificar se has é null ou nao
  //e aqui marca a palavra repetida, se has for diferente de null
  mR.appendReplacement(stringProximas, "");
  stringProximas.append(has);
}
public void Inserir(String palavra){
    if(this.list[9].equals("null")){
        if(this.list[0].equals("null")){
            this.list[this.fim]=palavra;
        }else{
            this.fim++;
            this.list[this.fim] = palavra;
        }
    }else{
        //inverte o apontador fim para a posição 0
        if(this.inicio == 0 && this.fim == 9){
            this.inicio++;
            this.fim = 0;
            this.list[this.fim] = palavra;
        }else if(this.inicio == 9 && this.fim == 8){//inverte o apontador inicio para posição 0
            this.inicio = 0;
            this.fim++;
            this.list[this.fim] = palavra;
        }else{
            this.inicio++;
            this.fim++;
            this.list[this.fim] = palavra;                    
        }
    }
}
public String Contem() throws Exception{
    for(int i=0;i<this.list.length;i++){
        for(int j=i+1;j<this.list.length;j++){
            if(this.list[i].equals(this.list[j]) && (!this.list[i].equals("null") || !this.list[j].equals("null"))){
                //nao pegar a mesma repetição mais de uma vez
                if(!this.list[i].equals("?")){
                    this.list[i] = "?";//provavelmente será retirado isso
                    return this.list[j];
                }
            }
        }
    }
    return "null";
}

My big problem: if I find repeated words, I can only mark the second occurrence because even the first one is in the queue, the variable word will be the second and because while I can’t make the second.

I’m using this text as an example:
Nowadays, it is necessary to be smart. Our day to day is complicated.
The method should return for example (put as bold here, but not necessarily the way to mark):
Today in day, is need to be smart. Our day to day is complicated.

  • 1

    You can provide some code on the methods Inserir and Contem and on the variables string, stringProximas and has so we can try to replicate the workings of your code to help you?

  • I edited it, check it out @Victor

  • 1

    How do you initialize the variable list? I don’t think so "null" with quotes and that "?".

  • @Victor this is only for test case. It does not influence much and will not be the definitive. Already the list is started in the class thus: private String list[];, recalling that size 10.

  • take a look at the issue I posted, I made some edits and now I think it’s the way you need it!

8 answers

16

Solution:

Using regular expressions can be solved with a very expressive code, small and few ifs - in fact only 1 if and only 1 loop:

public String assinalaRepetidas(String texto, String marcadorInicio, 
                                            String marcadorFim, int qtdPalavrasAnalisar) {

    String palavraInteiraPattern = "\\p{L}+"; 
    Pattern p = Pattern.compile(palavraInteiraPattern);
    Matcher matcher = p.matcher(texto);

    ArrayList<String> palavras = new ArrayList<String>();
    ArrayList<String> palavrasRepetidas = new ArrayList<String>();
    
    while (matcher.find() && palavras.size() < qtdPalavrasAnalisar) {
        
        String palavra = matcher.group();

        if (palavras.contains(palavra) && !palavrasRepetidas.contains(palavra)) {
            texto = texto.replaceAll(
                    String.format("\\b%s\\b", palavra), 
                    String.format("%s%s%s", marcadorInicio, palavra, marcadorFim));

            palavrasRepetidas.add(palavra);
        }
        palavras.add(palavra);
    }
    return texto;
}

And that’s all! End.

Below, some explanation and also the consumer code.

Explaining the solution:

I used regular expression to get every word in the text, ignoring spaces, parentheses, symbols, commas and other punctuations that are not real words. The regular expression to do this in Java in an accentuated text (using Unicode UTF-8) is \p{L}+.

In the same loop that I get the words found by the regular expression, I already replace the word repeated by itself, involving it by the markers.

The consumer code (unit test) was thus:

@Test
public void assinalaPrimeirasPalavrasRepetidas() {
  String texto = "Hoje em dia, é necessário ser esperto. O nosso dia a dia é complicado.";
  String esperado = "Hoje em [dia], é necessário ser esperto. O nosso [dia] a [dia] é complicado.";
    
  assertEquals(esperado, new AnalisaTexto().assinalaRepetidas(texto, "[", "]", 10));
}

Although the question describes wanting only the first 10 words, the expected result example seems to consider all of them. So I added a signature that dispenses with the "radius" of words to analyze:

public String assinalaPalavrasRepetidas(String texto, String marcadorInicio, String marcadorFim) {
    return assinalaRepetidas(texto, marcadorInicio, marcadorFim, Integer.MAX_VALUE);
}

Using this other method, as more than 10 words are parsed, the "is" is also identified as repeated:

@Test
public void assinalaTodasPalavrasRepetidas() {
  String texto = "Hoje em dia, é necessário ser esperto. O nosso dia a dia é complicado.";
  String esperado = "Hoje em [dia], [é] necessário ser esperto. O nosso [dia] a [dia] [é] complicado.";
    
  assertEquals(esperado, new AnalisaTexto().assinalaPalavrasRepetidas(texto, "[", "]"));
}

Finally, note that I also used regular expressions when replacing words with their checked equivalents. Note the regex in the method texto.replaceAll. Otherwise, a part of another word that coincided would also be marked. For example, in "server" would be flagged "[be] [to be]".

The test that proves the effectiveness of this small care is:

@Test
public void assinalaApenasPalavraInteira() {
    
    String texto = "Hoje em dia, pode ser necessário servir ao ser esperto.";
    String esperado = "Hoje em dia, pode [ser] necessário servir ao [ser] esperto.";
    
    assertEquals(esperado, new AnalisaTexto().assinalaPalavrasRepetidas(texto, "[", "]"));
}
  • 3

    Best answer so far, simple and covers all points! I would just suggest using a Set instead of a List, it makes no difference if the text and/or qtdPalavrasAnalisar is small, but if both are large the solution goes from quadratic to linear (at least in the case of many non-repeating words). In this case, the condition of the while would have to be adapted, of course.

  • @Caffé the problem of the 10-word radius is that in its solution only the first 10 words of the string are being analyzed and not a 10-word radius from each word.

  • the words have to be marked if they are repeated in a 10 word radius, plus they cannot be marked

  • 1

    @Peace You have just explained a problem very clearly. How about updating the question so that it gets this clarity? Also consider unchecking the answer as accepted so the question can attract answers again, or consider posting a new question.

  • @Caffé changed the question and I’ve cleared as accepted. Still, I hope you can solve this doubt for me, because your answer was the most complete.

  • @Peace I published a new answer with this "new" requirement because doing from scratch ended up leaving a completely different code and would mischaracterize this answer here.

Show 1 more comment

10

I decided to reimplementate from scratch. The reasons:

  • Do not depend on custom list implementations;
  • Not depend on Strings with special and arbitrary values;
  • Don’t need any numbers hardcoded or complicated mathematics;
  • No need for regular expressions where a simpler approach solves;
  • Delegate to Java itself to determine what is a letter or not according to its implementation of the Unicode standard.

And here is the result. Explanatory comments in the code:

import java.util.LinkedHashSet;
import java.util.LinkedList;
import java.util.List;
import java.util.Locale;
import java.util.Set;

/**
 * @author Victor
 */
public class BuscaPalavras {

    private static final Locale PT_BR = new Locale("PT", "BR");

    public static Set<String> palavrasRepetidas(int raio, String texto) {
        // Palavras repetidas dentro do raio já encontradas. Usa um Set para eliminar duplicatas automaticamente.
        Set<String> palavrasRepetidas = new LinkedHashSet<>(50);

        // Lista contendo as últimas palavras encontradas. O tamanho máximo da lista é igual ao raio.
        List<String> ultimasPalavras = new LinkedList<>();

        // Usa para guardar a plavra que está se formanado a medida que os caracteres são lidos.
        StringBuilder palavra = null;

        // Transforma o texto todo em um array.
        char[] caracteres = texto.toCharArray();
        int tamanho = caracteres.length;

        // Itera cada posição do array até uma depois da última (importante).
        for (int i = 0; i <= tamanho; i++) {
            // Se o caractere i do texto for uma letra, acrescenta ele no StringBuilder.
            // Se estiver na posição depois da última, não entrará no if e seguirá para o else-if.
            if (i < tamanho && Character.isLetter(caracteres[i])) {
                // Cria o StringBuilder caso a palavra esteja começando agora.
                if (palavra == null) palavra = new StringBuilder(20);
                palavra.append(caracteres[i]);

            // Caso contrário, se uma palavra acabou de ser encerrada...
            } else if (palavra != null) {
                // Retira do StringBuilder e converte para maiúsculas.
                String novaPalavra = palavra.toString().toUpperCase(PT_BR);

                // Se for uma das últimas palavras de acordo com o raio, acrescenta na lista de palavras repetidas.
                if (ultimasPalavras.contains(novaPalavra)) palavrasRepetidas.add(novaPalavra);

                // Faz a lista de últimas palavras andar.
                if (ultimasPalavras.size() >= raio) ultimasPalavras.remove(0);
                ultimasPalavras.add(novaPalavra);

                // Terminou a palavra. Volta para null para que outra palavra se inicie depois.
                palavra = null;
            }
        }
        return palavrasRepetidas;
    }

    // Para testar o método palavrasRepetidas.
    public static void main(String[] args) {
        String texto = "O rato roeu a roupa do rei de Roma e a rainha roeu o resto."
                + " Quem mafagafar os mafagafinhos bom amafagafigador será."
                + " Será só imaginação? Será que nada vai acontecer? Será que é tudo isso em vão?"
                + " Será que vamos conseguir vencer? Ô ô ô ô ô ô, YEAH!"
                + " O pato faz Quack-quack!"
                + " Quem é que para o time do Pará?";

        System.out.println(palavrasRepetidas(10, texto));
    }
}

Exit from the method main:

[A, ROEU, SERÁ, QUE, Ô, QUACK, O]
  • that’s not the point. It’s not just picking up the words that repeat. I need the word to be marked in the original text. I’m editing the question.

  • 1

    Will this be a XY problem?

  • @Patrick no.. the problem is clear. It is marking words that repeat in a text.

10

The code itself is explanatory. Basically what I did was:

  1. Create a list of repeated words;
  2. Scroll through this list and search for each word in the original text in the list;
  3. Exchange the word for herself but with the appointment.

I marked the text with <b></b> around the repeated words.

Having the repeat list becomes easy as the function replace() does almost everything: searches the words in the original text and exchange by marking.

Leading

Ideone ide = new Ideone();
StringBuffer texto = new StringBuffer("Hoje em dia, é necessário ser esperto. O nosso dia a dia é complicado.");
List<String> palavrasRepetidas = ide.pegaRepetidas(texto);

String saida = texto.toString().replace(palavrasRepetidas.get(0), "<b>"+palavrasRepetidas.get(0)+"</b>");

for (int i=1; i < palavrasRepetidas.size(); i++) {
  saida = saida.replace(palavrasRepetidas.get(i), "<b>"+palavrasRepetidas.get(i)+"</b>");
}
System.out.println(saida);

catch()

/**Retorna uma lista com as palavras que aparecem mais de uma vez no texto*//
private static List<String> pegaRepetidas(StringBuffer texto) {
    String textoFormatado = texto.toString().replaceAll("[,.!]", ""); //Retira pontos e vírgulas
    StringTokenizer st = new StringTokenizer(textoFormatado);

    List<String> palavrasRepetidas = new ArrayList<>();

    while (st.hasMoreTokens()) {
        String palavra = st.nextToken();
        if (contaPalavra(palavra, textoFormatado) > 1) { // Se palavra aparece mais de uma vez
            if ( !palavrasRepetidas.contains(palavra) ) { // Se ela ainda não se encontra na lista de repetidas
                palavrasRepetidas.add(palavra);
            }
        }
    }

    return palavrasRepetidas;
}

contaPalavras()

/** Retorna o número de vezes que a 'palavra' aparece no 'texto' */
private static int contaPalavra(String palavra, String texto) {
    StringTokenizer st = new StringTokenizer(texto);
    int count = 0;
    while (st.hasMoreTokens()) {
        if (st.nextToken().compareTo(palavra) == 0) {
            count++;
        }
    }

    return count;
}

See it working on ideone: http://ideone.com/n8xLlo

8

I’m not very good at java, so I accept corrections in this code, but I think I’ve arrived at the result you want:

See it running here on Ideone:

http://ideone.com/8YvDnp

import java.util.*;
import java.lang.*;
import java.io.*;

enum TokenKind
{
    WordSeparator,
    Word
}

class Token
{
    int _start;
    int _end;
    String _text;
    TokenKind _kind;

    public String getText()
    {
        return _text;
    }

    public void setText(String value)
    {
        _text = value;
    }

    public void setStart(int value)
    {
        _start = value;
    }

    public void setEnd(int value)
    {
        _end = value;
    }

    public int getStart()
    {
        return _start;
    }

    public int getEnd()
    {
        return _end;
    }

    public TokenKind getKind()
    {
        return _kind;
    }

    public void setKind(TokenKind value)
    {
        _kind = value;
    }
}

class LinearRepeatSearchLexer
{
    StringBuffer _text;
    int _position;
    char _peek;

    public LinearRepeatSearchLexer(StringBuffer text)
    {
        _text = text;
        _position = 0;
        _peek = (char)0;
    }

    public Token nextToken()
    {
        Token ret = new Token();
        char peek = PeekChar();

        if(isWordSeparator(peek))
        {
            ret.setStart(_position);
            readWordSeparator(ret);
            ret.setEnd(_position - 1);
            return ret;
        }
        else if(isLetterOrDigit(peek))
        {
            ret.setStart(_position);
            readWord(ret);
            ret.setEnd(_position - 1);
            return ret;
        } 
        else if(peek == (char)0)
        {
            return null;
        }
        else
        {
            // TODO: 
            //  caracteres não identificados
            //  ou você pode simplificar o readWord
            return null;
        }
    }

    void readWordSeparator(Token token)
    {
        char c = (char)0;
        StringBuffer tokenText = new StringBuffer();
        while(isWordSeparator(c = PeekChar()))
        {
            tokenText.append(c);
            MoveNext(1);
        }
        token.setText(tokenText.toString());
        token.setKind(TokenKind.WordSeparator);
    }

    void readWord(Token token)
    {
        char c = (char)0;
        StringBuffer tokenText = new StringBuffer();
        while(isLetterOrDigit(c = PeekChar()))
        {
            tokenText.append(c);
            MoveNext(1);
        }
        token.setText(tokenText.toString());
        token.setKind(TokenKind.Word);
    }

    boolean isWordSeparator(char c)
    {
        // TODO: outros separadores aqui
        return c == ' ' ||
            c == '\t' ||
            c == '\n' ||
            c == '\r' ||
            c == ',' ||
            c == '.' || 
            c == '-' || 
            c == ';' || 
            c == ':' ||
            c == '=' ||
            c == '>';
    }

    boolean isLetterOrDigit(char c)
    {
        // TODO: outras letras aqui
        return (c >= 'a' && c <= 'z') ||
            (c >= 'A' && c <= 'Z') ||
            (c >= '0' && c <= '9') ||
            (c >= 'à' && c <= 'ú') ||
            (c >= 'À' && c <= 'Ú') ||
            c == '_';
    }

    char PeekChar()
    {
        if(_position < _text.length())
            return _text.charAt(_position);
        return (char)0;
    }

    void MoveNext(int plus)
    {
        _position += plus;
    }
}

class LinearRepeatSearch
{
    StringBuffer _text;
    int _radius;

    public LinearRepeatSearch(StringBuffer text, int radius)
    {
        _text = text;
        _radius = radius;
    }

    public LinearRepeatSearch(String text, int radius)
    {
        this(new StringBuffer(text), radius);   
    }

    public List<Token> getRepeatedWords()
    {
        // ler todos os tokens
        ArrayList<Token> ret = new ArrayList<Token>();
        ArrayList<Token> readed = new ArrayList<Token>();
        LinearRepeatSearchLexer lexer = new LinearRepeatSearchLexer(_text);
        Token token = null;
        while((token = lexer.nextToken()) != null)
        {
            if(token.getKind() == TokenKind.Word)
                readed.add(token);
        }

        // localizar repetições a partir do raio
        // PERF:
        //      este laço pode ser melhorado em termos de performance
        //      pois há comparações repetidas aqui
        int size = readed.size();
        for(int x = 0; x < size; x++)
        {
            Token a = readed.get(x);
            for(int y = Math.max(0, x - _radius); y < size && (y - x) < _radius; y++)
            {
                if(x == y) continue;
                Token b = readed.get(y);
                if(a.getText().equals(b.getText()))
                {
                    ret.add(a);
                    break;
                }
            }
        }

        return ret;
    }
}

class Ideone
{
    public static void main (String[] args) throws java.lang.Exception
    {
        // your code goes here

        StringBuffer input = new StringBuffer("Hoje em dia, é necessário ser esperto. O nosso dia a dia é complicado.");
        StringBuffer output = new StringBuffer();
        LinearRepeatSearch searcher = new LinearRepeatSearch(input, 10);
        List<Token> spans = searcher.getRepeatedWords();
        int listSize = spans.size();
        int position = 0;
        for(int x = 0; x < listSize; x++)
        {
            Token item = spans.get(x);
            output.append(input.substring(position, item.getStart()));
            output.append("<b>");
            output.append(item.getText());
            output.append("</b>");
            position = item.getEnd() + 1;
        }
        if(position < input.length())
        {
            output.append(input.substring(position));
        }
        System.out.println(output.toString());
    }
}

Code result:

Today in day, is need to be smart. Our day to day is complicated.

7

A very simple solution*, using split and two sets:

public static String marcarRepetidas(String s, String prefixo, String sufixo) {
    Set<String> palavras = new HashSet<String>();
    Set<String> palavrasRepetidas = new HashSet<String>();

    // Acha o conjunto de palavras repetidas
    for ( String palavra : s.split("[^a-zA-Zà-úÀ-Ú]+") ) {
        palavra = palavra.toLowerCase();
        if ( palavra.length() > 0 && palavras.contains(palavra) )
            palavrasRepetidas.add(palavra);
        palavras.add(palavra);
    }

    // Marca cada uma dessas palavras no texto (envolvendo-as num prefixo e sufixo)
    for ( String palavra : palavrasRepetidas )
        s = s.replaceAll("(?<![a-zA-Zà-úÀ-Ú])(?iu)(" + palavra + ")(?![a-zA-Zà-úÀ-Ú])",
                         prefixo + "$1" + sufixo);

    // No Java 8 é mais simples (uma única chamada do replaceAll):
    // String juncao = String.join("|", palavrasRepetidas);
    // s = s.replaceAll("(?<![a-zA-Zà-úÀ-Ú])(?iu)(" + juncao + ")(?![a-zA-Zà-úÀ-Ú])",
    //                  prefixo + "$1" + sufixo);

    return s;
}

Example in Ideone. That one replaceAll at the end deserves an explanation: before replacing a word in the text, it is important to make sure that it is even a word, and not a substring another word. For this I used two lookarounds negative, one to see if it is not preceded by a letter, and another to see if it is not successful. The (?iu) is to ignore capitalization, and the capture group is for the word to be replaced by the marked version but without changing its capitalization. Example:

Nowadays, it’s necessary to be smart. Our day to day is complicated. Joe. Diaphragm. Day. Yeah. Day.

Exit:

Today in day, is need to be smart. Our day to day is complicated. Zé. diaphragm. Day. IS. Day.

* This response aims at simplicity, not efficiency; a "manual" method (i.e. where every costly API call - such as regexes - is replaced by an explicit loop and then optimized), taking advantage of the StringBuilder, etc could perform better if this requirement is important in your particular case.

  • 1

    Excellent point of capitalization! This detail escaped me. My answer suffers from the problem of not identifying that "is" is equal to "IS". And although identifying is simple using Regex, still has the question of replacing "IS" for "[IS]" instead of Ubstirui by "[is]". That is: there will be more code to solve the problem. As for what you mentioned of (?i) not be efficient, experiment (?Iu) (Unicode). See: https://blogs.oracle.com/xuemingshen/entry/case_insensitive_matching_in_java

  • @Caffé worked perfectly! Updated answer

5

Hello. I made a very simple implementation.

Essentially words are counted and a map is populated with words and the number of occurrences. I used the word as a key and quantity as a value.

Then I swept the map, overwriting the words that occur more than once.

In java 8 the code would be cleaner and, for sure, the implementation can be improved.

See the code:

import java.util.HashMap;
import java.util.Map;
import java.util.Map.Entry;
import java.util.StringTokenizer;


public class WordCount {
    Map<String, Integer> counter = new HashMap<String, Integer>();

    /**
     * @param args
     */
    public static void main(String[] args) {
        new WordCount().count();
    }

    private void count() {
        String string = "Hoje em dia, é necessário ser esperto. O nosso dia a dia é complicado.";

        StringTokenizer token = new StringTokenizer(string, " .,?:"); //caracateres que não interessam

        while (token.hasMoreTokens()) {
            String s = token.nextToken();
            count(s);
            System.out.println(s);
        }

        System.out.println(counter);
        print(string);
    }

    private void count(String s) {
        Integer i = this.counter.get(s);
        this.counter.put(s, i == null ? 0 : ++i);
    }

    private void print(String s) {
        for (Entry<String, Integer> e : this.counter.entrySet()) {
            if (e.getValue() > 0) {
                s = s.replaceAll(e.getKey(), String.format("<b>%s</b>", e.getKey()));
            }
        }

        System.out.println(s);
    }
}

5


Algorithm:

In this solution I broke all the text in tokens. Each token or is a word or is anything else between words (spaces, scores and other symbols).

Then I go through the tokens to see if each of them is a word that already exists among the last words read, and the amount of last words to compare is limited by the specified radius.

If the token matches a repeated word within the radius, I check both this token I’m reading now and that repeated word that was already there.

Finally I go through all the tokens again, reconstructing the original text and marking the words whose tokens had been marked as repeated words.

Code:

public static String assinalaPalavrasRepetidasEmUmRaio(String texto,
        String marcadorInicio, String marcadorFim, int qtdPalavrasRaio) {

    List<Token> tokens = new ArrayList<Token>();
    List<Token> palavrasNoRaio = new ArrayList<Token>();
    String palavraeNaoPalavraPattern = "\\p{L}+|[^\\p{L}]+";
    Matcher matcher = Pattern.compile(palavraeNaoPalavraPattern).matcher(texto);

    while (matcher.find()) {
        Token token = new Token(matcher.group());
        tokens.add(token);
        if (token.isPalavra() && palavrasNoRaio.contains(token)) {
            palavrasNoRaio.get(palavrasNoRaio.indexOf(token)).assinala();
            token.assinala();
        }
        if (token.isPalavra()) {
            palavrasNoRaio.add(token);
        }
        if (palavrasNoRaio.size() > qtdPalavrasRaio) {
            palavrasNoRaio.remove(0);
        }
    }
    StringBuilder textoReconstruido = new StringBuilder();
    for (Token token : tokens) {
        if (token.isAssinalado()) {
            textoReconstruido.append(marcadorInicio + token + marcadorFim);
        } else {
            textoReconstruido.append(token);
        }
    }
    return textoReconstruido.toString();
}

Token class:

As noted in the above code, the Token itself knows whether or not it is a word, and also has a flag indicating whether it has been flagged.

class Token {
    private final String texto;
    private final boolean isPalavra;
    private boolean isAssinalado;

    public Token(String texto) {
        isPalavra = texto.matches("\\p{L}+");
        this.texto = texto;
    }
    public boolean isPalavra() {
        return isPalavra;
    }
    public void assinala() {
        isAssinalado = true;
    }
    public boolean isAssinalado() {
        return isAssinalado;
    }
    @Override
    public int hashCode() {
        return texto.hashCode();
    }
    @Override
    public boolean equals(Object obj) {
        if (obj == null || !(obj instanceof Token)) {
            return false;
        }
        return texto.equalsIgnoreCase(((Token)obj).texto);
    }
    @Override
    public String toString() {
        return texto;
    }
}

The methods hashcode and equals are not consumed directly by my code, but they are used by the Java implementation to list contains. and list indexof., whereas the hashcode helps speed up the search and the equals is the comparison to see if the item is the same as the one you are searching for.

There are several techniques to make a hash code that assists in the performance. In this case I simply return the hash code of the text, because it is the text that I compare in equals to tell if one token is the same as another. Note that if the hashcode return Zero to all Tokens, still the search will work, the question is even the performance - worth an in-depth reading on hash codes.

Consumer code:

And this is the unit test:

@Test
public void assinalaRepetidasEmUmRaio() {

  String texto = "Dia! É bom ser esperto, não é mesmo? O nosso dia a dia é complicado.";

  String esperado = "Dia! [É] bom ser esperto, não [é] mesmo? O nosso [dia] a [dia] é complicado.";

  String obtido = ProcessadorTexto.assinalaPalavrasRepetidasEmUmRaio(texto, "[", "]", 5);

  assertEquals(esperado, obtido);
}

Note that even words with different uptake are detected as repeated ("IS" and "is"), which was a deficiency of my first response, brought to my attention by the response of @mgibsonbr. Who does the trick there is the method Token.equals which is used to check if the token is already in the word list (words.contains(token)).

Notice also that the first word "day" and the last word "is" were not marked because their closest repetitions are at a distance greater than the specified radius.

On the regular expressions used:

The regular expression (regex) I used to find every word Unicode is \p{L}+ for the simple \w+ in Java gets lost with accented words.

And the regex that I used to get all the rest that’s not word was the denial of the other expression, namely: [^\p{L}]+. This is because Java also finds accented characters when using regex non word \W+.

And to get all tokens at once (words and not words) I used at the same time the two regular expressions separated by the symbol | (pipe), which can be described as "one or the other", for example: X|Y = "find both X and Y".

4

Okay, okay, okay, it’s not Java...

#!/usr/bin/perl
use strict;
use utf8::all;

my %conta;
my $s="Hoje em dia, é necessário ser esperto. O nosso dia a dia é complicado.";

for($s=~ m{(\w+)}g ) { $conta{$_}++ }
$s =~ s{(\w+)}{ if($conta{$1}>1){ "<b>$1</b>"} else {"$1"}}eg;

print $s;

Browser other questions tagged

You are not signed in. Login or sign up in order to post.