How to view if there are two identical items in Arraylist and remove them?

Asked

Viewed 1,071 times

2

I am doing the "Tokenization" of a TXT file.

I need the code to hold all the tokens in an Arraylist, but can’t get any token duplicate.

I would like to know how to remove tokens duplicates, or checks whether the token already exists and in this case do not add it.

My current code:

for (org.cogroo.text.Token token : sentence.getTokens()) { // lista de tokens

    token.getStart(); token.getEnd(); // caracteres onde o token comeca e termina
    token.getLexeme(); // o texto do token (palavra que ele separa e pega exp: "clinico"
    token.getLemmas(); // um array com os possiveis lemas para o par lexeme+postag
    token.getPOSTag(); // classe morfologica de acordo com o contexto("coloca "prp, adj,n(noun))
    token.getFeatures(); // genero, numero, tempo etc
    contadorTokens++;
    System.out.println(expandirAcronimos(token.getLexeme()) + "_" + token.getPOSTag() + "_" + token.getFeatures());// imprime a palavra com o tag
    gravarArq.println(token.getLexeme() + "_" + token.getPOSTag() + "_" + token.getFeatures());// grava no arquivo txt cada palavra tokenizada
    gravarArquivo.println(token.getPOSTag() + "_" + token.getFeatures());// grava no arquivo "Tokens.txt" cada token

    listaTokens.add(token.getPOSTag()); //ADICIONA as tags para dentro de uma lista 

    for(int s=0;s<listaTokens.size();s++){  //PERCORRE A LISTA
        if (!listaTokens.equals(token.getPOSTag())) {

        }
    }
}
  • 1

    mightduck, for your own sanity and to facilitate who will help you, it is essential to make a logical indentation of the code. A good IDE helps in this. . . . Maybe the mgibson answer already solves, but it lacks the closure of the first for...

1 answer

3


To store elements without repetition, the ideal is to use a data type "set" instead of "list". I suggest the HashSet, or maybe the LinkedHashSet if the order of tokens must be preserved:

Set conjuntoTokens = new HashSet(); // Pode ser genérico, i.e. Set<Tipo>

for (org.cogroo.text.Token token : sentence.getTokens()) { // lista de tokens
    ...

    //listaTokens.add(token.getPOSTag()); //ADICIONA as tags para dentro de uma lista 
    boolean mudou = conjuntoTokens.add(token.getPOSTag()); // adiciona as tags no conjunto
                                                           // em vez da lista
    if ( !mudou ) {
        ... // O elemento já existia no conjunto
    }
}

listaTokens.addAll(conjuntoTokens); // adiciona todos os elementos do conjunto na lsta
  • but if I add this to a set as I will check if there are two equal in the set?

  • @mightduck If the element you add already exists in the set it does not change (so it is a set). You can check if this happened through the return value of the add. See my updated response.

  • I don’t understand what makes this If(!has changed) when I do Boolean has changed it already checks if it already exists within the set and adds only if it doesn’t exist inside?

  • @mightduck More or less so, yes. The data type "set" already takes care of itself to ensure that the data does not repeat itself inside. When you call the add then, it sees whether the data already exists or not, and only adds if it does not exist. In the end, it returns a boolean whether or not the whole has changed after the add. If the element did not yet exist, then it is added, and the set changes (true). If it already existed, it is not, and the set does not change (false). You do not need to use this return value if you do not want, the if that I put in the end was just an example.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.