How to remove duplicate elements from a C list?

Question

How to remove duplicate elements from a C list?

Asked 7 years, 4 months ago

Viewed 2,607 times

2

I have a contact list and need to be removed duplicates, I was trying to solve using the code below:

Elemento* remove_duplicados(Elemento* agenda)
{
    Elemento* a = NULL; //POnteiro do Elemento anterior
    Elemento* p;
    Elemento* q;

    for(p = agenda; p != NULL; p = p -> prox){//Analisa do primeiro elem até o ultimo
        for(q = p -> prox; q != NULL; q = q -> prox){//Analisa proximo elem de p até o fim
            if(0 != strcmp(p -> info.nome, q -> info.nome)){
                a = q; 
            }
        }

        if(a == NULL){ //Remove elemento do início e aponta para o 2º
            agenda = p -> prox;
        } 
        else
            a -> prox = p -> prox; //Ponteiro anterior levará para o prox

        free(p); //Libera o espaço
    }
    return agenda;
}

The problem is that when executing the program, when arriving at this stage it closes :/

Merge a list but do not put repeat results

– Jefferson Quesado

2017/11/30 at 18:11

1 answer

Browser other questions tagged c chain-list

You are not signed in. Login or sign up in order to post.

by Isac • **24,736** points · Answer 1 · 2017-12-01T01:16:34+00:00

Problems

There are some flaws in the logic that is implemented:

You are not removing the various repeated successively as you only use the a at the end of the second for
Can’t do free(p) and then use on for to move on to the next with p = p -> prox
The advance agenda = p -> prox it is not necessary to remove duplicates from the first rather than the first itself. This makes your instruction should be agenda->prox = p->prox or p->prox =q->prox since it stops a was NULL p must be exactly in the next element

Fixing the problems

Using the logic and variables you have I suggest you do so:

Elemento* remove_duplicados(Elemento* agenda){
    Elemento* a;
    Elemento* p;
    Elemento* q;

    for(p = agenda; p != NULL; p = p -> prox ){
        a = NULL; //a cada varrimento a começa a NULL

        for(q = p -> prox; q != NULL; ){ //sem incremento
            if(0 == strcmp(p -> info.nome, q -> info.nome)){ //teste de igual
                if(a == NULL){ //remove do inicio
                    p -> prox = q -> prox;
                }
                else{ //ou do meio/fim
                    a -> prox = q -> prox;
                }

                Elemento* remover = q;
                q = q -> prox;
                free(remover);
            }
            else { //so atualiza o anterior quando não é igual
                a = q;
                q = q->prox;
            }
        }
    }

    return agenda;
}

I withdrew the comments I had in order to highlight the ones I put, and with the explanations associated with the exchanges I made.

See this code working on Ideone

Refactoring for better readability

The names of variables that have are very little suggestive and make it difficult to read. You should always try to give representative names of the content that the variables have. In addition, the first element is never changed as duplicates are removed forward, so the type of return is not necessary either, and should be void.

Taking this into account we can make the function much clearer:

void remove_duplicados(Elemento* agenda){
    Elemento *anterior, *corrente, *seguinte;

    for(corrente = agenda; corrente != NULL; corrente = corrente -> prox ){
        anterior = NULL;

        for(seguinte = corrente -> prox; seguinte != NULL; ){
            if(0 == strcmp(corrente -> info.nome, seguinte -> info.nome)){
                if(anterior == NULL)
                    corrente -> prox = seguinte -> prox;
                else
                    anterior -> prox = seguinte -> prox;

                Elemento* remover = seguinte;
                seguinte = seguinte -> prox;
                free(remover);
            }
            else {
                anterior = seguinte;
                seguinte = seguinte->prox;
            }
        }
    }
}

Efficiency with Hash Table

If there is concern regarding efficiency, the solution presented by you, which was the one I took, can bring problems if the amount of data is large, since it is a quadratic solution, with time complexity in the order of O(n²).

There are several ways to improve the solution but with a hash table and ensuring few collisions we achieve a solution in the order of O(n).

We start with the structure of each hash table entry and the respective array that represents it:

typedef struct entrada {
    struct info *dados;
    struct entrada *prox;
} entrada;

#define TAMANHO_HASH 100
entrada *hash_nomes[TAMANHO_HASH];

To find the input for a given value we need the hash, which in this case we can use as strings:

unsigned long hash(char *str){
    unsigned long hash = 5381;
    int c;
    while (c = *str++)
        hash = ((hash << 5) + hash) + c;

    return hash;
}

Create a good function of hash is in itself a science, and will be reflected in the performance of the algorithm as it will create many or few collisions depending on how it is built. For simplicity I used an already known.

The function that remove_duplicados is now further simplified as it is based entirely on the hash to know if the current value already exists:

void remove_duplicados(Elemento* agenda){
    Elemento *corrente = agenda, *anterior = NULL;

    while (corrente != NULL){
        if (adiciona_se_inexistente(&corrente->info) == 0){ //0 não adicionou por já existir
            anterior->prox = corrente->prox;
            Elemento* remover = corrente;
            corrente = corrente -> prox;
            free(remover);
        }
        else {
            anterior = corrente;
            corrente = corrente -> prox;
        }
    }
}

Here the function was used adiciona_se_inexistente to add to the table of hash if there is no:

int adiciona_se_inexistente(struct info* dados){
    unsigned long key = hash(dados->nome) % TAMANHO_HASH; //achar a entrada

    entrada *corrente = hash_nomes[key];
    while (corrente != NULL){ //verificar se já existe
        if (strcmp(corrente->dados->nome, dados->nome) == 0){
            return 0;
        }
        corrente = corrente->prox;
    }

    //não existe adiciona a nova entrada
    entrada *nova = malloc(sizeof(entrada));
    nova->prox = hash_nomes[key];
    nova->dados = dados;
    hash_nomes[key] = nova;
    return 1;
}

The table of hash has to be initialized with all entries NULL, what I did using the function memset.

Remarks:

I implemented the hash with internal collision list instead of open addressing, which ultimately uses more space, but is potentially simpler to implement.
This solution is much more performance efficient if we ensure we have few collisions, something that depends on the function of hash and the size of the table. However it will use more memory than its original solution. Note that the size of 100 that I chose for the table is good for the amount of names I used (5), but if it has more names it can already generate many collisions, so it is something that has to adapt to your needs.
It is also worth remembering that if you only need this function at one point in the program and you are sure that it will not be needed later, you need to free up the space of all filled entries in the hash, with free.

Example of this implementation in Ideone