Data Comparison in Files [C]

Question

Data Comparison in Files [C]

Asked 7 years, 9 months ago

Viewed 1,174 times

1

I want to make an algorithm that takes a name in an X file and see if this name exists in another Y file, if it does not exist it writes that name in a third Z file. (Basically an algorithm that points the missing in a presence list). I tried to do this in the algorithm below but it didn’t work, someone could point out errors in the logical structure that prevent it from working?

#include <stdio.h>
#include <string.h>

typedef struct tipo_nome{
    char nomePessoa[50];
} nome;

void main(){
    int i, k=0;
    nome nome1, nome2;
    FILE *file1, *file2, *file3;
    file1 = fopen("listaCompleta.txt", "r");
    file2 = fopen("listaPresenca.txt", "r");
    file3 = fopen("listaFaltantes.txt", "a+");

    do{
        fread(&nome1, sizeof(nome), 1, file1);
        do{
            fread(&nome2, sizeof(nome), 1, file2);
            if(strcmp(nome1.nomePessoa, nome2.nomePessoa)==0){
                k=1;
            }
        }while(!feof(file2) && k==0);

        if(k==0){
            fwrite(&nome1, sizeof(nome), 1, file3);
        }
        k=0;    
    }while(!feof(file1));

}

Do you really need to be in [tag:C]? A shell script would solve your life much more easily

– Jefferson Quesado

2017/08/18 at 03:40
Yes, unfortunately it has to be in C

– Marcos Rodrigues

2017/08/18 at 03:45
I can write an algorithm here with very low efficiency, but it fits your case perfectly. To optimize, I’m already putting alternatives in the answer like sort the data sets or use an efficient search structure, like a hash or tree, but I don’t dwell too much on the subject because your focus here is not speed.

– Jefferson Quesado

2017/08/18 at 04:00
You also asked for an algorithm, so I’m taking the liberty of putting the idea, without much focus on the C itself

– Jefferson Quesado

2017/08/18 at 04:02

2 answers

Browser other questions tagged c filing-cabinet comparison

You are not signed in. Login or sign up in order to post.

by Lacobus • **13,510** points · Answer 1 · 2017-08-18T19:53:18+00:00

Here is a possible solution to your problem:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>


#define LINHA_MAX_TAM  (100)
#define NOME_MAX_TAM   (50)


typedef struct pessoa_s
{
    char nome[ NOME_MAX_TAM + 1 ];
} pessoa_t;


int carregar_pessoas( const char * arq, pessoa_t ** pp, int * qtd )
{
    FILE * pf = NULL;
    char linha[ LINHA_MAX_TAM + 1 ] = {0};
    pessoa_t * p = NULL;
    int n = 0;

    pf = fopen( arq, "r" );

    if(!pf)
        return -1;

    while( fgets( linha, LINHA_MAX_TAM, pf ) )
    {
        linha[ strcspn(linha, "\n") ] = 0;
        n++;
        p = realloc( p, n * sizeof(pessoa_t) );
        strncpy( p[n-1].nome, linha, NOME_MAX_TAM );
    }

    fclose(pf);

    *qtd = n;
    *pp = p;

    return 0;
}


int gravar_pessoas( const char * arq, pessoa_t * p, int qtd )
{
    FILE * pf = NULL;
    int i = 0;

    pf = fopen( arq, "w" );

    if(!pf)
        return -1;

    for( i = 0; i < qtd; i++ )
        fprintf( pf, "%s\n", p[i].nome );

    fclose( pf );

    return 0;
}


int processar_ausencias( pessoa_t * pc, int qtdc, pessoa_t * pp, int qtdp, pessoa_t ** pa, int * qtda )
{
    int i = 0;
    int j = 0;
    int presente = 0;
    pessoa_t * p = NULL;
    int n = 0;

    for( i = 0; i < qtdc; i++ )
    {
        presente = 0;

        for( j = 0; j < qtdp; j++ )
        {
            if( !strcmp( pc[i].nome, pp[j].nome ) )
            {
                presente = 1;
                break;
            }
        }

        if(!presente)
        {
            n++;
            p = realloc( p, n * sizeof(pessoa_t) );
            strncpy( p[n-1].nome, pc[i].nome, NOME_MAX_TAM );
        }
    }

    *pa = p;
    *qtda = n;

    return 0;
}


int main( void )
{
    pessoa_t * lst_completa = NULL;
    int qtd_completa = 0;

    pessoa_t * lst_presenca = NULL;
    int qtd_presenca = 0;

    pessoa_t * lst_ausencia = NULL;
    int qtd_ausencia = 0;

    carregar_pessoas( "listaCompleta.txt", &lst_completa, &qtd_completa );
    carregar_pessoas( "listaPresenca.txt", &lst_presenca, &qtd_presenca );

    processar_ausencias( lst_completa, qtd_completa, lst_presenca, qtd_presenca, &lst_ausencia, &qtd_ausencia );

    gravar_pessoas( "listaFaltantes.txt", lst_ausencia, qtd_ausencia );

    free(lst_ausencia);
    free(lst_completa);
    free(lst_presenca);

    return 0;
}

listaCompleta.txt

JOSE
JOAO
ANTONIO
FRANCISCO
LUIZ
CARLOS
PEDRO
PAULO
MANOEL
LUCAS

listPresenca.txt

JOSE
JOAO
LUIZ
PEDRO
PAULO
MANOEL
LUCAS

listFaltantes.txt

ANTONIO
FRANCISCO
CARLOS

by Jefferson Quesado • **22,370** points · Answer 2 · 2017-08-18T04:44:16+00:00

You want to create the following set C:

To do so, you need to answer two questions:

How to identify e belonging to A?
How to identify e not belonging to B?

As in our case A is a file then e belongs to A if it is the result of reading the file. So with this we can answer the first question.

Now, how do we know e does not belong to B? The irrelevant transaction entails the following:

If B is a common set, it means that I will always need to compare with all of its elements to make sure that e does not belong to B. But we don’t need to stick to common sets, I can have ordered sets, hash maps or tree names, all these alternatives allow us to make optimizations in the amount of comparisons made. An efficient search data structure will reduce the number of operations performed.

For the scope of this answer, I’m not optimising the amount of comparisons. I’m also taking into account that the file that contains the set B is relatively small, with a few megabytes at most.

As the whole B is always consulted as a whole, and the whole A only has of importance the generation of the next element, in the initialization of my algorithm I will pre-load fully B. The initialization of the set A, in this case, it will be simply open the file.

The general idea is more or less the following:

inicializa conjunto A
inicializa conjunto B
faça:
    pegue _e_ o próximo elemento de A
    se _e_ não pertence a B:
        adiciona _e_ em C
enquanto não chegou ao fim do conjunto A

The initialization of A would only open the file. If we were to try to optimize comparisons between A and B, we could use some data structure in A as an ordered vector.

inicialização conjunto A:
    abre arquivo "listaCompleta"

The initialization of the set B here is its complete reading. As we do not know its total dimension a priori, we can use a chained list, whose node contains a structure nome and a pointer to the next element of the linked list:

inicialização conjunto B:
    nodo_lista *conjuntoB = NULL
    abre arquivo "listaPresenca"
    faça:
        nodo_lista *novoElemento = malloc(sizeof(nodo_lista))
        novoElemento->next = conjuntoB
        lê do arquivo "listaPresenca" no endereço &(novoElemento->valor)
        se a leitura deu certo:
            conjuntoB = novoElemento
    enquanto não der fim do arquivo "listaPresenca"
    fecha arquivo "listaPresenca"

Take the next element and just give one fread the way you did:

pegue _e_ o próximo elemento de A:
    nome _e_
    lê do arquivo "listaCompleta" no endereço &_e_

The relevance of e in B is being treated by observing the whole set B, then we need to do a full iteration:

_e_ pertence a B?
    nodo_lista *elementoB = conjuntoB
    enquanto elementoB != NULL:
        se elementoB->valor é igual a _e_:
            retorna "_e_ pertence a B"
        elementoB = elementoB->next
    retorna "_e_ não pertence a B"

The comparison between two elements of the type nome is comparing the field nomePessoa of the two objects using strcmp:

_a_ é igual a _b_?
    retorna strcmp(_a_.nomePessoa, _b_.nomePessoa) == 0

Add element e to the whole C can be put the new element in a list or write directly to the file with fwrite. If the alternative of putting yourself on a list is used, after completing the analysis of all A, it is necessary to write these items in the file.

Differences between our approaches

Basically, I’m filling in the set B in working memory, while you try to keep using external memory (such as HD). The problem in your approach is that, before reviewing again an item of the set A, would be necessary to reposition the reading of the file representing the set B for the start again. A simple fseek right after reading the first file, forcing the second file back to the beginning, would sometimes return to the beginning of the set iteration B.

As its alternative implies a quadratic amount of readings to the external memory, I did not think it was a practical solution. Thus, using a linear amount of readings to external memory, I soon fill the whole set B.