Find equal records in two databases(csv)

Asked

Viewed 399 times

0

In two files .csv, how can I know whether or not a particular user is in the files ?

These files have a different number of columns, but they have in common some fields (columns), for example:

NOME, CNS, CPF, PIS, SEXO, NOME DA MAE, NOME DO PAI ...

The problem is that there are null fields in these columns. Therefore, there is the problem of knowing if the user exists or not in the databases. In short I have two systems, X and Y, how to know if a certain user belongs to the two systems ?

import csv 
arquivo  = ['cidade_social.csv' , 'gestantes_prenatal.csv']
f1 = open(arquivo[0], 'r').readlines()
f2 = open(arquivo[1], 'r').readlines()
fNome = open('saida.csv', 'a')
for _ in range(2):
for row in f1:
    if row not in f2:
        fNome.write(row)
f1, f2 = f2, f1

I tried to that, but I think this is more for reading lines and not comparing fields.

  • 1

    I believe you are looking to mess with data science. It gives a research on the Pandas. Eh a library (powerful mt) specific for analysis and reading tables, columns.. I believe you will find the solution with this.

  • In this case, you would use pandas to create two tables (one for each file), from this you would stir as needed: if you want to take the blank gaps (null or n/a), look for some word or number inside the file... anyway. Da a researched about

  • I tried this: import pandas as pd dataset = pd.read_csv('citade_social.csv') print(dataset.head()) print(dataset['NAME']) print(dataset.SEX) print(dataset[['NAME','SEX','CNS']]) print(dataset.describe() Tranquil, I can do this for the other csv, but then, how to compare them and write in a new dataframe, the corresponding?

  • It’s like you take the values of a column and read them.. if Voce wants to see if it has repeated name between two tables (probably in the same column - ex: name and name (duplicated names) the easiest way is to create a new dataframe with two columns, a column will receive the names of the two files (column merge into one) and also put the respective indises (where they are located in the table), thus vc uses: df["column"]. value_counts() (where: df is your created dataframe and "column" eh the column with all names) and it will display all names with a num (one of the times it appears)

  • I could tell?

1 answer

1


Come on,

I will assume the premise that in both tables there is at least one common field that is always filled, in my example I will say that this field is the NAME

So assuming the files are:

CSV 1

inserir a descrição da imagem aqui

CSV2

CSV2

We will use pandas to make the comparison.

1 - Import the library

import pandas as pd

2 - Let’s load the files:

df1 = pd.read_csv('C:/TESTES/CSV1.csv', delimiter=';')
df2 = pd.read_csv('C:/TESTES/CSV2.csv', delimiter=';')

From here there are several paths, I’ll talk about two that may or may not fit what you need.

...after uploading the files..

path 1

Concatenate the name column between the two dataframes, then display the duplicates with the function duplicated()

df_aux = pd.concat([df1['NOME'],df2['NOME']])
df_aux[df_aux.duplicated()]

---------Upshot----------

inserir a descrição da imagem aqui

path 2

Here we add a column to dataframe 1 that says in 0 or 1 whether there is a duplicate value in df2.

df1 = df1.assign(Existe_DF2=df1.NOME.isin(df2.NOME).astype(int))

---------Upshot----------

inserir a descrição da imagem aqui

  • 1

    Perfect! That’s right! Thank you so much! Show!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.