Inner Join with two keys with python

Asked

Viewed 2,625 times

0

I have two tables, one for consultation and one for examinations, both with the date of execution and the code of the beneficiary. Want to make a third table with people who have had exams and consultations on the same date.

If it’s in SQL would do so:

 Select t1.cod_benf,
        t1.data_execucao,
        t1.cod_serviço,
        t1.valor_pago,
        t2.cod_serviço,
        t2.valor_pago
From base_consulta as t1 inner join base_exame as t2
      on (t1.cod_benf = t2.cod_benf 
      and t1.data_execucao = t2.data_execucao);

How do I do it in python?

BASES: inserir a descrição da imagem aqui

  • And you can’t just run this statement as a string?

  • Although your question is seemingly simple to solve, you should provide as much information as possible so that the community can reproduce your problem and validate the tests. Provide an input data sample (may be the result of df.head() of both Dfs) and how the data output should be based on that sample

  • I added a base image.

  • 1

    If you put some data or data files that whoever answers can copy and paste to solve the problem also makes it easier. How is the question obliges, after all, who will answer to create a data set to provide an answer.

2 answers

2


Using the data you provided and python3:

1) Building the tables:

#Constrói os dicionários com os dados fornecidos para as bases:

consulta = {'cod_benefic' : ['123456','635241','968574'],
'cod_servico' : ['101010','101010','101010'],
'dat_execucao' : ['02/01/2016','03/01/2016','04/01/2016'],
'valor' : [65,65,65]}

exame = {'cod_benefic' : ['123456','123456','635241','635241','968574'],
'cod_servico' : ['401231','401250','401261','401250','401261'],
'dat_execucao' : ['02/01/2016','02/01/2016','03/01/2016','03/01/2016','04/01/2016'],
'valor' : [18,10,5,10,5]}
#import o pacote pandas
import pandas as pd

# transforma o dicionario em pandas dataframe
base_consulta = pd.DataFrame(data=consulta)
base_exame = pd.DataFrame(data=exame)

#transforma a data do tipo string para datetime
base_consulta['dat_execucao'] = pd.to_datetime(base_consulta['dat_execucao'])
base_exame['dat_execucao'] = pd.to_datetime(base_exame['dat_execucao'])

# pode usar df.dtypes para verificar os tipos das variaveis do seu df(dataframe)
# df.head() para mostrar algumas linhas do seu df
#base_consulta
    cod_benefic cod_servico dat_execucao  valor
0      123456      101010   2016-02-01     65
1      635241      101010   2016-03-01     65
2      968574      101010   2016-04-01     65
#base_exame
  cod_benefic cod_servico dat_execucao  valor
0      123456      401231   2016-02-01     18
1      123456      401250   2016-02-01     10
2      635241      401261   2016-03-01      5
3      635241      401250   2016-03-01     10
4      968574      401261   2016-04-01      5

2) After the creation of the tables we go to what matters: make the tables Join.

  • To make the tables Join we will use the pandas package.
#Faz o join das tabelas
df3 = pd.merge(base_consulta,base_exame,how='inner', on=['cod_benefic','dat_execucao'],suffixes=('_t1', '_t2'))

#Mostra as primeiras linhas
df3.head()

  cod_benefic cod_servico_t1 dat_execucao  valor_t1 cod_servico_t2  valor_t2
0      123456         101010   2016-02-01        65         401231        18
1      123456         101010   2016-02-01        65         401250        10
2      635241         101010   2016-03-01        65         401261         5
3      635241         101010   2016-03-01        65         401250        10
4      968574         101010   2016-04-01        65         401261         5

For more information check out the documentation of pandas.(Link documentation Pandas)

0

Having the tables in Pandas dataframes, just use the Python index syntax - the [ ], placing inside the brackets a Boolean expression that can use one or more columns of any dataframe (at this point they are seen as 'series'), provided that the index of the series is the same (that is, the special column that works as index - may be a sequential numbering, but in that case it could be the "cod_benefic" code itself. ). The values in the initial dataframe for which the expression used are true are selected.

First a simpler example - a dataframe with a column whose values are simply a numerical sequence - I use the index syntax to select lines whose numbers are divisible by 3 (this is the rest of the division by 3 - %3- is equal to 0):

In [23]: df = pd.DataFrame({"numeros":range(0,20)},)                                                                  

In [24]: df[df["numeros"] % 3 == 0]                                                                                   
Out[24]: 
    numeros
0         0
3         3
6         6
9         9
12       12
15       15
18       18

Once we understand the selection mechanism, we will create two dataframes with two columns with values that in a dataframe are a subset of other and perform a Join. (Note that in this example, we have the same benefit codes, and dates that are equal in only 50% of the cases between the different tables)

Data two dataframes:

In [73]: df1                                                                                                          
Out[73]: 
   cod_benef       datas  codigo_1
0        272  2019-04-23       100
1        141  2019-04-12       101
2        104  2019-04-28       102
3        203  2019-04-14       103
4        143  2019-04-29       104
5        112  2019-04-29       105
6        259  2019-04-19       106
7        281  2019-04-17       107
8        180  2019-04-24       108
9        175  2019-04-22       109

In [74]: df2                                                                                                          
Out[74]: 
   cod_benef       datas  codigo_2
0        272  2019-05-01       200
1        141  2019-04-19       201
2        104  2019-04-28       202
3        203  2019-05-01       203
4        143  2019-04-16       204
5        112  2019-04-29       205
6        259  2019-04-23       206
7        281  2019-04-17       207
8        180  2019-04-23       208
9        175  2019-04-22       209

We specify the condition of Join as a series resulting from comparison operations and others between the columns of the two frames. Operators should be used & for and and | for or, and pairs of extra parentheses, since the precedence of those operators is greater than that of the comparators ==.

This same condition should be used in both dataframes from which we take values, so it is best to keep it in a separate variable:

In [75]: join_cond = (df1["cod_benef"] == df2["cod_benef"]) & (df1["datas"] == df2["datas"])     

And then just assemble the dataframe with the desired results with pandas.concat, using the above set of conditions to select the rows of each input Dataframe, and the index operator, after selecting the rows, to select the desired columns of each Dataframe in the final result:

In [76]: df3 = pd.concat((df1[join_cond], df2[join_cond][["codigo_2"]] ), axis=1)     

And voila:

Out[77]: 
   cod_benef       datas  codigo_1  codigo_2
2        104  2019-04-28       102       202
5        112  2019-04-29       105       205
7        281  2019-04-17       107       207
9        175  2019-04-22       109       209

Just note that this is not "The Python", but the way to do it with Pandas. If the data were in Python in another data structure, the form of selection would be different- the Pandas by its nature ends up displaying a own way of thinking and solving the problems of what happens in pure Python.

  • I think it is worth mentioning that this solution will only work if the 2 Dataframes have the same number of lines.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.