Having the tables in Pandas dataframes, just use the Python index syntax - the [ ]
, placing inside the brackets a Boolean expression that can use one or more columns of any dataframe (at this point they are seen as 'series'), provided that the index of the series is the same (that is, the special column that works as index - may be a sequential numbering, but in that case it could be the "cod_benefic" code itself. ). The values in the initial dataframe for which the expression used are true are selected.
First a simpler example - a dataframe with a column whose values are simply a numerical sequence - I use the index syntax to select lines whose numbers are divisible by 3 (this is the rest of the division by 3 - %3- is equal to 0):
In [23]: df = pd.DataFrame({"numeros":range(0,20)},)
In [24]: df[df["numeros"] % 3 == 0]
Out[24]:
numeros
0 0
3 3
6 6
9 9
12 12
15 15
18 18
Once we understand the selection mechanism, we will create two dataframes with two columns with values that in a dataframe are a subset of other and perform a Join. (Note that in this example, we have the same benefit codes, and dates that are equal in only 50% of the cases between the different tables)
Data two dataframes:
In [73]: df1
Out[73]:
cod_benef datas codigo_1
0 272 2019-04-23 100
1 141 2019-04-12 101
2 104 2019-04-28 102
3 203 2019-04-14 103
4 143 2019-04-29 104
5 112 2019-04-29 105
6 259 2019-04-19 106
7 281 2019-04-17 107
8 180 2019-04-24 108
9 175 2019-04-22 109
In [74]: df2
Out[74]:
cod_benef datas codigo_2
0 272 2019-05-01 200
1 141 2019-04-19 201
2 104 2019-04-28 202
3 203 2019-05-01 203
4 143 2019-04-16 204
5 112 2019-04-29 205
6 259 2019-04-23 206
7 281 2019-04-17 207
8 180 2019-04-23 208
9 175 2019-04-22 209
We specify the condition of Join as a series resulting from comparison operations and others between the columns of the two frames. Operators should be used &
for and
and |
for or
, and pairs of extra parentheses, since the
precedence of those operators is greater than that of the comparators ==
.
This same condition should be used in both dataframes from which we take values, so it is best to keep it in a separate variable:
In [75]: join_cond = (df1["cod_benef"] == df2["cod_benef"]) & (df1["datas"] == df2["datas"])
And then just assemble the dataframe with the desired results with pandas.concat
, using the above set of conditions to select the rows of each input Dataframe, and the index operator, after selecting the rows, to select the desired columns of each Dataframe in the final result:
In [76]: df3 = pd.concat((df1[join_cond], df2[join_cond][["codigo_2"]] ), axis=1)
And voila:
Out[77]:
cod_benef datas codigo_1 codigo_2
2 104 2019-04-28 102 202
5 112 2019-04-29 105 205
7 281 2019-04-17 107 207
9 175 2019-04-22 109 209
Just note that this is not "The Python", but the way to do it with Pandas. If the data were in Python in another data structure, the form of selection would be different- the Pandas by its nature ends up displaying a own way of thinking and solving the problems of what happens in pure Python.
And you can’t just run this statement as a string?
– Costamilam
Although your question is seemingly simple to solve, you should provide as much information as possible so that the community can reproduce your problem and validate the tests. Provide an input data sample (may be the result of
df.head()
of both Dfs) and how the data output should be based on that sample– Terry
I added a base image.
– Cosme Franco
If you put some data or data files that whoever answers can copy and paste to solve the problem also makes it easier. How is the question obliges, after all, who will answer to create a data set to provide an answer.
– jsbueno