Python - Dataframe - Create a new Dataframe from comparing two other Dataframe

Asked

Viewed 694 times

0

I have a question, and I’d like your help. I have two Dataframe and I need to compare if some columns of these Dataframe are the same and, if they are the same, then I need to store this record in another dataframe. That is, I need to create a new Dataframe from the comparison of another two. The example refers to df1 and df2 (need to compare 4 criteria - gender, age, race and schooling) and after the comparison create df3 with the records that were true in the comparison. In the case of the example below, df3 would be formed by the index record 0 of df1 and by the record of the index 0 df2, since they are equal in the criteria mentioned.

import pandas as pd

 df1 = pd.DataFrame({"gender": ['f','m','f'],
                        "age": [11,22,39], 
                        "raca": ['C','nC','nC'], 
                        "escolaridade": ['F','S','M'],
                        "var1":["yes", "yes", "no"],
                        "var2":["no", "yes", "yes"],
                        "var3":["no", "no", "no"],
                        "classe":["no", "yes", "no"]})

df2 = pd.DataFrame({"gender": ['f','f','m'],
                        "age": [11,22,40], 
                        "raca": ['C','C','nC'], 
                        "escolaridade": ['F','M','M'],
                        "var1":["yes", "yes", "no"],
                        "var2":["no", "no", "yes"],
                        "var3":["no", "no", "yes"],
                        "classe":["yes", "yes", "no"]})

inserir a descrição da imagem aqui

  • 3

    Has a sample of Dataframes and a [mcve] of your attempt even not working? Probably your problem should be solved with DataFrame.merge() but without seeing the logic of what you’re doing it’s hard to say.

  • Lili, since we don’t have the real case here, take a look at this link

1 answer

1

I believe the method merge will solve the problem

import pandas as pd

df3 = pd.merge(df1, df2, on=['x', 'y', 'z', 'w'])

For more details see the documentation here

Note: any other column that has the name repeated in the two dataframes will receive the suffixes _x for the first df and _y for the second df. Suffixes can be passed as parameter in the pd.merge if necessary pd.merge(df1, df2, on=[col1, col2, col3], suffixes=['_df1', '_df2'])

See the example below:

>>> import pandas as pd

>>> df1 = pd.DataFrame({"x": [1,2,3],
                        "y": [1,2,3], 
                        "z": [1,2,3], 
                        "w": [1,2,3], 
                        "outra":["primeiro", "df", "esquerda"]})

>>> df2 = pd.DataFrame({"x": [1,2,3], 
                        "y": [1,2,3], 
                        "z": [1,2,3], 
                        "w": [3,3,3], 
                        "outra":["segundo", "df", "direita"]})

>>> df1
   x  y  z  w     outra
0  1  1  1  1  primeiro
1  2  2  2  2        df
2  3  3  3  3  esquerda

>>> df2
   x  y  z  w    outra
0  1  1  1  3  segundo
1  2  2  2  3       df
2  3  3  3  3  direito

>>> df3 = pd.merge(df1, df2, on=["x", "y", "z", "w"], suffixes=["_df1", "_df2"])

>>> df3
   x  y  z  w outra_df1 outra_df2
0  3  3  3  3  esquerda   direita
>>>

I hope it helps.

  • the problem is that I need to keep the variable of identification, because it is important for my problem. In addition to the comparison variables, Dataframe has 51 more variables each. And I need to "take" the records through certain conditions, for example, if they are of the same sex, if they are of the same age and other characteristics. That way, I need to create a Dataframe with these records.

  • The records that are key, will remain with your name. always remember that you can use the dataframe.columns to see the name of the dataframe columns. The filter can be done with dataframe[(CONDIÇÃO_AQUI)]

  • 1

    If you give an example of the two dataframes it will be easier to help you with this.

  • I’m asking for this example from the first comment.

  • df1 has 200 records and 50 variables. df2 has 1700 records and 50 variables. I need to compare df1 and df2 and check which records (between df1 and df2) have the same gender, age, schooling and race. When comparing, it is necessary that these records that present the same characteristics cited, be stored in another df. However, I need to preserve all variables, that is, the 50 variables. Thank you very much for the support.

  • Only a sample of three lines of each DF plus the structure. Otherwise it has no way to answer.

  • @Augustovasques, perhaps with the explanation above, I was able to clarify better. Thanks.

  • I understand that the data should be confidential. In this case, use what I went through as a response to serve as the basis for your specific case. Think about x as a genus, y how old, z as schooling and w as a race..

Show 3 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.