How to merge these 2 dataframes and filter the latest timestamp to repeated values?

Asked

Viewed 52 times

1

I have 2 dataframes that I would like to merge

DF1:

MRN  Encounter ID First Name Last Name  Birth Date       Admission D/T  \
0    1          1234       John       Doe  01/02/1999  04/12/2002 5:00 PM   
1    2          2345     Joanne       Lee  04/19/2002  04/19/2002 7:22 PM   
2    3          3456  Annabelle     Jones  01/02/2001  04/21/2002 5:00 PM   

         Discharge D/T          Update D/T  
0  04/13/2002 10:00 PM  04/24/2002 6:00 AM  
1   04/20/2002 6:22 AM  04/24/2002 6:00 AM  
2   04/23/2002 2:53 AM  04/24/2002 6:00 AM 

DF2:

MRN  Encounter ID First Name Last Name  Birth Date       Admission D/T  \
0   20           987      Jerry     Jones  01/02/1988  05/01/2002 2:00 PM   
1    2          2345     Cosmia       Lee  04/19/2002  04/19/2002 7:22 PM   
2    3          3456  Annabelle     Jones  01/02/2001  04/21/2002 5:00 PM   

        Discharge D/T          Update D/T  
0  05/02/2002 9:00 PM  05/17/2002 6:00 AM  
1  04/20/2002 6:22 AM  05/17/2002 6:00 AM  
2  04/23/2002 2:53 AM  05/17/2002 6:00 AM 

The 2 Dataframes have intersection points, such as the 2 record of each df,

MRN  Encounter ID First Name        Last Name  Birth Date       Admission D/T       Discharge D/T 
2    3          3456    Annabelle   Jones      01/02/2001       04/21/2002 5:00 PM  04/23/2002 2:53 AM

where all values are equal except "Update D/T" ( in df2 the value is most recent - 05/17/2002 6:00 AM)

It is possible to merge the 2 dataframes and for the repeated records to get the most updated value of Update D/T ?

1 answer

2


If this is the real case, the solution would be to carry out the following steps:

  1. merge of the two dataframes using the columns MRN, Encounter ID, First Name, Last Name and Birth Date with:
df_merged = pandas.merge(df1, dfs2, on=["MRN", "Encounter ID", "First Name", "Last Name", "Birth Date"])

By default, pandas will put a suffix for the fields that exist in the two dataframes, but are not part of the on=. In this case, the field Admission D/T will be 'Admission D/T_x' for df1 data and 'Admission D/T_y' for df2 data.

  1. Create the column Admission D/T in df_merged containing the highest value between 'Admission D/T_x' and 'Admission D/T_y'. Something like:
df_merged['Admission D/T'] = np.where(df['Admission D/T_x'] > df['Admission D/T_y'], df['Admission D/T_x'], df['Admission D/T_y'])

Note: the condition within the np.where can be a function.

  1. Delete the columns
df.drop(['Admission D/T_x', 'Admission D/T_y'], axis=1)

I hope it helps

Browser other questions tagged

You are not signed in. Login or sign up in order to post.