Another way of doing
Loading library and creating test dataframe
import pandas as pd
df = pd.DataFrame({"cpf": [1,2,3,2,3,3,4,1,1,4], "data":["01/01/2021", "01/02/2021", "01/03/2021", "01/06/2021", "01/04/2021", "01/08/2021", "01/02/2021", "01/03/2021", "01/03/2021", "01/07/2021"]})
df
cpf data
0 1 01/01/2021
1 2 01/02/2021
2 3 01/03/2021
3 2 01/06/2021
4 3 01/04/2021
5 3 01/08/2021
6 4 01/02/2021
7 1 01/03/2021
8 1 01/03/2021
9 4 01/07/2021
Dates are as string as string
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cpf 10 non-null int64
1 data 10 non-null object
Converting string dates to type datetime
df["data"] = pd.to_datetime(df["data"], format='%d/%m/%Y')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cpf 10 non-null int64
1 data 10 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(1)
memory usage: 288.0 bytes
Function to calculate date difference
def first_last(df):
return df["data"].iloc[-1] - df["data"].iloc[0]
Generating date differences
new_column = df.sort_values(by=["cpf", "data"]).groupby("cpf").apply(first_last)
df_diferenca = new_column.to_frame().rename(columns={0: "dif"})
Generating Dataframe with Differences
df_final = pd.merge(df, df_diferenca, on="cpf")
The result will be
cpf data dif
0 1 2021-01-01 59 days
1 1 2021-03-01 59 days
2 1 2021-03-01 59 days
3 2 2021-02-01 120 days
4 2 2021-06-01 120 days
5 3 2021-03-01 153 days
6 3 2021-04-01 153 days
7 3 2021-08-01 153 days
8 4 2021-02-01 150 days
9 4 2021-07-01 150 days
In case you want to delete lines whose difference is greater than 90 days
df_final = df_final[df_final['dif'] <= pd.Timedelta(90, 'D')]
The result by then will be:
df_final
cpf data dif
0 1 2021-01-01 59 days
1 1 2021-03-01 59 days
2 1 2021-03-01 59 days
To delete the duplicates
# Mantendo o último
df_final = df_final.drop_duplicates(subset=["cpf"], keep='last')
cpf data dif
2 1 2021-03-01 59 days
# Mantendo o primeiro
df_final = df_final.drop_duplicates(subset=["cpf"], keep='first')
cpf data dif
0 1 2021-01-01 59 days
Note, make sure the dataframe is ordered.
Finally, delete the difference column
del df_final['dif']
Dude, you saved the day, it worked right here. agr I’ll just find a way to concatenate the equal numbers. vlw
– Gabriel Schumacher
@Gabrielschumacher Good! If my answer satisfied your question, consider accept it. This can help other users find the answer with the same problem :)
– Terry