Compare lines Dataframa Pandas

Asked

Viewed 41 times

0

import pandas as pd
from IPython.core.display import display

df = pd.read_csv('plan.csv', sep=';') ## le a planilha

display(df)

df = df.sort_values('cpf') ## organiza a planilha de ordem de cpf

display(df)

df_2 = pd.concat(g for _, g in df.groupby('cpf') if len(g)>1)

display(df_2)

The above code takes a csv spreadsheet (with 2 columns: 'Cpf' and 'data') and puts it in a pandas DF, then it sorts according to 'Cpf' and at the end creates another dataframe only with the cpfs that repeat in the list. In the end, I have something like this:
inserir a descrição da imagem aqui

What I need now is a way to take the equal numbers and compare the dates to see if they have a difference > 90 days and add that information in a new column. In the case of 'Cpf' 9, it repeats 3 times, so I would need to compare the dates '01/01/2021, 01/04/2021, 01/05/2021' and, in a new column, write if there is more than 90 days difference between the first date and the last. I’m working on this simplified code, but the original has over 20,000 lines.

2 answers

0

I do not know if I understood correctly what is the expected output, but I believe that to calculate the difference of the smallest date in each number with the other dates, you will need to use the groupby with transform to identify the first/smallest date in each Cpf, and then subtract the dates within the DF with this group, thus:

df['data'] = pd.to_datetime(df['data'], dayfirst=True)

g = df.groupby('cpf')['data'].transform('min')
df['diff > 90'] = (df['data'] - g).dt.days > 90

df.head(10)
#saida:
    cpf data        diff > 90
0   1   2021-02-01  False
1   1   2021-02-01  False
2   6   2021-01-01  False
3   6   2021-02-01  False
4   9   2021-01-01  False
5   9   2021-04-01  False
6   9   2021-05-01  True

Step by step

The return of groupby got the shortest date for each CPF

df.groupby('cpf')['data'].transform('min')
#saida:

0   2021-02-01
1   2021-02-01
2   2021-01-01
3   2021-01-01
4   2021-01-01
5   2021-01-01
6   2021-01-01
Name: data, dtype: datetime64[ns]

And to find the difference in dates in days, the following was used:

(df['data'] - g).dt.days
#saida: 

0      0
1      0
2      0
3     31
4      0
5     90
6    120
Name: data, dtype: int64

As you search for differences greater than 90 days, is added at the end of the one commands > 90 that will return a Boolean array that is assigned to the new DF column

(df['data'] - g).dt.days > 90
#saida:

0    False
1    False
2    False
3    False
4    False
5    False
6     True
Name: data, dtype: bool
  • 1

    Dude, you saved the day, it worked right here. agr I’ll just find a way to concatenate the equal numbers. vlw

  • @Gabrielschumacher Good! If my answer satisfied your question, consider accept it. This can help other users find the answer with the same problem :)

0


Another way of doing

Loading library and creating test dataframe

import pandas as pd

df = pd.DataFrame({"cpf": [1,2,3,2,3,3,4,1,1,4], "data":["01/01/2021", "01/02/2021", "01/03/2021", "01/06/2021", "01/04/2021", "01/08/2021", "01/02/2021", "01/03/2021", "01/03/2021", "01/07/2021"]})

df
   cpf        data
0    1  01/01/2021
1    2  01/02/2021
2    3  01/03/2021
3    2  01/06/2021
4    3  01/04/2021
5    3  01/08/2021
6    4  01/02/2021
7    1  01/03/2021
8    1  01/03/2021
9    4  01/07/2021

Dates are as string as string

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   cpf     10 non-null     int64
 1   data    10 non-null     object

Converting string dates to type datetime

df["data"] = pd.to_datetime(df["data"], format='%d/%m/%Y')

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   cpf     10 non-null     int64
 1   data    10 non-null     datetime64[ns]
dtypes: datetime64[ns](1), int64(1)
memory usage: 288.0 bytes

Function to calculate date difference

def first_last(df):
    return df["data"].iloc[-1] - df["data"].iloc[0]

Generating date differences

new_column = df.sort_values(by=["cpf", "data"]).groupby("cpf").apply(first_last)

df_diferenca = new_column.to_frame().rename(columns={0: "dif"})

Generating Dataframe with Differences

df_final = pd.merge(df, df_diferenca, on="cpf")

The result will be

   cpf       data      dif
0    1 2021-01-01  59 days
1    1 2021-03-01  59 days
2    1 2021-03-01  59 days
3    2 2021-02-01 120 days
4    2 2021-06-01 120 days
5    3 2021-03-01 153 days
6    3 2021-04-01 153 days
7    3 2021-08-01 153 days
8    4 2021-02-01 150 days
9    4 2021-07-01 150 days

In case you want to delete lines whose difference is greater than 90 days

df_final = df_final[df_final['dif'] <= pd.Timedelta(90, 'D')]

The result by then will be:

df_final

   cpf       data     dif
0    1 2021-01-01 59 days
1    1 2021-03-01 59 days
2    1 2021-03-01 59 days

To delete the duplicates

# Mantendo o último
df_final = df_final.drop_duplicates(subset=["cpf"], keep='last')

   cpf       data     dif
2    1 2021-03-01 59 days


# Mantendo o primeiro
df_final = df_final.drop_duplicates(subset=["cpf"], keep='first')

   cpf       data     dif
0    1 2021-01-01 59 days

Note, make sure the dataframe is ordered.

Finally, delete the difference column

del df_final['dif']

Browser other questions tagged

You are not signed in. Login or sign up in order to post.