Python - Count the number of incidences of an event in a time window

Asked

Viewed 93 times

1

I have a table with basically names, dates and groups like:

Index Name Date Group
1 Joseph 01/01/2020 To
2 Joseph 01/01/2020 B
3 Joseph 03/02/2020 To
4 Joseph 01/03/2020 To
5 Joseph 01/05/2020 To
6 Maria 02/02/2020 B

I want to create two more columns in this table, one that counts how many times that name has appeared in the last 3 months, and another counting how many times that name has appeared in group A in the last 3 months (not counting the analyzed row itself). That is to say:

Index Name Date Group 90 days qlqr event 90 days event A
1 Joseph 01/01/2020 To 0 0
2 Joseph 01/01/2020 B 1 1
3 Joseph 03/02/2020 To 2 1
4 Joseph 01/03/2020 To 3 2
5 Joseph 01/05/2020 To 2 2
6 Maria 02/02/2020 B 0 0

Anybody got any ideas? I tried using groupby a few times, and something strange happened. Example:

ds2 = ds1.groupby('Nome')

And the result generated was inserir a descrição da imagem aqui

1 answer

1


The description below shows how to do for any group. Let’s go by parts:

Preparing the base

Importing libraries

import pandas as pd
import random

Creating the Dataframe

dias = 365

dti = pd.date_range("2019-01-01", periods=dias, freq="D")

names = ["José", "Maria", "João", "Teresa"]

df = pd.DataFrame({"nome": [random.choice(names) for _ in range(dias)], "data": dti})

Data in df

       nome       data
0      José 2019-01-01
1      José 2019-01-02
2    Teresa 2019-01-03
3      José 2019-01-04
4      José 2019-01-05
..      ...        ...
360    José 2019-12-27
361    José 2019-12-28
362    João 2019-12-29
363  Teresa 2019-12-30
364    José 2019-12-31

[365 rows x 2 columns]

Creating column to aid counting

df["dummy"] = 1

Counting

def f(x, t):
    return x.apply(lambda y: x.loc[x['data'].between(y['data'] - t, y['data'], inclusive=False), 'dummy'].sum(), axis=1)


df['30 dias'] = df.groupby('nome', group_keys=False).apply(f, pd.Timedelta(30, unit='d'))
df['90 dias'] = df.groupby('nome', group_keys=False).apply(f, pd.Timedelta(90, unit='d'))

Dataframe will be something like:

       nome       data  dummy  30 dias  90 dias
0      José 2019-01-01      1        0        0
1      José 2019-01-02      1        1        1
2    Teresa 2019-01-03      1        0        0
3      José 2019-01-04      1        2        2
4      José 2019-01-05      1        3        3
..      ...        ...    ...      ...      ...
360    José 2019-12-27      1        6       24
361    José 2019-12-28      1        7       25
362    João 2019-12-29      1        8       21
363  Teresa 2019-12-30      1        7       22
364    José 2019-12-31      1        8       25

[365 rows x 5 columns]

Delete dummy column (Opicional)

del df["dummy"]

Checking out

for index, row in df.iterrows():
    if row['nome'] == 'Teresa':
        print(f'{row["data"]} => {row["30 dias"]} => {row["90 dias"]}')

The output will be something like:

2019-01-03 00:00:00 => 0 => 0
2019-01-14 00:00:00 => 1 => 1
2019-01-21 00:00:00 => 2 => 2
2019-01-24 00:00:00 => 3 => 3
2019-01-25 00:00:00 => 4 => 4
2019-01-31 00:00:00 => 5 => 5
2019-02-01 00:00:00 => 6 => 6
2019-02-03 00:00:00 => 6 => 7    <----- MUDANÇA DE MÊS
2019-02-04 00:00:00 => 7 => 8
2019-02-16 00:00:00 => 7 => 9
2019-03-01 00:00:00 => 5 => 10   <----- MUDANÇA DE MÊS
(...)

Update

If you have a column with the group

>>> def f(x, t, g):
...     return x.apply(lambda y: x.loc[(x['data'].between(y['data'] - t, y['data'], inclusive=False)) & (y['grupo'] == g), 'dummy'].sum(), axis=1)
...
>>> df['90 dias A'] = df.groupby('nome', group_keys=False).apply(f, pd.Timedelta(90, unit='d'), "A")

I hope it helps

Browser other questions tagged

You are not signed in. Login or sign up in order to post.