Filter data from a Dataframe pandas by a specific column and the last four dates of a set of dates

Question

Filter data from a Dataframe pandas by a specific column and the last four dates of a set of dates

Asked 4 years, 9 months ago

Viewed 509 times

0

I have 2 dataframes in Pandas where:

Data Frame A - contains information of the available time of announcements in each program

DFTmpDisp

      sinal   cod_programa      data              tempo_disponivel
0      SP1         XXXX         25/07/2020         10
1      BH          XXXX         01/08/2020        300
2      NAC         XXXX         08/08/2020        118
3      SP1         XXXX         15/08/2020        106
4      SP1         XXXX         22/08/2020          6
2287    BH         ZZZZ         24/08/2020        218
2288    BH         ZZZZ         25/08/2020         66
2289    BH         ZZZZ         26/08/2020         44
2290    BH         ZZZZ         27/08/2020        190
2291    BH         ZZZZ         28/08/2020        202

> Para o Dataframe A, converti a data em dias, ficando portanto conforme
DFTmpDisp['dia_semana'] = pd.to_datetime(DFTmpDisp['dia_semana'], format='%d/%m/%Y').dt.dayofweek

 sinal   cod_programa      data              tempo_disponivel
0      SP1         XXXX           5                10
1      BH          XXXX           5               300
2      NAC         XXXX           5               118
3      SP1         XXXX           5               106
4      SP1         XXXX           5                 6
2287    BH         ZZZZ           0               218
2288    BH         ZZZZ           1                66
2289    BH         ZZZZ           2                44
2290    BH         ZZZZ           3               190
2291    BH         ZZZZ           4               202

- DataFrame B - informações de audiência dos programas que já foram
   exibidos


DFTmpAud
     sinal cod_programa data_exibicao data_inicio_programa  media_audiencia
2278   SP1         XXXX    23/05/2020  02/05/2020 19:01:00        325104.42
2280   SP1         XXXX    09/05/2020  02/05/2020 19:01:00        255104.42
2275    BH         XXXX    02/05/2020  02/05/2020 19:01:00        455104.42
6220   SP1         XXXX    16/05/2020  02/05/2020 19:05:00        118375.44
3345    NAC        XXXX    16/05/2020  02/05/2020 19:05:00       1053999.80
2296   SP1         XXXX    30/05/2020  02/05/2020 19:05:00       1299650.50
8873   SP1         XXXX    02/05/2020  02/05/2020 19:05:00      14091402.82
276     BH         ZZZZ    01/05/2020  01/05/2020 11:30:00        150998.68
8536   NAC         ZZZZ    01/05/2020  01/05/2020 11:30:00       7846891.62
5871   SP1         ZZZZ    01/05/2020  01/05/2020 11:30:00        841810.51
6329    DF         ZZZZ    01/05/2020  01/05/2020 11:30:00        123071.38
237     RJ         ZZZZ    01/05/2020  01/05/2020 11:30:00        581696.70

For Dataframe B I ordered date display by increasing date

DFTmpAud = DFTmpAud.sort_values('data_exibicao')

I converted the dates to datetime and put in the format day/month/year, a time they were in the format Year/Month/day

DFTmpAud['data_exibicao'] = pd.to_datetime(DFTmpAud['data_exibicao'], format='%Y-%m-%d')

DFTmpAud['data_inicio_programa'] =  pd.to_datetime(DFTmpAud['data_inicio_programa'], format='%Y-%m-%d %H:%M:%S')

Problem: I need to calculate the median ratings in the last four shows of the show by default, cod_program and day of the week. Data for calculation are in Dataframe B (Dftmpaud). After creating a new column in dataframe A it will contain these median values for each day. Remembering that I only need the last 4 dates, I thought of using groupby, grouping by the way, program and date, but I did not find a way to get only the last 4 dates. Filter then would be sign, program and 4 last display dates

Observation to calculate the median for SP1: Filter by signal = SP1 cod_program = XXXX Day 5 (Saturday) -> all display dates fall on Saturday which is equal to 5. Pick from highest to lowest date

As datasetB data should be found 5 records for day 5

2296 SP1 XXXX 30/05/2020 02/05/2020 19:05:00 1299650.50

2278 SP1 XXXX 23/05/2020 02/05/2020 19:01:00 325104.42

6220 SP1 XXXX 16/05/2020 02/05/2020 19:05:00 118375.44

2280 SP1 XXXX 09/05/2020 02/05/2020 19:01:00 255104.42

8873 SP1 XXXX 02/05/2020 02/05/2020 19:05:00 14091402.82

Take the last 4 dates (from highest to lowest) of the above result

The result should be sinal cod_programa data_exibicao data_inicio_programa media_audiencia

2296 SP1 XXXX 30/05/2020 02/05/2020 19:05:00
1299650.50

2278 SP1 XXXX 23/05/2020 02/05/2020 19:01:00 325104.42

6220 SP1 XXXX 16/05/2020 02/05/2020 19:05:00 118375.44

2280 SP1 XXXX 09/05/2020 02/05/2020 19:01:00 255104.42

Add the values contained in media_audiencia which will be 1998234,78 divided by 4 (as there are 4 lines) - 1998234,78/4 Result = 499558,695

For the BH sign cod_program = XXXX Day 5 (Saturday) Should be found only 1 record, then the media is the own media_audience for this signal and code program on 05.

2275 BH XXXX 02/05/2020 02/05/2020 19:01:00 455104.42

Dataframe Expected Result

       sinal   cod_programa      data        tempo_disponivel    mediana

0      SP1         XXXX           5           10        499558,695

2275    BH         XXXX           5           300        455104.42

I will read your post calmly later. But, as it has date filters list, see if that answer helping

– Paulo Marques

2020/10/25 at 22:46
Based on the two samples of Dataframes you posted, what would be the expected output? If you want to calculate the average of the last 4 events of each program it can be interesting to put a sample of data larger than 5 lines so that it is possible to reproduce exactly what you are going through

– Terry

2020/10/26 at 00:18
Hi Terry, I edited the question. I think you can understand better.

– Lisnete Miranda

2020/10/26 at 01:53
Ola Paulo, I edited the question with the result I hope. I think I can better understand. Thank you

– Lisnete Miranda

2020/10/26 at 01:54
Hi. If any of the questions helped you or solved your problem, please consider accepted as correct

– Terry

2020/10/28 at 19:53

2 answers

0

First it will be necessary to use the shift() and fillna to fill the audience of programs that have only one occurrence.

To calculate the average of these programs the function will be used rolling with groupby and since we are only interested in the value of the last occurrence, we will also use the drop_duplicates

df = DFTmpAud.copy()

g = df.groupby(['sinal', 'cod_programa'], sort=False)
df['media_audiencia_temp'] = g['media_audiencia'].shift()

df['media_audiencia_temp'] = df['media_audiencia_temp'].fillna(df['media_audiencia'])

g2 = df.groupby(['sinal', 'cod_programa'], sort=False)
grouper = g2['media_audiencia_temp'].rolling(4, min_periods=1).mean().reset_index(level=[0,1])
grouper['media_audiencia_temp'] = grouper['media_audiencia_temp'].round(2)

grouper = grouper.drop_duplicates(subset=['sinal', 'cod_programa'], keep = 'last')

grouper.head(10)
        sinal   cod_programa    media_audiencia_temp
276     BH      ZZZZ            150998.68
8536    NAC     ZZZZ            7846891.62
5871    SP1     ZZZZ            841810.51
6329    DF      ZZZZ            123071.38
237     RJ      ZZZZ            581696.70
8873    SP1     XXXX            499558.70
2275    BH      XXXX            455104.42
3345    NAC     XXXX            1053999.80

Now just join this DF with Dftmpdisp with merge

DFTmpDisp.merge(grouper, on =['sinal','cod_programa'])
#output:
    sinal   cod_programa    data    tempo_disponivel    media_audiencia_temp
0   SP1     XXXX            5       10                  499558.70
1   SP1     XXXX            5       106                 499558.70
2   SP1     XXXX            5       6                   499558.70
3   BH      XXXX            5       300                 455104.42
4   NAC     XXXX            5       118                 1053999.80
5   BH      ZZZZ            0       218                 150998.68
6   BH      ZZZZ            1       66                  150998.68
7   BH      ZZZZ            2       44                  150998.68
8   BH      ZZZZ            3       190                 150998.68
9   BH      ZZZZ            4       202                 150998.68

The output has several occurrences for the same program because in Dftmpdisp there are several occurrences for the same signal and cod_program, which can be deleted using drop_duplicates also

Hi Terry, I will test, but I think that’s it right there. I understand your logic. Thank you so much

– Lisnete Miranda

2020/10/26 at 17:29
1

Thanks Terry. I just made some adjustments to the groupby because I needed to group beyond the sign and cod_program also by date. I converted the date into day of the week, Group and was the expected result. Thank you so much for your help.

– Lisnete Miranda

2020/10/29 at 20:29

Browser other questions tagged python pandas numpy

You are not signed in. Login or sign up in order to post.

by Paulo Marques • **3,739** points · Answer 1 · 2020-10-26T02:48:36+00:00

UPDATED FROM THE COMMENTS

I hope I understand the problem.

Let’s cut to the chase:

Creating the dataframe A

>>> import pandas as pd

>>> dfA = pd.DataFrame({"sinal": ["SP1", "BH", "NAC", "SP1", "SP1", "BH", "BH", "BH", "BH", "BH"],
                   "cod_programa": ["XXXXX", "XXXXX", "XXXXX", "XXXXX", "XXXXX", "ZZZZZ", "ZZZZZ", "ZZZZZ", "ZZZZZ", "ZZZZZ"],
                   "data": ["23/05/2020", "01/05/2020", "02/05/2020", "01/05/2020", "02/05/2020", "02/05/2020", "03/05/2020", "01/05/2020", "02/05/2020", "03/05/2020"],
                   "tempo_disponivel": [10, 300, 118, 106, 6, 218, 66, 44, 190, 202]})

>>> dfA
  sinal cod_programa        data  tempo_disponivel
0   SP1        XXXXX  23/05/2020                10
1    BH        XXXXX  01/05/2020               300
2   NAC        XXXXX  02/05/2020               118
3   SP1        XXXXX  01/05/2020               106
4   SP1        XXXXX  02/05/2020                 6
5    BH        ZZZZZ  02/05/2020               218
6    BH        ZZZZZ  03/05/2020                66
7    BH        ZZZZZ  01/05/2020                44
8    BH        ZZZZZ  02/05/2020               190
9    BH        ZZZZZ  03/05/2020               202

Turning date (string) to date

>>> dfA["data"] = pd.to_datetime(dfA['data'], format="%d/%m/%Y")

>>> dfA
  sinal cod_programa       data  tempo_disponivel
0   SP1        XXXXX 2020-05-23                10
1    BH        XXXXX 2020-05-01               300
2   NAC        XXXXX 2020-05-02               118
3   SP1        XXXXX 2020-05-01               106
4   SP1        XXXXX 2020-05-02                 6
5    BH        ZZZZZ 2020-05-02               218
6    BH        ZZZZZ 2020-05-03                66
7    BH        ZZZZZ 2020-05-01                44
8    BH        ZZZZZ 2020-05-02               190
9    BH        ZZZZZ 2020-05-03               202

Creating the Dataframe B

>>> dfB = pd.DataFrame({"sinal": ["SP1", "SP1", "BH", "SP1", "NAC", "SP1", "SP1", "BH", "NAC", "SP1", "DF", "RJ"],
...                     "cod_programa": ["XXXX", "XXXX", "XXXX", "XXXX", "XXXX", "XXXX", "XXXX", "ZZZZ", "ZZZZ", "ZZZZ", "ZZZZ", "ZZZZ"],
...                     "data_exibicao": ["23/05/2020", "09/05/2020", "02/05/2020", "16/05/2020", "16/05/2020", "30/05/2020", "02/05/2020", "01/05/2020", "01/05/2020", "01/05/2020", "01/05/2020", "01/05/2020"],
...                     "data_inicio_programa": ["02/05/2020 19:01:00", "02/05/2020 19:01:00", "02/05/2020 19:01:00", "02/05/2020 19:05:00", "02/05/2020 19:05:00", "02/05/2020 19:05:00", "02/05/2020 19:05:00", "01/05/2020 11:30:00", "01/05/2020 11:30:00", "01/05/2020 11:30:00", "01/05/2020 11:30:00", "01/05/2020 11:30:00"],
...                     "media_audiencia": [325104.42, 255104.42, 455104.42, 118375.44, 1053999.80, 1299650.50, 14091402.82, 150998.68, 7846891.62, 841810.51, 123071.38, 581696.70]})

Turning dates

>>> dfB["data_exibicao"] = pd.to_datetime(dfB['data_exibicao'], format="%d/%m/%Y")

>>> dfB["data_inicio_programa"] = pd.to_datetime(dfB['data_inicio_programa'], format="%d/%m/%Y %H:%M:%S")

>>> dfB
   sinal cod_programa data_exibicao data_inicio_programa  media_audiencia
0    SP1         XXXX    23/05/2020  02/05/2020 19:01:00        325104.42
1    SP1         XXXX    09/05/2020  02/05/2020 19:01:00        255104.42
2     BH         XXXX    02/05/2020  02/05/2020 19:01:00        455104.42
3    SP1         XXXX    16/05/2020  02/05/2020 19:05:00        118375.44
4    NAC         XXXX    16/05/2020  02/05/2020 19:05:00       1053999.80
5    SP1         XXXX    30/05/2020  02/05/2020 19:05:00       1299650.50
6    SP1         XXXX    02/05/2020  02/05/2020 19:05:00      14091402.82
7     BH         ZZZZ    01/05/2020  01/05/2020 11:30:00        150998.68
8    NAC         ZZZZ    01/05/2020  01/05/2020 11:30:00       7846891.62
9    SP1         ZZZZ    01/05/2020  01/05/2020 11:30:00        841810.51
10    DF         ZZZZ    01/05/2020  01/05/2020 11:30:00        123071.38
11    RJ         ZZZZ    01/05/2020  01/05/2020 11:30:00        581696.70

Picking up existing signs

>>> sinais = list(set(dfB["sinal"]))

>>> sinais
['NAC', 'BH', 'DF', 'RJ', 'SP1']

Picking the biggest dates by the way

>>> datas = []
>>> for sinal in sinais:
...     dts = dfA[(dfA["sinal"] == sinal)].nlargest(4, "data")
...     tmp = [row["data"] for _, row in dts.iterrows()]
...     datas.append((sinal, tmp))
...

>>> datas
[('NAC', [Timestamp('2020-05-02 00:00:00')]), ('BH', [Timestamp('2020-05-03 00:00:00'), Timestamp('2020-05-03 00:00:00'), Timestamp('2020-05-02 00:00:00'), Timestamp('2020-05-02 00:00:00')]), ('DF', []), ('RJ', []), ('SP1', [Timestamp('2020-05-23 00:00:00'), Timestamp('2020-05-02 00:00:00'), Timestamp('2020-05-01 00:00:00')])]

Note: The numeric_only=True is already foreseeing future versions of pandas. The default for this parameter is None what in the future will bring the median of dates (datetime64 and datetime64tz).

Picking up medians by sign and date

>>> for item in datas:
...     sinal = item[0]
...     dts = item[1]
...     for dt in dts:
...         d = f"{dt.year}-{dt.month}-{dt.day}"
...         mediana = dfB[(dfB["sinal"] == sinal) & (dfB["data_exibicao"] == d)].median(numeric_only=True)
...         print(sinal, d, mediana["media_audiencia"])

BH 2020-5-3 nan
BH 2020-5-3 nan
BH 2020-5-2 455104.42
BH 2020-5-2 455104.42
NAC 2020-5-2 nan
SP1 2020-5-23 325104.42
SP1 2020-5-2 14091402.82
SP1 2020-5-1 841810.51

Note: as I created the dataframes in hand, some dates did not match, but server reference.

I hope I’ve helped