Using groupby in a dataframe

Question

Using groupby in a dataframe

Asked 6 years, 3 months ago

Viewed 265 times

0

I have a dataframe with 60 columns, but for the case need only 3

    ID          DT_DATE     NR_PRICE
0   22828949    2019-02-26  453.00
1   22828949    2019-02-22  453.00
2   22828949    2019-02-18  453.00
3   22828949    2019-02-05  453.00
4   22828950    2019-02-26  189.00
5   22828950    2019-02-24  189.00
6   22828950    2019-02-19  189.00
7   22828950    2019-02-14  189.00
8   22828950    2019-02-01  411.05

I need to list the first date, penultimate and last date with their respective values I tried to do it this way:

def custom(series):
    min_date = list(series)[0]
    pen_date = list(series)[-2]
    max_date = list(series)[-1]

    return min_date,pen_date,max_date

def get_price(series):
    price_a = list(series)[0]
    price_c = list(series)[-2]
    price_b = list(series)[-1]

    return price_a,price_c,price_b

dfb=df.groupby(["ID"],as_index=False).agg({"DT_DATE":custom,"NR_PRICE":get_price})

When I run it, the following error msg appears

"IndexError: list index out of range"

Someone’s been through it?

1 answer

Browser other questions tagged python pandas

You are not signed in. Login or sign up in order to post.

by Sidon • **6,563** points · Answer 1 · 2019-03-11T22:02:42+00:00

For this problem I wouldn’t use the groupby, see my solution:

I created an example where from the data you put in the question (I took only 6 lines of them) I create a DataFrame(df0) by doing Rt in the date column, I copy the content to a second, eliminating the lines with repeated dates (df1) and finally create a third (df2) only with what you ask (first, last and last date).

import pandas as pd

dados = [['22828949', '2019-02-26', '453.00'],
 ['22828949', '2019-02-22', '453.00'],
 ['22828949', '2019-02-18', '453.00'],
 ['22828949', '2019-02-05', '453.00'],
 ['22828950', '2019-02-26', '189.00'],
 ['22828950', '2019-02-24', '189.00']]

# Construindo o dataframe (Note o sort na colua de datas)
df0 = pd.DataFrame(dados, columns=['ID', 'DT_DATE', 'NR_PRICE']).sort_values(by=['DT_DATE'])

# Removendo as datas duplicatas
df1 = df0.drop_duplicates(subset='DT_DATE', keep='last')

# Extraindo o primeiro, o último e o penultimo registros
df2 = df1.head(1).append(df1.tail(2))

# Apresentando o resultado
print(df2)

Exit:

         ID     DT_DATE NR_PRICE
3  22828949  2019-02-05   453.00
5  22828950  2019-02-24   189.00
4  22828950  2019-02-26   189.00

See working on repl.it.