How to bring more fields in the Pandas groupby, without necessarily having to use them in the grouping?

Asked

Viewed 285 times

0

Good afternoon, guys. I’m messing with the Pandas library in Python and I’d like to ask you a question, if possible. I have a list of values and need to reduce this list to just one record per date. The criterion is that the third field (counter) is the largest of its date. To do this, I used the following code:

import pandas as pd

lista = [("2020-11-16", "803", 4),
    ("2020-11-16", "801", 18),
    ("2020-11-16", "802", 20),
    ("2020-11-17", "801", 12),
    ("2020-11-17", "802", 6)]

df = pd.DataFrame(lista)
df.columns=['data', 'codigo', 'contador']
df2 = df.groupby('data')['contador'].max()
for i in df2.items():
    data = i[0]
    contador = i[1]
    print(data, contador)

#O resultado obtido foi o seguinte e está correto: 
['2020-11-16', 20]
['2020-11-17', 12]

My doubt would be if it is possible to bring together in this result the column 'code', without it passing by grouping. If I put it in the groupby I will have multiple records per date and it will be wrong, because I can only have one record per date. Would there be any way to do this, whether for pandas or otherwise? Thanks from now on.

2 answers

2

One solution would be to create a 'filter':

filtro = df.groupby('data')['contador'].max()

And then use the isin of the pandas:

df[df['contador'].isin(filtro)].reset_index(drop = True)

Exit:

    data        codigo  contador
0   2020-11-16    802        20
1   2020-11-17    801        12
  • 1

    @imonferrari, beautiful solution! I’m not in the habit of using the method isin(), because I usually work with the whole base. It will surely make me think differently from now on.

  • Oops! Thank you, Paul! isin is a very fast method, once you start using becomes an addiction kkkk. Pranks part... Great Hug!

  • 1

    Perfect. I didn’t know this possibility of filter with isin(). Thank you very much!

2

I believe what you’re looking for is the aggregate

To get the result, just:

  1. Create a desired function structure for each field
f = {'contador': 'max', 'codigo': 'first'}
  1. Do the groupby with agg passing by f as a parameter
df2 = series = df.groupby('data', as_index=False)[['contador', 'codigo']].agg(f)

The result will be:

>>> df2
         data  contador codigo
0  2020-11-16        20    803
1  2020-11-17        12    801

Note that the code chosen by Aggregate was the first, which in this case is the 803. However, in case you want the same that is on the line of contador=20, that is to say, codigo=802 do exactly what you were doing and then conclude with a merge()

df3 = pd.merge(df2, df, on=['data', 'contador'])

The result will be:

>>> df3
         data  contador codigo
0  2020-11-16        20    802
1  2020-11-17        12    801

I hope I’ve helped

  • Paulo speaks, good night! All good? In a larger data frame it seems that creates columns code_x and code_y. I think you have to delete the code_x for the result to be the same as the one he asks for in the problem. Big Hug!

  • 1

    @imonferrari, thanks for the comment. Fields that exist in both dataframes automatically receive suffixes_x and _y. Particularly I prefer to pass the suffixes as a parameter in the merge. Something like suffixes=('_df1', '_df2') or suffixes=('_left', '_right'). Other very useful parameters are left_on and right_on which allow merging dataframes when fields have different names. See more details of merge here.

  • Yes, it was just an observation even to the answer "leave" the way he expects, the most everything in conforming kkkk. Hug!!

  • 1

    I tested here also with Aggregate and it worked too, thank you Paulo. It’s good to know that we have different ways to do this. Thank you very much for the two answers, even more personal.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.