Using Groupby in Pandas dataframe

Question

Using Groupby in Pandas dataframe

Asked 6 years, 4 months ago

Viewed 11,645 times

4

good afternoon.

I don’t have much skill with Python, I’m having some doubts. Anyone who can help me, I thank you.

I opened my csv file in python as follows:

import pandas as pd

caminhoArquivo = r'\\Desktop\Base\dias.csv'

baseDados = pd.read_csv(caminhoArquivo,sep=';',decimal=',',encoding='latin-1')

File Example:

Index  |  Nome  |  Dia
  0    | Pedro  |   3
  1    | Pedro  |   3
  2    | Pedro  |   24
  3    | Antonio|   24
  4    | Antonio|   24
  5    | Antonio|   24
  6    | Carlos |   4
  7    | Carlos |   4
  8    | Carlos |   28
  9    |  Jose  |   1
  10   |  Jose  |   2
  11   |  Jose  |   2

I removed duplicate data using the command:

colunas = ['Nome','Dia']

diaDuplicado = baseDados.drop_duplicates(subset = colunas)

diaDuplicado = diaDuplicado.reset_index()

So, it became:

 Index |  index  |  Nome  |  Dia
  0    |    0    | Pedro  |   3
  1    |    2    | Pedro  |   24
  2    |    3    | Antonio|   24
  3    |    6    | Carlos |   4
  4    |    8    | Carlos |   28
  5    |    9    |  Jose  |   1
  6    |    10   |  Jose  |   2

Now for my doubt. I needed to group the days by names, to stay this way:

Index |  Nome  |  Dia
  0   | Pedro  |   3, 24
  1   | Antonio|   24
  2   | Carlos |   4, 28
  3   |  Jose  |   1, 2

But the only solution I could find was:

diasgroup = diaDuplicado.groupby(by=['Nome'])['Dia'].apply(list)

But in this way it transforms the "Name" column into Dice and is in a format/Type "object".

Index  |  Dia
Pedro  |  3, 24
Antonio|  24
Carlos |  4, 28   
 Jose  |  1, 2

Someone could help me?

1

Use a diasgroup.reset_index() would not work?

– AlexCiuffa

2019/03/25 at 20:03

3 answers

Browser other questions tagged python pandas group-by

You are not signed in. Login or sign up in order to post.

by afonso • **748** points · Answer 1 · 2020-05-16T11:26:50+00:00

Using the function groupby:

df[['Nome', 'Dia']].groupby('Nome').agg(lambda x: list(set(x))).reset_index()
Out[6]:
      Nome      Dia
0  Antonio     [24]
1   Carlos  [4, 28]
2     Jose   [1, 2]
3    Pedro  [24, 3] 
dtype: object

by LuizAngioletti • **1,649** points · Answer 2 · 2020-04-07T18:34:37+00:00

The answer involves a few steps.

Creating the already deduplicated Dataframe:

import pandas as pd    
diaDuplicado = pd.DataFrame(columns=["Nome", "dia"], 
                 data=[["Pedro", 3], 
                       ["Pedro", 24], 
                       ["Antonio", 24], 
                       ["Carlos", 4], 
                       ["Carlos",28], 
                       ["Jose", 1], 
                       ["Jose", 2]])

The Dataframe:

print(diaDuplicado)
      Nome  dia
0    Pedro    3
1    Pedro   24
2  Antonio   24
3   Carlos    4
4   Carlos   28
5     Jose    1
6     Jose    2

Next, generating tuples in a series. The reason we generate tuples (and not lists here) is that lists are not hashable, which in practice implies that Pandas cannot deduplicate lists:

d = k.groupby(by=['Nome'])['dia'].apply(tuple)

The result is:

Nome
Antonio      (24,)
Carlos     (4, 28)
Jose        (1, 2)
Pedro      (3, 24)
Name: dia, dtype: object

Merging the two Dataframes by the correct keys:

p = pd.merge(k, d, left_on="Nome", right_index=True)
print(p)

      Nome  dia_x    dia_y
0    Pedro      3  (3, 24)
1    Pedro     24  (3, 24)
2  Antonio     24    (24,)
3   Carlos      4  (4, 28)
4   Carlos     28  (4, 28)
5     Jose      1   (1, 2)
6     Jose      2   (1, 2)

Now just deduplicate again, considering the columns of interest:

colunas = ['Nome','dia_y']
diaDuplicado = p.drop_duplicates(subset = colunas)

What results in the dataframe:

      Nome  dia_x    dia_y
 0    Pedro      3  (3, 24)
 2  Antonio     24    (24,)
 3   Carlos      4  (4, 28)
 5     Jose      1   (1, 2)

Now just convert the "dia_y" column to list and drop down the dia_x and dia_y columns:

diaDuplicado["DiaLista"] = diaDuplicado["dia_y"].apply(list)
diaLista=diaDuplicado.drop(["dia_x", "dia_y"], axis=1)

What results in the Dataframe "diaLista":

      Nome  DiaLista
0    Pedro   [3, 24]
2  Antonio      [24]
3   Carlos   [4, 28]
5     Jose    [1, 2]

by Sidon • **6,563** points · Answer 3 · 2019-03-25T22:35:27+00:00

TL;DR

Edited
Rereading the question I saw that the author of the question had achieved what he wanted but said that the result is an object and that the names become indices, or something similar. I was wondering if even if he can "navigate" this object and throw the elements in a list, dictionary or qq other variable, wouldn’t answer? I will keep my original answer, in case the requirement has to be a DataFrame but I will put down the code to iterate on the object you get when converting the DataFrame for list, a pandas.core.series.Series, (I will use the code fragment it uses in the question, to create the object):

# Criando o pandas.core.series.Series
diasgroup = diaDuplicado.groupby(by=['Nome'])['Dia'].apply(list)

# Navegando em diasgroup
for i in diasgroup.items():
    print(i)

Exit:

('Antonio', [24])
('Carlos', [4, 28])
('Jose', [1, 2])
('Pedro', [3, 24])

From now on it is for the case where the result needs to be a Dataframe:

I’m not sure if the pandas presents the data exactly the way you want in the groupby, but vc can convert to an empty multindex dataframe, which features something like this:

I am presenting the photo, because when I only get this exit, in a notebook jupyter when I type the name of df, without the function print, with the print function, the result is this:

Empty DataFrame
Columns: []
Index: [(Antonio, 24), (Carlos, 4), (Carlos, 28), (Jose, 1), (Jose, 2), 
(Pedro, 3), (Pedro, 24)]

See how you navigate the index and extract the information you need.

Let’s go to the code:

import io
import pandas as pd

s = '''
Nome,Dia
Pedro,3
Pedro,3
Pedro,24
Antonio,24
Antonio,24
Antonio,24
Carlos,4
Carlos,4
Carlos,28
Jose,1
Jose,2
Jose,2
'''

df = pd.read_csv(io.StringIO(s), parse_dates=True)
df = df.drop_duplicates(subset = ['Nome','Dia'])

grouped = df.groupby(['Nome', 'Dia']).sum()

print(grouped)

Exit:

Empty DataFrame
Columns: []
Index: [(Antonio, 24), (Carlos, 4), (Carlos, 28), (Jose, 1), (Jose, 2),
(Pedro, 3), (Pedro, 24)]

See working on repl it..