Add values from a column that are duplicated using groupby as python criteria

Question

Add values from a column that are duplicated using groupby as python criteria

Asked 4 years ago

Viewed 61 times

0

I have a data frame that needs to be removed duplicates and later from the previous dataframe I need to add a specific column. Actually I have 5 DF, 4 have already worked, but one because it contains duplicate values after the groupby it gives index error. Can anyone help me? I’ve tried so many ways.

DF1['TOTAL KBYTES VAN'] = DFORIGINAL.groupby(['CONVENIO', 'CNPJ', 'PRODUTO', 'RATEIO VAN'])['TOTAL KBYTES VAN'].transform(np.sum)

this code works for 4 DF, but in one of them TOTAL KBYTES "duplicates" several times, and this is a reality, I need to add also duplicates.

I have tried it in many ways but without success.

examples of attempts:

DF1['TOTAL KBYTES VAN'] = DFORIGINAL.groupby(['CONVENIO', 'CNPJ', 'PRODUTO', 'RATEIO VAN'])['TOTAL KBYTES VAN'].sum()

DF1['TOTAL KBYTES VAN'] = DFORIGINAL.groupby(['CONVENIO', 'CNPJ', 'PRODUTO', 'RATEIO VAN'])['TOTAL KBYTES VAN'].cumsum()

DF1['TOTAL KBYTES VAN'] = DFORIGINAL.groupby(['CONVENIO', 'CNPJ', 'PRODUTO', 'RATEIO VAN'], as_index=False)['TOTAL KBYTES VAN'].sum()

DF1['TOTAL KBYTES VAN'] = DFORIGINAL.groupby(['CONVENIO', 'CNPJ', 'PRODUTO', 'RATEIO VAN'], as_index=False)['TOTAL KBYTES VAN'].transform(np.sum)

DFORIGINAL

VAN;CNPJ;CLIENTE;PRODUTO;RATEIO VAN;TOTAL KBYTES VAN;CONVENIO;% MARGEM
EMPRESA;0123456789777;EMPRESA;OUTROS;100;2,63671875;220000000;1
EMPRESA;0123456789777;EMPRESA;PAGAMENTO;100;2,63671875;220000000;1
EMPRESA;0123456789777;EMPRESA;OUTROS;100;2,63671875;220000000;1
EMPRESA;0123456789777;EMPRESA;OUTROS;100;2,63671875;220000000;1
EMPRESA;0123456789777;EMPRESA;PAGAMENTO;100;2,63671875;220000000;1
EMPRESA;0123456789777;EMPRESA;OUTROS;100;2,63671875;220000000;1

VAN	CNPJ	CLIENT	PRODUCT	APPORTIONMENT VAN	TOTAL KBYTES VAN	CONVENTION	% MARGIN
COMPANY	0123456789777	COMPANY	OTHERS	100	2,63671875	220000000	1
COMPANY	0123456789777	COMPANY	PAYMENT	100	2,63671875	220000000	1
COMPANY	0123456789777	COMPANY	OTHERS	100	2,63671875	220000000	1
COMPANY	0123456789777	COMPANY	OTHERS	100	2,63671875	220000000	1
COMPANY	0123456789777	COMPANY	PAYMENT	100	2,63671875	220000000	1
COMPANY	0123456789777	COMPANY	OTHERS	100	2,63671875	220000000	1

I only got this way, see the expected result. Obs. Before making the sum, I make a "drop_duplicates" in DF1, using the same criteria as groupby, so DF1 brings only the 2 unique lines, and from them I hope to deliver the sum of the TOTAL KBYTES VAN

EXPECTED RESULT DF1

VAN;CNPJ;CLIENTE;PRODUTO;RATEIO VAN;TOTAL KBYTES VAN;CONVENIO;% MARGEM
EMPRESA;0123456789777;EMPRESA;OUTROS;100;10,546875;220000000;1
EMPRESA;0123456789777;EMPRESA;PAGAMENTO;100;5,2734375;220000000;1

VAN	CNPJ	CLIENT	PRODUCT	APPORTIONMENT VAN	TOTAL KBYTES VAN	CONVENTION	% MARGIN
COMPANY	0123456789777	COMPANY	OTHERS	100	10,546875	220000000	1
COMPANY	0123456789777	COMPANY	PAYMENT	100	5,2734375	220000000	1

Show, that’s exactly it. But how did you manage to do it this way here?

– Eduardo Garcia de Oliveira

2021/07/15 at 19:05
1

So https://answall.com/editing-help#Tables

– Augusto Vasques

2021/07/15 at 19:07

1 answer

Browser other questions tagged python pandas numpy

You are not signed in. Login or sign up in order to post.

by Paulo Marques • **3,739** points · Answer 1 · 2021-07-15T16:21:14+00:00

I believe you have to use the reset_index

Take the example:

Creating Dataframe

import pandas as pd

df = pd.DataFrame({"A": [1,2,1,3,4], "B": [1,2,1,3,4]})

df
Out[3]:
   A  B
0  1  1
1  2  2
2  1  1
3  3  3
4  4  4

Summing WITHOUT the reset_index

df1 = df.groupby(["A"])["B"].sum()

type(df1)
pandas.core.series.Series

Summing WITH reset_index

df2 = df.groupby(["A"])["B"].sum().reset_index()

type(df2)
pandas.core.frame.DataFrame

Note realize that the generated type is different between the forms.

EDITED 15/07/2021, based on the update of the post that now contains the dataframe and the expected

Creating Dataframe

import io
import pandas as pd

txt = """VAN;CNPJ;CLIENTE;PRODUTO;RATEIO VAN;TOTAL KBYTES VAN;CONVENIO;% MARGEM
EMPRESA;0123456789777;EMPRESA;OUTROS;100;2,63671875;220000000;1
EMPRESA;0123456789777;EMPRESA;PAGAMENTO;100;2,63671875;220000000;1
EMPRESA;0123456789777;EMPRESA;OUTROS;100;2,63671875;220000000;1
EMPRESA;0123456789777;EMPRESA;OUTROS;100;2,63671875;220000000;1
EMPRESA;0123456789777;EMPRESA;PAGAMENTO;100;2,63671875;220000000;1
EMPRESA;0123456789777;EMPRESA;OUTROS;100;2,63671875;220000000;1"""

df = pd.read_csv(io.StringIO(txt), sep=";")

df
       VAN          CNPJ  CLIENTE    PRODUTO  RATEIO VAN TOTAL KBYTES VAN   CONVENIO  % MARGEM
0  EMPRESA  123456789777  EMPRESA     OUTROS         100       2,63671875  220000000         1
1  EMPRESA  123456789777  EMPRESA  PAGAMENTO         100       2,63671875  220000000         1
2  EMPRESA  123456789777  EMPRESA     OUTROS         100       2,63671875  220000000         1
3  EMPRESA  123456789777  EMPRESA     OUTROS         100       2,63671875  220000000         1
4  EMPRESA  123456789777  EMPRESA  PAGAMENTO         100       2,63671875  220000000         1
5  EMPRESA  123456789777  EMPRESA     OUTROS         100       2,63671875  220000000         1

Converting `TOTAL KBYTES VAN` float

df["TOTAL KBYTES VAN"] = df["TOTAL KBYTES VAN"].str.replace(",", ".")

df["TOTAL KBYTES VAN"] = df["TOTAL KBYTES VAN"].astype(float)

df
       VAN          CNPJ  CLIENTE    PRODUTO  RATEIO VAN  TOTAL KBYTES VAN   CONVENIO  % MARGEM
0  EMPRESA  123456789777  EMPRESA     OUTROS         100          2.636719  220000000         1
1  EMPRESA  123456789777  EMPRESA  PAGAMENTO         100          2.636719  220000000         1
2  EMPRESA  123456789777  EMPRESA     OUTROS         100          2.636719  220000000         1
3  EMPRESA  123456789777  EMPRESA     OUTROS         100          2.636719  220000000         1
4  EMPRESA  123456789777  EMPRESA  PAGAMENTO         100          2.636719  220000000         1
5  EMPRESA  123456789777  EMPRESA     OUTROS         100          2.636719  220000000         1

Grouping and summing

novo_df = df.groupby(["VAN", "CNPJ", "CLIENTE", "RATEIO VAN", "CONVENIO", "% MARGEM", "PRODUTO"])["TOTAL KBYTES VAN"].sum().reset_index()

novo_df
       VAN          CNPJ  CLIENTE  RATEIO VAN   CONVENIO  % MARGEM    PRODUTO  TOTAL KBYTES VAN
0  EMPRESA  123456789777  EMPRESA         100  220000000         1     OUTROS         10.546875
1  EMPRESA  123456789777  EMPRESA         100  220000000         1  PAGAMENTO          5.273438

Note in groupby I used the column PRODUTO finally, because it is the one who has different values (PAYMENT AND OTHER). This nay is mandatory, just a habit I have.

Add values from a column that are duplicated using groupby as python criteria

DFORIGINAL

EXPECTED RESULT DF1

1 answer

Creating Dataframe

Summing WITHOUT the reset_index

Summing WITH reset_index

EDITED 15/07/2021, based on the update of the post that now contains the dataframe and the expected

Creating Dataframe

Converting TOTAL KBYTES VAN float

Grouping and summing

Converting `TOTAL KBYTES VAN` float