Add values from a column that are duplicated using groupby as python criteria

Asked

Viewed 61 times

0

I have a data frame that needs to be removed duplicates and later from the previous dataframe I need to add a specific column. Actually I have 5 DF, 4 have already worked, but one because it contains duplicate values after the groupby it gives index error. Can anyone help me? I’ve tried so many ways.

DF1['TOTAL KBYTES VAN'] = DFORIGINAL.groupby(['CONVENIO', 'CNPJ', 'PRODUTO', 'RATEIO VAN'])['TOTAL KBYTES VAN'].transform(np.sum)

this code works for 4 DF, but in one of them TOTAL KBYTES "duplicates" several times, and this is a reality, I need to add also duplicates.

I have tried it in many ways but without success.

examples of attempts:

DF1['TOTAL KBYTES VAN'] = DFORIGINAL.groupby(['CONVENIO', 'CNPJ', 'PRODUTO', 'RATEIO VAN'])['TOTAL KBYTES VAN'].sum()

DF1['TOTAL KBYTES VAN'] = DFORIGINAL.groupby(['CONVENIO', 'CNPJ', 'PRODUTO', 'RATEIO VAN'])['TOTAL KBYTES VAN'].cumsum()

DF1['TOTAL KBYTES VAN'] = DFORIGINAL.groupby(['CONVENIO', 'CNPJ', 'PRODUTO', 'RATEIO VAN'], as_index=False)['TOTAL KBYTES VAN'].sum()

DF1['TOTAL KBYTES VAN'] = DFORIGINAL.groupby(['CONVENIO', 'CNPJ', 'PRODUTO', 'RATEIO VAN'], as_index=False)['TOTAL KBYTES VAN'].transform(np.sum)

DFORIGINAL

VAN;CNPJ;CLIENTE;PRODUTO;RATEIO VAN;TOTAL KBYTES VAN;CONVENIO;% MARGEM
EMPRESA;0123456789777;EMPRESA;OUTROS;100;2,63671875;220000000;1
EMPRESA;0123456789777;EMPRESA;PAGAMENTO;100;2,63671875;220000000;1
EMPRESA;0123456789777;EMPRESA;OUTROS;100;2,63671875;220000000;1
EMPRESA;0123456789777;EMPRESA;OUTROS;100;2,63671875;220000000;1
EMPRESA;0123456789777;EMPRESA;PAGAMENTO;100;2,63671875;220000000;1
EMPRESA;0123456789777;EMPRESA;OUTROS;100;2,63671875;220000000;1
VAN CNPJ CLIENT PRODUCT APPORTIONMENT VAN TOTAL KBYTES VAN CONVENTION % MARGIN
COMPANY 0123456789777 COMPANY OTHERS 100 2,63671875 220000000 1
COMPANY 0123456789777 COMPANY PAYMENT 100 2,63671875 220000000 1
COMPANY 0123456789777 COMPANY OTHERS 100 2,63671875 220000000 1
COMPANY 0123456789777 COMPANY OTHERS 100 2,63671875 220000000 1
COMPANY 0123456789777 COMPANY PAYMENT 100 2,63671875 220000000 1
COMPANY 0123456789777 COMPANY OTHERS 100 2,63671875 220000000 1

I only got this way, see the expected result. Obs. Before making the sum, I make a "drop_duplicates" in DF1, using the same criteria as groupby, so DF1 brings only the 2 unique lines, and from them I hope to deliver the sum of the TOTAL KBYTES VAN

EXPECTED RESULT DF1

VAN;CNPJ;CLIENTE;PRODUTO;RATEIO VAN;TOTAL KBYTES VAN;CONVENIO;% MARGEM
EMPRESA;0123456789777;EMPRESA;OUTROS;100;10,546875;220000000;1
EMPRESA;0123456789777;EMPRESA;PAGAMENTO;100;5,2734375;220000000;1
VAN CNPJ CLIENT PRODUCT APPORTIONMENT VAN TOTAL KBYTES VAN CONVENTION % MARGIN
COMPANY 0123456789777 COMPANY OTHERS 100 10,546875 220000000 1
COMPANY 0123456789777 COMPANY PAYMENT 100 5,2734375 220000000 1
  • Show, that’s exactly it. But how did you manage to do it this way here?

  • 1

    So https://answall.com/editing-help#Tables

1 answer

4


I believe you have to use the reset_index

Take the example:

Creating Dataframe

import pandas as pd

df = pd.DataFrame({"A": [1,2,1,3,4], "B": [1,2,1,3,4]})

df
Out[3]:
   A  B
0  1  1
1  2  2
2  1  1
3  3  3
4  4  4

Summing WITHOUT the reset_index

df1 = df.groupby(["A"])["B"].sum()

type(df1)
pandas.core.series.Series

Summing WITH reset_index

df2 = df.groupby(["A"])["B"].sum().reset_index()

type(df2)
pandas.core.frame.DataFrame

Note realize that the generated type is different between the forms.

EDITED 15/07/2021, based on the update of the post that now contains the dataframe and the expected

Creating Dataframe

import io
import pandas as pd

txt = """VAN;CNPJ;CLIENTE;PRODUTO;RATEIO VAN;TOTAL KBYTES VAN;CONVENIO;% MARGEM
EMPRESA;0123456789777;EMPRESA;OUTROS;100;2,63671875;220000000;1
EMPRESA;0123456789777;EMPRESA;PAGAMENTO;100;2,63671875;220000000;1
EMPRESA;0123456789777;EMPRESA;OUTROS;100;2,63671875;220000000;1
EMPRESA;0123456789777;EMPRESA;OUTROS;100;2,63671875;220000000;1
EMPRESA;0123456789777;EMPRESA;PAGAMENTO;100;2,63671875;220000000;1
EMPRESA;0123456789777;EMPRESA;OUTROS;100;2,63671875;220000000;1"""

df = pd.read_csv(io.StringIO(txt), sep=";")

df
       VAN          CNPJ  CLIENTE    PRODUTO  RATEIO VAN TOTAL KBYTES VAN   CONVENIO  % MARGEM
0  EMPRESA  123456789777  EMPRESA     OUTROS         100       2,63671875  220000000         1
1  EMPRESA  123456789777  EMPRESA  PAGAMENTO         100       2,63671875  220000000         1
2  EMPRESA  123456789777  EMPRESA     OUTROS         100       2,63671875  220000000         1
3  EMPRESA  123456789777  EMPRESA     OUTROS         100       2,63671875  220000000         1
4  EMPRESA  123456789777  EMPRESA  PAGAMENTO         100       2,63671875  220000000         1
5  EMPRESA  123456789777  EMPRESA     OUTROS         100       2,63671875  220000000         1

Converting TOTAL KBYTES VAN float

df["TOTAL KBYTES VAN"] = df["TOTAL KBYTES VAN"].str.replace(",", ".")

df["TOTAL KBYTES VAN"] = df["TOTAL KBYTES VAN"].astype(float)

df
       VAN          CNPJ  CLIENTE    PRODUTO  RATEIO VAN  TOTAL KBYTES VAN   CONVENIO  % MARGEM
0  EMPRESA  123456789777  EMPRESA     OUTROS         100          2.636719  220000000         1
1  EMPRESA  123456789777  EMPRESA  PAGAMENTO         100          2.636719  220000000         1
2  EMPRESA  123456789777  EMPRESA     OUTROS         100          2.636719  220000000         1
3  EMPRESA  123456789777  EMPRESA     OUTROS         100          2.636719  220000000         1
4  EMPRESA  123456789777  EMPRESA  PAGAMENTO         100          2.636719  220000000         1
5  EMPRESA  123456789777  EMPRESA     OUTROS         100          2.636719  220000000         1

Grouping and summing

novo_df = df.groupby(["VAN", "CNPJ", "CLIENTE", "RATEIO VAN", "CONVENIO", "% MARGEM", "PRODUTO"])["TOTAL KBYTES VAN"].sum().reset_index()

novo_df
       VAN          CNPJ  CLIENTE  RATEIO VAN   CONVENIO  % MARGEM    PRODUTO  TOTAL KBYTES VAN
0  EMPRESA  123456789777  EMPRESA         100  220000000         1     OUTROS         10.546875
1  EMPRESA  123456789777  EMPRESA         100  220000000         1  PAGAMENTO          5.273438

Note in groupby I used the column PRODUTO finally, because it is the one who has different values (PAYMENT AND OTHER). This nay is mandatory, just a habit I have.

  • It was worth trying to help, thank you very much. But I had already tried the reset too and it doesn’t work. In the column I want the delivery of the sum, it brings me the data of the covenant column of the first row of each accessed row.

  • 2

    Update your post with a piece of the original dataframe (not an image), the result obtained and the expected result. The original "dataframe" may be false, just to reproduce the problem.

  • I just edited, demonstrating the 2 DF.

  • Man, you were sensational, thank you so much! I hope my doubt can help other people.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.