Adding total sum grouped to a new column Dataframe pyspark

Asked

Viewed 870 times

0

I have a dataframe with the following columns:

COL1    COL2    COL3    NEW_COL*
A       asd      1         8
B       adf      2         9
A       adg      8         1
B       adh      9         2
C       adj      7         7
D       adk      1         1

Where NEW_COL = (total sum of col1 by type - the value of col3) / (total Qtd of col1 by type - 1)

In this column I need help, someone knows how I can do in a Dataframe with pyspark?

Thanks!

2 answers

1

Adriana, I don’t understand the calculation to make your new column. If NEW_COL = (total sum of col1 by type - the value of col3) / (total Qtd of col1 by type - 1) then for first line of NEW_COL would be:

  • Sum total of col1 per type = 2, as there are two occurrences of A in col1
  • Value of col3: 1
  • Total quantity of col1 per type: 2 as they are two occurrences of A in col1

Soon the first line would be: (2-1)/(2-1)=1, hence I did not understand why the result gave 8. Could explain me with a more detailed example of the calculation?

  • It would look like this: Sum of type A = 9 - Value of type B = 1 => 8 Quantity of type A = 2 - Unit value = 1 => 1. So we have 8/1.... :)

0


I know the post is old, but it’s for future research. Would that be?

import pandas as pd
df = pd.DataFrame(data={'COL1':['A','B','A','B','C','D'],
      'COL2':['asd','adf','adg','adh','adj','adk',],
      'COL3':[1,2,8,9,7,1]})

print (df)

#gera uma nova coluna com a soma o valor da coluna1 por tipo
df['somaCol1porTipo'] = df['COL1'].groupby(df['COL1']).transform('count')

#gera uma nova coluna com a diferença
df['col_anterior_menos_Col3'] = df['somaCol1porTipo'] - df['COL3']

#gera um coluna com o total da coluna1
df['totalCol1'] = df['COL1'].count()

print (df)

Then, just do the calculations you need using the columns that already have the base values.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.