1
I would like to create a new variable which is the scalar product of two variables in a sub-group of my dataframe. I know that when I want to create a group-level variable, I can use the method transform
. For example, in the example dataset auto
(download) I can calculate the average price of brands as follows:
import pandas as pd
import numpy as np
df=pd.read_csv("auto.csv")
#cria a variável de marca
df['brand']=[k[0] for k in df.make.str.split(" ")]
#média dos preços
df['price_mean']=df.groupby("brand")['price'].transform(np.mean)
df.head()
Upshot:
| make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displacement | gear_ratio | foreign | brand | price_mean |
|:--------------|--------:|------:|--------:|-----------:|--------:|---------:|---------:|-------:|---------------:|-------------:|:----------|:--------|-------------:|
| AMC Concord | 4099 | 22 | 3 | 2.5 | 11 | 2930 | 186 | 40 | 121 | 3.58 | Domestic | AMC | 4215.67 |
| AMC Pacer | 4749 | 17 | 3 | 3 | 11 | 3350 | 173 | 40 | 258 | 2.53 | Domestic | AMC | 4215.67 |
| AMC Spirit | 3799 | 22 | nan | 3 | 12 | 2640 | 168 | 35 | 121 | 3.08 | Domestic | AMC | 4215.67 |
| Buick Century | 4816 | 20 | 3 | 4.5 | 16 | 3250 | 196 | 40 | 196 | 2.93 | Domestic | Buick | 6075.29 |
| Buick Electra | 7827 | 15 | 4 | 4 | 20 | 4080 | 222 | 43 | 350 | 2.41 | Domestic | Buick | 6075.29 |
I don’t know how to do a similar procedure using two variables. Concretely, the variable I want to create is called result
and is the result of the scalar product of a dummy variable called size
and the variable price
. For a brand would look like this:
df['size']=np.where(df.length>200, 1,0)
tab=df[df.brand=='Olds']
tab['result']=np.dot(tab['price'], tab['size'])
tab
Upshot:
| make | price | mpg | rep78 | headroom | trunk | weight | length | turn | displacement | gear_ratio | foreign | brand | price_mean | size | result |
|:---------------|--------:|------:|--------:|-----------:|--------:|---------:|---------:|-------:|---------------:|-------------:|:----------|:--------|-------------:|-------:|---------:|
| Olds 98 | 8814 | 21 | 4 | 4 | 20 | 4060 | 220 | 43 | 350 | 2.41 | Domestic | Olds | 6050.86 | 1 | 24075 |
| Olds Cutl Supr | 5172 | 19 | 3 | 2 | 16 | 3310 | 198 | 42 | 231 | 2.93 | Domestic | Olds | 6050.86 | 0 | 24075 |
| Olds Cutlass | 4733 | 19 | 3 | 4.5 | 16 | 3300 | 198 | 42 | 231 | 2.93 | Domestic | Olds | 6050.86 | 0 | 24075 |
| Olds Delta 88 | 4890 | 18 | 4 | 4 | 20 | 3690 | 218 | 42 | 231 | 2.73 | Domestic | Olds | 6050.86 | 1 | 24075 |
| Olds Omega | 4181 | 19 | 3 | 4.5 | 14 | 3370 | 200 | 43 | 231 | 3.08 | Domestic | Olds | 6050.86 | 0 | 24075 |
| Olds Starfire | 4195 | 24 | 1 | 2 | 10 | 2730 | 180 | 40 | 151 | 2.73 | Domestic | Olds | 6050.86 | 0 | 24075 |
| Olds Toronado | 10371 | 16 | 3 | 3.5 | 17 | 4030 | 206 | 43 | 350 | 2.41 | Domestic | Olds | 6050.86 | 1 | 24075 |
What I tried to do was:
df.groupby('brand')[['price','size']].transform(lambda x,y: np.dot(x,y))
But it didn’t work. How can I do an operation on two variables using transform
?
I believe it depends on the function you are using, in this case you may have to use the apply:
df.groupby('brand')[['price','size']].apply(lambda x: np.dot(x['price'],x['size']))
. Hug!– lmonferrari