Incorrect value of Agg Mean

Asked

Viewed 51 times

0

I need to get the same result of the media, fashion and average of this table

Televisores/dia   Freq. absoluta
 0 |----- 20           5
 20|----- 40           25
 40|----- 60           40
 60|----- 80           15
 80|----- 100          10
100|----- 120          5

media=53 moda=50 mediana=50

The idea is to calculate the average of each value in the first column and then the frequency of each one. I arrived at this result:

televisores = [*range(0, 120)]
frequencia = [5, 25, 40, 15, 10, 5]

df = pd.DataFrame({'televisores': televisores})

bins = pd.cut(df['televisores'], [0, 20, 40, 60, 80, 100, 120])
df = df.groupby(bins)['televisores'].agg(Media='mean')

df['Freq. absoluta'] = frequencia

count = [x for x,y in zip(df['Media'], df['Freq. absoluta']) for i in range(y)]

The problem is that the media returns the values with 0.5 more

    df
                 Media   Freq. absoluta
    televisores
    (0, 20]       10.5         5
    (20, 40]      30.5         25
    (40, 60]      50.5         40
    (60, 80]      70.5         15
    (80, 100]     90.5         10
    (100, 120]   110.0         5
    mean(count), mode(count), median(count)
      53.475       50.5         50.5

I wanted to understand the problem and know if there is any easier way to get the result.

  • Tried to use include_lowest=True no cut? Regardless. I believe your result is right, since you used the cut.

1 answer

0


The script is doing what it should. Assuming the averages are wrong at 0.5 is wrong.

Some points:

  1. range(0,120) will create a list starting at zero and ending at 119.
  2. pd.cut will create 6 ranges where by default it has the format (x...y]. Where the x side is opened and the y side closed. To flip use: right=False in the cut
  3. pd.cut does not include the lower value. to make it use: include_lowest=True

Using the cut, you’d have intervals like

>>> import numpy as np
>>> for sdf in np.array_split(df, 6):
...     print(list(sdf.televisores))
...
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
[40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59]
[60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79]
[80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
[100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119]

Realize that

>>> mean([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
9.5
>>> mean([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])
10
>>> mean([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21])
10.5

However, in case you use .loc:

>>> import pandas as pd
>>> import numpy as np
>>> from statistics import mean, median, mode

>>> televisores = [*range(0, 120)]
>>> frequencia = [5, 25, 40, 15, 10, 5]

>>> df = pd.DataFrame({'televisores': televisores})

>>> df.loc[0:20].mean()
televisores    10.0
dtype: float64
>>> df.loc[20:40].mean()
televisores    30.0
dtype: float64
>>> df.loc[40:60].mean()
televisores    50.0
dtype: float64
>>> df.loc[60:80].mean()
televisores    70.0
dtype: float64
>>> df.loc[80:100].mean()
televisores    90.0
dtype: float64

Although you get the averages you were hoping for, this is wrong. In the above case, df.loc[0:20].mean() and df.loc[20:40].mean() use the same 20 for both average calculations.

Note the .loc includes the initial and final indices in the result

>>> df.loc[0:3]
   televisores
0            0
1            1
2            2
3            3
  • Great explanation! To count the same number in two averages also does not seem correct, but it is what was requested. I ended up expressing myself badly by saying that the media was wrong, I did not know how to get around the problem of intervals, the closest I got was to round up with .apply(np.floor).reset_index() no groupby to force the result. Using Oc worked in this and all similar exercises.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.