Grouping function miscalculating

Asked

Viewed 92 times

4

I’m trying to write a python function that takes two vectors, eliminates duplicates in the first and adds the corresponding data in the second.

Kind of trying to emulate the following Mysql query:

SELECT `Price`, sum(`Volume`) FROM `Dataset` GROUP BY `Price`

My code is like this:

def Group(X, Y):
  temp = dict.fromkeys(X, [0, 0])
  L = range(len(X))

  for i in L:
    print('BEFORE: i=%s, temp=%s, temp[X[i]]=%s, X=%s, X[i]=%s, Y=%s, Y[i]=%s' %(i, temp, temp[X[i]], X, X[i], Y, Y[i]))

    temp[X[i]][0] += Y[i]
    temp[X[i]][1] += 1

    print('AFTER: i=%s, temp=%s, temp[X[i]]=%s, X=%s, X[i]=%s, Y=%s, Y[i]=%s' %(i, temp, temp[X[i]], X, X[i], Y, Y[i]))

  return temp

When calling the function, it should respond like this:

>>> a = [1, 1, 2]
>>> b = [3, 4, 5]
>>> Group(a, b)
BEFORE: i=0, temp={1: [0, 0], 2: [0, 0]}, temp[X[i]]=[0, 0], X=[1, 1, 2], X[i]=1, Y=[3, 4, 5], Y[i]=3
AFTER: i=0, temp={1: [3, 1], 2: [0, 0]}, temp[X[i]]=[3, 1], X=[1, 1, 2], X[i]=1, Y=[3, 4, 5], Y[i]=3
BEFORE: i=1, temp={1: [3, 1], 2: [0, 0]}, temp[X[i]]=[3, 1], X=[1, 1, 2], X[i]=1, Y=[3, 4, 5], Y[i]=4
AFTER: i=1, temp={1: [7, 2], 2: [0, 0]}, temp[X[i]]=[7, 2], X=[1, 1, 2], X[i]=1, Y=[3, 4, 5], Y[i]=4
BEFORE: i=2, temp={1: [7, 2], 2: [5, 1]}, temp[X[i]]=[5, 1], X=[1, 1, 2], X[i]=2, Y=[3, 4, 5], Y[i]=5
AFTER: i=2, temp={1: [7, 2], 2: [5, 1]}, temp[X[i]]=[5, 1], X=[1, 1, 2], X[i]=2, Y=[3, 4, 5], Y[i]=5
{1: [7, 2], 2: [5, 1]}

But you’re answering like this:

>>> a = [1, 1, 2]
>>> b = [3, 4, 5]
>>> Group(a, b)
BEFORE: i=0, temp={1: [0, 0], 2: [0, 0]}, temp[X[i]]=[0, 0], X=[1, 1, 2], X[i]=1, Y=[3, 4, 5], Y[i]=3
AFTER: i=0, temp={1: [3, 1], 2: [3, 1]}, temp[X[i]]=[3, 1], X=[1, 1, 2], X[i]=1, Y=[3, 4, 5], Y[i]=3
BEFORE: i=1, temp={1: [3, 1], 2: [3, 1]}, temp[X[i]]=[3, 1], X=[1, 1, 2], X[i]=1, Y=[3, 4, 5], Y[i]=4
AFTER: i=1, temp={1: [7, 2], 2: [7, 2]}, temp[X[i]]=[7, 2], X=[1, 1, 2], X[i]=1, Y=[3, 4, 5], Y[i]=4
BEFORE: i=2, temp={1: [7, 2], 2: [7, 2]}, temp[X[i]]=[7, 2], X=[1, 1, 2], X[i]=2, Y=[3, 4, 5], Y[i]=5
AFTER: i=2, temp={1: [12, 3], 2: [12, 3]}, temp[X[i]]=[12, 3], X=[1, 1, 2], X[i]=2, Y=[3, 4, 5], Y[i]=5
{1: [12, 3], 2: [12, 3]}

Already in the first execution of the loop the function misses, because I want it to add the value in Y[0] only in temp[1][0], and not in temp[2][0] also.

Someone can see what I’m doing wrong?

Thanks in advance!


[RESOLVED]

I want to sincerely thank you again for the tips and valuable answers! The solution I chose/found was:

def Group(X, Y, isSum):
    temp = dict.fromkeys(X, (0, 0))
    for i in range(len(X)):
        temp[X[i]] = (temp[X[i]][0] + Y[i], temp[X[i]][1] + 1)
    if not isSum:
        for i in temp:
            temp[i] = temp[i][0] / temp[i][1]
    return temp

And the machine’s response was:

>>> a = [1, 1, 2, 3, 4]
>>> b = [5, 6, 7, 7, 8]
>>> 
>>> Group(a, b, True)
{1: (11, 2), 2: (7, 1), 3: (7, 1), 4: (8, 1)}
>>> Group(b, a, False)
{8: 4.0, 5: 1.0, 6: 1.0, 7: 2.5}
>>> 

Actually I wanted two functions in one, to emulate these two queries:

SELECT `Price`, SUM(`Volume`) FROM `Dataset` GROUP BY `Price` ;
SELECT `Volume`, AVG(`Price`) FROM `Dataset` GROUP BY `Volume` ;

I apologize for presenting only half the problem.

Hug to all!


[SOLVED 2]

There goes a complement, with real problem data:

Dataset = [
    # Tempo, Volume e Preço do par BTCUSD em [https://www.bitfinex.com/]
    (1426502898, 0.01, 292.76),
    (1426503160, 0.01, 292.77),
    (1426503169, 0.0224, 292.77),
    (1426503180, 0.304111, 292.77),
    (1426503188, 0.026, 292.77),
    (1426503194, 0.388, 292.77),
    (1426503287, 0.01, 292.76),
    (1426503287.14286, 0.202, 292.7),
    (1426503287.28571, 0.7082, 292.77),
    (1426503287.42857, 1.12, 292.75),
    (1426503287.57143, 2.57392822, 292.76),
    (1426503287.71429, 3.095, 292.77),
    (1426503287.85714, 7, 292.77),
    (1426503292, 0.22, 292.7),
    (1426503320, 0.5, 292.78),
    (1426503340, 0.5, 292.78),
    (1426503358, 0.5, 292.78),
    (1426503377, 0.01, 292.76),
    (1426503377.5, 0.49, 292.78),
    (1426503378, 0.01, 292.75),
    (1426503384, 0.01, 292.76),
    (1426503385, 0.1, 292.78),
    (1426503394, 0.01, 292.76),
    (1426503395, 0.5, 292.78),
    (1426503399, 0.01, 292.76),
    (1426503402, 0.01, 292.76),
    (1426503407, 0.01, 292.76),
    (1426503414, 0.181549, 292.78),
    (1426503414.5, 0.318451, 292.78),
    (1426503415, 0.01, 292.76)
]

partition = lambda data, pos: [i[pos] for i in data]

def Group(X, Y, isSum):
    temp = dict.fromkeys(X, (0, 0))
    for i in range(len(X)):
        temp[X[i]] = (temp[X[i]][0] + Y[i], temp[X[i]][1] + 1)
    if not isSum:
        for i in temp:
            temp[i] = temp[i][0] / temp[i][1]
    return temp

And the machine’s response was:

>>> 
>>> Group(partition(Dataset, 2), partition(Dataset, 1), True)
{292.77: (11.553711, 8), 292.78: (3.09, 8), 292.76: (2.6639282199999985, 10), 292.75: (1.1300000000000001, 2), 292.7: (0.42200000000000004, 2)}
>>> Group(partition(Dataset, 1), partition(Dataset, 2), False)
{0.304111: 292.77, 1.12: 292.75, 2.57392822: 292.76, 3.095: 292.77, 0.181549: 292.78, 0.5: 292.78, 7: 292.77, 0.026: 292.77, 0.202: 292.7, 0.01: 292.76000000000005, 0.1: 292.78, 0.388: 292.77, 0.318451: 292.78, 0.49: 292.78, 0.0224: 292.77, 0.22: 292.7, 0.7082: 292.77}
>>> 

1 answer

2


The problem with your code is that you are using an object mutable as default value of your dictionary. When you do:

temp = dict.fromkeys(X, [0, 0])

You expect him to create a list [0, 0] different for each key in X, right? But that’s not what happens, he’s using the same list for all elements! Then, of course, by modifying one of them changes the others too...

One way to solve is by using a tuple, which is immutable:

temp = dict.fromkeys(X, (0, 0))

...

temp[X[i]] = (temp[X[i]][0] + Y[i], temp[X[i]][1] + 1)

Anyway why not use the method instead groupby python?

>>> from itertools import groupby
>>>
>>> a = [1, 1, 2]
>>> b = [3, 4, 5]
>>>
>>> temp = {}
>>> for g,e in groupby(sorted(zip(a,b)), lambda x: x[0]):
...   e = list(e)
...   temp[g] = [sum(x[1] for x in e), len(e)]
...
>>> temp
{1: [7, 2], 2: [5, 1]}

Explaining:

  • The zip(a,b) relates each element of a with its corresponding element of b, i and.. [(1,3), (1,4), (2,5)];

    • The sorted it is necessary, therefore the groupby only works correctly if the list is sorted according to the grouping key. Standard sorting is sufficient because it compares first by the first element (the key) and then by the second.
  • The groupby group that list by the first element; g then it is the element of the list a, and e is an iterator on the grouped pairs;

  • For each grouping, I calculate the sum of the list items b in this grouping and the number of these items, associating them with their element g.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.