Bucket Sort in Python: The array is only partially ordered!

Question

Bucket Sort in Python: The array is only partially ordered!

Asked 5 years, 3 months ago

Viewed 173 times

1

I’m trying to implement the algorithm Bucket Sort. The code below implemented only "1 round" (considering the numbers from right to left) of the algorithm, in which the array [97,3,100] turned into [100,3,97]. As the largest number of the array is 100 we should have more "2 rounds" for the array finish ordered but could not continue. Any idea?

 import numpy as np

    class BucketSort():
        def __init__(self,array):
            self.array = array
        def distribution_pass(self):
            bucket = np.empty((10,len(self.array)),dtype=int)
            bucket[:] = np.nan
            #rows =[]
           # cols = []
            for num in array:
                for j in range(bucket.shape[1]):
                    bucket[num % 10, j] = num
                    #rows.append(num%10)
                    #cols.append(j)
                    break
                    print()

            return bucket
        def gathering_pass(self):
            array = []
            self.bucket = BucketSort.distribution_pass(self)
            for i in range(self.bucket.shape[0]):
                for j in range(self.bucket.shape[1]):
                  if not np.isnan(self.bucket[i,j]):
                      #print(self.bucket[i,j])
                      array.append(self.bucket[i, j])

            return array


    def main(array):
        bucket = BucketSort(array)
        BucketSort.distribution_pass(bucket)
        array = BucketSort.gathering_pass(bucket)
        return array


    array = [97,3,100]
    print(main(array))
    #bucket = BucketSort(array)
    #print(bucket.distribution_pass())
    #print(bucket.gathering_pass())

Input: [97,3,100]

Output: [100,3,97]

1 answer

Browser other questions tagged python python-3.x classification

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2020-04-10T16:55:01+00:00

Is there any reason to use the numpy? Unless there’s a specific reason, I’m going to show you a real list solution.
The same question applies to the class BucketSort, needs to be a class? Anyway, let’s go to the answer...

Basically, in the Bucket Sort you have the following steps:

define and create the Buckets (quantity and which elements go into each)
place array elements in the respective Buckets
sort each Bucket individually (with an algorithm of your choice)
join the Buckets ordered in a single array

All of these steps are part of the algorithm, and you only have the complete ordering after they all run. If you’re going to provide an algorithm for it to sort a list of elements, then it makes sense that the function or the class responsible already does all these steps for you.

So it doesn’t make much sense in function main, you have to call the steps of Distribuition and Gathering manually. The algorithm should do this for you (in this case, the class BucketSort who should know how to coordinate these steps, and who will use the class would only call a method sort, for example). Even if you didn’t use a class, what should be made available in the API is just a function bucket_sort, that internally can call other functions (such as the distribution_pass and gathering_pass), but this is an implementation detail that does not matter who is calling the function and just wants its ordered list.

Another detail is that you defined the class to receive an array and guard it as an instance variable. The problem is that so each instance of BucketSort will only be able to sort an array (unless you change the array before sorting again, of course). But in my view the class should only have the sort algorithm, which is able to sort any array. That is, the array should not be part of BucketSort. What you could pass as parameters in the constructor are information related to the algorithm itself, such as the amount of Buckets used and/or the size of each Bucket (or even that, because maybe it doesn’t even matter who’s going to use the class).

Another confusing point is that you call distribution_pass, and then call gathering_pass, which in turn calls distribution_pass again (which would not be necessary: either you call the 2 separately, or one calls the other internally).

Anyway, an alternative would be:

from math import floor

class BucketSort:
    def __init__(self, bucket_size=5):
        self.bucket_size = bucket_size

    def sort(self, array):
        if not array: # se a lista é vazia, retorna ela mesma
            return array

        # passos 1 e 2, cria os buckets e distribui os elementos neles
        buckets = self._distribution(array)

        # passo 3, ordena os buckets
        for bucket in buckets:
            self._insertion_sort(bucket)

        # passo 4, junta os buckets
        return self._gathering(buckets)

    def _distribution(self, array):
        # passo 1, cria os buckets
        min_value = min(array)
        max_value = max(array)
        qtd_buckets = floor((max_value - min_value) / self.bucket_size) + 1
        buckets = [ [] for _ in range(qtd_buckets) ]

        # passo 2, distribui os elementos nos buckets
        for num in array:
            buckets[floor((num - min_value) / self.bucket_size)].append(num)
        return buckets

    def _insertion_sort(self, bucket):
        # ordena um único bucket, usando insertion sort
        for i in range(1, len(bucket)):
            up = bucket[i]
            j = i - 1
            while j >= 0 and bucket[j] > up:
                bucket[j + 1] = bucket[j]
                j -= 1
            bucket[j + 1] = up

    def _gathering(self, buckets):
        # passo 4, junta os buckets
        array = []
        for bucket in buckets:
            array.extend(bucket)
        return array


array = [97, 3, 100]
bs = BucketSort()
print(bs.sort(array)) # [3, 97, 100]
# ordena outro array
print(bs.sort([100, 3, -10, 97, -9, 8, 54])) # [-10, -9, 3, 8, 54, 97, 100]

I left the size of Bucket as an optional parameter in the constructor (and a value default of 5, if none is past), but as I have already said, this is an implementation detail that perhaps should not be informed (I just left there to show that could be an option).

Notice that the only method I need to call is sort. This method takes an array and returns another ordered array, and how it does it internally does not matter who it is calling. Notice how it coordinates the calls of the other methods, but who uses the class does not need to know how it is done (this is something only the class needs to know, because it is the one that encapsulates the algorithm).

So I left the other methods with a _ at the beginning of the name. This is a Python convention to indicate that the methods are private. It’s not exactly like private methods in other languages (Python doesn’t stop you from calling them from outside the class), but it serves to indicate that they shouldn’t be used directly, as it’s internal implementation details that can change. The only one that is available and that I guarantee will not change is the method sort: the class ensures that it receives an array and returns another ordered array (this guarantee is also called "contract" of the method: if I receive X, I will return Y - but as this is done internally, it doesn’t matter; the contract only says what is done, but not how).

This way of doing it ensures that the internal details can be changed without affecting anyone who just wants to sort the array. For example, if I want to change the algorithm of step 3 (I used Insertion Sort, but could use any other), or else change the way of distributing the elements between Buckets (as long as it continues working, there would be no problem), or even create other methods (such as one to create the Buckets) or delete some (do it all in one method), or change the code to use numpy instead of lists, etc., anyway, if I want to make these internal changes, I could, because who will use the class would continue calling only the method sort and everything would keep working.

The way you did, this is not possible, because who will use the class needs to know which methods to use and in which order, and if you use it wrong, the array will not be ordered.

Anyway, in this case you don’t have much advantage in using a class, except to complicate the code for no reason. You could simply use functions:

from math import floor

def distribution(array, bucket_size):
    min_value = min(array)
    max_value = max(array)
    qtd_buckets = floor((max_value - min_value) / bucket_size) + 1
    buckets = [ [] for _ in range(qtd_buckets) ]
    for num in array:
        buckets[floor((num - min_value) / bucket_size)].append(num)
    return buckets

def gathering(buckets):
    array = []
    for bucket in buckets:
        array.extend(bucket)
    return array

def insertion_sort(bucket):
    for i in range(1, len(bucket)):
        up = bucket[i]
        j = i - 1
        while j >= 0 and bucket[j] > up:
            bucket[j + 1] = bucket[j]
            j -= 1
        bucket[j + 1] = up

def bucket_sort(array):
    if not array: # se a lista é vazia, retorna ela mesma
        return array
    buckets = distribution(array, 5)
    for bucket in buckets:
        insertion_sort(bucket)
    return gathering(buckets)

print(bucket_sort([97, 3, 100])) # [3, 97, 100]
print(bucket_sort([100, 3, -10, 97, -9, 8, 54])) # [-10, -9, 3, 8, 54, 97, 100]