How does the hash table implementation work?

Asked

Viewed 6,730 times

11

The concept of hashes (or tables hash) is already embedded in various programming languages such as Javascript, Python, Ruby, etc., and is known to be much faster than, for example, chained lists. It associates key-value pairs like this:

meuHash = []
meuHash["minha chave"] = "meu valor"

My question is: How are they implemented? That is, what happens behind when we include or seek a value in a hash?

Just know that its implementation is different from chained structures like stacks, queues, lists, binary trees, etc.

  • Remember that this is a dictionary. 'Hash' is a metonym. The implementation can use hash, but it can also use vector, or tree, or something else. Python dictionary implementation changes from hash to tree in version 3.7 because key values are kept sorted.

2 answers

16


Each thing is one thing

First of all, it is important to understand that each type of data structure has a more specific application. So, say that hashes are faster than chained lists is to compare oranges with bananas.

A "table" is quite efficient to retrieve an element contained in it from a key. That means it’s quick to find a needle in a haystack if you have a shape (hash) to identify the needle.

However, going through an entire table is generally less efficient than going through an entire list.

In addition, only a vector-based list can efficiently access random elements.

Hashes

These are numerical values calculated by a function from an original value of arbitrary size. There are several algorithms for hash, the best ones are the ones that emit the least possible number of hashes repeated.

For example, the class String Java has the following method to generate the hash code:

public int More ...hashCode() {
    int h = hash;
    if (h == 0) {
        int off = offset;
        char val[] = value;
        int len = count;

        for (int i = 0; i < len; i++) {
            h = 31*h + val[off++];
        }
        hash = h;
    }
    return h;
}

The documentation of the method states that the hash shall be calculated as follows::

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

Where:

  • s[i]: i-nth character of the string.
  • m: string size, in characters

In short, generate a hash is like generating a numeric ID for the object.

In Java, any object can have its own implementation of hashCode. This implementation should consider all attributes that identify the object. It is important that different objects have hashes different as far as possible.

Tables of Hashes

The number generated by a function hash can be used to easily locate an object in a kind of "table".

Internally, in Java for example, this table is an array. Each vector position is assigned a range of numbers. It would be like slots single-shelf.

Storing values

Suppose we create a Hashtable and the internal vector has 10 positions (slots). Suppose we add objects and enter keys values of hash vary from 1 to 100. Thus, hashes that return values from 1 to 10 will be placed in the first slot, those that return values from 11 to 20 in the second slot and so on.

If we add an object (1) whose hash of the key be 20 it will be placed in the second slot. If we add another object (2) with hash 45 he will be placed fifth slot. If we add yet another object (3) with hash 15, he will be putting also in the second slot. We would have the following data structure:

Slot 1: 
Slot 2: Objeto 1, Objeto 3
Slot 3: 
Slot 4: 
Slot 5: Objeto 2 
Slot 6: 
Slot 7: 
Slot 8: 
Slot 9: 
Slot 10: 

Note that each table item is also a type of vector. In Java, in the class Hashtable, this is implemented through a linked list of objects of the type Entry (entree).

Recovering values

Then, to recover an object, just pass the key to the table, this will calculate the hash and will go directly to slot to retrieve the item.

Continuing the example above, if I wanted to retrieve Object 3, I would have to pass the key to it. The table would calculate the hash code key, which would return the value 15. Hashes always return the same value to the same key. So with a simple account I know that the item will be found in the second slot. However, as we still have a list of objects, the table will have to go through the list of entries from slot 2 and verify, one by one, which is really Object 3.

Implications

The explanations above have several implications:

  • If there’s too much collision hashes, many objects will end up falling in few slots. there being a bad distribution, each time it is necessary to recover an item the efficiency will be bad, because it will be necessary to go through the list of elements of the slot and compare one to one.
  • The internal size of the table needs to be well calculated, so there are not many items in each slot, not many slots empty. In the Java implementation, when the slots begin to get full the Hashtable increases the number of slots and makes a redistribution.
  • In Java, the method that generates the hash can be called several times, so it is important to take care of the performance and do cache of value when possible.
  • Depending on the implementation of hash table, include many items repeatedly can be quite inefficient, as there will be repeated breaks to add new slots and redistribute the elements.

Different types of implementations

A large part of what has been described above is just one of the possible types of use implementation of hashes to recover values.

For example, in Java, there is the Treemap, a "table" whose hashes entries are not stored in a vector, but in a binary tree.

There are several implications to this. Depending on the tree balance it can be more or less efficient to recover elements. Comparing a balanced tree with an array implementation with slots filled the tree would win.

However, I have already made some implementations that needed to carry very large trees in memory and the TreeMap was much more efficient.

  • Excellent answer! Approached all the details in a simple way.

  • Very enlightening the explanation.

0

A table hash is a structure that allows you to associate a key to a value and then have access to the value from its associated key. The key is transformed by a function into a position in the table; always using this transformation, the location of keys in the table is performed quickly.

In Java, this data structure is implemented by Hashtable class objects. To manipulate elements in the hash table, put() methods are used, which stores a pair of objects specified in the table, get(), which returns the value object associated with the specified key object, and remove(), which removes the pair of objects with the specified key. It is also possible to check whether a certain key exists in the table, with the containsKey() method, or whether a certain value is present in the table associated with any key, with the contains() method. The number of element pairs in the table can be obtained with the size method().

  • 1

    I didn’t really want to know how to use it, but how it’s implemented from behind, that is, what happens when we seek or include a value in a hash.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.