What is the most efficient way to calculate the Hashcode of an object in Javascript?

Asked

Viewed 830 times

4

I’m currently using

Object.prototype.GetHashCode = function () {
    var s = this instanceof Object ? JSON.stringify(this) : this.toString();

    var hash = 0;
    if (s.length === 0) return hash;
    for (var i = 0; i < s.length; ++i) {
        hash = ((hash << 5) - hash) + s.charCodeAt(i);
    }
    return hash;
};
Number.prototype.GetHashCode = function () { return this.valueOf(); };

For numbers is very fast, but for complex objects is very inefficient mainly due to conversion to JSON.

Is there any way to calculate the hashcode more efficiently?

2 answers

4


The main requirements of a hashCode sane:

  1. If A == B, then hashCode(A) == hashCode(B);

    (The reciprocal is not true: two objects can have the same hashCode and yet be different. This is called collision, and in practice it is usually inevitable.)

  2. Hashes should be distributed homogeneous for all domain objects (to decrease the chance of collision);

  3. He must be quick to compute, otherwise the performance gain in using a hash table is denied by the cost of computing the hashes.

In order to ensure efficiency, it is generally necessary to create a hashCode specific to each situation (your example code corresponds to a good hash for strings in general - since not too long). Even when using a specific equality criterion, it is important that the hashCode used to be coherent with this criterion (see requirement 1 above).

In the absence of more specificity, some methods to calculate the hashCode would be (in increasing order of homogeneity, but decreasing in performance):

// constante (rápido, mas inútil)
function hashCode1(obj) { return 0; }

// conta o número de propriedades
function hashCode2(obj) {
    var contagem = 0;
    for ( var p in obj )
        contagem++;
    return contagem;
}

// leva somente as chaves das propriedades em consideração
function hashCode3(obj) {
    var hash = 0;
    for ( var p in obj )
        hash += hashString(p);
    return hash;
}

// leva as chaves e os valores em consideração
function hashCode4(obj, profundidade) {
    if ( !profundidade) profundidade= 0;
    var hash = 0;
    for ( var p in obj )
        hash += hashString(p) + 
                (profundidade> 0 ? hashCode4(obj[p], profundidade-1) : 0);
    return hash;
}

And so on and so forth... The more "deep" it is in the object, the lower the chance of collisions (i.e. different objects with the same hash), but the longer the hash calculation time (i.e. the lower the performance). It is up to you to determine for your particular dataset what is the best balance between precision and speed of calculation that ensures that the hash-using algorithm performs at its best.

(Note: the hashCode4 simplified - in practice, the recursive call would also need to check the guy of the object to call the most appropriate hash method for that type)


Updating: in a comment on your reply to the related question I stated that hashing from the JSON of an object was "terribly inefficient". I would like to clarify that statement on the basis of the above analysis:

In practice, it is rarely possible to use a hash with 100% accuracy (i.e. that takes into account the structure complete of the object in its composition) and yet achieve an acceptable performance by putting this hash into practice. If the objects are very large - either because they have many properties, nested or not, or because they contain a large volume of data - the "global" performance will be bad, regardless of the method chosen.

What makes "serialize a JSON object and hash the string" inefficient is not serialization, but the fact that you are aiming for 100% accuracy. And if that’s really necessary, stick to your implementation, because it actually (as your tests showed) is more efficient than my methods 3 and 4 above.

I’m going to leave just one more example if I haven’t made it clear yet. How would you put the following strings in a hash table:

"123blábláblábláblá...10Mb depois...bláblá"
"345blábláblábláblá...10Mb depois...bláblá"
"678blábláblábláblá...10Mb depois...bláblá"
  1. Using the complete contents of the string in the hash?
  2. Using only the first 256 bytes and ignoring the rest?

(and, if your options were to use the full content, or use no hash and instead use a tree-based dictionary/mapping, which one you would choose?)

  • What would be the function hashString?

  • @Brunolm Just one example... You would need a different method to calculate the hash of each data type. On that another question you commented that you took the hash code from some other question in the OS, right? This code [taking the first line] would be a good candidate for hashString - although it is unsuitable for complex IMHO objects.

  • I get it, thank you :)

  • P.S. Could anyone suggest me a better word than homogeneity? What I mean is that the objects (domain) should be equally distributed over the hashes (counter-domain or image). " Homogeneity" suggests that the hashes would look alike to every object, which is just the opposite of what I’m trying to say... I don’t know if I was clear :P

  • Look, I don’t know if I did something wrong, but according to my tests that my method over there seems to be faster http://jsperf.com/js-hashcode-custom

  • @Brunolm There is nothing wrong no, I never claimed that my methods were [all] more efficient than yours, simply that efficiency decreases as accuracy grows. If you want 100% accuracy, stick to your method! It will be better on average - but not always - than my 3 and 4, for sure. The question you have to ask yourself is: in my specific application, will using a hash with 100% accuracy be more efficient than not using a hash? (using instead a tree, for example)

Show 1 more comment

0

Browser other questions tagged

You are not signed in. Login or sign up in order to post.