The main requirements of a hashCode
sane:
If A == B
, then hashCode(A) == hashCode(B)
;
(The reciprocal is not true: two objects can have the same hashCode
and yet be different. This is called collision, and in practice it is usually inevitable.)
Hashes should be distributed homogeneous for all domain objects (to decrease the chance of collision);
He must be quick to compute, otherwise the performance gain in using a hash table is denied by the cost of computing the hashes.
In order to ensure efficiency, it is generally necessary to create a hashCode
specific to each situation (your example code corresponds to a good hash for strings in general - since not too long). Even when using a specific equality criterion, it is important that the hashCode
used to be coherent with this criterion (see requirement 1 above).
In the absence of more specificity, some methods to calculate the hashCode
would be (in increasing order of homogeneity, but decreasing in performance):
// constante (rápido, mas inútil)
function hashCode1(obj) { return 0; }
// conta o número de propriedades
function hashCode2(obj) {
var contagem = 0;
for ( var p in obj )
contagem++;
return contagem;
}
// leva somente as chaves das propriedades em consideração
function hashCode3(obj) {
var hash = 0;
for ( var p in obj )
hash += hashString(p);
return hash;
}
// leva as chaves e os valores em consideração
function hashCode4(obj, profundidade) {
if ( !profundidade) profundidade= 0;
var hash = 0;
for ( var p in obj )
hash += hashString(p) +
(profundidade> 0 ? hashCode4(obj[p], profundidade-1) : 0);
return hash;
}
And so on and so forth... The more "deep" it is in the object, the lower the chance of collisions (i.e. different objects with the same hash), but the longer the hash calculation time (i.e. the lower the performance). It is up to you to determine for your particular dataset what is the best balance between precision and speed of calculation that ensures that the hash-using algorithm performs at its best.
(Note: the hashCode4
simplified - in practice, the recursive call would also need to check the guy of the object to call the most appropriate hash method for that type)
Updating: in a comment on your reply to the related question I stated that hashing from the JSON of an object was "terribly inefficient". I would like to clarify that statement on the basis of the above analysis:
In practice, it is rarely possible to use a hash with 100% accuracy (i.e. that takes into account the structure complete of the object in its composition) and yet achieve an acceptable performance by putting this hash into practice. If the objects are very large - either because they have many properties, nested or not, or because they contain a large volume of data - the "global" performance will be bad, regardless of the method chosen.
What makes "serialize a JSON object and hash the string" inefficient is not serialization, but the fact that you are aiming for 100% accuracy. And if that’s really necessary, stick to your implementation, because it actually (as your tests showed) is more efficient than my methods 3 and 4 above.
I’m going to leave just one more example if I haven’t made it clear yet. How would you put the following strings in a hash table:
"123blábláblábláblá...10Mb depois...bláblá"
"345blábláblábláblá...10Mb depois...bláblá"
"678blábláblábláblá...10Mb depois...bláblá"
- Using the complete contents of the string in the hash?
- Using only the first 256 bytes and ignoring the rest?
(and, if your options were to use the full content, or use no hash and instead use a tree-based dictionary/mapping, which one you would choose?)
What would be the function
hashString
?– BrunoLM
@Brunolm Just one example... You would need a different method to calculate the hash of each data type. On that another question you commented that you took the hash code from some other question in the OS, right? This code [taking the first line] would be a good candidate for
hashString
- although it is unsuitable for complex IMHO objects.– mgibsonbr
I get it, thank you :)
– BrunoLM
P.S. Could anyone suggest me a better word than homogeneity? What I mean is that the objects (domain) should be equally distributed over the hashes (counter-domain or image). " Homogeneity" suggests that the hashes would look alike to every object, which is just the opposite of what I’m trying to say... I don’t know if I was clear :P
– mgibsonbr
Look, I don’t know if I did something wrong, but according to my tests that my method over there seems to be faster http://jsperf.com/js-hashcode-custom
– BrunoLM
@Brunolm There is nothing wrong no, I never claimed that my methods were [all] more efficient than yours, simply that efficiency decreases as accuracy grows. If you want 100% accuracy, stick to your method! It will be better on average - but not always - than my 3 and 4, for sure. The question you have to ask yourself is: in my specific application, will using a hash with 100% accuracy be more efficient than not using a hash? (using instead a tree, for example)
– mgibsonbr