Question: say you use a hash which returns a 32-bit integer. If you were actuall...

jrmg · on May 29, 2012

Yeah, you've pretty much got it. A common alternative to masking bits off of the hash is to take hash modulo the size of the table as the index (although you have to be careful with a modulo strategy, so as not to introduce a bias towards certain table slots).

There are strategies to make the resize not be so expensive. Wikipedia's page on Hash Tables covers this at http://en.wikipedia.org/wiki/Hash_table#Dynamic_resizing

[edit: clarity in the first paragraph]

yason · on May 29, 2012

If the hash function is any good, simply bitwise-and enough bits from the hash and use that as the index. (It's a modulo too, but simply modulo(2^x).)

Using 2's powers generally fits nicely with programming on binary computers. It also has the nice quality of producing number of slots that are always 2's powers and you can shove those naturally for example into a single memory page.

viggity · on May 29, 2012

iirc, there is a major advantage to using a prime number of slots that prevents the biasing

hythloday · on May 29, 2012

It's to protect against bad hashing algorithms:

http://www.concentric.net/~Ttwang/tech/primehash.htm

Jabbles · on May 29, 2012

Does the modulo operation add a negligible cost to these (very cheap) hashing functions?

i.e. if you're hashing a 10 character (non-unicode) word with FNV-1a you're using 10 multiplications and 10 xors. Adding a modulo (by a prime†) operation in there could feasibly double the time taken.

†Not 2...

viggity · on May 29, 2012

The modolo function is pretty much required unless you have 2^32 bytes of RAM available for each hashtable. Modolo is also ridiculously cheap, especially compared to the hashing function.

Jabbles · on May 29, 2012

Certainly, if you are hashing a large amount of data then a single modulo operation isn't going to cost a lot. But if you're just hashing an English word, possibly even a URL, I'm not so sure.

You could shift or mask the result (as described above) to use a table size of any power of 2 to avoid the modulo.

mseebach · on May 29, 2012

Doing constant modulo on an integer is very fast - we're talking a few CPU operations.

haberman · on May 29, 2012

It won't be a compile-time constant unless your hash table is incapable of resizing. And modulo arithmetic on arbitrary values is slow (20-94 cycles of latency on Sandy Bridge).

mseebach · on May 30, 2012

Hmm, good point. Is that also true for JIT compilers?

dchest · on May 29, 2012

Yes, but bitwise AND is faster.

sukuriant · on May 29, 2012

Cross that bridge when you get there. If you're really finding that your hash function is the slowest piece of code in a tightly bound for-loop, and it's not all the collisions you're having (from having a bad hashing function), then look into alternatives. Before that, you're doing premature optimization.

dchest · on May 29, 2012

Please read the whole thread before replying with the "premature optimization" thing. We are talking about hash table optimizations; this commenter -- http://news.ycombinator.com/item?id=4037320 -- is asking about the cost of modulo operation.

terangdom · on May 30, 2012

IIrc changing a modulo to an and roughly doubled the performance of my hashtable.

drostie · on May 29, 2012

(1) The trick for resizing is, we only rehash all k elements to recalculate buckets every k inserts or so, and we double the hash table size. So, if you imagine that you just came from a resize, you had k elements and had performed N(k) hashes, then when you hit 2k elements you will have to resize again, and you will perform N(2k) = N(k) + k + 2k total hashes. This recurrence is solved by N(k) = 3k + C for an arbitrary constant C. Averaged over the elements you have inserted k, it's easy to see that for very large dictionaries, you only hash each element on average 2-3 times -- three times if you trigger a resize with k, 2 times if you come in just before triggering the resize.

(2) Strictly speaking you don't need this overhead and you can use trees to keep it sparse as well, although as far as I know the low-n overhead slows it down too much in practice when compared to the high-n space efficiency. That is, you could declare a binary tree structure with only those buckets and pathways of the 2^32 which you need, but to find something within the structure requires checking and following ~32 pointers and storing something requires creating ~32 nodes.

joeblau · on May 29, 2012

It depends on your performance use case. 2^32 is only about 540MB of RAM so you can allocate that all at once and then all subsequent actions are read/writes.

mseebach · on May 29, 2012

> 2^32 is only about 540MB of RAM

Yes, if you're only storing a single bit for each. Usually, you'll want to store a pointer to the head of a linked list. At 64 bits per pointer, you'll need to allocate 32 gigs of RAM - slightly less feasible.

joeblau · on May 29, 2012

Yeah sure you could waste all the space you want why in the world would you do that? Using a linked list is slow and inefficient with memory usage especially when you can do the same thing with a byte array.

Look up an algorithm called a Bloom Filter.

kbolino · on May 29, 2012

Bloom filters are not suitable as stand-in replacements for hash tables when you are representing the Set abstract data type. They are not deterministic, they treat objects with the same hash as identical, and they do not actually store values.

Bloom filters are only useful (in this context) to augment a proper Set implementation, in which case they increase the memory cost, in exchange for decreasing (on average) the I/O cost.

gcr · on June 2, 2012

What about bloom filter + sparse hash? You'd be able to see whether an object isn't in the hash table efficiently and pay the price of a longer lookup time. Could be useful for some situations with tiny true positive rates; say, a web browser's list of malwared sites.