Question: say you use a hash which returns a 32-bit integer. If you were actually implementing a hash table, would you need to declare a structure with 2^32 elements? `int buckets[2^32]`? This seems unwieldy!
Would an actual hash table only use, say, the first 10 bits or something of a hash function (int buckets[1024]) to make it less sparse (albeit increase collisions?)
If you decide you want more buckets later on, would you have to re-hash everything in your array and move it to a new, bigger one?
Yeah, you've pretty much got it. A common alternative to masking bits off of the hash is to take hash modulo the size of the table as the index (although you have to be careful with a modulo strategy, so as not to introduce a bias towards certain table slots).
If the hash function is any good, simply bitwise-and enough bits from the hash and use that as the index. (It's a modulo too, but simply modulo(2^x).)
Using 2's powers generally fits nicely with programming on binary computers. It also has the nice quality of producing number of slots that are always 2's powers and you can shove those naturally for example into a single memory page.
Does the modulo operation add a negligible cost to these (very cheap) hashing functions?
i.e. if you're hashing a 10 character (non-unicode) word with FNV-1a you're using 10 multiplications and 10 xors. Adding a modulo (by a prime†) operation in there could feasibly double the time taken.
The modolo function is pretty much required unless you have 2^32 bytes of RAM available for each hashtable. Modolo is also ridiculously cheap, especially compared to the hashing function.
Certainly, if you are hashing a large amount of data then a single modulo operation isn't going to cost a lot. But if you're just hashing an English word, possibly even a URL, I'm not so sure.
You could shift or mask the result (as described above) to use a table size of any power of 2 to avoid the modulo.
It won't be a compile-time constant unless your hash table is incapable of resizing. And modulo arithmetic on arbitrary values is slow (20-94 cycles of latency on Sandy Bridge).
Cross that bridge when you get there. If you're really finding that your hash function is the slowest piece of code in a tightly bound for-loop, and it's not all the collisions you're having (from having a bad hashing function), then look into alternatives. Before that, you're doing premature optimization.
Please read the whole thread before replying with the "premature optimization" thing. We are talking about hash table optimizations; this commenter -- http://news.ycombinator.com/item?id=4037320 -- is asking about the cost of modulo operation.
(1) The trick for resizing is, we only rehash all k elements to recalculate buckets every k inserts or so, and we double the hash table size. So, if you imagine that you just came from a resize, you had k elements and had performed N(k) hashes, then when you hit 2k elements you will have to resize again, and you will perform N(2k) = N(k) + k + 2k total hashes. This recurrence is solved by N(k) = 3k + C for an arbitrary constant C. Averaged over the elements you have inserted k, it's easy to see that for very large dictionaries, you only hash each element on average 2-3 times -- three times if you trigger a resize with k, 2 times if you come in just before triggering the resize.
(2) Strictly speaking you don't need this overhead and you can use trees to keep it sparse as well, although as far as I know the low-n overhead slows it down too much in practice when compared to the high-n space efficiency. That is, you could declare a binary tree structure with only those buckets and pathways of the 2^32 which you need, but to find something within the structure requires checking and following ~32 pointers and storing something requires creating ~32 nodes.
It depends on your performance use case. 2^32 is only about 540MB of RAM so you can allocate that all at once and then all subsequent actions are read/writes.
Yes, if you're only storing a single bit for each. Usually, you'll want to store a pointer to the head of a linked list. At 64 bits per pointer, you'll need to allocate 32 gigs of RAM - slightly less feasible.
Yeah sure you could waste all the space you want why in the world would you do that? Using a linked list is slow and inefficient with memory usage especially when you can do the same thing with a byte array.
Bloom filters are not suitable as stand-in replacements for hash tables when you are representing the Set abstract data type. They are not deterministic, they treat objects with the same hash as identical, and they do not actually store values.
Bloom filters are only useful (in this context) to augment a proper Set implementation, in which case they increase the memory cost, in exchange for decreasing (on average) the I/O cost.
What about bloom filter + sparse hash? You'd be able to see whether an object isn't in the hash table efficiently and pay the price of a longer lookup time. Could be useful for some situations with tiny true positive rates; say, a web browser's list of malwared sites.
Would an actual hash table only use, say, the first 10 bits or something of a hash function (int buckets[1024]) to make it less sparse (albeit increase collisions?)
If you decide you want more buckets later on, would you have to re-hash everything in your array and move it to a new, bigger one?