What kind of overhead is involved in a Hashtable?

Question

I understand that since allocation is at run-time there must be some house-keeping operations involved. But other than that what is the overhead ? Also, would it be wise to create a hashtable Vs array when you need to store the number of times an integer element appears in an infinite stream of numbers ?

Which programming language? And computers are notably bad at dealing with "infinite streams" of anything. — anon
– anon, Commented Mar 4, 2010 at 18:25
If you care to read there is pretty loooong article enfranchisedmind.com/blog/posts/problems-with-hash-tables — Pratik Deoghare
– Pratik Deoghare, Commented Mar 4, 2010 at 18:34
Okay. Not infinite but imagine that there is a stream of numbers coming in real-time. Is hashtable Vs array language dependent ? I was thinking more on the conceptual level which one would be a better choice if you needed to calculate how many times a no. is occurred in a stream of numbers. — Phoenix
– Phoenix, Commented Mar 4, 2010 at 18:36

Seun Osewa · Accepted Answer · 2010-03-04 18:34:20Z

3

Theoretically, it depends on how many unique numbers there are in the stream of numbers. But any real life scenario I can imagine, an array would be ridiculously slower. The more unique numbers you process, the slower the array solution will become.

A HashTable generally maintains the same access speed no matter how large it becomes. For an 'infinite stream', I can't imagine how a HashTable won't be the better solution. How do you intend to search the array?

answered Mar 4, 2010 at 18:34

Seun Osewa

5,0343 gold badges32 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Pratik Deoghare Over a year ago

+1 and given infinite stream of numbers program will be adding entries to the hastable till the end of the world.

Brooks Moses Over a year ago

You're assuming that the 'infinite' stream has a large set of unique numbers. This is not necessarily true; consider frequencies of letters in Gutenberg texts, or frequencies of readings from a 10-bit A/D converter.

Phoenix Over a year ago

Okay what if there is an equal distribution of repeated and unique values ?? So, you could have something like {1,2,3,10000,5594594,10000,559459410000,5594594} obviously it is impossible to allocate such a large array. So, does it then make sense to just use a hashtable? or some other datastructure would be better here ?

Andras Vass Over a year ago

Phoenix: A hashtable is just fine here. One of the things you can do is to preallocate it if you have an educated guess on the count of unique numbers. Say you know that you'll have at most 1000 unique items. By preallocating the hash to this size, you will avoid rehashing if you have less than 1001 unique numbers. On the other hand, if you don't know the upper bound, but you know that you'll have at least 1 000 000 unique numbers, why not preallocate to say 2 000 000? This way, you are good to go without rehashing up to at least 2 000 000 unique numbers.

Andras Vass Over a year ago

...and you most certainly avoid rehashing for the first 1 000 000 insertions. ;-) (That you know for sure, will happen.)

Brooks Moses · Accepted Answer · 2010-03-04 18:35:39Z

As Neil's comment implies, the overhead in hashtable implementations depends heavily on the particular implementation of a hash table. Typically, though, there will be storage overhead from unused hashes, and both storage and time overhead from dealing with hash collisions. There is of course also time overhead from computing the hash values.

In answer to your second question, this depends very much on the details of your stream of numbers, and the other aspects of your program. Some questions to consider:

Is the set of possible numbers large or small? (How big an array would you need to create?)
Out of the range of possible numbers, do you expect most of them to be used, or only a few? If you expect most of the possible numbers in the range to be used, then using a hash table will not save you much space.
Do you know the range of possible numbers before you start? Or is this unknown? Hash tables can deal with unknown ranges much more easily.
How important is saving storage space in this program? Can you easily afford to allocate an array of the necessary size? If you can easily afford to allocate the array, why bother with a hash table?
How important is runtime speed in this program? Arrays will be usually be faster.

I do not know the range of numbers. Imagine that I have to calculate something like I have 10 web pages in my application (but this is for the time being .. I could add more tommorow) AND I have to maintain the number of times the page has been viewed. Would you say hashtable is the best datastructure in this situation ? where I could keep a page counter against the URL of the Page ??
Yes, that's definitely a situation where a hashtable would be better than an array.

Keith Randall · Accepted Answer · 2010-03-04 18:39:43Z

Hashtables are pretty fast. Just as an experiment, I get about 50x slowdown between a raw array and a c++ hash_map (compile with the #if toggled both ways and try it yourself).

#include <ext/hash_map> using namespace __gnu_cxx; int main() { #if 0 hash_map<int,int> table; for (int i = 0; i < 256; i++) table[i] = 0; #else int table[256]; #endif for (int i = 0; i < 100000000; i++) { table[i&0xff]++; } }

This is so micro. Out of curiosity, do you have a comparison on what happens when the table size exceeds your processor cache size?
Yes, it is super micro. I'm not claiming a definitive result here, just getting a feel for the magnitude. As for table size > cache size, try it yourself!

Collectives™ on Stack Overflow

What kind of overhead is involved in a Hashtable?

3 Answers 3

5 Comments

2 Comments

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

2 Comments

2 Comments

Related