Python's underlying hash data structure for dictionaries

Question

I am build a very large dictionary and I am performing many checks to see if a key is in the structure and then adding if it unique or incrementing a counter if it is identical.

Python uses a hash data structure to store dictionaries (not to be confused with a cryptographic hash function). Lookups are O(1), but if the hash table is full the it has to be rehashed which is very expensive.

My Question is, would I be better off using a AVL Binary Search Tree or is a hash table good enough?

Note that if you're at the point of wondering which has better performance between a hashtable and an AVL tree, there are plenty of other players to consider. Trie and splay tree spring to mind. — Steve Jessop
– Steve Jessop, Commented Nov 25, 2010 at 17:36

Roushan · Accepted Answer · 2019-02-11 15:44:58Z

The only way to be sure would be to implement both and check, but my informed guess is that the dictionary will be faster, because a binary search tree has cost O(log(n)) for lookup and insertion, and I think that except under the most pessimal of situations (such as massive hash collisions) the hash table's O(1) lookup will outweigh the occasional resize.

If you take a look at the Python dictionary implementation, you'll see that:

a dictionary starts out with 8 entries (PyDict_MINSIZE);
a dictionary with 50,000 or fewer entries quadruples in size when it grows;
a dictionary with more than 50,000 entries doubles in size when it grows;
key hashes are cached in the dictionary, so they are not recomputed when the dictionary is resized.

(The "NOTES ON OPTIMIZING DICTIONARIES" are worth reading too.)

So if your dictionary has 1,000,000 entries, I believe that it will be resized eleven times (8 → 32 → 128 → 512 → 2048 → 8192 → 32768 → 131072 → 262144 → 524288 → 1048576 → 2097152) at a cost of 2,009,768 extra insertions during the resizes. This seems likely to be much less than the cost of all the rebalancing involved in 1,000,000 insertions into an AVL tree.

pyfunc · Accepted Answer · 2010-11-25 17:38:28Z

Python dictionaries are highly optimized. Python makes various special-case optimizations that the Python developers cater for in the CPython dictionary implementation.

In CPython, all PyDictObject's are optimized for dictionaries containing only string keys.
Python's dictionary makes an effort to never be more than 2/3rds full.

The book "Beautiful Code" discusses this all.

The eighteenth chapter is Python's Dictionary Implementation: Being All Things to All People by Adrew Kuchling

It is much better to use it than try to achieve the hand crafted custom implementation which will have to replicate all these optimizations to be any where near the main CPython implementation of dictionary look ups.

+1: Python depends so much on dictionaries and this affects performance of the language itself widely. I'd place a bet that their implementation is hard to beat.
@André Caron: Absolutely! See Gareth Rees answer too. We almost wrote along similar lines. Python implementation and optimization of dictionary is very good. Python depends on it. It would be very hard to beat that.
I was just reading the "notes on optimizing dictionaries". This is cool document. I like that people keep track of experiments and discussions because it avoids duplication of work (just like this post...)

pixelbeat · Accepted Answer · 2010-11-25 17:19:27Z

4

What's the ratio of items vs unique items? What's the expected number of unique items?

If a hash bucket fills, then extending should just be a matter of some memory reallocation, not rehashing.

Testing a counting dict should be very quick and easy to do.

Note also the counter class available since python 2.7 http://docs.python.org/library/collections.html#counter-objects http://svn.python.org/view?view=rev&revision=68559

edited Nov 25, 2010 at 17:19

answered Nov 25, 2010 at 17:11

pixelbeat

32k9 gold badges55 silver badges62 bronze badges

3 Comments

Daniel Roseman Over a year ago

Yes, the OP is wrong in saying that the dict has to be rehashed.

Steve Jessop Over a year ago

I thought that operation of redistributing the items into a larger array of buckets was called rehashing, even if you've stored the full hash value and therefore don't need to recompute it, just take a modulus with a different value. Either way, it's expensive compared with a normal insert.

André Caron Over a year ago

There's a difference between increasing bucket sizes and increasing the number of buckets. Increasing bucket size is typically cheap, especially when only storing pointers to objects. Increasing the number of buckets is another matter. Since it should only happen "often enough" it should be considered amortized over the number of inserts.

Douglas Leeder · Accepted Answer · 2010-11-25 17:46:14Z

You would have to implement your own data structures in C to stand a reasonable chance of beating the built-in structures.

Also you can avoid some of the overhead by using get, avoiding find existing elements twice. Or collections.Counter if you are using python 2.7+.

def increment(map, key): map[key] = map.get(key,0)+1

I like this, but your increment function is not incrementing :-)

John La Rooy · Accepted Answer · 2010-11-25 20:06:44Z

Using a dict is O(1). As the dict grows, reallocation is sometimes required, but that is amortized O(1)

If your other algorithm is O(log n), the simple dict will always beat it as the dataset grows larger.

If you use any type of tree, I would expect a O(log n) component in there somewhere.

Not only is a hash table good enough, it is better

Collectives™ on Stack Overflow

Python's underlying hash data structure for dictionaries

5 Answers 5

Comments

3 Comments

3 Comments

2 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

3 Comments

3 Comments

2 Comments

Comments

Linked

Related