0

I have a dictionary in some code which maps a key to a word, the key is the result of an md5 hash. I have code that essentially wants to get the key for a word, and when it doesn't already exist, add it to the dictionary

Here was my first implementation:

key = int(hashlib.md5(word).hexdigest(), 16) if key in self.id_to_word.keys(): assert word == self.id_to_word[key] else: self.id_to_word[key] = word return key 

After profiling my code I found this to be EXTREMELY slow. So then I tried this, which is functionally equivalent

key = int(hashlib.md5(word).hexdigest(), 16) try: assert word == self.id_to_word[key] return key except KeyError: self.id_to_word[key] = word 

This turned out to be incredibly faster. While I'm certainly happy about the performance improvement, I was wondering if someone could explain to me why. Is it bad practice to check for something in a keys() function from a dictionary like that? Is it generating copies of that every time (wasting a lot of computation)?

4
  • can just do key in some_dict, don't have to use .keys() Commented Apr 8, 2015 at 19:04
  • Could you add the actual time results? Commented Apr 8, 2015 at 19:05
  • .keys() creates a list of all the keys, and the in statement searches all those keys, an O(N) operation. key in dict is O(1) operation Commented Apr 8, 2015 at 19:05
  • This discussion doesn't work for you: stackoverflow.com/q/1835756/349420 ? Commented Apr 8, 2015 at 19:08

5 Answers 5

3

id_to_word.keys() creates a new list, which is linearly searched, which is much slower than a hash lookup. Remove the .keys().

The fastest way would be:

key = int(hashlib.md5(word).hexdigest(), 16) assert word == self.id_to_word.setdefault(key, word) 
Sign up to request clarification or add additional context in comments.

1 Comment

The only issue with this is that assert should not be used in production code. python -O (the optimize flag) will remove an assert from code. At that time, your code breaks.
2

key in some_dict is much faster than key in some_dict.keys()

Dict Lookup key in some_dict is O(1) complexity so its very fast

that said its still (very marginally)slower in the case where the key is in the dict than just try/except

the real answer is there is no real measurable difference between these 2 methods and do whatever feels right to you

Comments

2

This is to be expected (in python2). The keys() method returns a list of keys. So using the in operator on the list takes linear time.

Trying to access the item is constant time, which is much faster.

Note: you can simply use key in dictionary instead of the try: ...except:.


Note that dictionaries have a setdefault method that already does what you want. Moreover if you do that operation a lot of time you should consider using collections.defaultdict instead of a plain dictionary.

Comments

1

Elaborating on my comments above:

In [4]: d = {k:k for k in xrange(1000)} In [5]: %timeit 50 in d 10000000 loops, best of 3: 68.6 ns per loop In [6]: %timeit 50 in d.keys() 100000 loops, best of 3: 6.35 µs per loop 

as you can see, using d.keys() is about 100 times slower.

Comments

0

As mentioned setdefault() solves your problem without an if or try block.
But "Easier to Ask for Forgiveness than Permission" [EAFP] and duck-typing is a common idiom in Python, compared to the more defensive "Look Before You Leap" [LBYL] idiom common in other languages, e.g. Java, C++.

Jeff Knuth has an interesting blog post on it Write Clean Python: Use Exceptions

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.