5

Let's say we're solving a simple word count problem. There's a list and we're trying to find the frequency of each word occurring in the list, using a dictionary to keep the counts.

Which pattern is faster here?

book_title = ['great', 'expectations','the', 'adventures', 'of', 'sherlock','holmes','the','great','gasby','hamlet','adventures','of','huckleberry','fin'] word_count_dict = {} 

Pattern 1

for word in book_title: if word in word_count_dict: word_count_dict[word] += 1 else: word_count_dict[word] = 1 

Pattern 2

for word in book_title: if word not in word_count_dict: word_count_dict[word] = 1 else: word_count_dict[word] += 1 
0

5 Answers 5

4

Six of one, half a dozen of the other. They should be roughly equivalent to each other - in computational terms, a not operation is nearly negligible (literally the cheapest possible operation), and the in operation in a hashtable like a dictionary runs in constant time (either the hash is there, or it's not). If we were dealing with a list, it would run in linear time, but still the between in and not in. See also computational complexity of python data structures.

So basically, use whichever makes your code more understandable.


That said, have you considered using a collections.Counter, a data structure specifically designed for this purpose?

import collections book_title = ['great', 'expectations','the', 'adventures', 'of', 'sherlock','holmes','the','great','gasby','hamlet','adventures','of','huckleberry','fin'] word_counts = collections.Counter(book_title) print(word_counts) # Counter({'great': 2, 'the': 2, 'adventures': 2, 'of': 2, 'expectations': 1, 'sherlock': 1, 'holmes': 1, 'gasby': 1, 'hamlet': 1, 'huckleberry': 1, 'fin': 1}) 

You can typecast a collections.Counter to a dict if you need to, and in fact collections.Counter is a subclass of dict. It even has an .update() method specifically designed to work with other counters - if you add another book title, just feed it into a Counter and then .update() the original with that.

Sign up to request clarification or add additional context in comments.

Comments

4

Using dis, you can look at the bytecode generated for each of your methods.

The only difference I could see running your code through the disassembler was right at in and not in, where the difference in bytecode was:

COMPARE_OP 7 (not in) or COMPARE_OP 6 (in)

And afterwards POP_JUMP_IF_FALSE (i.e. continue onto the next instruction for this condition)

All in all, both of the methods seems to have the same amount of instructions, regardless of the comparison returning true or false, and therefore most probably executes equally fast.


There might be, however, some underlying optimizations closer to CPU instructions which would cause one or the other methods to be faster, but I would consider that territory out of scope for this question. If this would be the case, then I believe a simple measure of execution time over a larger list would prove which one is faster.

Execution speed of both instructions, beneath Python bytecode, might differ between Python versions, build, OS or architecture. You'd probably be able to make a small change in the Python source code to make one or the other instruction execute faster.

Comments

1

They have approximately the same cost. Think of not in operator as being in operator applied first and then a logical not applied to that result(which is almost negligible).

For confirmation, here's a little experiment you can perform to measure execution time

from time import time book_title = ['hi']*100000+['there']*10000 word_count_dict1 = {} word_count_dict2 = {} start = time() for word in book_title: if word in word_count_dict1: word_count_dict1[word] += 1 else: word_count_dict1[word] = 1 print(time()-start) start = time() for word in book_title: if word not in word_count_dict2: word_count_dict2[word] = 1 else: word_count_dict2[word] += 1 print(time()-start) 

Output (may vary for you)

0.021044015884399414 0.02713179588317871 

Comments

0

Basically, they have the same cost. From Python 3 reference

The operator not in is defined to have the inverse truth value of in.

Comments

0

You can check the timing of operation for both pattern and compare yourself.

import timeit def pattern1(): title = ['great', 'expectations','the', 'adventures', 'of', 'sherlock','holmes','the','great','gasby','hamlet','adventures','of','huckleberry','fin'] counts = {} for word in title: if word in counts: counts[word] += 1 else: counts[word] = 1 def pattern2(): title = ['great', 'expectations','the', 'adventures', 'of', 'sherlock','holmes','the','great','gasby','hamlet','adventures','of','huckleberry','fin'] counts = {} for word in title: if word not in counts: counts[word] = 1 else: counts[word] += 1 sample1 = [timeit.timeit(pattern1, number=10000) for _ in range(10)] sample2 = [timeit.timeit(pattern2, number=10000) for _ in range(10)] print(sum(sample1) / len(sample1)) # 0.01713230140048836 print(sum(sample2) / len(sample2)) # 0.017954919600015273 

As we can see, the difference is negligible.

2 Comments

Can you give an example pattern1 and pattern2 applicable to this question and sample output? Otherwise this answer is overly broad and not very useful for OP.
I could have done that @jayelm. But his pattern2 has variable name word_counter which was not defined in the question. So I was unsure if word_counter is an empty dictionary or not.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.