The efficiency of looping over nested dictionaries in Python

Question

How to make the following code more efficient?

""" Calculate the posterior distribution of p(pos|word) using Baye's rule: p(pos|word) \propto p(word|pos)p(pos). """ word_pos_dict = defaultdict(float) # Example, word|pos -> 0.0012 pos_dict = defaultdict(float) # pos -> 0.005 pos_word_dict = defaultdict(float) # pos|word -> 0.017 for word_pos, word_pos_prob in word_pos_dict.items(): word, pos = word_pos.split('|') marginal_prob = 0 for pos_prob in pos_dict.values(): marginal_prob += word_pos_prob * pos_prob # Marginal prob of p(word) pos_word_prob = word_pos_prob * pos_dict[pos] pos_word = pos + '|' + word pos_word_dict[pos_word] = pos_word_prob / marginal_prob

In practice, the length of word_pos_dict is 57,602, and pos_dict has 984 elements, which make this calculation much slower. Is there something wrong with the implementation, design or algorithm?

jonrsharpe · Accepted Answer · 2014-06-18 10:06:27Z

One obvious improvement:

for word_pos, word_pos_prob in word_pos_dict.items(): marginal_prob = 0 for pos_prob in pos_dict.values(): marginal_prob += word_pos_prob * pos_prob

Simplifies to:

multiplier = 0 for pos_prob in pos_dict.values(): multiplier += pos_prob # or just 'multiplier = sum(pos_dict.values())' for word_pos, word_pos_prob in word_pos_dict.items(): marginal_prob = word_pos_prob * multiplier

Now the inner loop only runs once.

vnp · Accepted Answer · 2014-06-18 09:22:38Z

which make this calculation much slower.

An immediate question is much slower than what?

Few optimisations are obvious, but they are just nitpicking. Asymptotics remains same no matter what.

First, I'd make word_pos_dict indices tuples instead of string. That would save you string splitting and concatenation.

Second, swapping inner/outer loops in your case would be beneficial: as coded, you initialize inner loop 57602 times; when swapped, the initialization would happen only 984 times.

Sorry for the vague expression. I thought that it would be executed in minutes but hours. — fishiwhj
– fishiwhj, Commented Jun 18, 2014 at 10:21
I have to say that using tuples instead of string is a beautiful solution but it will cost more time. — fishiwhj
– fishiwhj, Commented Jun 19, 2014 at 6:24

Stack Exchange Network

The efficiency of looping over nested dictionaries in Python

2 Answers 2

You must log in to answer this question.

Hot Network Questions

The efficiency of looping over nested dictionaries in Python

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions