2
\$\begingroup\$

How to make the following code more efficient?

""" Calculate the posterior distribution of p(pos|word) using Baye's rule: p(pos|word) \propto p(word|pos)p(pos). """ word_pos_dict = defaultdict(float) # Example, word|pos -> 0.0012 pos_dict = defaultdict(float) # pos -> 0.005 pos_word_dict = defaultdict(float) # pos|word -> 0.017 for word_pos, word_pos_prob in word_pos_dict.items(): word, pos = word_pos.split('|') marginal_prob = 0 for pos_prob in pos_dict.values(): marginal_prob += word_pos_prob * pos_prob # Marginal prob of p(word) pos_word_prob = word_pos_prob * pos_dict[pos] pos_word = pos + '|' + word pos_word_dict[pos_word] = pos_word_prob / marginal_prob 

In practice, the length of word_pos_dict is 57,602, and pos_dict has 984 elements, which make this calculation much slower. Is there something wrong with the implementation, design or algorithm?

\$\endgroup\$

2 Answers 2

5
\$\begingroup\$

One obvious improvement:

for word_pos, word_pos_prob in word_pos_dict.items(): marginal_prob = 0 for pos_prob in pos_dict.values(): marginal_prob += word_pos_prob * pos_prob 

Simplifies to:

multiplier = 0 for pos_prob in pos_dict.values(): multiplier += pos_prob # or just 'multiplier = sum(pos_dict.values())' for word_pos, word_pos_prob in word_pos_dict.items(): marginal_prob = word_pos_prob * multiplier 

Now the inner loop only runs once.

\$\endgroup\$
4
\$\begingroup\$

which make this calculation much slower.

An immediate question is much slower than what?

Few optimisations are obvious, but they are just nitpicking. Asymptotics remains same no matter what.

First, I'd make word_pos_dict indices tuples instead of string. That would save you string splitting and concatenation.

Second, swapping inner/outer loops in your case would be beneficial: as coded, you initialize inner loop 57602 times; when swapped, the initialization would happen only 984 times.

\$\endgroup\$
2
  • \$\begingroup\$ Sorry for the vague expression. I thought that it would be executed in minutes but hours. \$\endgroup\$ Commented Jun 18, 2014 at 10:21
  • \$\begingroup\$ I have to say that using tuples instead of string is a beautiful solution but it will cost more time. \$\endgroup\$ Commented Jun 19, 2014 at 6:24

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.