Very large execution time differences for virtually same C++ and Python code

Question

I was trying to write a solution for Problem 12 (Project Euler) in Python. The solution was just too slow, so I tried checking up other people's solution on the internet. I found this code written in C++ which does virtually the same exact thing as my python code, with just a few insignificant differences.

Python:

def find_number_of_divisiors(n): if n == 1: return 1 div = 2 # 1 and the number itself for i in range(2, n/2 + 1): if (n % i) == 0: div += 1 return div def tri_nums(): n = 1 t = 1 while 1: yield t n += 1 t += n t = tri_nums() m = 0 for n in t: d = find_number_of_divisiors(n) if m < d: print n, ' has ', d, ' divisors.' m = d if m == 320: exit(0)

C++:

#include <iostream> int main(int argc, char *argv[]) { unsigned int iteration = 1; unsigned int triangle_number = 0; unsigned int divisor_count = 0; unsigned int current_max_divisor_count = 0; while (true) { triangle_number += iteration; divisor_count = 0; for (int x = 2; x <= triangle_number / 2; x ++) { if (triangle_number % x == 0) { divisor_count++; } } if (divisor_count > current_max_divisor_count) { current_max_divisor_count = divisor_count; std::cout << triangle_number << " has " << divisor_count << " divisors." << std::endl; } if (divisor_count == 318) { exit(0); } iteration++; } return 0; }

The python code takes 1 minute and 25.83 seconds on my machine to execute. While the C++ code takes around 4.628 seconds. Its like 18x faster. I had expected the C++ code to be faster but not by this great margin and that too just for a simple solution which consists of just 2 loops and a bunch of increments and mods.

Although I would appreciate answers on how to solve this problem, the main question I want to ask is Why is C++ code so much faster? Am I using/doing something wrongly in python?

Replacing range with xrange:

After replacing range with xrange the python code takes around 1 minute 11.48 seconds to execute. (Around 1.2x faster)

Consider using xrange instead of range. Also just consider using C++ — phs
– phs, Commented Mar 16, 2013 at 4:25
It's really late so my mind may be a little fuzzy, but one slight improvement to find the number of divisors would be to go only to sqrt(n) in your for loop instead of n/2+1 ... But you'd have to add 2 to div each time then. One for the divisor less than sqrt(n) and one for its codivisor(is that a word??) — George Mitchell
– George Mitchell, Commented Mar 16, 2013 at 4:27
This basically has to do with a compiled language versus interpreted language. Since C++ is a compiled language, it runs much closer to the hardware. Interpreted languages incur additional overhead to make them run in the way they do. — Chris
– Chris, Commented Mar 16, 2013 at 4:32

Danica · Accepted Answer · 2013-03-16 06:53:48Z

This is exactly the kind of code where C++ is going to shine compared to Python: a single fairly tight loop doing arithmetic ops. (I'm going to ignore algorithmic speedups here, because your C++ code uses the same algorithm, and it seems you're explicitly not asking for that...)

C++ compiles this kind of code down to a relatively few number of instructions for the processor (and everything it does probably all fits in the super-fast levels of CPU cache), while Python has a lot of levels of indirection it's going through for each operation. For example, every time you increase a number it's checking that the number didn't just overflow and need to be moved into a bigger data type.

That said, all is not necessarily lost! This is also the kind of code that a just-in-time compiler system like PyPy will do well at, since once it's gone through the loop a few times it compiles the code to something similar to what the C++ code starts at. On my laptop:

$ time python2.7 euler.py >/dev/null python euler.py 72.23s user 0.10s system 97% cpu 1:13.86 total $ time pypy euler.py >/dev/null pypy euler.py > /dev/null 13.21s user 0.03s system 99% cpu 13.251 total $ clang++ -o euler euler.cpp && time ./euler >/dev/null ./euler > /dev/null 2.71s user 0.00s system 99% cpu 2.717 total

using the version of the Python code with xrange instead of range. Optimization levels don't make a difference for me with the C++ code, and neither does using GCC instead of Clang.

While we're at it, this is also a case where Cython can do very well, which compiles almost-Python code to C code that uses the Python APIs, but uses raw C when possible. If we change your code just a little bit by adding some type declarations, and removing the iterator since I don't know how to handle those efficiently in Cython, getting

cdef int find_number_of_divisiors(int n): cdef int i, div if n == 1: return 1 div = 2 # 1 and the number itself for i in xrange(2, n/2 + 1): if (n % i) == 0: div += 1 return div cdef int m, n, t, d m = 0 n = 1 t = 1 while True: n += 1 t += n d = find_number_of_divisiors(t) if m < d: print n, ' has ', d, ' divisors.' m = d if m == 320: exit(0)

then on my laptop I get

$ time python -c 'import euler_cy' >/dev/null python -c 'import euler_cy' > /dev/null 4.82s user 0.02s system 98% cpu 4.941 total

(within a factor of 2 of the C++ code).

Just for fun, since you tried everything else… you can numpy-vectorize find_number_of_divisors with 2 + np.count_zero(n % np.arange(2, n/2+1)), which ought to turn the tight loop full of arithmetic into C code. From a quick test, I get 7.34s—not as fast as the Cython version, but it's pretty nice and readable, and doesn't require compiling anything.
On further thought, if you're willing to trade a huge amount of space for a bit faster speed, you could vectorize the next loop up by just building a 2D array. Obviously not the whole loop, but N at a time for some suitable N.
OK, I got bored. pastebin.com/7QkpE56E does it in 4.95s, vs. 7.34s for the 1D numpy version… still not as good as the Cython version's 3.99s.
if you remove generator and put everything in functions, pypy should be faster than that.

nhahtdh · Accepted Answer · 2013-03-16 11:53:45Z

Rewriting the divisor counting algorithm to use divisor function makes the run time reduces to less than 1 second. It is still possible to make it faster, but not really necessary.

This is to show that: before you do any optimization trick with the language features and compiler, you should check whether your algorithm is the bottleneck or not. The trick with compiler/interpreter is indeed quite powerful, as shown in Dougal's answer where the gap between Python and C++ is closed for the equivalent code. However, as you can see, the change in algorithm immediately give a huge performance boost and lower the run time to around the level of algorithmically inefficient C++ code (I didn't test the C++ version, but on my 6-year-old computer, the code below finishes running in ~0.6s).

The code below is written and tested with Python 3.2.3.

import math def find_number_of_divisiors(n): if n == 1: return 1 num = 1 count = 1 div = 2 while (n % div == 0): n //= div count += 1 num *= count div = 3 while (div <= pow(n, 0.5)): count = 1 while n % div == 0: n //= div count += 1 num *= count div += 2 if n > 1: num *= 2 return num

+1 Nice: count the primes, find the product of the counts of the prime factors (plus 1).

hughdbrown · Accepted Answer · 2013-03-16 06:07:18Z

Here's my own variant built on nhahtdh's factor-counting optimization plus my own prime factorization code:

def prime_factors(x): def factor_this(x, factor): factors = [] while x % factor == 0: x /= factor factors.append(factor) return x, factors x, factors = factor_this(x, 2) x, f = factor_this(x, 3) factors += f i = 5 while i * i <= x: for j in (2, 4): x, f = factor_this(x, i) factors += f i += j if x > 1: factors.append(x) return factors def product(series): from operator import mul return reduce(mul, series, 1) def factor_count(n): from collections import Counter c = Counter(prime_factors(n)) return product([cc + 1 for cc in c.values()]) def tri_nums(): n, t = 1, 1 while 1: yield t n += 1 t += n if __name__ == '__main__': m = 0 for n in tri_nums(): d = factor_count(n) if m < d: print n, ' has ', d, ' divisors.' m = d if m == 320: break

@nhahtdh: That's part of it, but I already had the prime factorization code. I did not know the relation between count of prime factors and count of factors generally, so that's the idea I took from your code. I think my formulation is worth adding because of the reusable components: a good routine for prime factorization, a function that produces the product of a series, a function that composes these to get the count of factors in a number. If OP is going to do Project Euler puzzles, this code will be helpful.

Collectives™ on Stack Overflow

Very large execution time differences for virtually same C++ and Python code

3 Answers 3

4 Comments

2 Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

2 Comments

2 Comments

Linked

Related