Python vs CPP: Why is the difference in speed so huge?

Question

def main(): i = 2 sum = 1 while i < 100000: j = 2 while j < i: if i%j == 0: sum += 1 break j += 1 i += 1 print(sum) if __name__ == "__main__": main()

#include<iostream> using namespace std; int main() { int sum = 1; for (int i=2; i<100000; i++) { for (int j=2; j<i; j++) { if (i%j == 0) { sum++; break; } } } cout << sum << endl; return 0; }

C++

Run with: g++ -std=c++11 x.cpp -o x && time ./x

Time: ./x 1.36s user 0.00s system 99% cpu 1.376 total

Python

Run with: python x.py

Time: python x.py 32.10s user 0.21s system 98% cpu 32.854 total

Can anyone explain the huge difference between the time taken by the 2 programs? And what can be done to speed up the python one?

Why do you use a for loop in one and a while in the other? — Error - Syntactical Remorse
– Error - Syntactical Remorse, Commented Jul 15, 2019 at 17:43
The biggest difference is due to the fact that Python is an interpreted language, whereas C++ is a compiled language. You can get some of the benefits of a compiled language with Python by using something like Cython. — 0x5453
– 0x5453, Commented Jul 15, 2019 at 17:43
Possible duplicate of Why are Python Programs often slower than the Equivalent Program Written in C or C++? — mkrieger1
– mkrieger1, Commented Jul 15, 2019 at 18:20
Hopefully you're not including the output functions in your timings. Secondly, you should time optimized C++ builds, i.e. -O2 or -O3 on the compiler command line. — PaulMcKenzie
– PaulMcKenzie, Commented Jul 15, 2019 at 18:21

ShadowRanger · Accepted Answer · 2023-10-30 21:04:22Z

Here's a simple example of the difference:

i++ in C++ compiles down to (on x86-64 machines) a simple inc REGISTER instruction. Takes a fraction of a cycle to execute.

i += 1 in Python can be disassembled with the dis module via dis.dis('i += 1') which informs us that the opcodes involved are:

 1 0 LOAD_NAME 0 (i) 2 LOAD_CONST 0 (1) 4 INPLACE_ADD 6 STORE_NAME 0 (i) 8 LOAD_CONST 1 (None) 10 RETURN_VALUE

Try it online!

Technically, all the instructions that end in _NAME become _FAST in a function (we disassembled an isolated statement, so it behaved slightly differently), and the LOAD_CONST (None)/RETURN_VALUE pair won't exist for the expression in a real function (the function has to do it, but not for every expression), but close enough. In practice, the real opcodes within a function would be more like:

 1 0 LOAD_FAST 0 (i) 2 LOAD_CONST 0 (1) 4 INPLACE_ADD 6 STORE_FAST 0 (i)

Each of those instructions requires either a run through a switch statement or a computed goto (depending on how CPython was compiled), loading the next instruction and updating code position information (it could also involve repeatedly checking to ensure no other thread is asking for the GIL, though in practice these instructions are usually excluded from GIL checks). LOAD_FAST and LOAD_CONST instructions involve a C array lookup and reference count adjustment (a single reference count adjustment alone is equivalent to the i++ from before, except it has to change memory, not a register, so it's slower). STORE_FAST similarly involves an C array lookup, reference count adjustment (to decrement the existing value), and often, freeing memory (if the decref removed the last reference to the value). INPLACE_ADD has to dynamically lookup and call a function pointer to perform the addition (and it does so through a few layers of function indirection in the first place), which itself has to extract the underlying C value of each Python int to do the work (and if the numbers are big enough, this involves array based math, which gets ugly), (usually) create a brand new Python int object, and also do more reference count adjustment.

Basically, to get the equivalent of what C/C++ does in a single, cheap assembly instruction against a register, Python had to perform (estimating) half a dozen function calls (including one through a function pointer), dozens of memory lookups, a dozen or so reference count adjustments, etc. Frankly, the most surprising thing is that Python only takes ~24x longer than the C++.

I'll note that the relative cost here is highest for simple math operations; the more work a single opcode does, the less the interpreter overhead matters. Unfortunately for this case, your code is nothing but simple math, so Python (at least, CPython) is at its worst here.

As for speeding it up, the main rules are:

Write Python code, not C code. You're manually maintaining your counters, when Python's range could do the job for you (and save a lot of individual opcodes). As I mentioned, it's the simplest, cheapest operations where the interpreter overhead is highest, but those operations are normally stuff you don't actually need to do very much, because there is usually a better way to do them (e.g. for loops over range rather than while loops with manual counter adjustment).
For mass math operations, use extension modules that can do the work in bulk, e.g. numpy. All that overhead for a single addition is bad; paying it just once for 1000 additions is pretty trivial.
Try alternate interpreters (e.g. PyPy)
Use Cython to compile C++ from your Python code (requires adding appropriate cdef declarations)
Use ctypes to call existing C libraries, and/or write raw Python C extensions (when Cython can't handle what you want)

Aside from that, you just have to accept that interpreted languages with dynamic typing are always going to have overhead that a compiled, statically typed language won't have.

To address point #1, a Pythonic version of your code would look like:

def main(): sum = 1 for i in range(2, 100000): for j in range(2, i): if i%j == 0: sum += 1 break print(sum) if __name__ == "__main__": main()

You could even replace the inner loop with:

 sum += any(i % j == 0 for j in range(2, i))

though that's unlikely to yield any performance benefits, just a bit of code simplification. The performance benefits come from using range, which bundles all the basic math of incrementing and testing into a single dedicated function, reducing the overhead significantly.

For demonstration of the difference in opcode complexity, consider a function that does nothing but run a loop with either while and a manual counter or for and range:

def whileloop(n): i = 0 while i < n: i += 1 def forloop(n): for i in range(n): pass

Disassembling each function shows:

 3 0 LOAD_CONST 1 (0) 2 STORE_FAST 1 (i) 4 4 SETUP_LOOP 20 (to 26) >> 6 LOAD_FAST 1 (i) 8 LOAD_FAST 0 (n) 10 COMPARE_OP 0 (<) 12 POP_JUMP_IF_FALSE 24 5 14 LOAD_FAST 1 (i) 16 LOAD_CONST 2 (1) 18 INPLACE_ADD 20 STORE_FAST 1 (i) 22 JUMP_ABSOLUTE 6 >> 24 POP_BLOCK >> 26 LOAD_CONST 0 (None) 28 RETURN_VALUE

for whileloop and:

 8 0 SETUP_LOOP 16 (to 18) 2 LOAD_GLOBAL 0 (range) 4 LOAD_FAST 0 (n) 6 CALL_FUNCTION 1 8 GET_ITER >> 10 FOR_ITER 4 (to 16) 12 STORE_FAST 1 (i) 9 14 JUMP_ABSOLUTE 10 >> 16 POP_BLOCK >> 18 LOAD_CONST 0 (None) 20 RETURN_VALUE

Try it online!

for forloop. The body of the loop (the stuff executed once per pass, including testing the termination condition) for the while runs from the LOAD_FAST following the SETUP_LOOP to the JUMP_ABSOLUTE, encompassing nine instructions per loop; for the for, it runs from the FOR_ITER to the JUMP_ABSOLUTE, encompassing just three instructions. Since the work done for all of these instructions is pretty trivial, it's easy to see how the overhead of the loop itself would be significantly higher for the manually managed counter with a while loop.

CristiFati · Accepted Answer · 2019-08-30 10:12:28Z

[SO]: Python vs CPP: Why is the difference in speed so huge? (@ShadowRanger's answer) explains very well the why (rationale that happens behind the scenes). Here are some attempts that I've done in (incremental) steps.

Setup:

OS, tools and other info.

[cfati@cfati-5510-0:/cygdrive/e/Work/Dev/StackOverflow/q057044727]> ~/sopr.sh *** Set shorter prompt to better fit when pasted in StackOverflow (or other) pages *** [prompt]> uname -a CYGWIN_NT-10.0 cfati-5510-0 3.0.7(0.338/5/3) 2019-04-30 18:08 x86_64 Cygwin [prompt]> [prompt]> python3 -c "import sys;print(\"Python {0:s} {1:d}bit on {2:s}\".format(\" \".join(item.strip() for item in sys.version.split(\"\n\")), 64 if sys.maxsize > 0x100000000 else 32, sys.platform))" Python 3.6.8 (default, Feb 14 2019, 22:09:48) [GCC 7.4.0] 64bit on cygwin [prompt]> [prompt]> g++ --version | grep g++ g++ (GCC) 7.4.0 [prompt]> [prompt]> ls dll00.cpp dll01.cpp main00.cpp script00.py script01.py script02.py script03.py script04.py

C++ (0):

Split the code in 2 files (later you'll see why).

dll00.cpp:

#include <iostream> #if defined(_WIN32) # define DLL_EXPORT_API __declspec(dllexport) #else # define DLL_EXPORT_API #endif using std::cout; using std::endl; DLL_EXPORT_API int func00() { int non_primes = 1; for (int i = 2; i < 100000; i++) { for (int j = 2; j < i; j++) { if (i % j == 0) { non_primes++; break; } } } cout << non_primes << endl; return 0; }

main00.cpp:

#include "dll00.cpp" int main() { return func00(); }

Output:

[prompt]> g++ -std=c++11 main00.cpp -o main000 [prompt]> [prompt]> time ./main000 90407 real 0m1.384s user 0m1.359s sys 0m0.000s

script00.py:

Your original script (with small corrections).

#!/usr/bin/env python3 def main(): non_primes = 1 i = 2 while i < 100000: j = 2 while j < i: if i % j == 0: non_primes += 1 break j += 1 i += 1 print(non_primes) if __name__ == "__main__": main()

Output:

[prompt]> time python3 script00.py 90407 real 0m53.738s user 0m53.703s sys 0m0.031s

script01.py:

Replaced the (inefficient) while loops by for (using range).

#!/usr/bin/env python3 def main(): non_primes = 1 for i in range(2, 100000): for j in range(2, i): if i % j == 0: non_primes += 1 break print(non_primes) if __name__ == "__main__": main()

Output:

[prompt]> time python3 script01.py 90407 real 0m34.142s user 0m34.124s sys 0m0.000s

script02.py:

Use Python style 0 equality test.

#!/usr/bin/env python3 def main(): non_primes = 1 for i in range(2, 100000): for j in range(2, i): if not i % j: non_primes += 1 break print(non_primes) if __name__ == "__main__": main()

Output:

[prompt]> time python3 script02.py 90407 real 0m28.440s user 0m28.406s sys 0m0.031s

script03.py:

Specific for this case. The search for divisors is highly inefficient. It iterates til the number itself (, when in fact it should only go to its square root), generating lots of useless operations that deepen the performance gap between the 2 languages.
```
#!/usr/bin/env python3 from math import sqrt def main(): non_primes = 1 for i in range(2, 100000): for j in range(2, int(sqrt(i) + 1)): if not i % j: non_primes += 1 break print(non_primes) if __name__ == "__main__": main() 
```
Output:
```
[prompt]> time python3 script03.py 90407 real 0m0.291s user 0m0.265s sys 0m0.015s 
```
As seen, a whopping difference (almost 100 times faster) than the previous version, and even better than the (original) C code.

C++ (1):

The previous step operated on the algorithm itself. Change the C++ variant as well, otherwise the comparison would be unfair.

dll01.cpp:

#include <iostream> #include <math.h> #if defined(_WIN32) # define DLL_EXPORT_API __declspec(dllexport) #else # define DLL_EXPORT_API #endif using std::cout; using std::endl; #if defined(__cplusplus) extern "C" { #endif DLL_EXPORT_API int func00() { int non_primes = 1; for (int i = 2; i < 100000; i++) { for (int j = 2; j < static_cast<int>(sqrt(i) + 1); j++) { if (i % j == 0) { non_primes++; break; } } } cout << non_primes << endl; return 0; } #if defined(__cplusplus) } #endif

main00.cpp must (obviously) be modified accordingly (#include "dll01.cpp").

Output:

[prompt]> g++ -std=c++11 main00.cpp -o main001 [prompt]> [prompt]> time ./main001 90407 real 0m0.279s user 0m0.250s sys 0m0.030s

Call C++ code (C interfaced) from Python via [Python 3.Docs]: ctypes - A foreign function library for Python:

Uses the C++ code from previous step.

script04.py:

#!/usr/bin/env python3 import ctypes def main(): dll = ctypes.CDLL("./dll01.so") func = dll.func00 func.argtypes = [] func.restype = ctypes.c_int func() if __name__ == "__main__": main()

Output:

[prompt]> g++ -std=c++11 -fPIC -shared dll01.cpp -o dll01.so [prompt]> [prompt]> time python3 script04.py 90407 real 0m0.327s user 0m0.281s sys 0m0.031s

Conclusions (drawn from the above examples):

I've run each step 3 times, and placed here the middle result. However, a test with meaningful results should be run several thousands times and an average should be computed. Also, the fact that I'm using Cygwin might interfere with the results
Writing Pythonic code, improved performance almost 2 times (#4., #5.)
Writing an efficient algorithm, reduced the difference between the 2 languages almost to 0 (#6. vs. #7.), and (pure) Python code seems to be running faster than #8..
However, don't let yourself deceived by these facts. As proven, if the number of operations grows (and not necessarily due to inefficiency), C++ will work a lot faster.
You can check this by applying step #8. to dll00.cpp

Just for posterity, could you include the version of Python you used (major.minor, 32 vs. 64 bit)? Every once in a (long) while they change the interpreter to improve unusually slow things (e.g. recent releases have dramatically reduced the overhead of method calls with solely positional arguments, so some advice about avoiding method calls when syntax solutions exist is less relevant than before), so it would be nice to be able to compare if later version, for example, make range cheaper, or simple arithmetic, or comparisons to 0, or whatever.
@ShadowRanger: version was already there (in #1.), added architecture.

Daniel · Accepted Answer · 2019-07-15 18:18:44Z

3

You are calculating something like the non-prime numbers up to some n. Doing so with a sieve, is much faster:

def count_primes(n): count = 0 w = [False]*n for m in range(2,n): if not w[m]: w[m*m::m] = [True] * ((n+m-m*m-1)//m) count+=1 return count print(99999 - sieve(100000))

This runs in milliseconds, even with python.

answered Jul 15, 2019 at 18:18

Daniel

42.9k4 gold badges57 silver badges82 bronze badges

3 Comments

ShadowRanger Over a year ago

While true, not all problems can be solved with better algorithms; increase your sieve boundary, and optimize the C code equivalently (or even better, since C code can do all sorts of bit twiddling hacks to reduce memory usage at no real cost in performance, which Python can't match), and you'll find CPython still falls behind as you provide larger and larger ns.

Daniel Over a year ago

while this is true, most part of this algorithm is executed with C code, which makes it run with native speed.

ShadowRanger Over a year ago

Yes, it's a lot faster, but even so, equivalently optimized C code wins. I've written similar code (it output the primes themselves, not just the count, and used some tricks to reduce the bytecode per loop further), and hyperoptimized Python lost to the hyperoptimized C by a factor of ~8x for an input of n = 10**8. Don't get me wrong, the Python code was much faster to write, and even with some absurdly ugly optimizations, was still easier to read/verify. That's the whole point of writing Python at all; it's faster and easier to write, at the expense of being slower to execute.

Collectives™ on Stack Overflow

Python vs CPP: Why is the difference in speed so huge?

C++

Python

3 Answers 3

Comments

2 Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

C++

Python

3 Answers 3

Comments

2 Comments

3 Comments

Linked

Related