5

My Python program was too slow. So, I profiled it and found that most of the time was being spent in a function that computes distance between two points (a point is a list of 3 Python floats):

def get_dist(pt0, pt1): val = 0 for i in range(3): val += (pt0[i] - pt1[i]) ** 2 val = math.sqrt(val) return val 

To analyze why this function was so slow, I wrote two test programs: one in Python and one in C++ that do similar computation. They compute the distance between 1 million pairs of points. (The test code in Python and C++ is below.)

The Python computation takes 2 seconds, while C++ takes 0.02 seconds. A 100x difference!

Why is Python code so much slower than C++ code for such simple math computations? How do I speed it up to match the C++ performance?

The Python code used for testing:

import math, random, time num = 1000000 # Generate random points and numbers pt_list = [] rand_list = [] for i in range(num): pt = [] for j in range(3): pt.append(random.random()) pt_list.append(pt) rand_list.append(random.randint(0, num - 1)) # Compute beg_time = time.clock() dist = 0 for i in range(num): pt0 = pt_list[i] ri = rand_list[i] pt1 = pt_list[ri] val = 0 for j in range(3): val += (pt0[j] - pt1[j]) ** 2 val = math.sqrt(val) dist += val end_time = time.clock() elap_time = (end_time - beg_time) print elap_time print dist 

The C++ code used for testing:

#include <cstdlib> #include <iostream> #include <ctime> #include <cmath> struct Point { double v[3]; }; int num = 1000000; int main() { // Allocate memory Point** pt_list = new Point*[num]; int* rand_list = new int[num]; // Generate random points and numbers for ( int i = 0; i < num; ++i ) { Point* pt = new Point; for ( int j = 0; j < 3; ++j ) { const double r = (double) rand() / (double) RAND_MAX; pt->v[j] = r; } pt_list[i] = pt; rand_list[i] = rand() % num; } // Compute clock_t beg_time = clock(); double dist = 0; for ( int i = 0; i < num; ++i ) { const Point* pt0 = pt_list[i]; int r = rand_list[i]; const Point* pt1 = pt_list[r]; double val = 0; for ( int j = 0; j < 3; ++j ) { const double d = pt0->v[j] - pt1->v[j]; val += ( d * d ); } val = sqrt(val); dist += val; } clock_t end_time = clock(); double sec_time = (end_time - beg_time) / (double) CLOCKS_PER_SEC; std::cout << sec_time << std::endl; std::cout << dist << std::endl; return 0; } 
15
  • 6
    Because compiled code is always going to beat out a byte-code interpreted dynamic language? Use numpy for computation across such large datasets. Commented Apr 25, 2013 at 9:02
  • 1
    Not an answer to your question, but, have you considered using numpy? Commented Apr 25, 2013 at 9:02
  • 2
    @Ashwin: You are not exactly using Python's strengths here, nor is your code the most efficient. Using local scope vs. global scope makes a difference, unrolling the loop and avoiding attribute dereferencing can help too. Commented Apr 25, 2013 at 9:15
  • 3
    You could also try running this code with pypy. edit: For me pypy was 6.5x faster than cpython Commented Apr 25, 2013 at 9:19
  • 4
    I'd expect 100x difference specifically for "simple math computations" when comparing C vs. CPython. If a module produces millions of 3D points without using numpy; write a wrapper to get numpy arrays. Cython won't help you achieve C performance if you keep the loops in pure Python (get_dist() implemented in Cython would be almost instantaneous compared with for i in xrange(num) overhead (14ms for num=1000000 on my machine)). Cython interoperates with numpy arrays very well. You can use Cython if you can't express your computations as vectorized numpy operations. Commented Apr 25, 2013 at 10:03

5 Answers 5

6

A sequence of optimizations:

The original code, with small changes

import math, random, time num = 1000000 # Generate random points and numbers # Change #1: Sometimes it's good not to have too much randomness. # This is one of those cases. # Changing the code shouldn't change the results. # Using a fixed seed ensures that the changes are valid. # The final 'print dist' should yield the same result regardless of optimizations. # Note: There's nothing magical about this seed. # I randomly picked a hash tag from a git log. random.seed (0x7126434a2ea2a259e9f4196cbb343b1e6d4c2fc8) pt_list = [] rand_list = [] for i in range(num): pt = [] for j in range(3): pt.append(random.random()) pt_list.append(pt) # Change #2: rand_list is computed in a separate loop. # This ensures that upcoming optimizations will get the same results as # this unoptimized version. for i in range(num): rand_list.append(random.randint(0, num - 1)) # Compute beg_time = time.clock() dist = 0 for i in range(num): pt0 = pt_list[i] ri = rand_list[i] pt1 = pt_list[ri] val = 0 for j in range(3): val += (pt0[j] - pt1[j]) ** 2 val = math.sqrt(val) dist += val end_time = time.clock() elap_time = (end_time - beg_time) print elap_time print dist 


Optimization #1: Put the code in a function.

The first optimization (not shown) is to embed all of the code except the import in a function. This simple change offers a 36% performance boost on my computer.


Optimization #2: Eschew the ** operator.

You don't use pow(d,2) in your C code because everyone knows that this is suboptimal in C. It's also suboptimal in python. Python's ** is smart; it evaluates x**2 as x*x. However, it takes time to be smart. You know you want d*d, so use it. Here's the computation loop with that optimization:

for i in range(num): pt0 = pt_list[i] ri = rand_list[i] pt1 = pt_list[ri] val = 0 for j in range(3): d = pt0[j] - pt1[j] val += d*d val = math.sqrt(val) dist += val 


Optimization #3: Be pythonic.

Your Python code looks a whole lot like your C code. You aren't taking advantage of the language.

import math, random, time, itertools def main (num=1000000) : # This small optimization speeds things up by a couple percent. sqrt = math.sqrt # Generate random points and numbers random.seed (0x7126434a2ea2a259e9f4196cbb343b1e6d4c2fc8) def random_point () : return [random.random(), random.random(), random.random()] def random_index () : return random.randint(0, num-1) # Big optimization: # Don't generate the lists of points. # Instead use list comprehensions that create iterators. # It's best to avoid creating lists of millions of entities when you don't # need those lists. You don't need the lists; you just need the iterators. pt_list = [random_point() for i in xrange(num)] rand_pts = [pt_list[random_index()] for i in xrange(num)] # Compute beg_time = time.clock() dist = 0 # Don't loop over a range. That's too C-like. # Instead loop over some iterable, preferably one that doesn't create the # collection over which the iteration is to occur. # This is particularly important when the collection is large. for (pt0, pt1) in itertools.izip (pt_list, rand_pts) : # Small optimization: inner loop inlined, # intermediate variable 'val' eliminated. d0 = pt0[0]-pt1[0] d1 = pt0[1]-pt1[1] d2 = pt0[2]-pt1[2] dist += sqrt(d0*d0 + d1*d1 + d2*d2) end_time = time.clock() elap_time = (end_time - beg_time) print elap_time print dist 


Update

Optimization #4, Use numpy

The following takes about 1/40th the time of the original version in the timed section of the code. Not quite as fast as C, but close.

Note the commented out, "Mondo slow" computation. That takes about ten times as long as the original version. There is an overhead cost with using numpy. The setup takes quite a bit longer in the code that follows compared to that in my non-numpy optimization #3.

Bottom line: You need to take care when using numpy, and the setup costs might be significant.

import numpy, random, time def main (num=1000000) : # Generate random points and numbers random.seed (0x7126434a2ea2a259e9f4196cbb343b1e6d4c2fc8) def random_point () : return [random.random(), random.random(), random.random()] def random_index () : return random.randint(0, num-1) pt_list = numpy.array([random_point() for i in xrange(num)]) rand_pts = pt_list[[random_index() for i in xrange(num)],:] # Compute beg_time = time.clock() # Mondo slow. # dist = numpy.sum ( # numpy.apply_along_axis ( # numpy.linalg.norm, 1, pt_list - rand_pts)) # Mondo fast. dist = numpy.sum ((numpy.sum ((pt_list-rand_pts)**2, axis=1))**0.5) end_time = time.clock() elap_time = (end_time - beg_time) print elap_time print dist 
Sign up to request clarification or add additional context in comments.

3 Comments

Excellent answer. How about optimization #4: improve its performance using numpy? :-)
Awesome. How about a final #5: Cython :-)
Numpy is a 30MB dependency (the Python runtime is about 5MB last time I checked) that only works if your algorithm can be run in parallel.
4

Some general hints:

Move all your code into a main() function and use the normal

if __name__ == "__main__": main() 

construct. It greatly improves speed due to variable-scoping. See Why does Python code run faster in a function? for an explanation of why.

Don't use range() since it generates the complete range at once which is slow for large numbers; instead use xrange() which uses a generator.

4 Comments

Yes, that helped. It reduced it to 1.34s. Still ~100x slower than C code.
@FredrikPihl A point is a list of 3 Python floats. Using xrange will slow it down because of the generator overhead in this case of small numbers
I was more thinking of the line for i in range(num) where num is 1000000. If it is python3, yes then range is the same as xrange. Move code into main() still holds though.
@FredrikPihl Thanks for this tip, which I was not aware of. But sadly, this doesn't help my problem since get_dist is part of a larger program. See my updated question.
3

Python is not a fast language, it does not produce "computer-code", it's run in the python virtual machine. "Everything" is objects, so you don't have static typing as in C. Only this will slow it down a lot. - Anyways, that's not my area so I will not speak to much of it.

You should consider PyPy, Cython, maybe even writing a python extension in C.

I ran the code in PyPy, the time used was 250ms <-- That's what you're looking for? I wrote a quick test for Cython and managed to get it down to 500ms..

So the best bet would be to use PyPy, or Cython when speed is REALLY important.

4 Comments

I do not see why this was down-voted. Cython is a great way of optimising the code when speed is needed. It's easy to handle and simple to maintain.
I cannot use anything other than CPython right now. My actual program is built using lots of third party libraries.
@Ashwin: Cython is not a separate implementation. Cython works together with CPython; you use a subset of Python that is then compiled to optimized C, and that extension is then run within CPython.
@MartijnPieters I realize that now :-) I had assumed it yet another Python implementation like Jython and IronPython.
2

You can't expect to match C++ performance in Python, however you can tweak the Python code a bit to make it faster:

def get_dist(pt0, pt1): val = 0 for i in range(3): val += (pt0[i] - pt1[i]) ** 2 val = math.sqrt(val) return val 

The for loop version of this code and your C++ for loop are completely different. The Python version creates a list and then iterates through it, while the C++ version simply increments a variable. If you want to speed up the Python version, the best way to do it is write it out explicitly to spare the overhead of the Python for loop.

def get_dist(pt0, pt1, sqrt=math.sqrt): # cache function at definition time return sqrt((pt0[0] - pt1[0]) ** 2 + (pt0[1] - pt1[1]) ** 2 + (pt0[2] - pt1[2]) ** 2) 

And that's probably as fast as you can get (without using numpy) for that particular function, there are other things you can improve in your main code too.

10 Comments

And that's probably as fast as you can get. Using numpy will make this a lot faster still without moving wholesale to C or C++.
@MartijnPieters Oh right, I should mention "without using numpy" thanks
Also, you could 'cache' the sqrt function with get_dist(pt0, pt1, sqrt=math.sqrt): return sqrt((pt0[0] - pt1[0]) ** 2 + (pt0[1] - pt1[1]) ** 2 + (pt0[2] - pt1[2]) ** 2)
This link says math.sqrt() is faster
Oh didn't know that, I'll change it
|
0

This page is getting really confusing and most of the answers are actually in comments, so here’s a quick overview of possible optimizations:

  • Jamlak’s answer: optimize your python code:

    def get_dist(pt0, pt1, sqrt=math.sqrt): # cache function at definition time return sqrt((pt0[0] - pt1[0]) ** 2 + (pt0[1] - pt1[1]) ** 2 + (pt0[2] - pt1[2]) ** 2) 
  • Use the numpy module for the calulations

  • Run your code with pypy instead of CPython
  • Complie the time-critical code with Cython

2 Comments

You missed the 'move the code into a function'. Running the supplied code on my system: 4.8s, moving the otherwise unchanged code into main(): 3.5s, running it with pypy 0.65s. numpy probably helps more but would need code changes.
At some point it becomes quicker to just learn c or Fortran ;)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.