Running C extension in Python faster than plain C

Question

I have implemented a Python extension in C and found that executing a C function inside of Python to be 2x faster than just executing the C code from a C main.

But why is this faster? I would expect the plain C to be exactly the same performance when called from Python as it is when called from C.

Here is my experiment:

Plain C compute code (simple 3for matrix-matrix multiplication)
Plain C main function that calls the mmult() function
Python extension wrapper to call the mmult() function
All timing is happening entirely within the C code

Here are my results:

Pure C - 85us

Python Extension - 36us

Heres my code:

--mmult.cpp----------

#include "mmult.h" void mmult(int32_t a[1024],int32_t b[1024],int32_t c[1024]) { struct timeval t1, t2; gettimeofday(&t1, NULL); for(int i=0; i<32; i=i+1) { for(int j=0; j<32; j=j+1) { int32_t result=0; for(int k=0; k<32; k=k+1) { result+=a[i*32+k]*b[k*32+j]; } c[i*32+j] = result; } } gettimeofday(&t2, NULL); double elapsedTime = (t2.tv_usec - t1.tv_usec) + (t2.tv_sec - t1.tv_sec)*1000000; printf("elapsed time: %fus\n",elapsedTime); }

--mmult.h-------

#include <stdint.h> void mmult(int32_t a[1024],int32_t b[1024],int32_t c[1024]);

--main.cpp------

#include <stdio.h> #include <stdlib.h> #include <sys/time.h> #include "mmult.h" int main() { int* a = (int*)malloc(sizeof(int)*1024); int* b = (int*)malloc(sizeof(int)*1024); int* c = (int*)malloc(sizeof(int)*1024); for(int i=0; i<1024; i++) { a[i]=i+1; b[i]=i+1; c[i]=0; } struct timeval t1, t2; gettimeofday(&t1, NULL); mmult(a,b,c); gettimeofday(&t2, NULL); double elapsedTime = (t2.tv_usec - t1.tv_usec) + (t2.tv_sec - t1.tv_sec)*1000000; printf("elapsed time: %fus\n",elapsedTime); free(a); free(b); free(c); return 0; }

Heres how I compile main:

gcc -o main main.cpp mmult.cpp -O3

--wrapper.cpp-----

#include <Python.h> #include <numpy/arrayobject.h> #include "mmult.h" static PyObject* mmult_wrapper(PyObject* self, PyObject* args) { int32_t* a; PyArrayObject* a_obj = NULL; int32_t* b; PyArrayObject* b_obj = NULL; int32_t* c; PyArrayObject* c_obj = NULL; int res = PyArg_ParseTuple(args, "OOO", &a_obj, &b_obj, &c_obj); if (!res) return NULL; a = (int32_t*) PyArray_DATA(a_obj); b = (int32_t*) PyArray_DATA(b_obj); c = (int32_t*) PyArray_DATA(c_obj); /* call function */ mmult(a,b,c); Py_RETURN_NONE; } /* define functions in module */ static PyMethodDef TheMethods[] = { {"mmult_wrapper", mmult_wrapper, METH_VARARGS, "your c function"}, {NULL, NULL, 0, NULL} }; static struct PyModuleDef cModPyDem = { PyModuleDef_HEAD_INIT, "mmult", "Some documentation", -1, TheMethods }; PyMODINIT_FUNC PyInit_c_module(void) { PyObject* retval = PyModule_Create(&cModPyDem); import_array(); return retval; }

--setup.py-----

import os import numpy from distutils.core import setup, Extension cur = os.path.dirname(os.path.realpath(__file__)) c_module = Extension("c_module", sources=["wrapper.cpp","mmult.cpp"],include_dirs=[cur,numpy.get_include()]) setup(ext_modules=[c_module])

--code.py-----

import c_module import time import numpy as np if __name__ == "__main__": a = np.ndarray((32,32),dtype='int32',buffer=np.linspace(1,1024,1024,dtype='int32').reshape(32,32)) b = np.ndarray((32,32),dtype='int32',buffer=np.linspace(1,1024,1024,dtype='int32').reshape(32,32)) c = np.ndarray((32,32),dtype='int32',buffer=np.zeros((32,32),dtype='int32')) c_module.mmult_wrapper(a,b,c)

Heres how I compile the Python extension:

python3.6 setup_sw.py build_ext --inplace

UPDATE

Ive updated the mmult.cpp code to run the 3for for 1,000,000 iterations internally. This resulted in very similar times:

Pure C - 27us

Python Extension - 27us

You're using two different compilers, correct? So I'm guessing that one is better at creating efficient executables in this one instance — bendl
– bendl, Commented Nov 29, 2017 at 13:56
executing a C function inside of Python to be 2x faster than just executing the C code from a C main <--- are you sure, how do you benchmarking the speed here, I'm curious to know what made you to come in conclusion that executing C function in Python is faster? — danglingpointer
– danglingpointer, Commented Nov 29, 2017 at 13:56
In most of these cases, the answer is incorrect benchmarking. — Lundin
– Lundin, Commented Nov 29, 2017 at 13:58
int* a = (int*)malloc(sizeof(int)*1024); -- This number of entries (to me) is a tiny amount to be used for meaningful benchmark tests. — PaulMcKenzie
– PaulMcKenzie, Commented Nov 29, 2017 at 14:04

Basile Starynkevitch · Accepted Answer · 2017-11-29 15:29:52Z

85 microseconds is too small a delay to be measured reliably and repeatedly. For example, CPU cache effects (or context switches, or paging) may dominate the computation time (and alter it to make that timing meaningless).

^{(I guess you are on Linux/x86-64)}

As a rule of thumb, try to have a run lasting about half a second at least, and repeat the benchmarking a few times. You could also use time(1) for measurements.

See also time(7). There are several notions of time (elapsed "real" time, monotonic time, process cpu time, thread cpu time, etc...). You could consider using clock(3) or clock_gettime(2) to measure time.

BTW, you might compile with a more recent version of GCC (in November 2017, GCC7 and in a few weeks GCC8) and you want to compile with gcc -march=native -O3 for benchmarking purposes. Try also other optimization options and tuning. You could also try another compiler, e.g. Clang/LLVM.

Look also at this answer (regarding parallelization) to a relevant question. Probably the numpy package is using (internally) similar techniques (outside of the Python GIL), so could be faster than your naive sequential matrix multiplication code in C.

Collectives™ on Stack Overflow

Running C extension in Python faster than plain C

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related