All,
I'm writing some performance sensitive code, including a 3d vector class that will be doing lots of cross-products. As a long-time C++ programmer, I know all about the evils of macros and the various benefits of inline functions. I've long been under the impression that inline functions should be approximately the same speed as macros. However, in performance testing macro vs inline functions, I've come to an interesting discovery that I hope is the result of me making a stupid mistake somewhere: the macro version of my function appears to be over 8 times as fast as the inline version!
First, a ridiculously trimmed down version of a simple vector class:
class Vector3d { public: double m_tX, m_tY, m_tZ; Vector3d() : m_tX(0), m_tY(0), m_tZ(0) {} Vector3d(const double &tX, const double &tY, const double &tZ): m_tX(tX), m_tY(tY), m_tZ(tZ) {} static inline void CrossAndAssign ( const Vector3d& cV1, const Vector3d& cV2, Vector3d& cV ) { cV.m_tX = cV1.m_tY * cV2.m_tZ - cV1.m_tZ * cV2.m_tY; cV.m_tY = cV1.m_tZ * cV2.m_tX - cV1.m_tX * cV2.m_tZ; cV.m_tZ = cV1.m_tX * cV2.m_tY - cV1.m_tY * cV2.m_tX; } #define FastVectorCrossAndAssign(cV1,cV2,cVOut) { \ cVOut.m_tX = cV1.m_tY * cV2.m_tZ - cV1.m_tZ * cV2.m_tY; \ cVOut.m_tY = cV1.m_tZ * cV2.m_tX - cV1.m_tX * cV2.m_tZ; \ cVOut.m_tZ = cV1.m_tX * cV2.m_tY - cV1.m_tY * cV2.m_tX; } }; Here's my sample benchmarking code:
Vector3d right; Vector3d forward(1.0, 2.2, 3.6); Vector3d up(3.2, 1.4, 23.6);
clock_t start = clock(); for (long l=0; l < 100000000; l++) { Vector3d::CrossAndAssign(forward, up, right); // static inline version } clock_t end = clock(); std::cout << end - start << endl; clock_t start2 = clock(); for (long l=0; l<100000000; l++) { FastVectorCrossAndAssign(forward, up, right); // macro version } clock_t end2 = clock(); std::cout << end2 - start2 << endl; The end result: With optimizations turned completely off, the inline version takes 3200 ticks, and the macro version 500 ticks... With optimization turned on (/O2, maximize speed, and other speed tweaks), I can get the inline version down to 1100 ticks, which is better but still not the same.
So I appeal to all of you: is this really true? Have I made a stupid mistake somewhere? Or are inline functions really this much slower -- and if so, why?