Some trivial observations: - `#include <omp.h>` is unnecessary. (You use OpenMP, but don't call any OpenMP functions.) - The [return type of `main()`](http://stackoverflow.com/q/204476/1157100) should be `int`, not `void`. - The code also compiles with `clang` (LLVM), if you omit the `-masm=intel` option. - `zeromat()` could simply be `memset(C, 0, n * sizeof(double))`. - When compiling with `-Wall`, the code in `dgemm_2x8_sse()` to zero some registers causes spurious warnings: > matmul.c:56:21: warning: variable 'r8' is uninitialized when used here [-Wuninitialized] > r8 = _mm_xor_pd(r8,r8); // ab > ^~ > matmul.c:52:5: note: variable 'r8' is declared here > register __m128d xmm1, xmm4, // > ^ I recommend disabling the warnings with a pair of pragmas: #pragma GCC diagnostic ignored "-Wuninitialized" r8 = _mm_xor_pd(r8,r8); // ab r9 = _mm_xor_pd(r9,r9); r10 = _mm_xor_pd(r10,r10); r11 = _mm_xor_pd(r11,r11); r12 = _mm_xor_pd(r12,r12); // ab + 8 r13 = _mm_xor_pd(r13,r13); r14 = _mm_xor_pd(r14,r14); r15 = _mm_xor_pd(r15,r15); #pragma GCC diagnostic warning "-Wuninitialized" You should also discard the confusing and useless comment that precedes that code: // 10 registers declared here