Without addressing performance concerns, some trivial observations:
#include <omp.h>is unnecessary. (You use OpenMP, but don't call any OpenMP functions.)The return type of
main()return type ofmain()should beint, notvoid.The code also compiles with
clang(LLVM), if you omit the-masm=inteloption.zeromat()could simply bememset(C, 0, n * sizeof(double)).When compiling with
-Wall, the code indgemm_2x8_sse()to zero some registers causes spurious warnings:matmul.c:56:21: warning: variable 'r8' is uninitialized when used here [-Wuninitialized] r8 = _mm_xor_pd(r8,r8); // ab ^~ matmul.c:52:5: note: variable 'r8' is declared here register __m128d xmm1, xmm4, // ^I recommend disabling the warnings with a pair of pragmas:
#pragma GCC diagnostic ignored "-Wuninitialized" r8 = _mm_xor_pd(r8,r8); // ab r9 = _mm_xor_pd(r9,r9); r10 = _mm_xor_pd(r10,r10); r11 = _mm_xor_pd(r11,r11); r12 = _mm_xor_pd(r12,r12); // ab + 8 r13 = _mm_xor_pd(r13,r13); r14 = _mm_xor_pd(r14,r14); r15 = _mm_xor_pd(r15,r15); #pragma GCC diagnostic warning "-Wuninitialized"You should also discard the confusing and useless comment that precedes that code:
// 10 registers declared here"Gflops/s" is redundant and incorrect terminology (unless you are talking about acceleration, not speed!)