Skip to main content
replaced http://stackoverflow.com/ with https://stackoverflow.com/
Source Link

See related issues herehere, including info about my hardware.

See related issues here, including info about my hardware.

See related issues here, including info about my hardware.

deleted 160 characters in body
Source Link
Jamal
  • 35.2k
  • 13
  • 134
  • 238

Cross-post from Stack Overflow since this is very code intensive

Problem

gcc -std=c99 -O3 -msse3 -ffast-math -march=nocona -mtune=nocona -funroll-loops -fomit-frame-pointer -masm=intel

gcc -std=c99 -O3 -msse3 -ffast-math -march=nocona -mtune=nocona -funroll-loops -fomit-frame-pointer -masm=intel

Hopefully is understandable. Please ask if not. I set up the macro structure (forfor loops) as described in the 2nd paper above. I pack the matrices as discussed in either paper. My inner kernel computes 2x8 blocks, as this seems to be the optimal computation for Nehalem architecture (see GotoBLAS source code - kernels). The inner kernel is based on the concept of calculating rank-1 updates as described here.

Cross-post from Stack Overflow since this is very code intensive

Problem

gcc -std=c99 -O3 -msse3 -ffast-math -march=nocona -mtune=nocona -funroll-loops -fomit-frame-pointer -masm=intel

Hopefully is understandable. Please ask if not. I set up the macro structure (for loops) as described in the 2nd paper above. I pack the matrices as discussed in either paper. My inner kernel computes 2x8 blocks, as this seems to be the optimal computation for Nehalem architecture (see GotoBLAS source code - kernels). The inner kernel is based on the concept of calculating rank-1 updates as described here

Problem

gcc -std=c99 -O3 -msse3 -ffast-math -march=nocona -mtune=nocona -funroll-loops -fomit-frame-pointer -masm=intel

I set up the macro structure (for loops) as described in the 2nd paper above. I pack the matrices as discussed in either paper. My inner kernel computes 2x8 blocks, as this seems to be the optimal computation for Nehalem architecture (see GotoBLAS source code - kernels). The inner kernel is based on the concept of calculating rank-1 updates as described here.

edited tags; edited title
Link
200_success
  • 145.7k
  • 22
  • 191
  • 481

Optimizing matrix multiplication of square matrices for full CPU utilization

links
Source Link
200_success
  • 145.7k
  • 22
  • 191
  • 481
Loading
Tweeted twitter.com/#!/StackCodeReview/status/472599133292068864
edited title
Link
syb0rg
  • 21.9k
  • 10
  • 113
  • 193
Loading
Source Link
Loading