SSE optimized code performs similar to plain version

Question

I wanted to take my first steps with Intel's SSE so I followed the guide published here, with the difference that instead of developing for Windows and C++ I make it for Linux and C (therefore I don't use any _aligned_malloc but posix_memalign).

I also implemented one computing intensive method without making use of the SSE extensions. Surprisingly, when I run the program both pieces of code (that one with SSE and that one without) take similar amounts of time to run, usually being the time of the one using the SSE slightly higher than the other.

Is that normal? Could it be possible that GCC does already optimize with SSE (also using -O0 option)? I also tried the -mfpmath=387 option, but no way, still the same.

OK - see my answer below, and you might also want to post your code and the command line that you are using to build it. — Paul R
– Paul R, Commented Aug 10, 2011 at 16:23

Paul R · Accepted Answer · 2011-08-10 16:18:53Z

3

For floating point operations you may not see a huge benefit with SSE. Most modern x86 CPUs have two FPUs so double precision may only be about the same speed for SIMD vs scalar, and single precision might give you 2x for SIMD over scalar on a good day. For integer operations though, e.g. image or audio processing at 8 or 16 bits, you can still get substantial benefits with SSE.

answered Aug 10, 2011 at 16:18

Paul R

214k38 gold badges402 silver badges579 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Genís Over a year ago

That may be the cause. I will try a single precision version.

Paul R Over a year ago

OK - add the code and the command line to your question too though - there are so many simple things that you can get wrong when starting to work with SIMD.

Genís Over a year ago

You were right, Paul R. The version that uses 32-bit integers gets an speedup of approximately 2 times faster. I suppose that in 16 and 8 bit operations the benefits would be even better. By the way, I deleted that square root operation in the integer version. Thanks a lot.

Necrolis · Accepted Answer · 2011-08-10 16:10:43Z

2

GCC has a very good inbuilt code vectorizer, (which iirc kicks in at -O0 and above), so this means it will use SIMD in any place that it can in order to speed up scalar code (it will also optimize SIMD code a bit too, if its possible).

its pretty easy to confirm this is indeed whats happening here, just disassemble the output (or have gcc emit commented asm files).

answered Aug 10, 2011 at 16:10

Necrolis

26.3k3 gold badges69 silver badges103 bronze badges

3 Comments

Genís Over a year ago

I checked the assembler code and I just see the pair of addps instructions I expected from the piece of code with explicit (at least) SSE.

Christian Rau Over a year ago

I doubt that automatic vectorization comes into play at O0 (no optimization), as it's a very heavy optimization that should only kick in at O2 or O3.

Hans-Christoph Steiner Over a year ago

If you look at the gcc man page, it says that -ftree-vectorize is set by -O3. That's on Debian/Ubuntu, it might be different on other platforms. Careful, -O0 is 0 optimization. Optimization starts at -O1

Collectives™ on Stack Overflow

SSE optimized code performs similar to plain version

2 Answers 2

3 Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Linked

Related