Why is memcmp so much faster than a for loop check?

Question

Why is memcmp(a, b, size) so much faster than:

for(i = 0; i < nelements; i++) { if a[i] != b[i] return 0; } return 1;

Is memcmp a CPU instruction or something? It must be pretty deep because I got a massive speedup using memcmp over the loop.

Compile with -S to see the assembly-language output and find out. On x86, as others have mentioned, there are instructions for this, but often these can be vectorized. — Davislor
– Davislor, Commented Apr 21, 2018 at 3:31
But what optimization level are you using? Many compilers can unroll that loop. — Davislor
– Davislor, Commented Apr 21, 2018 at 3:39

Jonathon Reinhart · Accepted Answer · 2017-06-17 19:30:08Z

memcmp is often implemented in assembly to take advantage of a number of architecture-specific features, which can make it much faster than a simple loop in C.

As a "builtin"

GCC supports memcmp (as well as a ton of other functions) as builtins. In some versions / configurations of GCC, a call to memcmp will be recognized as __builtin_memcmp. Instead of emitting a call to the memcmp library function, GCC will emit a handful of instructions to act as an optimized inline version of the function.

On x86, this leverages the use of the cmpsb instruction, which compares a string of bytes at one memory location to another. This is coupled with the repe prefix, so the strings are compared until they are no longer equal, or a count is exhausted. (Exactly what memcmp does).

Given the following code:

int test(const void* s1, const void* s2, int count) { return memcmp(s1, s2, count) == 0; }

gcc version 3.4.4 on Cygwin generates the following assembly:

; (prologue) mov esi, [ebp+arg_0] ; Move first pointer to esi mov edi, [ebp+arg_4] ; Move second pointer to edi mov ecx, [ebp+arg_8] ; Move length to ecx cld ; Clear DF, the direction flag, so comparisons happen ; at increasing addresses cmp ecx, ecx ; Special case: If length parameter to memcmp is ; zero, don't compare any bytes. repe cmpsb ; Compare bytes at DS:ESI and ES:EDI, setting flags ; Repeat this while equal ZF is set setz al ; Set al (return value) to 1 if ZF is still set ; (all bytes were equal). ; (epilogue)

Reference:

cmpsb instruction

As a library function

Highly-optimized versions of memcmp exist in many C standard libraries. These will usually take advantage of architecture-specific instructions to work with lots of data in parallel.

In Glibc, there are versions of memcmp for x86_64 that can take advantage of the following instruction set extensions:

The cool part is that glibc will detect (at run-time) the newest instruction set your CPU has, and execute the version optimized for it. See this snippet from sysdeps/x86_64/multiarch/memcmp.S:

ENTRY(memcmp) .type memcmp, @gnu_indirect_function LOAD_RTLD_GLOBAL_RO_RDX HAS_CPU_FEATURE (SSSE3) jnz 2f leaq __memcmp_sse2(%rip), %rax ret 2: HAS_CPU_FEATURE (SSE4_1) jz 3f leaq __memcmp_sse4_1(%rip), %rax ret 3: leaq __memcmp_ssse3(%rip), %rax ret END(memcmp)

In the Linux kernel

Linux does not seem to have an optimized version of memcmp for x86_64, but it does for memcpy, in arch/x86/lib/memcpy_64.S. Note that is uses the alternatives infrastructure (arch/x86/kernel/alternative.c) for not only deciding at runtime which version to use, but actually patching itself to only make this decision once at boot-up.

It would be interesting to profile the builtin version with non builtin version (-fno-builtin). At some point the builtin version was much slower. I don't know if it has improved.
IIRC instructions like rep cmpsb are actually quite slow. gcc now generates a call to the libc version of memcmp, which (in glibc) has an optimized asm implementation (using SIMD, not rep cmpsb).
That is not universally true. Modern CPUs have a "fast string operations" feature that puts the rep * versions back at the top. The Linux kernel detects your whether your CPU supports this feature and live patches the appropriate code in place. (Although that may be only for movsb and friends)
Marc Glisse is correct; but "fast strings" only applies to rep, not repz/repnz. rep movsb / rep stosb are fast (especially with ERMSB on Ivybridge+), but repz cmpsb isn't. See agner.org/optimize for instruction tables: On Skylake, repz cmps has a runtime of >=2n cycles, taking >= 8n uops. (Where n is the element count, rcx if it goes to the end, i.e. 1 byte per 2 cycles for cmpsb.) But rep movs has a best-case of 1/32B (copy 32 bytes per cycle).

Community · Accepted Answer · 2017-05-23 11:46:49Z

1

It's usually a compiler intrinsic that is translated into fast assembly with specialized instructions for comparing blocks of memory.

intrinsic memcmp

edited May 23, 2017 at 11:46

CommunityBot

11 silver badge

answered Jan 14, 2014 at 5:45

a_mole

5254 silver badges11 bronze badges

2 Comments

Jonathon Reinhart Over a year ago

memcmp is a GCC builtin, not intrinsic. intrinsic typically refers to C-level access to specific CPU instructions.

a_mole Over a year ago

and intrinsic is what they're called in Visual C++

user207421 · Accepted Answer · 2014-01-14 06:00:06Z

Is memcmp a CPU instruction or something?

It is at least a very highly optimized compiler-provided intrinsic function. Possibly a single machine instruction, or two, depending on the platform, which you haven't specified.

Collectives™ on Stack Overflow

Why is memcmp so much faster than a for loop check?

3 Answers 3

As a "builtin"

As a library function

In the Linux kernel

11 Comments

2 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

As a "builtin"

As a library function

In the Linux kernel

11 Comments

2 Comments

Comments

Linked

Related