Why is memcmp(a, b, size) so much faster than:
for(i = 0; i < nelements; i++) { if a[i] != b[i] return 0; } return 1; Is memcmp a CPU instruction or something? It must be pretty deep because I got a massive speedup using memcmp over the loop.
memcmp is often implemented in assembly to take advantage of a number of architecture-specific features, which can make it much faster than a simple loop in C.
GCC supports memcmp (as well as a ton of other functions) as builtins. In some versions / configurations of GCC, a call to memcmp will be recognized as __builtin_memcmp. Instead of emitting a call to the memcmp library function, GCC will emit a handful of instructions to act as an optimized inline version of the function.
On x86, this leverages the use of the cmpsb instruction, which compares a string of bytes at one memory location to another. This is coupled with the repe prefix, so the strings are compared until they are no longer equal, or a count is exhausted. (Exactly what memcmp does).
Given the following code:
int test(const void* s1, const void* s2, int count) { return memcmp(s1, s2, count) == 0; } gcc version 3.4.4 on Cygwin generates the following assembly:
; (prologue) mov esi, [ebp+arg_0] ; Move first pointer to esi mov edi, [ebp+arg_4] ; Move second pointer to edi mov ecx, [ebp+arg_8] ; Move length to ecx cld ; Clear DF, the direction flag, so comparisons happen ; at increasing addresses cmp ecx, ecx ; Special case: If length parameter to memcmp is ; zero, don't compare any bytes. repe cmpsb ; Compare bytes at DS:ESI and ES:EDI, setting flags ; Repeat this while equal ZF is set setz al ; Set al (return value) to 1 if ZF is still set ; (all bytes were equal). ; (epilogue) Reference:
Highly-optimized versions of memcmp exist in many C standard libraries. These will usually take advantage of architecture-specific instructions to work with lots of data in parallel.
In Glibc, there are versions of memcmp for x86_64 that can take advantage of the following instruction set extensions:
sysdeps/x86_64/memcmp.Ssysdeps/x86_64/multiarch/memcmp-sse4.Ssysdeps/x86_64/multiarch/memcmp-ssse3.SThe cool part is that glibc will detect (at run-time) the newest instruction set your CPU has, and execute the version optimized for it. See this snippet from sysdeps/x86_64/multiarch/memcmp.S:
ENTRY(memcmp) .type memcmp, @gnu_indirect_function LOAD_RTLD_GLOBAL_RO_RDX HAS_CPU_FEATURE (SSSE3) jnz 2f leaq __memcmp_sse2(%rip), %rax ret 2: HAS_CPU_FEATURE (SSE4_1) jz 3f leaq __memcmp_sse4_1(%rip), %rax ret 3: leaq __memcmp_ssse3(%rip), %rax ret END(memcmp) Linux does not seem to have an optimized version of memcmp for x86_64, but it does for memcpy, in arch/x86/lib/memcpy_64.S. Note that is uses the alternatives infrastructure (arch/x86/kernel/alternative.c) for not only deciding at runtime which version to use, but actually patching itself to only make this decision once at boot-up.
rep cmpsb, that is.-fno-builtin). At some point the builtin version was much slower. I don't know if it has improved.rep cmpsb are actually quite slow. gcc now generates a call to the libc version of memcmp, which (in glibc) has an optimized asm implementation (using SIMD, not rep cmpsb).rep, not repz/repnz. rep movsb / rep stosb are fast (especially with ERMSB on Ivybridge+), but repz cmpsb isn't. See agner.org/optimize for instruction tables: On Skylake, repz cmps has a runtime of >=2n cycles, taking >= 8n uops. (Where n is the element count, rcx if it goes to the end, i.e. 1 byte per 2 cycles for cmpsb.) But rep movs has a best-case of 1/32B (copy 32 bytes per cycle).It's usually a compiler intrinsic that is translated into fast assembly with specialized instructions for comparing blocks of memory.
memcmp is a GCC builtin, not intrinsic. intrinsic typically refers to C-level access to specific CPU instructions.
x86, as others have mentioned, there are instructions for this, but often these can be vectorized.