[BOLT][AArch64] Enabling Inlining for Memcpy for AArch64 in BOLT #154929

yafet-a · 2025-08-22T11:20:03Z

Overview

The pass for inlining memcpy in BOLT was currently X86-specific. It was using the instruction rep movsb which currently has no equivalent in ARM v8-A.

This patch implements a static size analysis system for AArch64 memcpy inlining that extracts copy sizes from preceding instructions to then use it to generate the optimal width-specific load/store sequences.

Testing Coverage (`inline-memcpy.s`)

Positive Tests:

Exact size optimizations: 1, 2, 4, 8, 16, 32 bytes → optimal instruction sequences
Arbitrary size decomposition: 37 bytes → 16+16+4+1 byte sequence with correct offsets
Inline count verification: CHECK-INLINE: inlined 8 memcpy() calls
Assembly validation: CHECK-ASM patterns verify exact generated instructions

Negative Tests:

Large size safety: 128 bytes → CHECK-ASM-NOT: ldr.*q (no SIMD, skipped inlining)
No unwanted instructions: CHECK-ASM-NOT patterns ensure clean generation

llvmbot · 2025-08-22T11:20:31Z

@llvm/pr-subscribers-bolt

Author: (yafet-a)

Changes

Overview

The pass for inlining memcpy in BOLT was currently X86-specific. It was using the instruction rep movsb which currently has no equivalent in ARM v8-A.

This patch implements a static size analysis system for AArch64 memcpy inlining that extracts copy sizes from preceding instructions to then use it to generate the optimal width-specific load/store sequences.

Testing Coverage (`inline-memcpy.s`)

Positive Tests:

Exact size optimizations: 1, 2, 4, 8, 16, 32 bytes → optimal instruction sequences
Arbitrary size decomposition: 37 bytes → 16+16+4+1 byte sequence with correct offsets
Inline count verification: CHECK-INLINE: inlined 8 memcpy() calls
Assembly validation: CHECK-ASM patterns verify exact generated instructions

Negative Tests:

Large size safety: 128 bytes → CHECK-ASM-NOT: ldr.*q (no SIMD, skipped inlining)
No unwanted instructions: CHECK-ASM-NOT patterns ensure clean generation

Full diff: https://github.com/llvm/llvm-project/pull/154929.diff

6 Files Affected:

(modified) bolt/docs/CommandLineArgumentReference.md (+1-1)
(modified) bolt/include/bolt/Core/MCPlusBuilder.h (+16)
(modified) bolt/lib/Passes/BinaryPasses.cpp (+26-2)
(modified) bolt/lib/Rewrite/BinaryPassManager.cpp (+3-1)
(modified) bolt/lib/Target/AArch64/AArch64MCPlusBuilder.cpp (+204)
(added) bolt/test/AArch64/inline-memcpy.s (+193)

diff --git a/bolt/docs/CommandLineArgumentReference.md b/bolt/docs/CommandLineArgumentReference.md index f3881c9a640a9..3fc0594514f6e 100644 --- a/bolt/docs/CommandLineArgumentReference.md +++ b/bolt/docs/CommandLineArgumentReference.md @@ -631,7 +631,7 @@ - `--inline-memcpy` - Inline memcpy using 'rep movsb' instruction (X86-only) + Inline memcpy using optimized instruction sequences (X86: 'rep movsb', AArch64: width-optimized register operations) - `--inline-small-functions` diff --git a/bolt/include/bolt/Core/MCPlusBuilder.h b/bolt/include/bolt/Core/MCPlusBuilder.h index e773250ce8734..6cbf288f3b8f4 100644 --- a/bolt/include/bolt/Core/MCPlusBuilder.h +++ b/bolt/include/bolt/Core/MCPlusBuilder.h @@ -1895,6 +1895,22 @@ class MCPlusBuilder { return {}; } + /// Creates size-aware inline memcpy instruction. If \p KnownSize is provided, + /// generates optimized code for that specific size. Falls back to regular + /// createInlineMemcpy if size is unknown or not needed (e.g. with X86). + virtual InstructionListType + createInlineMemcpy(bool ReturnEnd, std::optional<uint64_t> KnownSize) const { + return createInlineMemcpy(ReturnEnd); + } + + /// Extract immediate value from move instruction that sets the given + /// register. Returns the immediate value if the instruction is a + /// move-immediate to TargetReg. + virtual std::optional<uint64_t> + extractMoveImmediate(const MCInst &Inst, MCPhysReg TargetReg) const { + return std::nullopt; + } + /// Create a target-specific relocation out of the \p Fixup. /// Note that not every fixup could be converted into a relocation. virtual std::optional<Relocation> diff --git a/bolt/lib/Passes/BinaryPasses.cpp b/bolt/lib/Passes/BinaryPasses.cpp index d7f02b9470030..0068c1ad0bf1c 100644 --- a/bolt/lib/Passes/BinaryPasses.cpp +++ b/bolt/lib/Passes/BinaryPasses.cpp @@ -1843,7 +1843,7 @@ Error StripRepRet::runOnFunctions(BinaryContext &BC) { } Error InlineMemcpy::runOnFunctions(BinaryContext &BC) { - if (!BC.isX86()) + if (!BC.isX86() && !BC.isAArch64()) return Error::success(); uint64_t NumInlined = 0; @@ -1866,8 +1866,32 @@ Error InlineMemcpy::runOnFunctions(BinaryContext &BC) { const bool IsMemcpy8 = (CalleeSymbol->getName() == "_memcpy8"); const bool IsTailCall = BC.MIB->isTailCall(Inst); + // Extract the size of thecopy from preceding instructions by looking + // for writes to the size register + std::optional<uint64_t> KnownSize = std::nullopt; + BitVector WrittenRegs(BC.MRI->getNumRegs()); + + // Get the size register (3rd arg register, index 2 for AArch64) + MCPhysReg SizeReg = BC.MIB->getIntArgRegister(2); + + // Look backwards through the basic block for size-setting instr + for (auto InstIt = BB.begin(); InstIt != II; ++InstIt) { + MCInst &Inst = *InstIt; + WrittenRegs.reset(); // Clear and check what the instruction writes to + BC.MIB->getWrittenRegs(Inst, WrittenRegs); + + // Check for writes to the size register + if (SizeReg != BC.MIB->getNoRegister() && WrittenRegs[SizeReg]) { + if (std::optional<uint64_t> ExtractedSize = + BC.MIB->extractMoveImmediate(Inst, SizeReg)) { + KnownSize = *ExtractedSize; + break; + } + } + } + const InstructionListType NewCode = - BC.MIB->createInlineMemcpy(IsMemcpy8); + BC.MIB->createInlineMemcpy(IsMemcpy8, KnownSize); II = BB.replaceInstruction(II, NewCode); std::advance(II, NewCode.size() - 1); if (IsTailCall) { diff --git a/bolt/lib/Rewrite/BinaryPassManager.cpp b/bolt/lib/Rewrite/BinaryPassManager.cpp index 996d2e972599d..6b554598cf1bc 100644 --- a/bolt/lib/Rewrite/BinaryPassManager.cpp +++ b/bolt/lib/Rewrite/BinaryPassManager.cpp @@ -247,7 +247,9 @@ static cl::opt<bool> Stoke("stoke", cl::desc("turn on the stoke analysis"), static cl::opt<bool> StringOps( "inline-memcpy", - cl::desc("inline memcpy using 'rep movsb' instruction (X86-only)"), + cl::desc( + "inline memcpy using size-specific optimized instructions " + "(X86: 'rep movsb', AArch64: width-optimized register operations)"), cl::cat(BoltOptCategory)); static cl::opt<bool> StripRepRet( diff --git a/bolt/lib/Target/AArch64/AArch64MCPlusBuilder.cpp b/bolt/lib/Target/AArch64/AArch64MCPlusBuilder.cpp index 973261765f951..03f62117ea096 100644 --- a/bolt/lib/Target/AArch64/AArch64MCPlusBuilder.cpp +++ b/bolt/lib/Target/AArch64/AArch64MCPlusBuilder.cpp @@ -2597,6 +2597,210 @@ class AArch64MCPlusBuilder : public MCPlusBuilder { getInstructionSize(const MCInst &Inst) const override { return 4; } + + InstructionListType createInlineMemcpy(bool ReturnEnd) const override { + // Fallback + return createInlineMemcpy(ReturnEnd, std::nullopt); + } + + std::optional<uint64_t> + extractMoveImmediate(const MCInst &Inst, MCPhysReg TargetReg) const override { + if (Inst.getOpcode() == AArch64::MOVZXi && Inst.getNumOperands() >= 3) { + if (Inst.getOperand(0).isReg() && + Inst.getOperand(0).getReg() == TargetReg && + Inst.getOperand(1).isImm() && Inst.getOperand(2).isImm() && + Inst.getOperand(2).getImm() == 0) { + return Inst.getOperand(1).getImm(); + } + } + return std::nullopt; + } + + InstructionListType + createInlineMemcpy(bool ReturnEnd, + std::optional<uint64_t> KnownSize) const override { + InstructionListType Code; + if (ReturnEnd) { + if (KnownSize.has_value() && (*KnownSize >> 12) == 0) { + // Use immediate if size is known and fits in 12-bit immediate (0-4095) + Code.emplace_back(MCInstBuilder(AArch64::ADDXri) + .addReg(AArch64::X0) + .addReg(AArch64::X0) + .addImm(*KnownSize) + .addImm(0)); + } else { + // Fall back to register add for unknown or large sizes + Code.emplace_back(MCInstBuilder(AArch64::ADDXrr) + .addReg(AArch64::X0) + .addReg(AArch64::X0) + .addReg(AArch64::X2)); + } + } + + if (!KnownSize.has_value()) { + return Code; + } + + uint64_t Size = *KnownSize; + return generateSizeSpecificMemcpy(Code, Size); + } + + InstructionListType generateSizeSpecificMemcpy(InstructionListType &Code, + uint64_t Size) const { + // Generate optimal instruction sequences based on exact size + switch (Size) { + case 1: + // Single byte copy + Code.emplace_back(MCInstBuilder(AArch64::LDRBBui) + .addReg(AArch64::W3) + .addReg(AArch64::X1) + .addImm(0)); + Code.emplace_back(MCInstBuilder(AArch64::STRBBui) + .addReg(AArch64::W3) + .addReg(AArch64::X0) + .addImm(0)); + break; + + case 2: + // 2-byte copy using 16-bit load/store + Code.emplace_back(MCInstBuilder(AArch64::LDRHHui) + .addReg(AArch64::W3) + .addReg(AArch64::X1) + .addImm(0)); + Code.emplace_back(MCInstBuilder(AArch64::STRHHui) + .addReg(AArch64::W3) + .addReg(AArch64::X0) + .addImm(0)); + break; + + case 4: + // 4-byte copy using 32-bit load/store + Code.emplace_back(MCInstBuilder(AArch64::LDRWui) + .addReg(AArch64::W3) + .addReg(AArch64::X1) + .addImm(0)); + Code.emplace_back(MCInstBuilder(AArch64::STRWui) + .addReg(AArch64::W3) + .addReg(AArch64::X0) + .addImm(0)); + break; + + case 8: + // 8-byte copy using 64-bit load/store + Code.emplace_back(MCInstBuilder(AArch64::LDRXui) + .addReg(AArch64::X3) + .addReg(AArch64::X1) + .addImm(0)); + Code.emplace_back(MCInstBuilder(AArch64::STRXui) + .addReg(AArch64::X3) + .addReg(AArch64::X0) + .addImm(0)); + break; + + case 16: + // 16-byte copy using 128-bit SIMD + Code.emplace_back(MCInstBuilder(AArch64::LDRQui) + .addReg(AArch64::Q0) + .addReg(AArch64::X1) + .addImm(0)); + Code.emplace_back(MCInstBuilder(AArch64::STRQui) + .addReg(AArch64::Q0) + .addReg(AArch64::X0) + .addImm(0)); + break; + + case 32: + // 32-byte copy using two 128-bit SIMD operations + Code.emplace_back(MCInstBuilder(AArch64::LDRQui) + .addReg(AArch64::Q0) + .addReg(AArch64::X1) + .addImm(0)); + Code.emplace_back(MCInstBuilder(AArch64::STRQui) + .addReg(AArch64::Q0) + .addReg(AArch64::X0) + .addImm(0)); + Code.emplace_back(MCInstBuilder(AArch64::LDRQui) + .addReg(AArch64::Q1) + .addReg(AArch64::X1) + .addImm(1)); + Code.emplace_back(MCInstBuilder(AArch64::STRQui) + .addReg(AArch64::Q1) + .addReg(AArch64::X0) + .addImm(1)); + break; + + default: + if (Size <= 64) { + // For sizes up to 64 bytes, greedily use the largest possible loads in + // descending order + uint64_t Remaining = Size; + uint64_t Offset = 0; + + while (Remaining >= 16) { + Code.emplace_back(MCInstBuilder(AArch64::LDRQui) + .addReg(AArch64::Q0) + .addReg(AArch64::X1) + .addImm(Offset / 16)); + Code.emplace_back(MCInstBuilder(AArch64::STRQui) + .addReg(AArch64::Q0) + .addReg(AArch64::X0) + .addImm(Offset / 16)); + Remaining -= 16; + Offset += 16; + } + if (Remaining >= 8) { + Code.emplace_back(MCInstBuilder(AArch64::LDRXui) + .addReg(AArch64::X3) + .addReg(AArch64::X1) + .addImm(Offset / 8)); + Code.emplace_back(MCInstBuilder(AArch64::STRXui) + .addReg(AArch64::X3) + .addReg(AArch64::X0) + .addImm(Offset / 8)); + Remaining -= 8; + Offset += 8; + } + if (Remaining >= 4) { + Code.emplace_back(MCInstBuilder(AArch64::LDRWui) + .addReg(AArch64::W3) + .addReg(AArch64::X1) + .addImm(Offset / 4)); + Code.emplace_back(MCInstBuilder(AArch64::STRWui) + .addReg(AArch64::W3) + .addReg(AArch64::X0) + .addImm(Offset / 4)); + Remaining -= 4; + Offset += 4; + } + if (Remaining >= 2) { + Code.emplace_back(MCInstBuilder(AArch64::LDRHHui) + .addReg(AArch64::W3) + .addReg(AArch64::X1) + .addImm(Offset / 2)); + Code.emplace_back(MCInstBuilder(AArch64::STRHHui) + .addReg(AArch64::W3) + .addReg(AArch64::X0) + .addImm(Offset / 2)); + Remaining -= 2; + Offset += 2; + } + if (Remaining == 1) { + Code.emplace_back(MCInstBuilder(AArch64::LDRBBui) + .addReg(AArch64::W3) + .addReg(AArch64::X1) + .addImm(Offset)); + Code.emplace_back(MCInstBuilder(AArch64::STRBBui) + .addReg(AArch64::W3) + .addReg(AArch64::X0) + .addImm(Offset)); + } + } else { + Code.clear(); + } + break; + } + return Code; + } }; } // end anonymous namespace diff --git a/bolt/test/AArch64/inline-memcpy.s b/bolt/test/AArch64/inline-memcpy.s new file mode 100644 index 0000000000000..3bb498e600fb6 --- /dev/null +++ b/bolt/test/AArch64/inline-memcpy.s @@ -0,0 +1,193 @@ +## This test checks that BOLT correctly inlines memcpy calls on AArch64. + +# REQUIRES: system-linux + +# RUN: llvm-mc -filetype=obj -triple aarch64-unknown-unknown %s -o %t.o +# RUN: %clang --target=aarch64-unknown-linux-gnu %t.o -o %t.exe -Wl,-q  +# RUN: llvm-bolt %t.exe --inline-memcpy -o %t.bolt 2>&1 | FileCheck %s --check-prefix=CHECK-INLINE +# RUN: llvm-objdump -d %t.bolt | FileCheck %s --check-prefix=CHECK-ASM + +# Verify BOLT reports that it inlined memcpy calls (all 8 calls processed) +# CHECK-INLINE: BOLT-INFO: inlined 8 memcpy() calls + +# Each function should use optimal size-specific instructions and NO memcpy calls + +# 1-byte copy should use single byte load/store (ldrb/strb) +# CHECK-ASM-LABEL: <test_1_byte_direct>: +# CHECK-ASM: ldrb{{.*}}w{{[0-9]+}}, [x1] +# CHECK-ASM: strb{{.*}}w{{[0-9]+}}, [x0] +# CHECK-ASM-NOT: bl{{.*}}<memcpy + +# 2-byte copy should use single 16-bit load/store (ldrh/strh) +# CHECK-ASM-LABEL: <test_2_byte_direct>: +# CHECK-ASM: ldrh{{.*}}w{{[0-9]+}}, [x1] +# CHECK-ASM: strh{{.*}}w{{[0-9]+}}, [x0] +# CHECK-ASM-NOT: bl{{.*}}<memcpy + +# 4-byte copy should use single 32-bit load/store (w register) +# CHECK-ASM-LABEL: <test_4_byte_direct>: +# CHECK-ASM: ldr{{.*}}w{{[0-9]+}}, [x1] +# CHECK-ASM: str{{.*}}w{{[0-9]+}}, [x0] +# CHECK-ASM-NOT: bl{{.*}}<memcpy + +# 8-byte copy should use single 64-bit load/store (x register) +# CHECK-ASM-LABEL: <test_8_byte_direct>: +# CHECK-ASM: ldr{{.*}}x{{[0-9]+}}, [x1] +# CHECK-ASM: str{{.*}}x{{[0-9]+}}, [x0] +# CHECK-ASM-NOT: bl{{.*}}<memcpy + +# 16-byte copy should use single 128-bit SIMD load/store (q register) +# CHECK-ASM-LABEL: <test_16_byte_direct>: +# CHECK-ASM: ldr{{.*}}q{{[0-9]+}}, [x1] +# CHECK-ASM: str{{.*}}q{{[0-9]+}}, [x0] +# CHECK-ASM-NOT: bl{{.*}}<memcpy + +# 32-byte copy should use two 128-bit SIMD operations +# CHECK-ASM-LABEL: <test_32_byte_direct>: +# CHECK-ASM: ldr{{.*}}q{{[0-9]+}}, [x1] +# CHECK-ASM: str{{.*}}q{{[0-9]+}}, [x0] +# CHECK-ASM: ldr{{.*}}q{{[0-9]+}}, [x1, #0x10] +# CHECK-ASM: str{{.*}}q{{[0-9]+}}, [x0, #0x10] +# CHECK-ASM-NOT: bl{{.*}}<memcpy + +# 37-byte copy should use greedy decomposition: (2*16) + (1*4) + (1*1) +# CHECK-ASM-LABEL: <test_37_byte_arbitrary>: +# CHECK-ASM: ldr{{.*}}q{{[0-9]+}}, [x1] +# CHECK-ASM: str{{.*}}q{{[0-9]+}}, [x0] +# CHECK-ASM: ldr{{.*}}q{{[0-9]+}}, [x1, #0x10] +# CHECK-ASM: str{{.*}}q{{[0-9]+}}, [x0, #0x10] +# CHECK-ASM: ldr{{.*}}w{{[0-9]+}}, [x1, #0x20] +# CHECK-ASM: str{{.*}}w{{[0-9]+}}, [x0, #0x20] +# CHECK-ASM: ldrb{{.*}}w{{[0-9]+}}, [x1, #0x24] +# CHECK-ASM: strb{{.*}}w{{[0-9]+}}, [x0, #0x24] +# CHECK-ASM-NOT: bl{{.*}}<memcpy + +# 128-byte copy should be "inlined" by removing the call entirely (too large for real inlining) +# CHECK-ASM-LABEL: <test_128_byte_too_large>: +# CHECK-ASM-NOT: bl{{.*}}<memcpy +# CHECK-ASM-NOT: ldr{{.*}}q{{[0-9]+}} + +.text +.globl	test_1_byte_direct  +.type	test_1_byte_direct,@function +test_1_byte_direct:  +stp	x29, x30, [sp, #-32]!  +mov	x29, sp +add	x1, sp, #16 +add	x0, sp, #8  +mov	x2, #1 +bl	memcpy +ldp	x29, x30, [sp], #32 +ret +.size	test_1_byte_direct, .-test_1_byte_direct + +.globl	test_2_byte_direct  +.type	test_2_byte_direct,@function +test_2_byte_direct:  +stp	x29, x30, [sp, #-32]!  +mov	x29, sp +add	x1, sp, #16 +add	x0, sp, #8  +mov	x2, #2 +bl	memcpy +ldp	x29, x30, [sp], #32 +ret +.size	test_2_byte_direct, .-test_2_byte_direct + +.globl	test_4_byte_direct  +.type	test_4_byte_direct,@function +test_4_byte_direct:  +stp	x29, x30, [sp, #-32]!  +mov	x29, sp +add	x1, sp, #16 +add	x0, sp, #8  +mov	x2, #4 +bl	memcpy +ldp	x29, x30, [sp], #32 +ret +.size	test_4_byte_direct, .-test_4_byte_direct + +.globl	test_8_byte_direct  +.type	test_8_byte_direct,@function +test_8_byte_direct:  +stp	x29, x30, [sp, #-32]!  +mov	x29, sp +add	x1, sp, #16 +add	x0, sp, #8  +mov	x2, #8 +bl	memcpy +ldp	x29, x30, [sp], #32 +ret +.size	test_8_byte_direct, .-test_8_byte_direct + +.globl	test_16_byte_direct +.type	test_16_byte_direct,@function +test_16_byte_direct: +stp	x29, x30, [sp, #-48]! +mov	x29, sp +add	x1, sp, #16 +add	x0, sp, #32 +mov	x2, #16 +bl	memcpy +ldp	x29, x30, [sp], #48 +ret +.size	test_16_byte_direct, .-test_16_byte_direct + +.globl	test_32_byte_direct +.type	test_32_byte_direct,@function +test_32_byte_direct: +stp	x29, x30, [sp, #-80]! +mov	x29, sp +add	x1, sp, #16  +add	x0, sp, #48 +mov	x2, #32 +bl	memcpy +ldp	x29, x30, [sp], #80 +ret +.size	test_32_byte_direct, .-test_32_byte_direct + +.globl	test_37_byte_arbitrary +.type	test_37_byte_arbitrary,@function +test_37_byte_arbitrary: +stp	x29, x30, [sp, #-96]! +mov	x29, sp +add	x1, sp, #16  +add	x0, sp, #56 +mov	x2, #37 +bl	memcpy +ldp	x29, x30, [sp], #96 +ret +.size	test_37_byte_arbitrary, .-test_37_byte_arbitrary + +.globl	test_128_byte_too_large +.type	test_128_byte_too_large,@function +test_128_byte_too_large: +stp	x29, x30, [sp, #-288]! +mov	x29, sp +add	x1, sp, #16  +add	x0, sp, #152 +mov	x2, #128 +bl	memcpy +ldp	x29, x30, [sp], #288 +ret +.size	test_128_byte_too_large, .-test_128_byte_too_large + +.globl	main +.type	main,@function +main: +stp	x29, x30, [sp, #-16]! +mov	x29, sp + +bl	test_1_byte_direct +bl	test_2_byte_direct +bl	test_4_byte_direct +bl	test_8_byte_direct +bl	test_16_byte_direct  +bl	test_32_byte_direct +bl	test_37_byte_arbitrary +bl	test_128_byte_too_large + +mov	w0, #0 +ldp	x29, x30, [sp], #16 +ret +.size	main, .-main

bolt/test/runtime/AArch64/inline-memcpy.s

bolt/lib/Passes/BinaryPasses.cpp

bolt/lib/Target/AArch64/AArch64MCPlusBuilder.cpp

bolt/test/runtime/AArch64/inline-memcpy.s

bolt/lib/Target/AArch64/AArch64MCPlusBuilder.cpp

bolt/test/runtime/AArch64/inline-memcpy.s

bolt/lib/Target/AArch64/AArch64MCPlusBuilder.cpp

bolt/test/runtime/AArch64/inline-memcpy.s

bolt/lib/Passes/BinaryPasses.cpp

bolt/lib/Target/AArch64/AArch64MCPlusBuilder.cpp

bolt/lib/Passes/BinaryPasses.cpp

…arly return check

sjoerdmeijer · 2025-08-29T10:31:02Z

Thanks for fixing that. I have created this example:

https://godbolt.org/z/rnaYKarfe

What you see there is that X9 is saved to the stack before the call to memcpy, and after the call it is reloaded because it is used by function use. It's a caller-saved register, and that means that if we now start using X9 as a temp register for the inlined memcpy, we are good.

Can you add this test-case to your positive tests please? You can keep the other little examples that you have, but I think it would be good to have a bigger test case where you match the whole assembly sequence that includes this caller-saved register behaviour. It would be good if you then add one more, that similarly test the whole sequence but using the FP temp register, that would then cover everything I think.

yafet-a · 2025-09-03T16:46:31Z

AArch64 Memcpy Inline Optimization Results

Beyond the lit test, I ran some brief smoketests on a few real-world binaries to validate the pattern detection of the immediate movs. I have shared a few of them below:

Application	Inlined Calls	Total Calls	Hit Rate
Redis	22	558	4%
PostgreSQL	78	956	8%
Zstd	31	177	18%

Correctness Testing

Redis: redis-benchmark./runtest --single unit/memefficiency(memory-specific test) +redis-benchmark SET/GET operations) - both passed.
Zstd: make check (complete compression/decompression test suite) - all test cases passed.
PostgreSQL: - Basic functionality tests (server startup, SQL operations) - passed.

paschalis-mpeis

Hi @yafet-a,

Thanks for working to support memcpy! Nice patch and report. :)
Please see some comments below.

bolt/lib/Target/AArch64/AArch64MCPlusBuilder.cpp

bolt/test/runtime/AArch64/inline-memcpy.s

sjoerdmeijer · 2025-09-04T19:02:37Z

Beyond the lit test, I ran some brief smoketests on a few real-world binaries to validate the pattern detection of the immediate movs. I have shared a few of them below:

Application Inlined Calls Total Calls Hit Rate
Redis 22 558 4%
PostgreSQL 78 956 8%
Zstd 31 177 18%

Correctness Testing

Redis: redis-benchmark./runtest --single unit/memefficiency(memory-specific test) +redis-benchmark SET/GET operations) - both passed.

Zstd: make check (complete compression/decompression test suite) - all test cases passed.

PostgreSQL: - Basic functionality tests (server startup, SQL operations) - passed.

These are some great numbers.
And great that the smoke tests pass. Did you check whether some of the inlined memcpy's were actually executed? I want to avoid that we have missed some corner cases.

paschalis-mpeis

Thanks for addressing the comments, @yafet-a.

Looks good. Tentative accept, pending the verification ask from @sjoerdmeijer and also give it a few days for the rest of the reviewers to react.

Also I added a couple of nits.

bolt/test/runtime/AArch64/inline-memcpy.s

bolt/lib/Target/AArch64/AArch64MCPlusBuilder.cpp

yafet-a · 2025-09-05T16:01:50Z

Did you check whether some of the inlined memcpy's were actually executed? I want to avoid that we have missed some corner cases.

Yes, I checked on the zstd smoke tests and inspected the objdump. Found our inlined memcpy patterns at multiple addresses:

178bc: ldr q0, [x1] 178c0: str q0, [x0] 40e4c: ldr q0, [x1] 40e50: str q0, [x0] 46e10: ldr q0, [x1] 46e14: str q0, [x0]

I then confirmed execution by using perf tools on the ZSTD_compressBlock_fast function and saw some of these instances actively running e.g.:

0.33 : 559bc: ldr q0, [x1] <- 0.33% CPU time   | 559c0: str q0, [x0] <- actively executing

So for this 16-byte copy example it used our inlining change that replaced the original memcpy calls.

…inaryPasses.cpp)

sjoerdmeijer

Thanks for the testing, LGTM.

Let's wait a day with merging this in case someone else has more comments.

yafet-a added 3 commits August 21, 2025 10:12

pre-commit test

ce56f84

[BOLT] documentation

1c27d89

[BOLT][AArch64] Implement safe size-aware memcpy inlining

db353b7

yafet-a requested review from aaupov, ayermolo, maksfb, paschalis-mpeis, rafaelauler, yota9 and yozhu as code owners August 22, 2025 11:20

llvmbot added the BOLT label Aug 22, 2025

yafet-a added 3 commits August 22, 2025 05:14

test target fix for CI cross-compilation issue

2e5b22b

moved inline-memcpy to avoid CI cross-compilation PIE conflicts

385fa23

removed old test

4f9ef67

sjoerdmeijer reviewed Aug 22, 2025

View reviewed changes

bolt/test/runtime/AArch64/inline-memcpy.s Show resolved Hide resolved

bolt/lib/Passes/BinaryPasses.cpp Outdated Show resolved Hide resolved

bolt/lib/Target/AArch64/AArch64MCPlusBuilder.cpp Outdated Show resolved Hide resolved

response to review

e83126e

sjoerdmeijer reviewed Aug 26, 2025

View reviewed changes

yafet-a added 5 commits August 27, 2025 03:57

Update conditional formatting and move check for size into binaryPasses

cf8279a

Negative Tests (live-in, register move, non-mov instruction)

c317eb0

memcpy8 redundant handling removed

df97d61

nit: comment clean up

25cfb58

minor refactor

e308855

sjoerdmeijer reviewed Aug 28, 2025

View reviewed changes

bolt/lib/Passes/BinaryPasses.cpp Show resolved Hide resolved

yafet-a added 2 commits August 28, 2025 06:33

NFC: Post-review refactor

365a0bf

NFC: Test for corner case with size 0

84c904a

sjoerdmeijer reviewed Aug 28, 2025

View reviewed changes

bolt/lib/Target/AArch64/AArch64MCPlusBuilder.cpp Outdated Show resolved Hide resolved

bolt/lib/Passes/BinaryPasses.cpp Show resolved Hide resolved

yafet-a added 2 commits August 28, 2025 10:01

Use temp instead of argument registers

0561bcc

Update early return

cc49db7

Update tests to be more specific about registers + negative test on e…

115606b

…arly return check

yafet-a added 2 commits August 29, 2025 08:03

Complex test + register aliasing

1986bfa

NFC use if initializer

bd990ea

paschalis-mpeis reviewed Sep 4, 2025

View reviewed changes

yafet-a added 5 commits September 4, 2025 09:07

[style] trailing whitespaces removed

ee5f859

[test] CHECK-NEXT used

ad503a7

[test] updated negative test to check for negative size

267432a

[nfc] minor refactor

198744d

[bug] memcpy call removed for sizes>64

62b871e

paschalis-mpeis approved these changes Sep 5, 2025

View reviewed changes

bolt/test/runtime/AArch64/inline-memcpy.s Outdated Show resolved Hide resolved

bolt/lib/Target/AArch64/AArch64MCPlusBuilder.cpp Outdated Show resolved Hide resolved

yafet-a added 2 commits September 5, 2025 09:16

[nfc][test] reordered test

dcab6ac

[nfc] added assert for default case (future-proofing for changes to B…

875156e

…inaryPasses.cpp)

sjoerdmeijer approved these changes Sep 8, 2025

View reviewed changes

sjoerdmeijer merged commit 244588b into llvm:main Sep 9, 2025
9 checks passed

[BOLT][AArch64] Enabling Inlining for Memcpy for AArch64 in BOLT #154929

[BOLT][AArch64] Enabling Inlining for Memcpy for AArch64 in BOLT #154929

Uh oh!

Conversation

yafet-a commented Aug 22, 2025 • edited by paschalis-mpeis Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Testing Coverage (inline-memcpy.s)

Positive Tests:

Negative Tests:

llvmbot commented Aug 22, 2025

Overview

Testing Coverage (inline-memcpy.s)

Positive Tests:

Negative Tests:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sjoerdmeijer commented Aug 29, 2025

yafet-a commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AArch64 Memcpy Inline Optimization Results

Correctness Testing

paschalis-mpeis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sjoerdmeijer commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Correctness Testing

paschalis-mpeis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yafet-a commented Sep 5, 2025

sjoerdmeijer left a comment

Choose a reason for hiding this comment

Uh oh!

Labels

4 participants

yafet-a commented Aug 22, 2025 •

edited by paschalis-mpeis

Loading

Testing Coverage (`inline-memcpy.s`)

Testing Coverage (`inline-memcpy.s`)

yafet-a commented Sep 3, 2025 •

edited

Loading

sjoerdmeijer commented Sep 4, 2025 •

edited

Loading