- Notifications
You must be signed in to change notification settings - Fork 15.3k
[BOLT][AArch64] Enabling Inlining for Memcpy for AArch64 in BOLT #154929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BOLT][AArch64] Enabling Inlining for Memcpy for AArch64 in BOLT #154929
Conversation
| @llvm/pr-subscribers-bolt Author: (yafet-a) ChangesOverviewThe pass for inlining memcpy in BOLT was currently X86-specific. It was using the instruction This patch implements a static size analysis system for AArch64 memcpy inlining that extracts copy sizes from preceding instructions to then use it to generate the optimal width-specific load/store sequences. Testing Coverage ( |
…arly return check
| Thanks for fixing that. I have created this example: https://godbolt.org/z/rnaYKarfe What you see there is that X9 is saved to the stack before the call to memcpy, and after the call it is reloaded because it is used by function Can you add this test-case to your positive tests please? You can keep the other little examples that you have, but I think it would be good to have a bigger test case where you match the whole assembly sequence that includes this caller-saved register behaviour. It would be good if you then add one more, that similarly test the whole sequence but using the FP temp register, that would then cover everything I think. |
AArch64 Memcpy Inline Optimization ResultsBeyond the lit test, I ran some brief smoketests on a few real-world binaries to validate the pattern detection of the immediate movs. I have shared a few of them below:
Correctness Testing
|
paschalis-mpeis left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @yafet-a,
Thanks for working to support memcpy! Nice patch and report. :)
Please see some comments below.
These are some great numbers. |
paschalis-mpeis left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the comments, @yafet-a.
Looks good. Tentative accept, pending the verification ask from @sjoerdmeijer and also give it a few days for the rest of the reviewers to react.
Also I added a couple of nits.
sjoerdmeijer left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the testing, LGTM.
Let's wait a day with merging this in case someone else has more comments.

Overview
The pass for inlining memcpy in BOLT was currently X86-specific. It was using the instruction
rep movsbwhich currently has no equivalent in ARM v8-A.This patch implements a static size analysis system for AArch64 memcpy inlining that extracts copy sizes from preceding instructions to then use it to generate the optimal width-specific load/store sequences.
Testing Coverage (
inline-memcpy.s)Positive Tests:
Negative Tests: