270

Embarcadero's Delphi compilers use an LLVM backend to produce native ARM code for Android devices. I have large amounts of Pascal code that I need to compile into Android applications and I would like to know how to make Delphi generate more efficient code.

Right now, I'm not even talking about advanced features like automatic SIMD optimizations, just about producing reasonable code. Surely there must be a way to pass parameters to the LLVM side, or somehow affect the result? Usually, any compiler will have many options to affect code compilation and optimization, but Delphi's ARM targets seem to be just "optimization on/off" and that's it.

LLVM is supposed to be capable of producing reasonably tight and sensible code, but it seems that Delphi is using its facilities in a weird way. Delphi wants to use the stack very heavily, and it generally only utilizes the processor's registers r0-r3 as temporary variables. Perhaps the craziest of all, it seems to be loading normal 32 bit integers as four 1-byte load operations. How can I make Delphi produce better ARM code, and without the byte-by-byte hassle it is making for Android?

At first, I thought the byte-by-byte loading was for swapping byte order from big-endian, but that was not the case. It is really just loading a 32 bit number with 4 single-byte loads. It might be to load the full 32 bits without doing an unaligned word-sized memory load (whether it should avoid that is another thing, which would hint to the whole thing being a compiler bug)

Let's look at this simple function:

function ReadInteger(APInteger : PInteger) : Integer; begin Result := APInteger^; end; 

Even with optimizations switched on, Delphi XE7 with update pack 1, as well as XE6, produce the following ARM assembly code for that function:

Disassembly of section .text._ZN16Uarmcodetestform11ReadIntegerEPi

00000000 <_ZN16Uarmcodetestform11ReadIntegerEPi>: 0: b580 push {r7, lr} 2: 466f mov r7, sp 4: b083 sub sp, #12 6: 9002 str r0, [sp, #8] 8: 78c1 ldrb r1, [r0, #3] a: 7882 ldrb r2, [r0, #2] c: ea42 2101 orr.w r1, r2, r1, lsl #8 10: 7842 ldrb r2, [r0, #1] 12: 7803 ldrb r3, [r0, #0] 14: ea43 2202 orr.w r2, r3, r2, lsl #8 18: ea42 4101 orr.w r1, r2, r1, lsl #16 1c: 9101 str r1, [sp, #4] 1e: 9000 str r0, [sp, #0] 20: 4608 mov r0, r1 22: b003 add sp, #12 24: bd80 pop {r7, pc} 

Just count the number of instructions and memory accesses Delphi needs for that. And constructing a 32-bit integer from four single-byte loads... If I change the function a little bit and use a var parameter instead of a pointer, it is slightly less convoluted:

Disassembly of section .text._ZN16Uarmcodetestform14ReadIntegerVarERi

00000000 <_ZN16Uarmcodetestform14ReadIntegerVarERi>: 0: b580 push {r7, lr} 2: 466f mov r7, sp 4: b083 sub sp, #12 6: 9002 str r0, [sp, #8] 8: 6801 ldr r1, [r0, #0] a: 9101 str r1, [sp, #4] c: 9000 str r0, [sp, #0] e: 4608 mov r0, r1 10: b003 add sp, #12 12: bd80 pop {r7, pc} 

I won't include the disassembly here, but for iOS, Delphi produces identical code for the pointer and var parameter versions, and they are almost, but not exactly, the same as the Android var parameter version.

To clarify, the byte-by-byte loading is only on Android. And only on Android, the pointer and var parameter versions differ from each other. On iOS both versions generate exactly the same code.*

For comparison, here's what Free Pascal (FPC) 2.7.1 (SVN trunk version from March 2014) thinks of the function with optimization level -O2. The pointer and var parameter versions are exactly the same.

Disassembly of section .text.n_p$armcodetest_$$_readinteger$pinteger$$longint:

00000000 <P$ARMCODETEST_$$_READINTEGER$PINTEGER$$LONGINT>: 0: 6800 ldr r0, [r0, #0] 2: 46f7 mov pc, lr 

I also tested an equivalent C function with the C compiler that comes with the Android NDK.

int ReadInteger(int *APInteger) { return *APInteger; } 

And this compiles into essentially the same thing FPC made:

Disassembly of section .text._Z11ReadIntegerPi:

00000000 <_Z11ReadIntegerPi>: 0: 6800 ldr r0, [r0, #0] 2: 4770 bx lr 

I no longer work for the company where this question originated, and do not have access to Delphi XEx. While I was there, the problem was solved by migrating to mixed FPC+GCC (Pascal+C), with NEON intrinsics for some routines where it made a difference. (FPC+GCC is highly recommended also because it enables using standard tools, particularly Valgrind.) If someone can demonstrate, with credible examples, how they are actually able to produce optimized ARM code from Delphi XEx, I'm happy to accept the answer.

14
  • 15
    Btw in the Google+ discussion about this, Sam Shaw notes that C++ shows the long-form code in debug builds and the optimised code in release. Wheres Delphi does it in both. So from that it could well be a simple bug in the flags they're sending LLVM, and if so a bug report is very worth filing, it might get fixed quite soon. Commented Jan 14, 2015 at 17:06
  • 9
    Oh, ok, I misread. Then, as Notlikethat said, it sounds like it assumes the pointer load would be unaligned (or can't guarantee alignment), and older ARM platforms can't necessarily do unaligned loads. Make sure you have it build targeting armeabi-v7a instead of armeabi (not sure if there are such options in this compiler), since unaligned loads should be supported since ARMv6 (while armeabi assumes ARMv5). (The shown disassembly doesn't look like it reads a bigendian value, it just reads a little endian value one byte at a time.) Commented Jan 14, 2015 at 17:07
  • 6
    I found RSP-9922 which appears to be this same bug. Commented Jan 16, 2015 at 11:48
  • 6
    Someone had asked about optimization being getting broken between XE4 and XE5, in the embarcadero.public.delphi.platformspecific.ios newsgroup, "ARM Compiler optimization broken?" devsuperpage.com/search/… Commented Jan 26, 2015 at 9:30
  • 6
    @Johan: what executable is it? I had the impression that it was somehow baked inside Delphi's compiler executable. Give it a try and let us know the results. Commented Aug 3, 2015 at 12:56

1 Answer 1

12
+50

We are investigating the issue. In short, it depends on the potential mis-alignment (to 32 boundary) of the Integer referenced by a pointer. Need a little more time to have all of the answers... and a plan to address this.

Marco Cantù, moderator on Delphi Developers

Also reference Why are the Delphi zlib and zip libraries so slow under 64 bit? as Win64 libraries are shipped built without optimizations.


In the QP Report: RSP-9922 Bad ARM code produced by the compiler, $O directive ignored?, Marco added following explanation:

There are multiple issues here:

  • As indicated, optimization settings apply only to entire unit files and not to individual functions. Simply put, turning optimization on and off in the same file will have no effect.
  • Furthermore, simply having "Debug information" enabled turns off optimization. Thus, when one is debugging, explicitly turning on optimizations will have no effect. Consequently, the CPU view in the IDE will not be able to display a disassembled view of optimized code.
  • Third, loading non-aligned 64bit data is not safe and does result in errors, hence the separate 4 one byte operations that are needed in given scenarios.
Sign up to request clarification or add additional context in comments.

2 Comments

Marco Cantù posted that note "We are investigating the issue" in January 2015, and the related bug report RSP-9922 was marked resolved with resolution "Works As Expected" in January 2016, and there's a mention "internal issue closed on Mar 2, 2015". I do not understand their explanations.
I added a comment in the issue resolution.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.