1 PRAGMATIC OPTIMIZATION IN MODERN PROGRAMMING DEMYSTIFYING A COMPILER Created by for / 2015-2016Marina (geek) Kolpakova UNN
2 COURSE TOPICS Ordering optimization approaches Demystifying a compiler Mastering compiler optimizations Modern computer architectures concepts
3 OUTLINE Compilation trajectory Intermediate language Dealing with local variables link-time and whole program optimization Optimization levels Compiler optimization taxonomies Classic Scope Code pattern How to get the feedback from optimization? Compiler optimization challenges Summary
4 EXECUTABLE GENERATION PHASES Pre-processing. Pre-process, but don't compile. gcc -E test.cc cl /E test.cc Compilation. Compile, but don't assemble. gcc -S test.cc cl /FA test.cc Assembling. Assemble, but don't link. gcc -c test.cc cl /c test.cc Linking. Link object les to generate the executable. gcc test.cc cl test.cc
5 . 1 COMPILATION TRAJECTORY Lexical Analysis scans the source code as a stream of characters converting it into lexemes (tokens). Syntax Analysis takes the tokens, produced by lexical analysis, as input and generates a syntax tree. Source code grammar (syntactical correctness) is checked here.
5 . 2 COMPILATION TRAJECTORY Semantic Analysis checks whether the constructed syntax tree follows the language rules (including the type checking). Intermediate Code Generation builds a program representation for some abstract machine. It is in between the high-level language and the target machine language.
5 . 3 COMPILATION TRAJECTORY Code Optimization does optimization of the intermediate code (eg, redundancy elimination). Code Generation takes an optimized representation of the intermediate code and maps it to the target machine language.
6 FRONTEND AND BACKEND Only a backend is required for new machine support Only a frontend is required for new language support Most of optimizations resemble each other for all targets and could be applied in between frontend and backend
7 . 1 INTERMEDIATE LANGUAGE Optimization techniques become much easier to conduct on the level of intermediate code. Modern compilers usually use 2 levels of intermediate representation (IR).
7 . 2 INTERMEDIATE LANGUAGE High Level IR is close to the source and can be easily generated from the source code. Some code optimizations are possible. It is not very suitable for target machine optimization. Low Level IR is close to the target machine and used for machine- dependent optimizations: register allocation, instruction selection, peephole optimization.
7 . 3 INTERMEDIATE LANGUAGE Language-speci c to be used for JIT compilation later: Java byte code; .NET CLI, NVIDIA PTX. Language independent, like three-(four-)address code (similar to a classic RISC ISA). a = b + c * d + c * d; Three-Address Code (TAC) r1 = c * d; r2 = b + r1; r3 = r2 + r1; a = r3 Here rth is an abstract register.
7 . 4 THREE-ADDRESS CODE Quadruples has four elds Op arg1 arg2 result * c d r1 + b r1 r2 + r2 r1 r3 = r3 a Triples or Indirect triples have three elds Op arg1 arg2 * c d + b (0) + (1) (0) = (2)
7 . 5 INTERMEDIATE LANGUAGE Provides frontend independent code representation. and GNU Compiler Collection -fdump-tree-all -fdump-tree-optimized -fdump-tree-ssa -fdump-rtl-all clang and other LLWM based compilers -emit-llvm CIL (C Intermediate Language) Visual Studio cl.exe GENERIC GIMPLE LLVM IL
7 . 6 INTERMEDIATE LANGUAGE uint32_t gray2rgba_v1(uint8_t c) { return c + (c<<8) + (c<<16) + (c<<24); } $ clang -Os -S -emit-llvm test.c -o test.ll $ cat test.ll define i32 @gray2rgba_v1(i8 zeroext %c) # 0 { %1 = zext i8 %c to i32 %2 = mul i32 %1, 16843009 ret i32 %2 } gray2rgba_v1: movzbl %dil, %eax imull $16843009, %eax, %eax ret
8 . 1 DEALING LOCAL VARIABLES Compiler don't care how many variables are used in code, register allocation is done after IR rotations. for( ; j <= roi.width - 4; j += 4 ) { uchar t0 = tab[src[j]]; uchar t1 = tab[src[j+ 1]]; dst[j] = t0; dst[j+1] = t1; t0 = tab[src[j+2]]; t1 = tab[src[j+3]]; dst[j+2] = t0; dst[j+3] = t1; }
8 . 2 DEALING LOCAL VARIABLES .lr.ph4: ; preds = %0, %.lr.ph4 %indvars.iv5 = phi i64 [ %indvars.iv.next6, %.lr.ph4 ], [0, %0 ] %6 = getelementptr inbounds i8* %src, i64 %indvars.iv5 %7 = load i8* %6, align 1, !tbaa !1 %8 = zext i8 %7 to i64 %9 = getelementptr inbounds i8* %tab, i64 %8 %10 = load i8* %9, align 1, !tbaa !1 %11 = or i64 %indvars.iv5, 1 %12 = getelementptr inbounds i8* %src, i64 %11 %13 = load i8* %12, align 1, !tbaa !1 %14 = zext i8 %13 to i64 %15 = getelementptr inbounds i8* %tab, i64 %14 %16 = load i8* %15, align 1, !tbaa !1 %17 = getelementptr inbounds i8* %dst, i64 %indvars.iv5 store i8 %10, i8* %17, align 1, !tbaa !1 %18 = getelementptr inbounds i8* %dst, i64 %11 store i8 %16, i8* %18, align 1, !tbaa !1 %19 = or i64 %indvars.iv5, 2 // ... %28 = zext i8 %27 to i64 %29 = getelementptr inbounds i8* %tab, i64 %28 %30 = load i8* %29, align 1, !tbaa !1 %31 = getelementptr inbounds i8* %dst, i64 %19 store i8 %24, i8* %31, align 1, !tbaa !1 %32 = getelementptr inbounds i8* %dst, i64 %25 store i8 %30, i8* %32, align 1, !tbaa !1 %indvars.iv.next6 = add nuw nsw i64 %indvars.iv5,4 %33 = trunc i64 %indvars.iv.next6 to i32 %34 = icmp sgt i32 %33, %1 br i1 %34, label %..preheader_crit_edge, label %.lr.ph4
9 . 1 Perform inter-procedural optimizations during linking. Most compilers support this feature: LINK-TIME OPTIMIZATION (LTO) (- to) (- to) stating with 4.9 (/GL, /LTCG) ... clang gcc cl.exe
9 . 2 WHOPR: WHOLE PROGRAM OPTIMIZATION 1. Compile each source le separately, add extra information to the object le 2. Analyze information collected from all object les 3. Perform second optimization phase to generate object le 4. Link the nal binary Eliminate even more redundant code Compilations is better optimized for multi-core systems
10 . 1 OPTIMIZATION LEVELS -O0 (the default) No optimization generates unoptimized code but has the fastest compilation time. -O1 Moderate optimization optimizes reasonably well but does not degrade compilation time signi cantly. -O2 Full optimization generates highly optimized code and has the slowest compilation time.
10 . 2 OPTIMIZATION LEVELS -O3 Aggressive optimization employees more aggressive automatic inlining of subprograms within a unit and attempts to vectorize. -Os Optimizes with focus on program size enables all -O2 optimizations that do not typically increase code size. It also performs further optimizations designed to reduce code size.
10 . 3 ENABLED OPTIMIZATIONS: GCC -O0 GNU C version 4.9.2 (x86_64-linux-gnu) $ touch 1.c; gcc -O0 -S -fverbose-asm 1.c -o 1.s options enabled: -faggressive-loop-optimizations -fasynchronous-unwind-tables -fauto-inc-dec -fcommon -fdelete-null-pointer-checks - fdwarf2-c -asm -fearly-inlining -feliminate-unused-debug-types -ffunction-cse -fgcse-lm -fgnu-runtime -fgnu-unique - dent - nline-atomics - ra-hoist-pressure - ra-share-save-slots - ra-share-spill-slots - vopts -fkeep-static-consts - eading-underscore -fmath-errno -fmerge-debug- strings -fpeephole -fprefetch-loop-arrays -freg-struct-return -fsched-critical-path-heuristic -fsched-dep-count-heuristic -fsched-group- heuristic -fsched-interblock -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic -fsched-stalled-insns- dep -fshow-column -fsigned-zeros -fsplit-ivs-in-unroller -fstack-protector -fstrict-volatile-bit elds -fsync-libcalls -ftrapping-math -ftree- coalesce-vars -ftree-cselim -ftree-forwprop -ftree-loop-if-convert -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize- loops= -ftree-phiprop -ftree-reassoc -ftree-scev-cprop -funit-at-a-time -funwind-tables -fverbose-asm -fzero-initialized-in-bss -m128bit-long- double -m64 -m80387 -malign-stringops -mavx256-split-unaligned-load -mavx256-split-unaligned-store -mfancy-math-387 -mfp-ret-in- 387 -mfxsr -mglibc -mieee-fp -mlong-double-80 -mmmx -mno-sse4 -mpush-args -mred-zone -msse -msse2 -mtls-direct-seg-refs
10 . 4 ENABLED OPTIMIZATIONS: GCC -O1 GNU C version 4.9.2 (x86_64-linux-gnu) $ touch 1.c; gcc -O1 -S -fverbose-asm 1.c -o 1.s options enabled: -faggressive-loop-optimizations -fasynchronous-unwind-tables -fauto-inc-dec -fbranch-count-reg -fcombine-stack- adjustments -fcommon -fcompare-elim -fcprop-registers -fdefer-pop -fdelete-null-pointer-checks -fdwarf2-c -asm -fearly-inlining - feliminate-unused-debug-types -fforward-propagate -ffunction-cse -fgcse-lm -fgnu-runtime -fgnu-unique -fguess-branch-probability - dent - f-conversion - f-conversion2 - nline - nline-atomics - nline-functions-called-once - pa-pro le - pa-pure-const - pa-reference - ra-hoist- pressure - ra-share-save-slots - ra-share-spill-slots - vopts -fkeep-static-consts - eading-underscore -fmath-errno -fmerge-constants - fmerge-debug-strings -fmove-loop-invariants -fomit-frame-pointer -fpeephole -fprefetch-loop-arrays -freg-struct-return -fsched-critical-path- heuristic -fsched-dep-count-heuristic -fsched-group-heuristic -fsched-interblock -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched- spec -fsched-spec-insn-heuristic -fsched-stalled-insns-dep -fshow-column -fshrink-wrap -fsigned-zeros -fsplit-ivs-in-unroller -fsplit-wide- types -fstack-protector -fstrict-volatile-bit elds -fsync-libcalls -ftoplevel-reorder -ftrapping-math -ftree-bit-ccp -ftree-ccp -ftree-ch -ftree- coalesce-vars -ftree-copy-prop -ftree-copyrename -ftree-cselim -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-forwprop -ftree-fre -ftree- loop-if-convert -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops= -ftree-phiprop -ftree-pta -ftree-reassoc -ftree- scev-cprop -ftree-sink -ftree-slsr -ftree-sra -ftree-ter -funit-at-a-time -funwind-tables -fverbose-asm -fzero-initialized-in-bss -m128bit-long- double -m64 -m80387 -malign-stringops -mavx256-split-unaligned-load -mavx256-split-unaligned-store -mfancy-math-387 -mfp-ret-in- 387 -mfxsr -mglibc -mieee-fp -mlong-double-80 -mmmx -mno-sse4 -mpush-args -mred-zone -msse -msse2 -mtls-direct-seg-refs
10 . 5 ENABLED OPTIMIZATIONS: GCC -O2 GNU C version 4.9.2 (x86_64-linux-gnu) $ touch 1.c; gcc -O2 -S -fverbose-asm 1.c -o 1.s options enabled: -faggressive-loop-optimizations -fasynchronous-unwind-tables -fauto-inc-dec -fbranch-count-reg -fcaller-saves -fcombine- stack-adjustments -fcommon -fcompare-elim -fcprop-registers -fcse-follow-jumps -fdefer-pop -fdelete-null-pointer-checks -fdevirtualize - fdevirtualize-speculatively -fdwarf2-c -asm -fearly-inlining -feliminate-unused-debug-types -fexpensive-optimizations -fforward-propagate - ffunction-cse -fgcse -fgcse-lm -fgnu-runtime -fgnu-unique -fguess-branch-probability -fhoist-adjacent-loads - dent - f-conversion - f- conversion2 - ndirect-inlining - nline - nline-atomics - nline-functions-called-once - nline-small-functions - pa-cp - pa-pro le - pa-pure- const - pa-reference - pa-sra - ra-hoist-pressure - ra-share-save-slots - ra-share-spill-slots - solate-erroneous-paths-dereference - vopts - fkeep-static-consts - eading-underscore -fmath-errno -fmerge-constants -fmerge-debug-strings -fmove-loop-invariants -fomit-frame-pointer -foptimize-sibling-calls -foptimize-strlen -fpartial-inlining -fpeephole -fpeephole2 -fprefetch-loop-arrays -free -freg-struct-return -freorder- blocks -freorder-blocks-and-partition -freorder-functions -frerun-cse-after-loop -fsched-critical-path-heuristic -fsched-dep-count-heuristic - fsched-group-heuristic -fsched-interblock -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic -fsched- stalled-insns-dep -fschedule-insns2 -fshow-column -fshrink-wrap -fsigned-zeros -fsplit-ivs-in-unroller -fsplit-wide-types -fstack-protector - fstrict-aliasing -fstrict-over ow -fstrict-volatile-bit elds -fsync-libcalls -fthread-jumps -ftoplevel-reorder -ftrapping-math -ftree-bit-ccp -ftree- builtin-call-dce -ftree-ccp -ftree-ch -ftree-coalesce-vars -ftree-copy-prop -ftree-copyrename -ftree-cselim -ftree-dce -ftree-dominator-opts - ftree-dse -ftree-forwprop -ftree-fre -ftree-loop-if-convert -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops= - ftree-phiprop -ftree-pre -ftree-pta -ftree-reassoc -ftree-scev-cprop -ftree-sink -ftree-slsr -ftree-sra -ftree-switch-conversion -ftree-tail-merge - ftree-ter -ftree-vrp -funit-at-a-time -funwind-tables -fverbose-asm -fzero-initialized-in-bss -m128bit-long-double -m64 -m80387 -malign- stringops -mavx256-split-unaligned-load -mavx256-split-unaligned-store -mfancy-math-387 -mfp-ret-in-387 -mfxsr -mglibc -mieee-fp - mlong-double-80 -mmmx -mno-sse4 -mpush-args -mred-zone -msse -msse2 -mtls-direct-seg-refs -mvzeroupper
10 . 6 ENABLED OPTIMIZATIONS: GCC -O3 GNU C version 4.9.2 (x86_64-linux-gnu) $ touch 1.c; gcc -O3 -S -fverbose-asm 1.c -o 1.s options enabled: -faggressive-loop-optimizations -fasynchronous-unwind-tables -fauto-inc-dec -fbranch-count-reg -fcaller-saves -fcombine- stack-adjustments -fcommon -fcompare-elim -fcprop-registers -fcrossjumping -fcse-follow-jumps -fdefer-pop -fdelete-null-pointer-checks - fdevirtualize -fdevirtualize-speculatively -fdwarf2-c -asm -fearly-inlining -feliminate-unused-debug-types -fexpensive-optimizations - fforward-propagate -ffunction-cse -fgcse -fgcse-after-reload -fgcse-lm -fgnu-runtime -fgnu-unique -fguess-branch-probability -fhoist- adjacent-loads - dent - f-conversion - f-conversion2 - ndirect-inlining - nline - nline-atomics - nline-functions - nline-functions-called- once - nline-small-functions - pa-cp - pa-cp-clone - pa-pro le - pa-pure-const - pa-reference - pa-sra - ra-hoist-pressure - ra-share-save- slots - ra-share-spill-slots - solate-erroneous-paths-dereference - vopts -fkeep-static-consts - eading-underscore -fmath-errno -fmerge- constants -fmerge-debug-strings -fmove-loop-invariants -fomit-frame-pointer -foptimize-sibling-calls -foptimize-strlen -fpartial-inlining - fpeephole -fpeephole2 -fpredictive-commoning -fprefetch-loop-arrays -free -freg-struct-return -freorder-blocks -freorder-blocks-and-partition -freorder-functions -frerun-cse-after-loop -fsched-critical-path-heuristic -fsched-dep-count-heuristic -fsched-group-heuristic -fsched- interblock -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic -fsched-stalled-insns-dep -fschedule- insns2 -fshow-column -fshrink-wrap -fsigned-zeros -fsplit-ivs-in-unroller -fsplit-wide-types -fstack-protector -fstrict-aliasing -fstrict-over ow -fstrict-volatile-bit elds -fsync-libcalls -fthread-jumps -ftoplevel-reorder -ftrapping-math -ftree-bit-ccp -ftree-builtin-call-dce -ftree-ccp -ftree- ch -ftree-coalesce-vars -ftree-copy-prop -ftree-copyrename -ftree-cselim -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-forwprop -ftree-fre -ftree-loop-distribute-patterns -ftree-loop-if-convert -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-loop-vectorize -ftree- parallelize-loops= -ftree-partial-pre -ftree-phiprop -ftree-pre -ftree-pta -ftree-reassoc -ftree-scev-cprop -ftree-sink -ftree-slp-vectorize -ftree- slsr -ftree-sra -ftree-switch-conversion -ftree-tail-merge -ftree-ter -ftree-vrp -funit-at-a-time -funswitch-loops -funwind-tables -fverbose-asm - fzero-initialized-in-bss -m128bit-long-double -m64 -m80387 -malign-stringops -mavx256-split-unaligned-load -mavx256-split-unaligned- store -mfancy-math-387 -mfp-ret-in-387 -mfxsr -mglibc -mieee-fp -mlong-double-80 -mmmx -mno-sse4 -mpush-args -mred-zone -msse - msse2 -mtls-direct-seg-refs -mvzeroupper
11 CLASSIC COMPILER OPTIMIZATION TAXONOMY Machine independent Applicable across a broad range of machines 1. Eliminate redundant computations, dead code 2. Reduce running time and space 3. Decrease ratio of overhead to real work 4. Specialize code on a context 5. Enable other optimizations Machine dependent Capitalize on speci c machine properties 1. Manage or hide latency 2. Manage resources (registers, stack) 3. Improve mapping from IR to concrete machine 4. Use some exotic instructions (eg VLDM )
12 . 1 SCOPE COMPILER OPTIMIZATION TAXONOMY Interprocedural optimizations consider the whole translation unit, involve analysis of data ow and dependency graphs. Intraprocedural optimizations consider the whole procedure, involve analysis of data ow and dependency graphs.
12 . 2 SCOPE COMPILER OPTIMIZATION TAXONOMY Global optimizations consider the inter-most code block with the context. Loop optimizations belong to this. Local optimizations consider a single block, the analysis is limited to it. Peephole optimizations map one or more consecutive operators from the IR to a machine code.
12 . 3 INTERPROCEDURAL OPTIMIZATIONS (IPO) Look at all routines in a translation unit in order to make optimizations across routine boundaries, including but not limited to inlining and cloning. Also called as Interprocedural Analysis (IPA). Compiler can move, optimize, restructure and delete code between procedures and even different source les, if LTO is enabled Inlining — replacing a subroutine call with the replicated code of it Cloning — optimizing logic in the copied subroutine for a particular call
13 PATTERN COMPILER OPTIMIZATION TAXONOMY Dependency chains (linear code) Branches Loop bodies Single loop Loop and branch Multi-loop Functional calls to subroutines
14 . 1 HOW TO GET OPTIMIZATION FEEDBACK? Check wall-time of you application If a compiler has done its job well, you'll see performance improvements Dump an assembly of your code (or/and IL) Ensure instruction and register scheduling Check for extra operations and register spills See compiler optimization report All the compilers have some support for it Some of them are able to generate very detailed reports about loop unrolling, auto-vectorization, VLIW slots scheduling, etc
14 . 2 COMMONLY CONSIDERED METRICS Wall(-clock) time is a human perception of the span of time from the start to the completion of a task. Power consumption is the electrical energy which is consumed to complete a task. Processor time (or runtime) is the total execution time during which a processor was dedicated to a task (i.e. executes instructions of that task).
14 . 3 DUMPING ASSEMBLY Assembler is a must-have to check the compiler but it is rarely used to write low-level code. $ gcc code.c -S -o asm.s Assembly writing is the least portable optimization Inline assembly limits compiler optimizations Assembly does not give overwhelming speedup nowadays Sometimes it is needed to overcome compiler bugs and optimization limitations
14 . 4 EXAMPLE: GCC FEEDBACK OPTIONS Enables optimization information printing -fopt-info -fopt-info-<optimized/missed/note/all> -fopt-info-all-<ipa/loop/inline/vec/optall> -fopt-info=filename Controls the amount of debugging output the scheduler prints on targets that use instruction scheduling -fopt-info -fsched-verbose=n Controls the amount of output from auto-vectorizer -ftree-vectorizer-verbose=n
14 . 5 EXAMPLES: GCC FEEDBACK OPTIONS Outputs all optimization info to stderr. gcc -O3 -fopt-info Outputs missed optimization report from all the passes to missed.txt gcc -O3 -fopt-info-missed=missed.txt Outputs information about missed optimizations as well as optimized locations from all the inlining passes to inline.txt. gcc -O3 -fopt-info-inline-optimized-missed=inline.txt
14 . 6 GCC FEEDBACK EXAMPLE ./src/box.cc:193:9: note: loop vectorized ./src/box.cc:193:9: note: loop versioned for vectorization because of possible aliasing ./src/box.cc:193:9: note: loop peeled for vectorization to enhance alignment ./src/box.cc:96:9: note: loop vectorized ./src/box.cc:96:9: note: loop peeled for vectorization to enhance alignment ./src/box.cc:51:9: note: loop vectorized ./src/box.cc:51:9: note: loop peeled for vectorization to enhance alignment ./src/box.cc:193:9: note: loop with 7 iterationscompletely unrolled ./src/box.cc:32:13: note: loop with 7 iterations completely unrolled ./src/box.cc:96:9: note: loop with 15 iterations completely unrolled ./src/box.cc:51:9: note: loop with 15 iterations completely unrolled ./src/box.cc:584:9: note: loop vectorized ./src/box.cc:584:9: note: loop versioned for vectorization because of possible aliasing ./src/box.cc:584:9: note: loop peeled for vectorization to enhance alignment ./src/box.cc:482:9: note: loop vectorized ./src/box.cc:482:9: note: loop peeled for vectorization to enhance alignment ./src/box.cc:463:5: note: loop vectorized ./src/box.cc:463:5: note: loop versioned for vectorization because of possible aliasing ./src/box.cc:463:5: note: loop peeled for vectorization to enhance alignment
15 . 1 POINTER ALIASING void twiddle1(int *xp, int *yp) { *xp += *yp; *xp += *yp; } void twiddle2(int *xp, int *yp) { *xp += 2* *yp; } ARE THEY ALWAYS EQUAL?
15 . 2 POINTER ALIASING What if.. int main(int argc, char** argv) { int i = 5, j = 5; twiddle1(&i, &i); twiddle2(&j, &j); printf("twiddle1 result is %dn" , i); printf("twiddle2 result is %dn" , j); } twiddle1 result is 20 while twiddle2 result is 15
15 . 3 POINTER ALIASING Aliasing refers to the situation where the same memory location can be accessed by using different names. void twiddle1(int *xp, int *yp) { *xp += *yp; *xp += *yp; } void twiddle2(int *xp, int *yp) { *xp += 2* *yp; }
15 . 4 STRICT ALIASING ASSUMPTION Strict aliasing is an assumption, made by a C (or C++) compiler, that dereferencing pointers to objects of different types will never refer to the same memory location. This assumption enables more aggressive optimization (gcc assumes it up from -02), but a programmer should have to follow strict aliasing rules to get code working correctly.
15 . 5 results in results in STRICT ALIASING ASSUMPTION void check(int32_t *h, int64_t *k) { *h = 5; *k = 6; printf("%dn", *h); } void main() { int64_t k; check((int32_t *)&k, &k); } gcc -O1 test.c -o test ; ./test 6 gcc -O2 test.c -o test ; ./test 5
15 . 6 POINTER ALIASING: MISSED OPPORTUNITIES Compiler freely schedules arithmetic, but often preserves the order of memory dereferencing Compiler is limited in redundancy elimination Compiler is limited in loop unrolling Compiler is limited in auto-vectorization
16 . 1 FUNCTION CALLS int callee(); int caller() { return callee() + callee(); } int callee(); int caller() { return 2*callee(); } ARE THEY EQUAL?
16 . 2 FUNCTION CALLS int callee(int i); int caller() { int s=0, i=0; for ( ; i < N ; i++) s += callee(i); return s; } int callee(int i); int caller() { int s0=0, s1=0, i=0; for ( ; i < N/2 ; i+=2) { s0+=callee(i); s1+=callee(i+1); } return s0 + s1; } ARE THEY EQUAL?
16 . 3 PURE FUNCTIONS Pure function is a function for which both of the following statements are true: 1. The function always evaluates the same result having been given the same argument value(s). The function result must not depend on any hidden information or state that may change while program execution proceeds or between different executions of the program, as well as on any external input from I/O devices. 2. Evaluation of the result does not cause any semantically observable side effect or output, such as mutation of mutable objects or output to I/O devices.
16 . 4 PURE FUNCTIONS Pure functions are much easier to optimize. Expressing ideas in code as pure functions simpli es compiler's life. Most functions from math.h are not pure (sets/cleans oating point ags and conditions, throws oating point exceptions) Use constexpr keyword for c++11 to hint a compiler that function could be evaluated in compile time Use static keyword to help the compiler to see all the usages of the function (and perform aggressive inlining, or even deduce whether the function is pure or not) Neither constexpr nor static doesn't guarantee that function is pure but they give compiler some hints.
16 . 5 FUNCTIONAL CALLS: MISSED OPPORTUNITIES If the compiler fails to inline a function body: it is limited in redundancy elimination there are some overhead on function calls inlining is crucial for functional calls from loops many other optimizations aren't performed for this fragment of code loop unrolling auto-vectorization etc potential bloating of code and stack
17 FLOATING POINT Floating point arithmetics is not associative, so A+(B+C) != (A+B)+C A compiler is very conservative about oating point! void associativityCheck (void) { double x = 3.1415926535897931; double a = 1.0e15; double b = -(1.0e15 - 1.0); printf("%f %fn", x*(a + b), x*a + x*b ); } $ gcc check.c -o check ; ./check 3.141593 3.000000 Such situation is known as catastrophic cancellation
18 MORE OPTIMIZATION CHALLENGES Branches inside a loop Exceptions Accesses to storage type global variables Inline assembly volatile keyword
19 . 1 SUMMARY Source code does through lexical, syntax, semantic analysis, as well as IR generation, optimization before producing target machine code Backend and frontend simplify compiler development Intermediate language makes compiler optimizations reusable across broad range of languages and targets IL can be Language-speci c or Language independent Triples and Quadruples are widely used as language-independent IR All the compiler optimizations are done on IR Register allocation goes after IR optimization, local-variable reuse is pointless nowadays!
19 . 2 SUMMARY LTO allows do optimizations during linking WHOPR allows globally optimize whole binary Compiler usually supports multiple optimization levels Compiler optimizations are split into machine-dependent and machine-independent groups By scope compiler optimizations are split into interprocedural, intraprocedural, global, local and peephole The most common targets are dependency chains, branches, loops Compiler optimization is a multi-phase iterative process Performing one optimization allows many other Most optimizations need certain order of application
19 . 3 SUMMARY Checking wall-time, assembly or optimizer's report are the most common ways to get optimization feedback Wall time is the most important metric to optimize Assembler is a must-have to check the compiler but it is rarely used to write low-level code Inspect optimizer's report to demystify its "habits" Stick to the strict aliasing rule Clean code is not enough.. Write pure code! Compilers are usually very conservative optimizing oating point math
20 THE END / 2015-2016MARINA KOLPAKOVA

Pragmatic Optimization in Modern Programming - Demystifying the Compiler

  • 1.
    1 PRAGMATIC OPTIMIZATION IN MODERN PROGRAMMING DEMYSTIFYINGA COMPILER Created by for / 2015-2016Marina (geek) Kolpakova UNN
  • 2.
    2 COURSE TOPICS Ordering optimizationapproaches Demystifying a compiler Mastering compiler optimizations Modern computer architectures concepts
  • 3.
    3 OUTLINE Compilation trajectory Intermediate language Dealingwith local variables link-time and whole program optimization Optimization levels Compiler optimization taxonomies Classic Scope Code pattern How to get the feedback from optimization? Compiler optimization challenges Summary
  • 4.
    4 EXECUTABLE GENERATION PHASES Pre-processing.Pre-process, but don't compile. gcc -E test.cc cl /E test.cc Compilation. Compile, but don't assemble. gcc -S test.cc cl /FA test.cc Assembling. Assemble, but don't link. gcc -c test.cc cl /c test.cc Linking. Link object les to generate the executable. gcc test.cc cl test.cc
  • 5.
    5 . 1 COMPILATIONTRAJECTORY Lexical Analysis scans the source code as a stream of characters converting it into lexemes (tokens). Syntax Analysis takes the tokens, produced by lexical analysis, as input and generates a syntax tree. Source code grammar (syntactical correctness) is checked here.
  • 6.
    5 . 2 COMPILATIONTRAJECTORY Semantic Analysis checks whether the constructed syntax tree follows the language rules (including the type checking). Intermediate Code Generation builds a program representation for some abstract machine. It is in between the high-level language and the target machine language.
  • 7.
    5 . 3 COMPILATIONTRAJECTORY Code Optimization does optimization of the intermediate code (eg, redundancy elimination). Code Generation takes an optimized representation of the intermediate code and maps it to the target machine language.
  • 8.
    6 FRONTEND AND BACKEND Onlya backend is required for new machine support Only a frontend is required for new language support Most of optimizations resemble each other for all targets and could be applied in between frontend and backend
  • 9.
    7 . 1 INTERMEDIATELANGUAGE Optimization techniques become much easier to conduct on the level of intermediate code. Modern compilers usually use 2 levels of intermediate representation (IR).
  • 10.
    7 . 2 INTERMEDIATELANGUAGE High Level IR is close to the source and can be easily generated from the source code. Some code optimizations are possible. It is not very suitable for target machine optimization. Low Level IR is close to the target machine and used for machine- dependent optimizations: register allocation, instruction selection, peephole optimization.
  • 11.
    7 . 3 INTERMEDIATELANGUAGE Language-speci c to be used for JIT compilation later: Java byte code; .NET CLI, NVIDIA PTX. Language independent, like three-(four-)address code (similar to a classic RISC ISA). a = b + c * d + c * d; Three-Address Code (TAC) r1 = c * d; r2 = b + r1; r3 = r2 + r1; a = r3 Here rth is an abstract register.
  • 12.
    7 . 4 THREE-ADDRESSCODE Quadruples has four elds Op arg1 arg2 result * c d r1 + b r1 r2 + r2 r1 r3 = r3 a Triples or Indirect triples have three elds Op arg1 arg2 * c d + b (0) + (1) (0) = (2)
  • 13.
    7 . 5 INTERMEDIATELANGUAGE Provides frontend independent code representation. and GNU Compiler Collection -fdump-tree-all -fdump-tree-optimized -fdump-tree-ssa -fdump-rtl-all clang and other LLWM based compilers -emit-llvm CIL (C Intermediate Language) Visual Studio cl.exe GENERIC GIMPLE LLVM IL
  • 14.
    7 . 6 INTERMEDIATELANGUAGE uint32_t gray2rgba_v1(uint8_t c) { return c + (c<<8) + (c<<16) + (c<<24); } $ clang -Os -S -emit-llvm test.c -o test.ll $ cat test.ll define i32 @gray2rgba_v1(i8 zeroext %c) # 0 { %1 = zext i8 %c to i32 %2 = mul i32 %1, 16843009 ret i32 %2 } gray2rgba_v1: movzbl %dil, %eax imull $16843009, %eax, %eax ret
  • 15.
    8 . 1 DEALINGLOCAL VARIABLES Compiler don't care how many variables are used in code, register allocation is done after IR rotations. for( ; j <= roi.width - 4; j += 4 ) { uchar t0 = tab[src[j]]; uchar t1 = tab[src[j+ 1]]; dst[j] = t0; dst[j+1] = t1; t0 = tab[src[j+2]]; t1 = tab[src[j+3]]; dst[j+2] = t0; dst[j+3] = t1; }
  • 16.
    8 . 2 DEALINGLOCAL VARIABLES .lr.ph4: ; preds = %0, %.lr.ph4 %indvars.iv5 = phi i64 [ %indvars.iv.next6, %.lr.ph4 ], [0, %0 ] %6 = getelementptr inbounds i8* %src, i64 %indvars.iv5 %7 = load i8* %6, align 1, !tbaa !1 %8 = zext i8 %7 to i64 %9 = getelementptr inbounds i8* %tab, i64 %8 %10 = load i8* %9, align 1, !tbaa !1 %11 = or i64 %indvars.iv5, 1 %12 = getelementptr inbounds i8* %src, i64 %11 %13 = load i8* %12, align 1, !tbaa !1 %14 = zext i8 %13 to i64 %15 = getelementptr inbounds i8* %tab, i64 %14 %16 = load i8* %15, align 1, !tbaa !1 %17 = getelementptr inbounds i8* %dst, i64 %indvars.iv5 store i8 %10, i8* %17, align 1, !tbaa !1 %18 = getelementptr inbounds i8* %dst, i64 %11 store i8 %16, i8* %18, align 1, !tbaa !1 %19 = or i64 %indvars.iv5, 2 // ... %28 = zext i8 %27 to i64 %29 = getelementptr inbounds i8* %tab, i64 %28 %30 = load i8* %29, align 1, !tbaa !1 %31 = getelementptr inbounds i8* %dst, i64 %19 store i8 %24, i8* %31, align 1, !tbaa !1 %32 = getelementptr inbounds i8* %dst, i64 %25 store i8 %30, i8* %32, align 1, !tbaa !1 %indvars.iv.next6 = add nuw nsw i64 %indvars.iv5,4 %33 = trunc i64 %indvars.iv.next6 to i32 %34 = icmp sgt i32 %33, %1 br i1 %34, label %..preheader_crit_edge, label %.lr.ph4
  • 17.
    9 . 1 Performinter-procedural optimizations during linking. Most compilers support this feature: LINK-TIME OPTIMIZATION (LTO) (- to) (- to) stating with 4.9 (/GL, /LTCG) ... clang gcc cl.exe
  • 18.
    9 . 2 WHOPR:WHOLE PROGRAM OPTIMIZATION 1. Compile each source le separately, add extra information to the object le 2. Analyze information collected from all object les 3. Perform second optimization phase to generate object le 4. Link the nal binary Eliminate even more redundant code Compilations is better optimized for multi-core systems
  • 19.
    10 . 1 OPTIMIZATIONLEVELS -O0 (the default) No optimization generates unoptimized code but has the fastest compilation time. -O1 Moderate optimization optimizes reasonably well but does not degrade compilation time signi cantly. -O2 Full optimization generates highly optimized code and has the slowest compilation time.
  • 20.
    10 . 2 OPTIMIZATIONLEVELS -O3 Aggressive optimization employees more aggressive automatic inlining of subprograms within a unit and attempts to vectorize. -Os Optimizes with focus on program size enables all -O2 optimizations that do not typically increase code size. It also performs further optimizations designed to reduce code size.
  • 21.
    10 . 3 ENABLEDOPTIMIZATIONS: GCC -O0 GNU C version 4.9.2 (x86_64-linux-gnu) $ touch 1.c; gcc -O0 -S -fverbose-asm 1.c -o 1.s options enabled: -faggressive-loop-optimizations -fasynchronous-unwind-tables -fauto-inc-dec -fcommon -fdelete-null-pointer-checks - fdwarf2-c -asm -fearly-inlining -feliminate-unused-debug-types -ffunction-cse -fgcse-lm -fgnu-runtime -fgnu-unique - dent - nline-atomics - ra-hoist-pressure - ra-share-save-slots - ra-share-spill-slots - vopts -fkeep-static-consts - eading-underscore -fmath-errno -fmerge-debug- strings -fpeephole -fprefetch-loop-arrays -freg-struct-return -fsched-critical-path-heuristic -fsched-dep-count-heuristic -fsched-group- heuristic -fsched-interblock -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic -fsched-stalled-insns- dep -fshow-column -fsigned-zeros -fsplit-ivs-in-unroller -fstack-protector -fstrict-volatile-bit elds -fsync-libcalls -ftrapping-math -ftree- coalesce-vars -ftree-cselim -ftree-forwprop -ftree-loop-if-convert -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize- loops= -ftree-phiprop -ftree-reassoc -ftree-scev-cprop -funit-at-a-time -funwind-tables -fverbose-asm -fzero-initialized-in-bss -m128bit-long- double -m64 -m80387 -malign-stringops -mavx256-split-unaligned-load -mavx256-split-unaligned-store -mfancy-math-387 -mfp-ret-in- 387 -mfxsr -mglibc -mieee-fp -mlong-double-80 -mmmx -mno-sse4 -mpush-args -mred-zone -msse -msse2 -mtls-direct-seg-refs
  • 22.
    10 . 4 ENABLEDOPTIMIZATIONS: GCC -O1 GNU C version 4.9.2 (x86_64-linux-gnu) $ touch 1.c; gcc -O1 -S -fverbose-asm 1.c -o 1.s options enabled: -faggressive-loop-optimizations -fasynchronous-unwind-tables -fauto-inc-dec -fbranch-count-reg -fcombine-stack- adjustments -fcommon -fcompare-elim -fcprop-registers -fdefer-pop -fdelete-null-pointer-checks -fdwarf2-c -asm -fearly-inlining - feliminate-unused-debug-types -fforward-propagate -ffunction-cse -fgcse-lm -fgnu-runtime -fgnu-unique -fguess-branch-probability - dent - f-conversion - f-conversion2 - nline - nline-atomics - nline-functions-called-once - pa-pro le - pa-pure-const - pa-reference - ra-hoist- pressure - ra-share-save-slots - ra-share-spill-slots - vopts -fkeep-static-consts - eading-underscore -fmath-errno -fmerge-constants - fmerge-debug-strings -fmove-loop-invariants -fomit-frame-pointer -fpeephole -fprefetch-loop-arrays -freg-struct-return -fsched-critical-path- heuristic -fsched-dep-count-heuristic -fsched-group-heuristic -fsched-interblock -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched- spec -fsched-spec-insn-heuristic -fsched-stalled-insns-dep -fshow-column -fshrink-wrap -fsigned-zeros -fsplit-ivs-in-unroller -fsplit-wide- types -fstack-protector -fstrict-volatile-bit elds -fsync-libcalls -ftoplevel-reorder -ftrapping-math -ftree-bit-ccp -ftree-ccp -ftree-ch -ftree- coalesce-vars -ftree-copy-prop -ftree-copyrename -ftree-cselim -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-forwprop -ftree-fre -ftree- loop-if-convert -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops= -ftree-phiprop -ftree-pta -ftree-reassoc -ftree- scev-cprop -ftree-sink -ftree-slsr -ftree-sra -ftree-ter -funit-at-a-time -funwind-tables -fverbose-asm -fzero-initialized-in-bss -m128bit-long- double -m64 -m80387 -malign-stringops -mavx256-split-unaligned-load -mavx256-split-unaligned-store -mfancy-math-387 -mfp-ret-in- 387 -mfxsr -mglibc -mieee-fp -mlong-double-80 -mmmx -mno-sse4 -mpush-args -mred-zone -msse -msse2 -mtls-direct-seg-refs
  • 23.
    10 . 5 ENABLEDOPTIMIZATIONS: GCC -O2 GNU C version 4.9.2 (x86_64-linux-gnu) $ touch 1.c; gcc -O2 -S -fverbose-asm 1.c -o 1.s options enabled: -faggressive-loop-optimizations -fasynchronous-unwind-tables -fauto-inc-dec -fbranch-count-reg -fcaller-saves -fcombine- stack-adjustments -fcommon -fcompare-elim -fcprop-registers -fcse-follow-jumps -fdefer-pop -fdelete-null-pointer-checks -fdevirtualize - fdevirtualize-speculatively -fdwarf2-c -asm -fearly-inlining -feliminate-unused-debug-types -fexpensive-optimizations -fforward-propagate - ffunction-cse -fgcse -fgcse-lm -fgnu-runtime -fgnu-unique -fguess-branch-probability -fhoist-adjacent-loads - dent - f-conversion - f- conversion2 - ndirect-inlining - nline - nline-atomics - nline-functions-called-once - nline-small-functions - pa-cp - pa-pro le - pa-pure- const - pa-reference - pa-sra - ra-hoist-pressure - ra-share-save-slots - ra-share-spill-slots - solate-erroneous-paths-dereference - vopts - fkeep-static-consts - eading-underscore -fmath-errno -fmerge-constants -fmerge-debug-strings -fmove-loop-invariants -fomit-frame-pointer -foptimize-sibling-calls -foptimize-strlen -fpartial-inlining -fpeephole -fpeephole2 -fprefetch-loop-arrays -free -freg-struct-return -freorder- blocks -freorder-blocks-and-partition -freorder-functions -frerun-cse-after-loop -fsched-critical-path-heuristic -fsched-dep-count-heuristic - fsched-group-heuristic -fsched-interblock -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic -fsched- stalled-insns-dep -fschedule-insns2 -fshow-column -fshrink-wrap -fsigned-zeros -fsplit-ivs-in-unroller -fsplit-wide-types -fstack-protector - fstrict-aliasing -fstrict-over ow -fstrict-volatile-bit elds -fsync-libcalls -fthread-jumps -ftoplevel-reorder -ftrapping-math -ftree-bit-ccp -ftree- builtin-call-dce -ftree-ccp -ftree-ch -ftree-coalesce-vars -ftree-copy-prop -ftree-copyrename -ftree-cselim -ftree-dce -ftree-dominator-opts - ftree-dse -ftree-forwprop -ftree-fre -ftree-loop-if-convert -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops= - ftree-phiprop -ftree-pre -ftree-pta -ftree-reassoc -ftree-scev-cprop -ftree-sink -ftree-slsr -ftree-sra -ftree-switch-conversion -ftree-tail-merge - ftree-ter -ftree-vrp -funit-at-a-time -funwind-tables -fverbose-asm -fzero-initialized-in-bss -m128bit-long-double -m64 -m80387 -malign- stringops -mavx256-split-unaligned-load -mavx256-split-unaligned-store -mfancy-math-387 -mfp-ret-in-387 -mfxsr -mglibc -mieee-fp - mlong-double-80 -mmmx -mno-sse4 -mpush-args -mred-zone -msse -msse2 -mtls-direct-seg-refs -mvzeroupper
  • 24.
    10 . 6 ENABLEDOPTIMIZATIONS: GCC -O3 GNU C version 4.9.2 (x86_64-linux-gnu) $ touch 1.c; gcc -O3 -S -fverbose-asm 1.c -o 1.s options enabled: -faggressive-loop-optimizations -fasynchronous-unwind-tables -fauto-inc-dec -fbranch-count-reg -fcaller-saves -fcombine- stack-adjustments -fcommon -fcompare-elim -fcprop-registers -fcrossjumping -fcse-follow-jumps -fdefer-pop -fdelete-null-pointer-checks - fdevirtualize -fdevirtualize-speculatively -fdwarf2-c -asm -fearly-inlining -feliminate-unused-debug-types -fexpensive-optimizations - fforward-propagate -ffunction-cse -fgcse -fgcse-after-reload -fgcse-lm -fgnu-runtime -fgnu-unique -fguess-branch-probability -fhoist- adjacent-loads - dent - f-conversion - f-conversion2 - ndirect-inlining - nline - nline-atomics - nline-functions - nline-functions-called- once - nline-small-functions - pa-cp - pa-cp-clone - pa-pro le - pa-pure-const - pa-reference - pa-sra - ra-hoist-pressure - ra-share-save- slots - ra-share-spill-slots - solate-erroneous-paths-dereference - vopts -fkeep-static-consts - eading-underscore -fmath-errno -fmerge- constants -fmerge-debug-strings -fmove-loop-invariants -fomit-frame-pointer -foptimize-sibling-calls -foptimize-strlen -fpartial-inlining - fpeephole -fpeephole2 -fpredictive-commoning -fprefetch-loop-arrays -free -freg-struct-return -freorder-blocks -freorder-blocks-and-partition -freorder-functions -frerun-cse-after-loop -fsched-critical-path-heuristic -fsched-dep-count-heuristic -fsched-group-heuristic -fsched- interblock -fsched-last-insn-heuristic -fsched-rank-heuristic -fsched-spec -fsched-spec-insn-heuristic -fsched-stalled-insns-dep -fschedule- insns2 -fshow-column -fshrink-wrap -fsigned-zeros -fsplit-ivs-in-unroller -fsplit-wide-types -fstack-protector -fstrict-aliasing -fstrict-over ow -fstrict-volatile-bit elds -fsync-libcalls -fthread-jumps -ftoplevel-reorder -ftrapping-math -ftree-bit-ccp -ftree-builtin-call-dce -ftree-ccp -ftree- ch -ftree-coalesce-vars -ftree-copy-prop -ftree-copyrename -ftree-cselim -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-forwprop -ftree-fre -ftree-loop-distribute-patterns -ftree-loop-if-convert -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-loop-vectorize -ftree- parallelize-loops= -ftree-partial-pre -ftree-phiprop -ftree-pre -ftree-pta -ftree-reassoc -ftree-scev-cprop -ftree-sink -ftree-slp-vectorize -ftree- slsr -ftree-sra -ftree-switch-conversion -ftree-tail-merge -ftree-ter -ftree-vrp -funit-at-a-time -funswitch-loops -funwind-tables -fverbose-asm - fzero-initialized-in-bss -m128bit-long-double -m64 -m80387 -malign-stringops -mavx256-split-unaligned-load -mavx256-split-unaligned- store -mfancy-math-387 -mfp-ret-in-387 -mfxsr -mglibc -mieee-fp -mlong-double-80 -mmmx -mno-sse4 -mpush-args -mred-zone -msse - msse2 -mtls-direct-seg-refs -mvzeroupper
  • 25.
    11 CLASSIC COMPILER OPTIMIZATIONTAXONOMY Machine independent Applicable across a broad range of machines 1. Eliminate redundant computations, dead code 2. Reduce running time and space 3. Decrease ratio of overhead to real work 4. Specialize code on a context 5. Enable other optimizations Machine dependent Capitalize on speci c machine properties 1. Manage or hide latency 2. Manage resources (registers, stack) 3. Improve mapping from IR to concrete machine 4. Use some exotic instructions (eg VLDM )
  • 26.
    12 . 1 SCOPECOMPILER OPTIMIZATION TAXONOMY Interprocedural optimizations consider the whole translation unit, involve analysis of data ow and dependency graphs. Intraprocedural optimizations consider the whole procedure, involve analysis of data ow and dependency graphs.
  • 27.
    12 . 2 SCOPECOMPILER OPTIMIZATION TAXONOMY Global optimizations consider the inter-most code block with the context. Loop optimizations belong to this. Local optimizations consider a single block, the analysis is limited to it. Peephole optimizations map one or more consecutive operators from the IR to a machine code.
  • 28.
    12 . 3 INTERPROCEDURALOPTIMIZATIONS (IPO) Look at all routines in a translation unit in order to make optimizations across routine boundaries, including but not limited to inlining and cloning. Also called as Interprocedural Analysis (IPA). Compiler can move, optimize, restructure and delete code between procedures and even different source les, if LTO is enabled Inlining — replacing a subroutine call with the replicated code of it Cloning — optimizing logic in the copied subroutine for a particular call
  • 29.
    13 PATTERN COMPILER OPTIMIZATIONTAXONOMY Dependency chains (linear code) Branches Loop bodies Single loop Loop and branch Multi-loop Functional calls to subroutines
  • 30.
    14 . 1 HOWTO GET OPTIMIZATION FEEDBACK? Check wall-time of you application If a compiler has done its job well, you'll see performance improvements Dump an assembly of your code (or/and IL) Ensure instruction and register scheduling Check for extra operations and register spills See compiler optimization report All the compilers have some support for it Some of them are able to generate very detailed reports about loop unrolling, auto-vectorization, VLIW slots scheduling, etc
  • 31.
    14 . 2 COMMONLYCONSIDERED METRICS Wall(-clock) time is a human perception of the span of time from the start to the completion of a task. Power consumption is the electrical energy which is consumed to complete a task. Processor time (or runtime) is the total execution time during which a processor was dedicated to a task (i.e. executes instructions of that task).
  • 32.
    14 . 3 DUMPINGASSEMBLY Assembler is a must-have to check the compiler but it is rarely used to write low-level code. $ gcc code.c -S -o asm.s Assembly writing is the least portable optimization Inline assembly limits compiler optimizations Assembly does not give overwhelming speedup nowadays Sometimes it is needed to overcome compiler bugs and optimization limitations
  • 33.
    14 . 4 EXAMPLE:GCC FEEDBACK OPTIONS Enables optimization information printing -fopt-info -fopt-info-<optimized/missed/note/all> -fopt-info-all-<ipa/loop/inline/vec/optall> -fopt-info=filename Controls the amount of debugging output the scheduler prints on targets that use instruction scheduling -fopt-info -fsched-verbose=n Controls the amount of output from auto-vectorizer -ftree-vectorizer-verbose=n
  • 34.
    14 . 5 EXAMPLES:GCC FEEDBACK OPTIONS Outputs all optimization info to stderr. gcc -O3 -fopt-info Outputs missed optimization report from all the passes to missed.txt gcc -O3 -fopt-info-missed=missed.txt Outputs information about missed optimizations as well as optimized locations from all the inlining passes to inline.txt. gcc -O3 -fopt-info-inline-optimized-missed=inline.txt
  • 35.
    14 . 6 GCCFEEDBACK EXAMPLE ./src/box.cc:193:9: note: loop vectorized ./src/box.cc:193:9: note: loop versioned for vectorization because of possible aliasing ./src/box.cc:193:9: note: loop peeled for vectorization to enhance alignment ./src/box.cc:96:9: note: loop vectorized ./src/box.cc:96:9: note: loop peeled for vectorization to enhance alignment ./src/box.cc:51:9: note: loop vectorized ./src/box.cc:51:9: note: loop peeled for vectorization to enhance alignment ./src/box.cc:193:9: note: loop with 7 iterationscompletely unrolled ./src/box.cc:32:13: note: loop with 7 iterations completely unrolled ./src/box.cc:96:9: note: loop with 15 iterations completely unrolled ./src/box.cc:51:9: note: loop with 15 iterations completely unrolled ./src/box.cc:584:9: note: loop vectorized ./src/box.cc:584:9: note: loop versioned for vectorization because of possible aliasing ./src/box.cc:584:9: note: loop peeled for vectorization to enhance alignment ./src/box.cc:482:9: note: loop vectorized ./src/box.cc:482:9: note: loop peeled for vectorization to enhance alignment ./src/box.cc:463:5: note: loop vectorized ./src/box.cc:463:5: note: loop versioned for vectorization because of possible aliasing ./src/box.cc:463:5: note: loop peeled for vectorization to enhance alignment
  • 36.
    15 . 1 POINTERALIASING void twiddle1(int *xp, int *yp) { *xp += *yp; *xp += *yp; } void twiddle2(int *xp, int *yp) { *xp += 2* *yp; } ARE THEY ALWAYS EQUAL?
  • 37.
    15 . 2 POINTERALIASING What if.. int main(int argc, char** argv) { int i = 5, j = 5; twiddle1(&i, &i); twiddle2(&j, &j); printf("twiddle1 result is %dn" , i); printf("twiddle2 result is %dn" , j); } twiddle1 result is 20 while twiddle2 result is 15
  • 38.
    15 . 3 POINTERALIASING Aliasing refers to the situation where the same memory location can be accessed by using different names. void twiddle1(int *xp, int *yp) { *xp += *yp; *xp += *yp; } void twiddle2(int *xp, int *yp) { *xp += 2* *yp; }
  • 39.
    15 . 4 STRICTALIASING ASSUMPTION Strict aliasing is an assumption, made by a C (or C++) compiler, that dereferencing pointers to objects of different types will never refer to the same memory location. This assumption enables more aggressive optimization (gcc assumes it up from -02), but a programmer should have to follow strict aliasing rules to get code working correctly.
  • 40.
    15 . 5 resultsin results in STRICT ALIASING ASSUMPTION void check(int32_t *h, int64_t *k) { *h = 5; *k = 6; printf("%dn", *h); } void main() { int64_t k; check((int32_t *)&k, &k); } gcc -O1 test.c -o test ; ./test 6 gcc -O2 test.c -o test ; ./test 5
  • 41.
    15 . 6 POINTERALIASING: MISSED OPPORTUNITIES Compiler freely schedules arithmetic, but often preserves the order of memory dereferencing Compiler is limited in redundancy elimination Compiler is limited in loop unrolling Compiler is limited in auto-vectorization
  • 42.
    16 . 1 FUNCTIONCALLS int callee(); int caller() { return callee() + callee(); } int callee(); int caller() { return 2*callee(); } ARE THEY EQUAL?
  • 43.
    16 . 2 FUNCTIONCALLS int callee(int i); int caller() { int s=0, i=0; for ( ; i < N ; i++) s += callee(i); return s; } int callee(int i); int caller() { int s0=0, s1=0, i=0; for ( ; i < N/2 ; i+=2) { s0+=callee(i); s1+=callee(i+1); } return s0 + s1; } ARE THEY EQUAL?
  • 44.
    16 . 3 PUREFUNCTIONS Pure function is a function for which both of the following statements are true: 1. The function always evaluates the same result having been given the same argument value(s). The function result must not depend on any hidden information or state that may change while program execution proceeds or between different executions of the program, as well as on any external input from I/O devices. 2. Evaluation of the result does not cause any semantically observable side effect or output, such as mutation of mutable objects or output to I/O devices.
  • 45.
    16 . 4 PUREFUNCTIONS Pure functions are much easier to optimize. Expressing ideas in code as pure functions simpli es compiler's life. Most functions from math.h are not pure (sets/cleans oating point ags and conditions, throws oating point exceptions) Use constexpr keyword for c++11 to hint a compiler that function could be evaluated in compile time Use static keyword to help the compiler to see all the usages of the function (and perform aggressive inlining, or even deduce whether the function is pure or not) Neither constexpr nor static doesn't guarantee that function is pure but they give compiler some hints.
  • 46.
    16 . 5 FUNCTIONALCALLS: MISSED OPPORTUNITIES If the compiler fails to inline a function body: it is limited in redundancy elimination there are some overhead on function calls inlining is crucial for functional calls from loops many other optimizations aren't performed for this fragment of code loop unrolling auto-vectorization etc potential bloating of code and stack
  • 47.
    17 FLOATING POINT Floating pointarithmetics is not associative, so A+(B+C) != (A+B)+C A compiler is very conservative about oating point! void associativityCheck (void) { double x = 3.1415926535897931; double a = 1.0e15; double b = -(1.0e15 - 1.0); printf("%f %fn", x*(a + b), x*a + x*b ); } $ gcc check.c -o check ; ./check 3.141593 3.000000 Such situation is known as catastrophic cancellation
  • 48.
    18 MORE OPTIMIZATION CHALLENGES Branchesinside a loop Exceptions Accesses to storage type global variables Inline assembly volatile keyword
  • 49.
    19 . 1 SUMMARY Sourcecode does through lexical, syntax, semantic analysis, as well as IR generation, optimization before producing target machine code Backend and frontend simplify compiler development Intermediate language makes compiler optimizations reusable across broad range of languages and targets IL can be Language-speci c or Language independent Triples and Quadruples are widely used as language-independent IR All the compiler optimizations are done on IR Register allocation goes after IR optimization, local-variable reuse is pointless nowadays!
  • 50.
    19 . 2 SUMMARY LTOallows do optimizations during linking WHOPR allows globally optimize whole binary Compiler usually supports multiple optimization levels Compiler optimizations are split into machine-dependent and machine-independent groups By scope compiler optimizations are split into interprocedural, intraprocedural, global, local and peephole The most common targets are dependency chains, branches, loops Compiler optimization is a multi-phase iterative process Performing one optimization allows many other Most optimizations need certain order of application
  • 51.
    19 . 3 SUMMARY Checkingwall-time, assembly or optimizer's report are the most common ways to get optimization feedback Wall time is the most important metric to optimize Assembler is a must-have to check the compiler but it is rarely used to write low-level code Inspect optimizer's report to demystify its "habits" Stick to the strict aliasing rule Clean code is not enough.. Write pure code! Compilers are usually very conservative optimizing oating point math
  • 52.