x86-64 Assembly - Sum of multiples of 3 or 5

Question

I'm trying to learn some basic x86 assembly and so I've begun solving Project Euler problems. I was hoping for some critique of my code that, hopefully, includes either the efficiency of the operations or the readability / style of the code itself. I will provide the Makefile for Linux 64 bit.

The purpose of the code is to sum all numbers from [0, 1000) that are divisible by 3 or 5.

The code can be run using make RUN=euler_1.

NB:

I am aware that most compilers replace modulos of known numbers with some combination of mov and shr to avoid the integer division. For example, see this thread.

Makefile

.PHONY: clean all: $(RUN).elf ./$^ %.elf: %.o ld $^ -o $@ -lc -e main -dynamic-linker /lib64/ld-linux-x86-64.so.2 %.o: %.asm nasm -f elf64 $^ clean: rm -f *.o *.elf

euler_1.asm

extern printf global main section .data fmt: db "%d", 0x0a, 0 section .text ;; main - Calculate the sum of all numbers between [0, 1000) that are divisible ;; by 3 or 5. ;; sum : R8 main: ; sum = 0 mov r8, 0 ; for i in [0, 1000) { mov rcx, 0 for0: ; if i % 3 == 0 or i % 5 == 0 { ; i % 3 == 0 mov rax, rcx mov rdx, 0 mov r9, 3 div r9 test rdx, rdx jne if01 ; sum = sum + i add r8, rcx jmp if0 if01: ; i % 5 == 0 mov rax, rcx mov rdx, 0 mov r9, 5 div r9 test rdx, rdx jne if0 ; sum = sum + i add r8, rcx jmp if0 ; } if0: inc rcx cmp rcx, 1000 jl for0 ; } ; printf("%d", sum) lea rdi, [rel fmt] mov rsi, r8 mov rax, 0 call printf ; sys_exit(0) mov rdi, 0 mov rax, 60 syscall

I also asked about Project Euler 1 in x86_64 (but AT&T rather than Intel syntax) recently, maybe some of those comments will help you: codereview.stackexchange.com/q/245990/195637 — Jonathan Lam
– Jonathan Lam, Commented Dec 21, 2020 at 21:17

Edward · Accepted Answer · 2020-12-20 17:07:30Z

Here are some things that may help you improve your code. The other review made some good points, but here are some not covered there.

Decide whether you're using stdlib or not

The Makefile and call to printf both indicate that you're using the standard C library, which is fine, but then the program terminates using a syscall which is not. The reason is that the standard C startup sets things up before main is called and then also tears them down again after main returns. This code is skipping the teardown by instead using the syscall to end the program, which is not good practice. There are two alternatives: either don't use the C library at all (that is, write your own printing routine) or let the teardown actually happen:

xor eax, eax ; set exit code to 0 to indicate success ret ; return to _libc_start_main which called our main

For further reading on how the startup and teardown works in Linux read this.

Manage registers carefully

One of the things that expert assembly language programmers (and good compilers) do is managing register usage. In this case, the ultimate use of the sum is to print it, and to print it we need the value in the rsi register. So why not use rsi instead of r8 as the running sum?

Know how to efficiently zero a register

Obviously, if we write mov r8, 0 it has the desired effect of loading the value 0 into the r8 register, and as the other review notes, there are better ways to do that, but let's look more deeply. The code currently does this:

; sum = 0 mov r8, 0 ; for i in [0, 1000) { mov rcx, 0

That works, but let's look at the listing file to see what NASM has turned that into:

13 ; sum = 0 14 00000000 41B800000000 mov r8, 0 15 ; for i in [0, 1000) { 16 00000006 B900000000 mov rcx, 0

The first column is just the line number of the listing file, the second is the address and the third is the encoded instruction. So we see that the two instructions use 11 bytes. We can do better! The other review correctly mentioned the xor instruction, so let's try it:

19 00000000 4D31C0 xor r8, r8 20 00000003 4831C9 xor rcx, rcx

Better, only six bytes. We can do better still. As one of the comments correctly noted, on a 64-bit x86 machine, if you xor the lower half of a rXX register, it also clears the upper half. So let's do that:

19 00000000 4D31C0 xor r8, r8 20 00000003 31C9 xor ecx, ecx

That saved one byte, but there is no e8 register. Can we do better by clearing ecx and then copying that value into r8?

14 00000000 31C9 xor ecx, ecx 20 00000002 4989C8 mov r8, rcx

No, we can't, unless we also follow the advice above and use rsi instead of r8:

19 00000000 31C9 xor ecx, ecx 20 00000002 31F6 xor esi, esi

Now we're down to four bytes, and we no longer need the mov rsi, r8 instruction which saves us another 3 bytes, for a net savings of 10 bytes with just those two things.

Avoid `div` if practical

The div instruction is one of the slowest instructions on the x86_64 architecture and can also cause an exception if we try to divide by zero. For both of those reasons, it's often better to avoid the instruction if we can. In this case, one way to avoid it is to note that it looks a lot like fizzbuzz and keep two counters: one that counts down from 5 and the other that counts down from 3.

Use local labels where practical

It's clear that main needs to be a file global symbol, but for0 and if01 (both poor names, as has already been noted) do not need to be. In NASM, we can designate local labels by prefixing those labels with a single period so instead of for0 we could use .for0. The advantage to doing this is that we can reuse a label in another function without having to worry about collision.

Avoid unconditional jumps where practical

The x86 processor does its best to figure out which instruction will be executed next. It has all kinds of things to make that happen, including multi-level cacheing and branch prediction. It does that to try to make software run faster. You can help it by avoiding branching at all where practical, and especially by avoiding unconditional jumps. Thinking carefully about it, we can often do this by restructuring the code. Here's the original code:

 test rdx, rdx jne if01 ; sum = sum + i add rsi, rcx jmp if0 if01: ; i % 5 == 0 mov rax, rcx mov rdx, 0 mov r9, 5 div r9 test rdx, rdx jne if0 ; sum = sum + i add rsi, rcx jmp if0 ; } if0: inc rcx cmp rcx, 1000 jl for0

We can rewrite this like this:

 test rdx, rdx je .accumulate ; i % 5 == 0 mov rax, rcx mov rdx, 0 mov r9, 5 div r9 test rdx, rdx jne .next .accumulate: ; sum = sum + i add rsi, rcx ; } .next: inc rcx cmp rcx, 1000 jl .for0

Unconditional jumps aren't necessarily bad. In fact, they're more likely to be predicted properly than conditional jumps. But in this case, you are exactly right, the code should be restructured to simply fall through. In general, a good rule of thumb is that the expected case should be the fall-through case, and the unexpected case should be the one that takes a branch. When writing large swaths of assembly code, readability and performance concerns dovetail nicely: avoid as many branches as reasonably possible. — Cody Gray
– Cody Gray, Commented Dec 20, 2020 at 17:17
I wonder if it might be worthwhile to use cmov instructions to avoid having a branch at all in that part of the code. — Daniel Schepler
– Daniel Schepler, Commented Dec 21, 2020 at 19:14
Regarding "there is no e8 register": I thought that was what r8d would be. (However, in my test with GNU asm, xor %r8, %r8 and xor %r8d, %r8d ended up having the same length.) — Daniel Schepler
– Daniel Schepler, Commented Dec 21, 2020 at 21:05
@DanielSchepler: You're correct, the 32-bit version of r8 is r8d. It doesn't make xor-zeroing smaller because it still needs a REX prefix for the register number. But you should still use xor r8d, r8d because Silvermont doesn't recognize xor r8,r8 as a "zeroing idiom" independent of the old value, but 32-bit operand-size xor (which all compilers always use) is optimal on every CPU that treats any kind of zeroing idiom specially. See What is the best way to set a register to zero in x86 assembly: xor, mov or and? — Peter Cordes
– Peter Cordes, Commented Dec 21, 2020 at 23:18

vnp · Accepted Answer · 2020-12-20 01:42:18Z

16

if01 and if0 are not the greatest names.
Instead of reloading r9, use two registers. Let r9 always contain 3, and r10 always contain 5.
Increment r8 in one place.
Running the loop downwards (1000 to 0), rather than upwards, spares an instruction (cmp).
mov rdx, 0 is encoded in 7 bytes. xor rdx, rdx is way shorter.

All that said, consider

main: mov r8, 0 mov r9, 3 mov r10, 5 ; for i in (1000, 0] mov rcx, 999 for0: mov rax, rcx xor rdx, rdx div r9 test rdx, rdx jeq accumulate mov rax, rcx xor rdx, rdx div r10 test rdx, rdx jne next accumulate: add r8, rcx next: dec rcx jne for0

PS: I hope you know that this problem has a very straightforward arithmetical solution.

edited Dec 20, 2020 at 1:42

answered Dec 20, 2020 at 1:14

vnp

58.7k4 gold badges55 silver badges144 bronze badges

4

\$\begingroup\$ What assembler uses the mnemonic jeq? Or is that just a typo? Also, you do not need to use the full 64-bit registers when clearing or loading small immediates. The upper 32 bits are automatically cleared, so you can just do, e.g., xor edx, edx. And there's never a reason to do mov reg, 0. \$\endgroup\$

Cody Gray
– Cody Gray

2020-12-20 12:54:15 +00:00
Commented Dec 20, 2020 at 12:54
\$\begingroup\$ @CodyGray if you wanted to align the next instruction, I suppose mov reg32,0 could be (if it has a suitable size) a better choice than xor reg32,reg32\nnop dword [rax]. Am I wrong? \$\endgroup\$

Ruslan
– Ruslan

2020-12-20 16:47:52 +00:00
Commented Dec 20, 2020 at 16:47
1

\$\begingroup\$ @CodyGray: Compilers automatically align functions. In hand-written assembly, you should use align 16 before labels if you want to match that behaviour. If I wanted a longer zeroing instruction, I might use some dummy segment-override prefixes and a REX.W=0 before a 32-bit xor-zeroing instruction. Segment prefixes (unlike REP) can apply to some forms of XOR, so it's highly unlikely that future CPUs will use that prefix+opcode combo as some other instruction. \$\endgroup\$

Peter Cordes
– Peter Cordes

2020-12-20 19:22:34 +00:00
Commented Dec 20, 2020 at 19:22
1

\$\begingroup\$ If I didn't want to use extra prefixes on xor-zeroing, mov reg, 0 is really not bad unless you need to avoid partial-register penalties on P6-family (e.g. before setcc); I think I mentioned that in my canonical answer on What is the best way to set a register to zero in x86 assembly: xor, mov or and?. Modern CPUs have enough back-end ports vs. their front-end width that having xor-zeroing eliminated is usually not significant for throughput. And only Intel actually eliminates it in the front-end; AMD Zen only eliminates mov reg,reg. (@Ruslan) \$\endgroup\$

Peter Cordes
– Peter Cordes

2020-12-20 19:24:38 +00:00
Commented Dec 20, 2020 at 19:24
1

\$\begingroup\$ Besides xor-zeroing, we know the remainder of anything divided by 3 or 5 will fit in a 32-bit register, so we can similarly save a REX prefix with test edx, edx. (And with mov r9d, 3, although NASM will do that optimization for you because it's exactly equivalent. Other assemblers won't, and understanding x86-64's implicit zero-extending is important for understanding some compiler-generated code because they always optimize mov-immediate; unfortunately not always other insns.) \$\endgroup\$

Peter Cordes
– Peter Cordes

2020-12-21 06:53:00 +00:00
Commented Dec 21, 2020 at 6:53

| Show 1 more comment

Peter Cordes · Accepted Answer · 2020-12-21 23:24:30Z

A few quick notes on your implementation choices, and how I'd approach it:

You don't need 64-bit operand-size for div when your numbers only go up to 1000, that's significantly slower than div r32 on Intel before Ice Lake: I explained the details in another Code Review: Checking if a number is prime in NASM Win64 Assembly.

(And in general for other instructions, test edx, edx would save code size there. Even with 64-bit numbers and 64-bit div, i % 5 will always fit in 32 bits so it's safe to ignore the high 32. See The advantages of using 32bit registers/instructions in x86-64 - it's the default operand-size for x86-64, not needing any machine-code prefixes. For efficiency, use it unless you actually need 64-bit operand-size for that specific instruction, and implicit zero-extension to 64-bit won't do what you need. Don't spend extra instructions, though; 64-bit operand-size is often needed, e.g. for pointer increments.)

Of course, for division by compile-time constants, div is a slow option that compilers avoid entirely, instead using a fixed-point multiplicative inverse. Like in Why does GCC use multiplication by a strange number in implementing integer division? on SO, or this code review.

Also, you don't need to divide at all if you use down-counters that you reset to 3 or 5 when they hit 0 (and/or unrolling) to handle the 3, 5 pattern, like FizzBuzz - see this Stack Overflow answer where I wrote a large tutorial about such techniques, which I won't repeat here. Unlike FizzBuzz, you only want to count a number once even if it's a multiple of both 3 and 5.

You could just unroll by 15 (so the pattern fully repeats) and hard-code something like

.unroll15_loop: ; lets say ECX=60 for example add eax, ecx ; += 60 lea eax, [rax + rcx + 3] ; += 63 lea eax, [rax + rcx + 5] ; += 65 lea eax, [rax + rcx + 6] ; += 66 ... add ecx, 15 cmp ecx, 1000-15 jbe .unroll15_loop ; handle the last not full group of 15 numbers

Or apply some math and instead of actually looking at every number, use a closed-form formula for the sum of the multiples of 3 and 5 in a 15-number range, offset by i * nmuls where i is the start of your range, and nmuls is the number of multiples.

e.g. in the [60, 75) range, we have 60, 63, 65, 66, 69, 70, 72. So that's 8 of the 15 numbers. So it's like [0, 15) but + 8*60. Either do the 0..14 part by hand, or with a loop and remember the result. (Project Euler is about math as much as programming; it's up to you how much math you want to do vs. how much brute force you want your program to do.)

Conveniently, 8 happens to be one of the scale-factors that x86 addressing modes support, so we can even do

lea eax, [rax + rcx*8 + 0 + 3 + 5 + 6 + 9 + 10 + 12]

(3+5+6+... is a constant expression so the assembler can do it for you at assemble time, producing a [reg + reg*scale + disp8] addressing mode. Unfortunately that 3-component LEA has 3-cycle latency on Intel CPUs, and that loop-carried dependency will be the bottleneck for the loop. So it would actually be more efficient to use a separate add instruction.)

And of course we've reduced this to basically a sum of a linearly increasing series, and could apply Gauss's formula (n * (n+1) / 2) for a closed form over the whole-interval range, just having to handle the cleanup of n%15 for the numbers approaching n. BTW, clang knows how to turn a simple for loop doing sum += i; into the closed form, arranging it to avoid overflow of the temporary before dividing by 2. (right shift). Matt Godbolt's CppCon2017 talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” uses that as an example. See also https://stackoverflow.com/questions/38552116/how-to-remove-noise-from-gcc-clang-assembly-output

The remainder part could be implemented based on a simple table lookup. Especially if you're doing a count down to 0; but even if you're doing a count up to the limit, you could have the table with entries for the number of multiples of the base and the constant term to use. — Daniel Schepler
– Daniel Schepler, Commented Dec 21, 2020 at 22:32
@DanielSchepler: Oh, the 0..14 elements of cleanup? Yeah, might as well just look up the answer instead of a jump table into a series of lea and NOP instructions or something. Although if you make every LEA the same length, you can do a computed jump (like end_of_block - remaining * 4 instead of loading from a table of pointers, so no static data involved. That might be fun, but yeah the efficient option would be a table. — Peter Cordes
– Peter Cordes, Commented Dec 21, 2020 at 22:47

Daniel Schepler · Accepted Answer · 2020-12-21 23:31:09Z

Use conditional move instructions where appropriate

To extend the discussion in the answer by @Edward: if you can use conditional move instructions, that will further reduce the amount of branching and thus help the processor.

If you combine with the suggestion to maintain modulo 3 and modulo 5 counters instead of doing division, then an outline of the main loop body could look like this (untested, though):

%define mod3_reg r8 %define mod5_reg r9 %define zero_reg r10 %define count_reg rcx %define accum_reg rsi %define addend_reg rdi %define limit 1000 ... mainloop: xor addend_reg, addend_reg inc mod3_reg cmp mod3_reg, 3 cmove addend_reg, count_reg cmove mod3_reg, zero_reg inc mod5_reg cmp mod5_reg, 5 cmove addend_reg, count_reg cmove mod5_reg, zero_reg add accum_reg, addend_reg inc count_reg cmp count_reg, limit jl mainloop

(Note that in order to match an initial value of 0 for the counter, you would need to initialize mod3_reg to 2 and mod5_reg to 4. If you adjust to start with 1, on the other hand, you could initialize both to 0 which would be a bit simpler.)

Do also note that according to some comments by @PeterCordes, there may be issues with cmov creating enough extra dependencies in the loop that it might not actually turn out to be worth it. This would be a case where, if you cared a lot about performance, running a benchmark on your target machine would be important.

Instead of inc, cmp it would be simpler and shorter to preload with 3 and 5 and count down with dec. You could then directly use cmove to reload. — Edward
– Edward, Commented Dec 21, 2020 at 20:23
Sure, that's definitely a possibility. Then, you could take advantage of the fact that dec preloads flags for you and avoid the test or cmp. In the end, I ended up doing it this way for a slight improvement in readability - with the count down, it's a bit more complex to describe what the registers represent. — Daniel Schepler
– Daniel Schepler, Commented Dec 21, 2020 at 20:41
If you're going for simplicity not performance, you can increment your mod3 and mod5 regs after their cmp/cmov (next to count_reg), so everything can start at zero. Or with down-counters, they'd just start at 3 and 5 for counter=0, same as their reset values. — Peter Cordes
– Peter Cordes, Commented Dec 21, 2020 at 22:58
However, this branchless strategy is probably only good for very low limits; Most modern branch predictors will learn the pattern quickly. But you're always paying a high cost every time through the loop for front-end throughput (12 uops assuming single-uop CMOVE (Intel Broadwell and later, or AMD since forever) and macro-fusion of cmp/jl, so on a 4-wide pipeline like Intel before Ice Lake, that takes 3 cycles per iteration to issue. With down-counters, that could get down to 10 uops, which Zen can issue in 2 cycles. — Peter Cordes
– Peter Cordes, Commented Dec 21, 2020 at 23:04
@PeterCordes OK, I've added a section at the tail acknowledging that benchmarking would be important if you want to see whether the change to cmov actually helps or hurts overall performance. — Daniel Schepler
– Daniel Schepler, Commented Dec 21, 2020 at 23:32

Stack Exchange Network

x86-64 Assembly - Sum of multiples of 3 or 5

4 Answers 4

Decide whether you're using stdlib or not

Manage registers carefully

Know how to efficiently zero a register

Avoid `div` if practical

Use local labels where practical

Avoid unconditional jumps where practical

Use conditional move instructions where appropriate

You must log in to answer this question.

Linked

Hot Network Questions

x86-64 Assembly - Sum of multiples of 3 or 5

4 Answers 4

Decide whether you're using stdlib or not

Manage registers carefully

Know how to efficiently zero a register

Avoid div if practical

Use local labels where practical

Avoid unconditional jumps where practical

Use conditional move instructions where appropriate

You must log in to answer this question.

Linked

Related

Hot Network Questions

Avoid `div` if practical