Here are some things that may help you improve your code. The other review made some good points, but here are some not covered there.
Decide whether you're using stdlib or not
The Makefile and call to printf both indicate that you're using the standard C library, which is fine, but then the program terminates using a syscall which is not. The reason is that the standard C startup sets things up before main is called and then also tears them down again after main returns. This code is skipping the teardown by instead using the syscall to end the program, which is not good practice. There are two alternatives: either don't use the C library at all (that is, write your own printing routine) or let the teardown actually happen:
xor eax, eax ; set exit code to 0 to indicate success ret ; return to _libc_start_main which called our main
For further reading on how the startup and teardown works in Linux read this.
Manage registers carefully
One of the things that expert assembly language programmers (and good compilers) do is managing register usage. In this case, the ultimate use of the sum is to print it, and to print it we need the value in the rsi register. So why not use rsi instead of r8 as the running sum?
Know how to efficiently zero a register
Obviously, if we write mov r8, 0 it has the desired effect of loading the value 0 into the r8 register, and as the other review notes, there are better ways to do that, but let's look more deeply. The code currently does this:
; sum = 0 mov r8, 0 ; for i in [0, 1000) { mov rcx, 0
That works, but let's look at the listing file to see what NASM has turned that into:
13 ; sum = 0 14 00000000 41B800000000 mov r8, 0 15 ; for i in [0, 1000) { 16 00000006 B900000000 mov rcx, 0
The first column is just the line number of the listing file, the second is the address and the third is the encoded instruction. So we see that the two instructions use 11 bytes. We can do better! The other review correctly mentioned the xor instruction, so let's try it:
19 00000000 4D31C0 xor r8, r8 20 00000003 4831C9 xor rcx, rcx
Better, only six bytes. We can do better still. As one of the comments correctly noted, on a 64-bit x86 machine, if you xor the lower half of a rXX register, it also clears the upper half. So let's do that:
19 00000000 4D31C0 xor r8, r8 20 00000003 31C9 xor ecx, ecx
That saved one byte, but there is no e8 register. Can we do better by clearing ecx and then copying that value into r8?
14 00000000 31C9 xor ecx, ecx 20 00000002 4989C8 mov r8, rcx
No, we can't, unless we also follow the advice above and use rsi instead of r8:
19 00000000 31C9 xor ecx, ecx 20 00000002 31F6 xor esi, esi
Now we're down to four bytes, and we no longer need the mov rsi, r8 instruction which saves us another 3 bytes, for a net savings of 10 bytes with just those two things.
Avoid div if practical
The div instruction is one of the slowest instructions on the x86_64 architecture and can also cause an exception if we try to divide by zero. For both of those reasons, it's often better to avoid the instruction if we can. In this case, one way to avoid it is to note that it looks a lot like fizzbuzz and keep two counters: one that counts down from 5 and the other that counts down from 3.
Use local labels where practical
It's clear that main needs to be a file global symbol, but for0 and if01 (both poor names, as has already been noted) do not need to be. In NASM, we can designate local labels by prefixing those labels with a single period so instead of for0 we could use .for0. The advantage to doing this is that we can reuse a label in another function without having to worry about collision.
Avoid unconditional jumps where practical
The x86 processor does its best to figure out which instruction will be executed next. It has all kinds of things to make that happen, including multi-level cacheing and branch prediction. It does that to try to make software run faster. You can help it by avoiding branching at all where practical, and especially by avoiding unconditional jumps. Thinking carefully about it, we can often do this by restructuring the code. Here's the original code:
test rdx, rdx jne if01 ; sum = sum + i add rsi, rcx jmp if0 if01: ; i % 5 == 0 mov rax, rcx mov rdx, 0 mov r9, 5 div r9 test rdx, rdx jne if0 ; sum = sum + i add rsi, rcx jmp if0 ; } if0: inc rcx cmp rcx, 1000 jl for0
We can rewrite this like this:
test rdx, rdx je .accumulate ; i % 5 == 0 mov rax, rcx mov rdx, 0 mov r9, 5 div r9 test rdx, rdx jne .next .accumulate: ; sum = sum + i add rsi, rcx ; } .next: inc rcx cmp rcx, 1000 jl .for0