Skip to main content
added 81 characters in body
Source Link
qwr
  • 11.6k
  • 6
  • 75
  • 121

With the usual caveats about profiling if this really matters (it won't unless you're doing this A LOT):

As @phuclv says, on modern processors quotient and remainder will be calculated in one instruction. All these assume unsigned numbers without worrying about overflow. With x86-64 GCC -O3

unsigned int f(unsigned int x, unsigned int y) { return x / y + (x % y != 0); } 

produces

mov eax, edi xor edx, edx # zero edx div esi # divides edx:eax (y) by esi (x) # eax = quotient, edx = remainder cmp edx, 1 # set CF = (edx - 1 < 0), i.e. edx == 0 sbb eax, -1 # eax -= CF - 1, i.e. eax += 1 - CF, no branch ret 

https://godbolt.org/z/4Gsn3Kj5s

unsigned int f(unsigned int x, unsigned int y) { return (x + y - 1) / y; } 

is clever and uses lea to do the addition and subtraction

lea eax, [rsi-1+rdi] xor edx, edx div esi ret 

https://godbolt.org/z/9sfsc1Wa5

For 64-bit inputs, the results are similar but with 64-bit registers instead.

I would guess the shorter codeLEA is faster than CMP/SBB as LEA is a fast instruction, but I didn't benchmark anything.

In a deleted answer, @Matt suggests the remainder increment version is faster, but his g++ compile command didn't include optimization flag which is supsect.

With the usual caveats about profiling if this really matters (it won't unless you're doing this A LOT):

As @phuclv says, on modern processors quotient and remainder will be calculated in one instruction. All these assume unsigned numbers without worrying about overflow. With x86-64 GCC -O3

unsigned int f(unsigned int x, unsigned int y) { return x / y + (x % y != 0); } 

produces

mov eax, edi xor edx, edx # zero edx div esi # divides edx:eax (y) by esi (x) # eax = quotient, edx = remainder cmp edx, 1 # set CF = (edx - 1 < 0), i.e. edx == 0 sbb eax, -1 # eax -= CF - 1, i.e. eax += 1 - CF, no branch ret 

https://godbolt.org/z/4Gsn3Kj5s

unsigned int f(unsigned int x, unsigned int y) { return (x + y - 1) / y; } 

is clever and uses lea to do the addition and subtraction

lea eax, [rsi-1+rdi] xor edx, edx div esi ret 

https://godbolt.org/z/9sfsc1Wa5

For 64-bit inputs, the results are similar but with 64-bit registers instead.

I would guess the shorter code is faster, but I didn't benchmark anything.

In a deleted answer, @Matt suggests the remainder increment version is faster, but his g++ compile command didn't include optimization flag which is supsect.

With the usual caveats about profiling if this really matters (it won't unless you're doing this A LOT):

As @phuclv says, on modern processors quotient and remainder will be calculated in one instruction. All these assume unsigned numbers without worrying about overflow. With x86-64 GCC -O3

unsigned int f(unsigned int x, unsigned int y) { return x / y + (x % y != 0); } 

produces

mov eax, edi xor edx, edx # zero edx div esi # divides edx:eax (y) by esi (x) # eax = quotient, edx = remainder cmp edx, 1 # set CF = (edx - 1 < 0), i.e. edx == 0 sbb eax, -1 # eax -= CF - 1, i.e. eax += 1 - CF, no branch ret 

https://godbolt.org/z/4Gsn3Kj5s

unsigned int f(unsigned int x, unsigned int y) { return (x + y - 1) / y; } 

is clever and uses lea to do the addition and subtraction

lea eax, [rsi-1+rdi] xor edx, edx div esi ret 

https://godbolt.org/z/9sfsc1Wa5

For 64-bit inputs, the results are similar but with 64-bit registers instead.

I would guess LEA is faster than CMP/SBB as LEA is a fast instruction, but I didn't benchmark anything.

In a deleted answer, @Matt suggests the remainder increment version is faster, but his g++ compile command didn't include optimization flag which is supsect.

fix incorrect code and add explanation
Source Link
qwr
  • 11.6k
  • 6
  • 75
  • 121

With the usual caveats about profiling if this really matters (it won't unless you're doing this A LOT):

As @phuclv says, on modern processors quotient and remainder will be calculated in one instruction. All these assume unsigned numbers without worrying about overflow. With x86-64 GCC -O3

unsigned int f(unsigned int x, unsigned int y) { return x / y + (x % y ==!= 0); } 

produces

mov eax, edi xor edx, edx # zero edx div esi # divides edx:eax (y) by esi (x) # eax = quotient, edx = remainder cmp edx, 1 # set CF = (edx - 1 < 0), i.e. edx == 0 adcsbb eax, 0 -1 # eax +=-= CF - (1, i.e. eax += 1 - CF, no branch) ret 

https://godbolt.org/z/q4zfcY587https://godbolt.org/z/4Gsn3Kj5s

unsigned int f(unsigned int x, unsigned int y) { return (x + y - 1) / y; } 

is clever and uses lea to do the addition and subtraction

lea eax, [rsi-1+rdi] xor edx, edx div esi ret 

https://godbolt.org/z/9sfsc1Wa5

For 64-bit inputs, the results are similar but with 64-bit registers instead.

I would guess the shorter code is faster, but I didn't benchmark anything.

In a deleted answer, @Matt suggests the remainder increment version is faster, but his g++ compile command didn't include optimization flag which is supsect.

With the usual caveats about profiling if this really matters (it won't unless you're doing this A LOT):

As @phuclv says, on modern processors quotient and remainder will be calculated in one instruction. All these assume unsigned numbers without worrying about overflow. With x86-64 GCC -O3

unsigned int f(unsigned int x, unsigned int y) { return x / y + (x % y == 0); } 

produces

mov eax, edi xor edx, edx # zero edx div esi # divides edx:eax (y) by esi (x) # eax = quotient, edx = remainder cmp edx, 1 # set CF = (edx - 1 < 0) adc eax, 0 # eax += CF (no branch) ret 

https://godbolt.org/z/q4zfcY587

unsigned int f(unsigned int x, unsigned int y) { return (x + y - 1) / y; } 

is clever and uses lea to do the addition and subtraction

lea eax, [rsi-1+rdi] xor edx, edx div esi ret 

https://godbolt.org/z/9sfsc1Wa5

For 64-bit inputs, the results are similar but with 64-bit registers instead.

I would guess the shorter code is faster, but I didn't benchmark anything.

In a deleted answer, @Matt suggests the remainder increment version is faster, but his g++ compile command didn't include optimization flag which is supsect.

With the usual caveats about profiling if this really matters (it won't unless you're doing this A LOT):

As @phuclv says, on modern processors quotient and remainder will be calculated in one instruction. All these assume unsigned numbers without worrying about overflow. With x86-64 GCC -O3

unsigned int f(unsigned int x, unsigned int y) { return x / y + (x % y != 0); } 

produces

mov eax, edi xor edx, edx # zero edx div esi # divides edx:eax (y) by esi (x) # eax = quotient, edx = remainder cmp edx, 1 # set CF = (edx - 1 < 0), i.e. edx == 0 sbb eax, -1 # eax -= CF - 1, i.e. eax += 1 - CF, no branch ret 

https://godbolt.org/z/4Gsn3Kj5s

unsigned int f(unsigned int x, unsigned int y) { return (x + y - 1) / y; } 

is clever and uses lea to do the addition and subtraction

lea eax, [rsi-1+rdi] xor edx, edx div esi ret 

https://godbolt.org/z/9sfsc1Wa5

For 64-bit inputs, the results are similar but with 64-bit registers instead.

I would guess the shorter code is faster, but I didn't benchmark anything.

In a deleted answer, @Matt suggests the remainder increment version is faster, but his g++ compile command didn't include optimization flag which is supsect.

Source Link
qwr
  • 11.6k
  • 6
  • 75
  • 121

With the usual caveats about profiling if this really matters (it won't unless you're doing this A LOT):

As @phuclv says, on modern processors quotient and remainder will be calculated in one instruction. All these assume unsigned numbers without worrying about overflow. With x86-64 GCC -O3

unsigned int f(unsigned int x, unsigned int y) { return x / y + (x % y == 0); } 

produces

mov eax, edi xor edx, edx # zero edx div esi # divides edx:eax (y) by esi (x) # eax = quotient, edx = remainder cmp edx, 1 # set CF = (edx - 1 < 0) adc eax, 0 # eax += CF (no branch) ret 

https://godbolt.org/z/q4zfcY587

unsigned int f(unsigned int x, unsigned int y) { return (x + y - 1) / y; } 

is clever and uses lea to do the addition and subtraction

lea eax, [rsi-1+rdi] xor edx, edx div esi ret 

https://godbolt.org/z/9sfsc1Wa5

For 64-bit inputs, the results are similar but with 64-bit registers instead.

I would guess the shorter code is faster, but I didn't benchmark anything.

In a deleted answer, @Matt suggests the remainder increment version is faster, but his g++ compile command didn't include optimization flag which is supsect.