5

Hi I'm implementing some fixed point math stuff for embedded systems and I'm trying to do the multiplication of two 16.16 fixed point numbers without creating a 64bit temporary. So far here is the code I came up with that generates the least instructions.

int multiply(int x, int y){ int result; long long temp = x; temp *= y; temp >>= 16; result = temp; return result; } 

the problem with this code is that it uses a temporary 64 bit integer which seem to generate bad assembly code. I'm trying to make a system that uses two 32 bit integers instead of a 64 bit one. Anyone know how to do this?

7
  • 4
    Did you see a 64-bit temporary in the disassembly dump of the output, or just in your head? Just because the source code has a variable of type long long does not mean there's actually any "64-bit temporary". On any decent 32-bit arch, a 32x32 multiply automatically generates a 64-bit result, usually separated into two 32-bit registers for your (or the compiler's) convenience. Manipulating these is much more efficient than breaking it down into 4 16x16 multiplies to avoid the "64-bit temporary". Commented Feb 27, 2013 at 23:03
  • @R.. yeah I looked into the assembler dump. it uses two halves in the EAX and EDX register but it's implementation is not optimal Commented Feb 28, 2013 at 0:31
  • It's surely a lot less suboptimal than performing four MULs... Commented Feb 28, 2013 at 0:45
  • 1
    @R.. the idea is that now that I know how to do it in C I can implement it in hand optimized assembly Commented Feb 28, 2013 at 0:56
  • 1
    most modern x86 compilers will automatically use the upper 32 bits of the result if you specify the optimization level high enough Commented Mar 23, 2014 at 10:39

1 Answer 1

7

Think of your numbers as each composed of two large "digits."

 A.B x C.D 

The "base" of the digits is the 2^bit_width, i.e., 2^16, or 65536.

So, the product is

D*B + D*A*65536 + C*B*65536 + C*A*65536*65536 

However, to get the product shifted right by 16, you need to divide all these terms by 65536, so

D*B/65536 + D*A + C*B + C*A*65536 

In C:

uint16_t a = x >> 16; uint16_t b = x & 0xffff; uint16_t c = y >> 16; uint16_t d = y & 0xffff; return ((d * b) >> 16) + (d * a) + (c * b) + ((c * a) << 16); 

The signed version is a bit more complicated; it is often easiest to perform the arithmetic on the absolute values of x and y and then fix the sign (unless you overflow, which you can check for rather tediously).

Sign up to request clarification or add additional context in comments.

3 Comments

Perfect exactly what I was looking for.
And this is going to be a lot slower than doing it the way you originally tried.
Well, the speed depends on the capabilities of the embedded processor (note that I answered the question before the EAX register was mentioned!). On a 16-bit processor, my approach will be slightly faster because a 32x32->64-bit multiply will also have 4 16-bit multiplies, but also must do 64-bit additions, whereas my approach only needs 32-bit additions (and no actual shifts since on a 16-bit processor the 16-bit words of the product will be individual accessible).

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.