Skip to main content
Tweeted twitter.com/StackSignals/status/1604219899605286916
Became Hot Network Question
correct mistake
Source Link
Zitrax
  • 158
  • 1
  • 6

I am new to DSP and I am trying to understand how to work with fixed point operations and in particular the Q31 format.

In floating point (full range) I am doing some multiplications, it could be for example 6.0 * 10.0 = 60.0. Converting this to Q31 would require first scaling it down to [-1,1) and then converting that to an int32_t (q31).

However when doing the same multiplication in Q31 we get a new Q31 result (after bit shifting properly), that can be taken back to float in the range [-1,1). But if I then scale that back up by reversing the original downscaling I do not end up with 60.

In summary:

  • Downscale full range float to float in the range [-1,1) to be able to be represented in Q31
  • Convert float in the proper range to Q31
  • Perform fixed point multiplication (and possibly other unknown operations)
  • Convert Q31 back to float
  • Scale this float back up to the original range?

Is it possible to scale back to the original range somehow as in my last point - or is it a one-way conversion?

An example:

  • Downscale: Input floats in an original range [0-5000] are scaled down to [-1,1). I tried with scaling function sign(val) * (abs(val) - min)/(max-min) where min is 0 and max 5000. This takes the values 6 and 10 to 0.0012 and 0.002.
  • FloatToQ31: 0.0012 -> 2576980, 0.002 -> 4294967
  • Multiply in Q31: 2576980*4294967 >> 31 = 5152
  • Q31ToFloat: 5152 -> 2.39909e-06
  • Upscale: Reversed downscale function val * (max - min) + min (again min=0, max=10000max=5000). This takes the value back to 0.0119954 which is not close to 60.

I am new to DSP and I am trying to understand how to work with fixed point operations and in particular the Q31 format.

In floating point (full range) I am doing some multiplications, it could be for example 6.0 * 10.0 = 60.0. Converting this to Q31 would require first scaling it down to [-1,1) and then converting that to an int32_t (q31).

However when doing the same multiplication in Q31 we get a new Q31 result (after bit shifting properly), that can be taken back to float in the range [-1,1). But if I then scale that back up by reversing the original downscaling I do not end up with 60.

In summary:

  • Downscale full range float to float in the range [-1,1) to be able to be represented in Q31
  • Convert float in the proper range to Q31
  • Perform fixed point multiplication (and possibly other unknown operations)
  • Convert Q31 back to float
  • Scale this float back up to the original range?

Is it possible to scale back to the original range somehow as in my last point - or is it a one-way conversion?

An example:

  • Downscale: Input floats in an original range [0-5000] are scaled down to [-1,1). I tried with scaling function sign(val) * (abs(val) - min)/(max-min) where min is 0 and max 5000. This takes the values 6 and 10 to 0.0012 and 0.002.
  • FloatToQ31: 0.0012 -> 2576980, 0.002 -> 4294967
  • Multiply in Q31: 2576980*4294967 >> 31 = 5152
  • Q31ToFloat: 5152 -> 2.39909e-06
  • Upscale: Reversed downscale function val * (max - min) + min (again min=0, max=10000). This takes the value back to 0.0119954 which is not close to 60.

I am new to DSP and I am trying to understand how to work with fixed point operations and in particular the Q31 format.

In floating point (full range) I am doing some multiplications, it could be for example 6.0 * 10.0 = 60.0. Converting this to Q31 would require first scaling it down to [-1,1) and then converting that to an int32_t (q31).

However when doing the same multiplication in Q31 we get a new Q31 result (after bit shifting properly), that can be taken back to float in the range [-1,1). But if I then scale that back up by reversing the original downscaling I do not end up with 60.

In summary:

  • Downscale full range float to float in the range [-1,1) to be able to be represented in Q31
  • Convert float in the proper range to Q31
  • Perform fixed point multiplication (and possibly other unknown operations)
  • Convert Q31 back to float
  • Scale this float back up to the original range?

Is it possible to scale back to the original range somehow as in my last point - or is it a one-way conversion?

An example:

  • Downscale: Input floats in an original range [0-5000] are scaled down to [-1,1). I tried with scaling function sign(val) * (abs(val) - min)/(max-min) where min is 0 and max 5000. This takes the values 6 and 10 to 0.0012 and 0.002.
  • FloatToQ31: 0.0012 -> 2576980, 0.002 -> 4294967
  • Multiply in Q31: 2576980*4294967 >> 31 = 5152
  • Q31ToFloat: 5152 -> 2.39909e-06
  • Upscale: Reversed downscale function val * (max - min) + min (again min=0, max=5000). This takes the value back to 0.0119954 which is not close to 60.
added 585 characters in body
Source Link
Zitrax
  • 158
  • 1
  • 6

I am new to DSP and I am trying to understand how to work with fixed point operations and in particular the Q31 format.

In floating point (full range) I am doing some multiplications, it could be for example 6.0 * 10.0 = 60.0. Converting this to Q31 would require first scaling it down to [-1,1) and then converting that to an int32_t (q31).

However when doing the same multiplication in Q31 we get a new Q31 result (after bit shifting properly), that can be taken back to float in the range [-1,1). But if I then scale that back up by reversing the original downscaling I do not end up with 60.

In summary:

  • Downscale full range float to float in the range [-1,1) to be able to be represented in Q31
  • Convert float in the proper range to Q31
  • Perform fixed point multiplication (and possibly other unknown operations)
  • Convert Q31 back to float
  • Scale this float back up to the original range?

Is it possible to scale back to the original range somehow as in my last point - or is it a one-way conversion?

An example:

  • Downscale: Input floats in an original range [0-5000] are scaled down to [-1,1). I tried with scaling function sign(val) * (abs(val) - min)/(max-min) where min is 0 and max 5000. This takes the values 6 and 10 to 0.0012 and 0.002.
  • FloatToQ31: 0.0012 -> 2576980, 0.002 -> 4294967
  • Multiply in Q31: 2576980*4294967 >> 31 = 5152
  • Q31ToFloat: 5152 -> 2.39909e-06
  • Upscale: Reversed downscale function val * (max - min) + min (again min=0, max=10000). This takes the value back to 0.0119954 which is not close to 60.

I am new to DSP and I am trying to understand how to work with fixed point operations and in particular the Q31 format.

In floating point (full range) I am doing some multiplications, it could be for example 6.0 * 10.0 = 60.0. Converting this to Q31 would require first scaling it down to [-1,1) and then converting that to an int32_t (q31).

However when doing the same multiplication in Q31 we get a new Q31 result (after bit shifting properly), that can be taken back to float in the range [-1,1). But if I then scale that back up by reversing the original downscaling I do not end up with 60.

In summary:

  • Downscale full range float to float in the range [-1,1) to be able to be represented in Q31
  • Convert float in the proper range to Q31
  • Perform fixed point multiplication (and possibly other unknown operations)
  • Convert Q31 back to float
  • Scale this float back up to the original range?

Is it possible to scale back to the original range somehow as in my last point - or is it a one-way conversion?

I am new to DSP and I am trying to understand how to work with fixed point operations and in particular the Q31 format.

In floating point (full range) I am doing some multiplications, it could be for example 6.0 * 10.0 = 60.0. Converting this to Q31 would require first scaling it down to [-1,1) and then converting that to an int32_t (q31).

However when doing the same multiplication in Q31 we get a new Q31 result (after bit shifting properly), that can be taken back to float in the range [-1,1). But if I then scale that back up by reversing the original downscaling I do not end up with 60.

In summary:

  • Downscale full range float to float in the range [-1,1) to be able to be represented in Q31
  • Convert float in the proper range to Q31
  • Perform fixed point multiplication (and possibly other unknown operations)
  • Convert Q31 back to float
  • Scale this float back up to the original range?

Is it possible to scale back to the original range somehow as in my last point - or is it a one-way conversion?

An example:

  • Downscale: Input floats in an original range [0-5000] are scaled down to [-1,1). I tried with scaling function sign(val) * (abs(val) - min)/(max-min) where min is 0 and max 5000. This takes the values 6 and 10 to 0.0012 and 0.002.
  • FloatToQ31: 0.0012 -> 2576980, 0.002 -> 4294967
  • Multiply in Q31: 2576980*4294967 >> 31 = 5152
  • Q31ToFloat: 5152 -> 2.39909e-06
  • Upscale: Reversed downscale function val * (max - min) + min (again min=0, max=10000). This takes the value back to 0.0119954 which is not close to 60.
Source Link
Zitrax
  • 158
  • 1
  • 6

Fixed point scaling; float -> Q31 -> float

I am new to DSP and I am trying to understand how to work with fixed point operations and in particular the Q31 format.

In floating point (full range) I am doing some multiplications, it could be for example 6.0 * 10.0 = 60.0. Converting this to Q31 would require first scaling it down to [-1,1) and then converting that to an int32_t (q31).

However when doing the same multiplication in Q31 we get a new Q31 result (after bit shifting properly), that can be taken back to float in the range [-1,1). But if I then scale that back up by reversing the original downscaling I do not end up with 60.

In summary:

  • Downscale full range float to float in the range [-1,1) to be able to be represented in Q31
  • Convert float in the proper range to Q31
  • Perform fixed point multiplication (and possibly other unknown operations)
  • Convert Q31 back to float
  • Scale this float back up to the original range?

Is it possible to scale back to the original range somehow as in my last point - or is it a one-way conversion?