1

Can anyone spot any way to improve the speed in the next Bilinear resizing Algorithm? I need to improve Speed as this is critical, keeping good image quality. Is expected to be used in mobile devices with low speed CPUs. The algorithm is used mainly for up-scale resizing. Any other faster Bilinear algorithm also would be appreciated. Thanks

void resize(int* input, int* output, int sourceWidth, int sourceHeight, int targetWidth, int targetHeight) { int a, b, c, d, x, y, index; float x_ratio = ((float)(sourceWidth - 1)) / targetWidth; float y_ratio = ((float)(sourceHeight - 1)) / targetHeight; float x_diff, y_diff, blue, red, green ; int offset = 0 ; for (int i = 0; i < targetHeight; i++) { for (int j = 0; j < targetWidth; j++) { x = (int)(x_ratio * j) ; y = (int)(y_ratio * i) ; x_diff = (x_ratio * j) - x ; y_diff = (y_ratio * i) - y ; index = (y * sourceWidth + x) ; a = input[index] ; b = input[index + 1] ; c = input[index + sourceWidth] ; d = input[index + sourceWidth + 1] ; // blue element blue = (a&0xff)*(1-x_diff)*(1-y_diff) + (b&0xff)*(x_diff)*(1-y_diff) + (c&0xff)*(y_diff)*(1-x_diff) + (d&0xff)*(x_diff*y_diff); // green element green = ((a>>8)&0xff)*(1-x_diff)*(1-y_diff) + ((b>>8)&0xff)*(x_diff)*(1-y_diff) + ((c>>8)&0xff)*(y_diff)*(1-x_diff) + ((d>>8)&0xff)*(x_diff*y_diff); // red element red = ((a>>16)&0xff)*(1-x_diff)*(1-y_diff) + ((b>>16)&0xff)*(x_diff)*(1-y_diff) + ((c>>16)&0xff)*(y_diff)*(1-x_diff) + ((d>>16)&0xff)*(x_diff*y_diff); output [offset++] = 0x000000ff | // alpha ((((int)red) << 24)&0xff0000) | ((((int)green) << 16)&0xff00) | ((((int)blue) << 8)&0xff00); } } } 
2

5 Answers 5

3

Off the the top of my head:

  1. Stop using floating-point, unless you're certain your target CPU has it in hardware with good performance.
  2. Make sure memory accesses are cache-optimized, i.e. clumped together.
  3. Use the fastest data types possible. Sometimes this means smallest, sometimes it means "most native, requiring least overhead".
  4. Investigate if signed/unsigned for integer operations have performance costs on your platform.
  5. Investigate if look-up tables rather than computations gain you anything (but these can blow the caches, so be careful).

And, of course, do lots of profiling and measurements.

Sign up to request clarification or add additional context in comments.

1 Comment

Actually, 3. is not generally applicable: the compiler will have to expand small values into the values the CPU can work with. So using int actually has its benefits.
3

In-Line Cache and Lookup Tables

Cache your computations in your algorithm.

  • Avoid duplicate computations (like (1-y_diff) or (x_ratio * j))

    Go through all the lines of your algorithm, and try to identify patterns of repetitions. Extract these to local variables. And possibly extract to functions, if they are short enough to be inlined, to make things more readable.

  • Use a lookup-table

    It's quite likely that, if you can spare some memory, you can implement a "store" for your RGB values and simply "fetch" them based on the inputs that produced them. Maybe you don't need to store all of them, but you could experiment and see if some come back often. Alternatively, you could "fudge" your colors and thus end up with less values to store for more lookup inputs.

    If you know the boundaries for you inputs, you can calculate the complete domain space and figure out what makes sense to cache. For instance, if you can't cache the whole R, G, B values, maybe you can at least pre-compute the shiftings ((b>>16) and so forth...) that are most likely deterministic in your case).

Use the Right Data Types for Performance

If you can avoid double and float variables, use int. On most architectures, int would be test faster type for computations because of the memory model. You can still achieve decent precision by simply shifting your units (ie use 1026 as int instead of 1.026 as double or float). It's quite likely that this trick would be enough for you.

Comments

0
 x = (int)(x_ratio * j) ; y = (int)(y_ratio * i) ; x_diff = (x_ratio * j) - x ; y_diff = (y_ratio * i) - y ; index = (y * sourceWidth + x) ; 

Could surely use some optimization: you were using x_ration * j-1 just a few cycles earlier, so all you really need here is x+=x_ratio

Comments

0

My random guess (use a profiler instead of letting people guess!):

The compiler has to generate that works when input and output overlap which means it has to do generate loads of redundant stores and loads. Add restrict to the input and output parameters to remove that safety feature.

You could also try using a=b; and c=d; instead of loading them again.

Comments

0

here is my version, steal some ideas. My C-fu is quite weak, so some lines are pseudocodes, but you can fix them.

void resize(int* input, int* output, int sourceWidth, int sourceHeight, int targetWidth, int targetHeight ) { // Let's create some lookup tables! // you can move them into 2-dimensional arrays to // group together values used at the same time to help processor cache int sx[0..targetWidth ]; // target->source X lookup int sy[0..targetHeight]; // target->source Y lookup int mx[0..targetWidth ]; // left pixel's multiplier int my[0..targetHeight]; // bottom pixel's multiplier // we don't have to calc indexes every time, find out when bool reloadPixels[0..targetWidth ]; bool shiftPixels[0..targetWidth ]; int shiftReloadPixels[0..targetWidth ]; // can be combined if necessary int v; // temporary value for (int j = 0; j < targetWidth; j++){ // (8bit + targetBits + sourceBits) should be < max int v = 256 * j * (sourceWidth-1) / (targetWidth-1); sx[j] = v / 256; mx[j] = v % 256; reloadPixels[j] = j ? ( sx[j-1] != sx[j] ? 1 : 0) : 1; // always load first pixel // if no reload -> then no shift too shiftPixels[j] = j ? ( sx[j-1]+1 = sx[j] ? 2 : 0) : 0; // nothing to shift at first pixel shiftReloadPixels[j] = reloadPixels[i] | shiftPixels[j]; } for (int i = 0; i < targetHeight; i++){ v = 256 * i * (sourceHeight-1) / (targetHeight-1); sy[i] = v / 256; my[i] = v % 256; } int shiftReload; int srcIndex; int srcRowIndex; int offset = 0; int lm, rm, tm, bm; // left / right / top / bottom multipliers int a, b, c, d; for (int i = 0; i < targetHeight; i++){ srcRowIndex = sy[ i ] * sourceWidth; tm = my[i]; bm = 255 - tm; for (int j = 0; j < targetWidth; j++){ // too much ifs can be too slow, measure. // always true for first pixel in a row if( shiftReload = shiftReloadPixels[ j ] ){ srcIndex = srcRowIndex + sx[j]; if( shiftReload & 2 ){ a = b; c = d; }else{ a = input[ srcIndex ]; c = input[ srcIndex + sourceWidth ]; } b = input[ srcIndex + 1 ]; d = input[ srcIndex + 1 + sourceWidth ]; } lm = mx[j]; rm = 255 - lm; // WTF? // Input AA RR GG BB // Output RR GG BB AA if( j ){ leftOutput = rightOutput ^ 0xFFFFFF00; }else{ leftOutput = // blue element ((( ( (a&0xFF)*tm + (c&0xFF)*bm )*lm ) & 0xFF0000 ) >> 8) // green element | ((( ( ((a>>8)&0xFF)*tm + ((c>>8)&0xFF)*bm )*lm ) & 0xFF0000 )) // no need to shift // red element | ((( ( ((a>>16)&0xFF)*tm + ((c>>16)&0xFF)*bm )*lm ) & 0xFF0000 ) << 8 ) ; } rightOutput = // blue element ((( ( (b&0xFF)*tm + (d&0xFF)*bm )*lm ) & 0xFF0000 ) >> 8) // green element | ((( ( ((b>>8)&0xFF)*tm + ((d>>8)&0xFF)*bm )*lm ) & 0xFF0000 )) // no need to shift // red element | ((( ( ((b>>16)&0xFF)*tm + ((d>>16)&0xFF)*bm )*lm ) & 0xFF0000 ) << 8 ) ; output[offset++] = // alpha 0x000000ff | leftOutput | rightOutput ; } } } 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.