Optimize Bilinear Resize Algorithm in C

Question

Can anyone spot any way to improve the speed in the next Bilinear resizing Algorithm? I need to improve Speed as this is critical, keeping good image quality. Is expected to be used in mobile devices with low speed CPUs. The algorithm is used mainly for up-scale resizing. Any other faster Bilinear algorithm also would be appreciated. Thanks

void resize(int* input, int* output, int sourceWidth, int sourceHeight, int targetWidth, int targetHeight) { int a, b, c, d, x, y, index; float x_ratio = ((float)(sourceWidth - 1)) / targetWidth; float y_ratio = ((float)(sourceHeight - 1)) / targetHeight; float x_diff, y_diff, blue, red, green ; int offset = 0 ; for (int i = 0; i < targetHeight; i++) { for (int j = 0; j < targetWidth; j++) { x = (int)(x_ratio * j) ; y = (int)(y_ratio * i) ; x_diff = (x_ratio * j) - x ; y_diff = (y_ratio * i) - y ; index = (y * sourceWidth + x) ; a = input[index] ; b = input[index + 1] ; c = input[index + sourceWidth] ; d = input[index + sourceWidth + 1] ; // blue element blue = (a&0xff)*(1-x_diff)*(1-y_diff) + (b&0xff)*(x_diff)*(1-y_diff) + (c&0xff)*(y_diff)*(1-x_diff) + (d&0xff)*(x_diff*y_diff); // green element green = ((a>>8)&0xff)*(1-x_diff)*(1-y_diff) + ((b>>8)&0xff)*(x_diff)*(1-y_diff) + ((c>>8)&0xff)*(y_diff)*(1-x_diff) + ((d>>8)&0xff)*(x_diff*y_diff); // red element red = ((a>>16)&0xff)*(1-x_diff)*(1-y_diff) + ((b>>16)&0xff)*(x_diff)*(1-y_diff) + ((c>>16)&0xff)*(y_diff)*(1-x_diff) + ((d>>16)&0xff)*(x_diff*y_diff); output [offset++] = 0x000000ff | // alpha ((((int)red) << 24)&0xff0000) | ((((int)green) << 16)&0xff00) | ((((int)blue) << 8)&0xff00); } } }

While it is not directly applicable, but might be inspirational: en.wikipedia.org/wiki/Bresenham's_line_algorithm — biziclop
– biziclop, Commented Jul 6, 2012 at 14:11
OP doesn't mention the code is from here: tech-algorithm.com/articles/bilinear-image-scaling — oPless
– oPless, Commented Jun 16, 2014 at 9:11

unwind · Accepted Answer · 2012-07-06 13:45:20Z

Off the the top of my head:

Stop using floating-point, unless you're certain your target CPU has it in hardware with good performance.
Make sure memory accesses are cache-optimized, i.e. clumped together.
Use the fastest data types possible. Sometimes this means smallest, sometimes it means "most native, requiring least overhead".
Investigate if signed/unsigned for integer operations have performance costs on your platform.
Investigate if look-up tables rather than computations gain you anything (but these can blow the caches, so be careful).

And, of course, do lots of profiling and measurements.

Actually, 3. is not generally applicable: the compiler will have to expand small values into the values the CPU can work with. So using int actually has its benefits.

haylem · Accepted Answer · 2012-07-06 13:56:23Z

In-Line Cache and Lookup Tables

Cache your computations in your algorithm.

Avoid duplicate computations (like (1-y_diff) or (x_ratio * j))

Go through all the lines of your algorithm, and try to identify patterns of repetitions. Extract these to local variables. And possibly extract to functions, if they are short enough to be inlined, to make things more readable.
Use a lookup-table

It's quite likely that, if you can spare some memory, you can implement a "store" for your RGB values and simply "fetch" them based on the inputs that produced them. Maybe you don't need to store all of them, but you could experiment and see if some come back often. Alternatively, you could "fudge" your colors and thus end up with less values to store for more lookup inputs.

If you know the boundaries for you inputs, you can calculate the complete domain space and figure out what makes sense to cache. For instance, if you can't cache the whole R, G, B values, maybe you can at least pre-compute the shiftings ((b>>16) and so forth...) that are most likely deterministic in your case).

Use the Right Data Types for Performance

If you can avoid double and float variables, use int. On most architectures, int would be test faster type for computations because of the memory model. You can still achieve decent precision by simply shifting your units (ie use 1026 as int instead of 1.026 as double or float). It's quite likely that this trick would be enough for you.

PypeBros · Accepted Answer · 2012-07-06 13:45:25Z

 x = (int)(x_ratio * j) ; y = (int)(y_ratio * i) ; x_diff = (x_ratio * j) - x ; y_diff = (y_ratio * i) - y ; index = (y * sourceWidth + x) ;

Could surely use some optimization: you were using x_ration * j-1 just a few cycles earlier, so all you really need here is x+=x_ratio

Patrick · Accepted Answer · 2012-07-06 15:39:57Z

My random guess (use a profiler instead of letting people guess!):

The compiler has to generate that works when input and output overlap which means it has to do generate loads of redundant stores and loads. Add restrict to the input and output parameters to remove that safety feature.

You could also try using a=b; and c=d; instead of loading them again.

biziclop · Accepted Answer · 2012-07-06 16:11:05Z

here is my version, steal some ideas. My C-fu is quite weak, so some lines are pseudocodes, but you can fix them.

void resize(int* input, int* output, int sourceWidth, int sourceHeight, int targetWidth, int targetHeight ) { // Let's create some lookup tables! // you can move them into 2-dimensional arrays to // group together values used at the same time to help processor cache int sx[0..targetWidth ]; // target->source X lookup int sy[0..targetHeight]; // target->source Y lookup int mx[0..targetWidth ]; // left pixel's multiplier int my[0..targetHeight]; // bottom pixel's multiplier // we don't have to calc indexes every time, find out when bool reloadPixels[0..targetWidth ]; bool shiftPixels[0..targetWidth ]; int shiftReloadPixels[0..targetWidth ]; // can be combined if necessary int v; // temporary value for (int j = 0; j < targetWidth; j++){ // (8bit + targetBits + sourceBits) should be < max int v = 256 * j * (sourceWidth-1) / (targetWidth-1); sx[j] = v / 256; mx[j] = v % 256; reloadPixels[j] = j ? ( sx[j-1] != sx[j] ? 1 : 0) : 1; // always load first pixel // if no reload -> then no shift too shiftPixels[j] = j ? ( sx[j-1]+1 = sx[j] ? 2 : 0) : 0; // nothing to shift at first pixel shiftReloadPixels[j] = reloadPixels[i] | shiftPixels[j]; } for (int i = 0; i < targetHeight; i++){ v = 256 * i * (sourceHeight-1) / (targetHeight-1); sy[i] = v / 256; my[i] = v % 256; } int shiftReload; int srcIndex; int srcRowIndex; int offset = 0; int lm, rm, tm, bm; // left / right / top / bottom multipliers int a, b, c, d; for (int i = 0; i < targetHeight; i++){ srcRowIndex = sy[ i ] * sourceWidth; tm = my[i]; bm = 255 - tm; for (int j = 0; j < targetWidth; j++){ // too much ifs can be too slow, measure. // always true for first pixel in a row if( shiftReload = shiftReloadPixels[ j ] ){ srcIndex = srcRowIndex + sx[j]; if( shiftReload & 2 ){ a = b; c = d; }else{ a = input[ srcIndex ]; c = input[ srcIndex + sourceWidth ]; } b = input[ srcIndex + 1 ]; d = input[ srcIndex + 1 + sourceWidth ]; } lm = mx[j]; rm = 255 - lm; // WTF? // Input AA RR GG BB // Output RR GG BB AA if( j ){ leftOutput = rightOutput ^ 0xFFFFFF00; }else{ leftOutput = // blue element ((( ( (a&0xFF)*tm + (c&0xFF)*bm )*lm ) & 0xFF0000 ) >> 8) // green element | ((( ( ((a>>8)&0xFF)*tm + ((c>>8)&0xFF)*bm )*lm ) & 0xFF0000 )) // no need to shift // red element | ((( ( ((a>>16)&0xFF)*tm + ((c>>16)&0xFF)*bm )*lm ) & 0xFF0000 ) << 8 ) ; } rightOutput = // blue element ((( ( (b&0xFF)*tm + (d&0xFF)*bm )*lm ) & 0xFF0000 ) >> 8) // green element | ((( ( ((b>>8)&0xFF)*tm + ((d>>8)&0xFF)*bm )*lm ) & 0xFF0000 )) // no need to shift // red element | ((( ( ((b>>16)&0xFF)*tm + ((d>>16)&0xFF)*bm )*lm ) & 0xFF0000 ) << 8 ) ; output[offset++] = // alpha 0x000000ff | leftOutput | rightOutput ; } } }

Collectives™ on Stack Overflow

Optimize Bilinear Resize Algorithm in C

5 Answers 5

1 Comment

In-Line Cache and Lookup Tables

Use the Right Data Types for Performance

Comments

Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

In-Line Cache and Lookup Tables

Use the Right Data Types for Performance

Comments

Comments

Comments

Comments

Related