0

i have a problem when converting c++ program to assembly i have to do it for

here is my c++ code

for(int i=0;i<rows-4;i++,a+=4,b+=4,c+=4,d+=4,e+=4,f+=4,x+=4,o+=4){ for(int j=0;j<cols-4;j++,a++,b++,c++,d++,e++,f++,x++,o++){ *o=*a>*x; *o=*b>*x|(*o<<1); *o=*c>*x|(*o<<1); *o=*d>*x|(*o<<1); *o=*e>*x|(*o<<1); *o=*f>*x|(*o<<1); } } 

o is pointer for the output data while a,b,c,d,e,f and x are pointer to input data. what i want is just save the comparisons from the input data to a single variable, but the code above is not efficient when the data that being processed is big. The program need more times to save a data into memory compared to saving temporary data in register.

so what i want to do is just make this process done in register. What i've tried is i store the data that referred by x in EBX, compare EBX to ECX which hold the value referred by a (and b,c,d,e,f sequentially), save the comparison result to EAX and shift the EAX register to left so that all the comparison will be stored in one variable. after all 6 comparisons already processed the value from ECX is copied to memory.

here is what i did, my program can runs two times faster but all the values that i get i just zero. maybe i do it in a wrong way?

 __asm__( "xorl %%eax,%%eax;" "xorl %%ebx,%%ebx;" "xorl %%ecx,%%ecx;" "movl %1, %%ebx;" //start here "movl %2,%%ecx;" "cmp %%ebx,%%ecx;" "jnz .one;" "orl $0x1,%%eax;" ".one:;" "shll $1,%%eax;" "movl %3,%%ecx;" "cmp %%ebx,%%ecx;" "jnz .two;" "orl $0x1,%%eax;" ".two:;" "shll $1,%%eax;" "movl %4,%%ecx;" "cmp %%ebx,%%ecx;" "jnz .three;" "orl $0x1,%%eax;" ".three:;" "shll $1,%%eax;" "movl %5,%%ecx;" "cmp %%ebx,%%ecx;" "jnz .four;" "orl $0x1,%%eax;" ".four:" "shll $1,%%eax;" "movl %6,%%ecx;" "cmp %%ebx,%%ecx;" "jnz .five;" "orl $0x1,%%eax;" ".five:" "shll $1,%%eax;" "movl %7,%%ecx;" "cmp %%ebx,%%ecx;" "jnz .six;" "orl $0x1,%%eax;" ".six:" //output "movl %%eax,%0;" :"=r"(sett) :"r"((int)*x),"r"((int)*a) ,"r"((int)*b) ,"r"((int)*c) ,"r"((int)*d),"r"((int)*e),"r"((int)*f) /* input */ ); 
1
  • 3
    I believe there are tools that are pretty good at converting C and C++ code into fairly optimized assembly without introducing human errors. If only could remember what they were called... Commented Aug 10, 2013 at 2:10

2 Answers 2

1

A few options:

1) Throw away your handcrafted assembly code. You said the C code is slow, tell us by how much. I can't see how could have measured the difference in any meaningful way, as the asm version doesn't even produce the correct result. Put in another way, try asm("nop;");, it's an even faster way to produce the incorrect result.

2) Rewrite your C code to read *x only once; keep the result in a temporary variable, and only write to *o at the end.

3) If appropriate for your semantics (and supported by your compiler) decorate your pointers with restrict/__restrict/__restrict__ (from C99, commonly available in C++ as an extension) so the compiler knows none of the input variables change when you write to *o.

4) Compilers are fairly good at unrolling loops automatically. It might require a combination of command-line options, #pragma directives, or extension/attributes.

EDIT

This is what I mean by rewriting it to use temporaries:

for(int i=0;i<rows-4;i++,a+=4,b+=4,c+=4,d+=4,e+=4,f+=4,x+=4,o+=4){ for(int j=0;j<cols-4;j++,a++,b++,c++,d++,e++,f++,x++,o++){ uint32_t tmp_x = *x; *o = (*a > tmp_x ? 0x20 : 0) | (*b > tmp_x ? 0x10 : 0) | (*c > tmp_x ? 0x08 : 0) | (*d > tmp_x ? 0x04 : 0) | (*e > tmp_x ? 0x02 : 0) | (*f > tmp_x ? 0x01 : 0); } } 

What difference does it make? On the original version, x is read from in every single assignment. The compiler doesn't know that o and x point to different locations; in the worst case, the compiler has to read from x again every single time, because by writing to o, the value in x could be changing.

Of course, this code has different semantics: if you are really letting o alias either of the other pointer, it will do something different from the original.

Sign up to request clarification or add additional context in comments.

5 Comments

the program runs about 33ms for 1M data (my image was 1200x1100), the most consuming time is the storing to *o, when it not store to *o the time is just about 5ms, using that assembly code takes about 15ms... so the problem is not the reading part, but the writing part.. i need to modify my program so that it only store the result one time...storing to a temporary variable also not helps...
If you don't write anything out to *o, the compiler is probably optimizing the loop away. As I said, if your ASM code produces the incorrect result, it doesn't make sense to compare. My ASM code beats yours by a factor of infinity.
waw...thank you very much..it is really give my program significant speed up, now it can works for about 9ms (before was 33ms), and combining with the register keyword that mentioned by tallen the speed is about 7ms...thank you very much :D
I never found the register keyword to change anything in the generated assembly. Some messages from the GCC mailing list suggest unless you disabled all optimizations, it shouldn't matter (it is in fact being deprecated in C++1y for being useless). Are you compiling with optimizations enabled? I imagine restrict will have a much larger impact.
for me the register keyword really make a difference in speed, about 1ms for 1 million iterations, i think the optimizations is not enabled, my compiler is strange, this morning the code works very slow, the i make new code, exactly same, and the speed is normal again, now it is about 8.5ms, maybe tomorrow it will be slow again, don't know what happened, i use gcc 4.7.2
0

I am going to assume you are using a recent Intel chip. ...and what I think you really want to use are the (rather limited if one is used to say a Cray:-) vector capabilities, these are called AVX. There are also libraries that will do this under C/C++, start by googling AVX and C.

Having said that, you could also tell the compiler to store some variables in registers by using the "register" keyword, see this Register keyword in C++

1 Comment

yes...you gave me a very good sugestion, i edited my program as sugested by DanielKO and i also use register keyword, now my program can runs in about 7ms (before was 33ms!) i will also check AVX, thank you very much :-)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.