Return to Question

added 534 characters in body

edited Sep 27, 2015 at 12:29

EDIT:
I did some research and I have some nice results. How would you explain this behavior ?

Sorry for my latest edit, but I will first show case 1 and case 2 times. As you canhad some cache problems as far as I could see case 2 runs faster, at least on my machine.

alin@ubuntu:~/Desktop$ time ./1
real 0m4.025s
user 0m4.008s
sys 0m0.020s

alin@ubuntu:~/Desktop$ time ./2
real 0m3.285s
user 0m3.272s
sys 0m0these are more accurate results and code samples, I hope.016s

#include <stdio.h>   extern int * cache; extern bool * b; extern int * x; extern int * a; extern unsigned long * loop; extern void A(); extern void B(); int main() { for (unsigned long i = 0; i < *loop; ++i) {   ++*cache;  *x += *a; if (*b) { A(); } else { B(); } } delete b; delete x; delete a; delete loop;  delete cache;  return 0; } int * cache = new int(0); bool * b = new bool(true); int * x = new int(0); int * a = new int(0); unsigned long * loop = new unsigned long(0xfffffffe0x0ffffffe); void A() { --*x; *b = false; } void B() { ++*x; *b = true; }

#include <stdio.h> extern int * cache; extern bool * b; extern int * x; extern int * a; extern unsigned long * loop; extern void A(); extern void B(); int main() { for (unsigned long i = 0; i < *loop; ++i) {   ++*cache;  if (*b) { *x += *a; A(); } else { *x += *a; B(); } } delete b; delete x; delete a; delete loop; delete cache; return 0; }   int * cache = new int(0); bool * b = new bool(true); int * x = new int(0); int * a = new int(0); unsigned long * loop = new unsigned long(0xfffffffe0x0ffffffe); void A() { --*x; *b = false; } void B() { ++*x; *b = true; }

So thisThere is pretty much says what I believed ? Ifunnoticeable difference between the compiler has no way to know at compile time-O3 versions of both approaches, then he can't optimize thatbut without -O3, therefore you should do itsecond case does run slightly faster, at least on my machine. I have tested without ?-O3 and with the loop = 0xfffffffe.
Best times:
alin@ubuntu:~/Desktop$ time ./1

real 0m20.231s
user 0m20.224s
sys 0m0.020s

alin@ubuntu:~/Desktop$ time ./2

real 0m19.932s
user 0m19.890s
sys 0m0.060s

EDIT:
I did some research and I have some nice results. How would you explain this behavior ?

I will first show case 1 and case 2 times. As you can see case 2 runs faster, at least on my machine.

alin@ubuntu:~/Desktop$ time ./1
real 0m4.025s
user 0m4.008s
sys 0m0.020s

alin@ubuntu:~/Desktop$ time ./2
real 0m3.285s
user 0m3.272s
sys 0m0.016s

#include <stdio.h> extern bool * b; extern int * x; extern int * a; extern unsigned long * loop; extern void A(); extern void B(); int main() { for (unsigned long i = 0; i < *loop; ++i) { *x += *a; if (*b) { A(); } else { B(); } } delete b; delete x; delete a; delete loop; return 0; } bool * b = new bool(true); int * x = new int(0); int * a = new int(0); unsigned long * loop = new unsigned long(0xfffffffe); void A() { --*x; *b = false; } void B() { ++*x; *b = true; }

#include <stdio.h> extern bool * b; extern int * x; extern int * a; extern unsigned long * loop; extern void A(); extern void B(); int main() { for (unsigned long i = 0; i < *loop; ++i) { if (*b) { *x += *a; A(); } else { *x += *a; B(); } } delete b; delete x; delete a; delete loop; return 0; } bool * b = new bool(true); int * x = new int(0); int * a = new int(0); unsigned long * loop = new unsigned long(0xfffffffe); void A() { --*x; *b = false; } void B() { ++*x; *b = true; }

So this pretty much says what I believed ? If the compiler has no way to know at compile time, then he can't optimize that, therefore you should do it ?

EDIT:
I did some research and I have some nice results. How would you explain this behavior ? Sorry for my latest edit, but I had some cache problems as far as I could see, these are more accurate results and code samples, I hope.

#include <stdio.h>   extern int * cache; extern bool * b; extern int * x; extern int * a; extern unsigned long * loop; extern void A(); extern void B(); int main() { for (unsigned long i = 0; i < *loop; ++i) {   ++*cache;  *x += *a; if (*b) { A(); } else { B(); } } delete b; delete x; delete a; delete loop;  delete cache;  return 0; } int * cache = new int(0); bool * b = new bool(true); int * x = new int(0); int * a = new int(0); unsigned long * loop = new unsigned long(0x0ffffffe); void A() { --*x; *b = false; } void B() { ++*x; *b = true; }

#include <stdio.h> extern int * cache; extern bool * b; extern int * x; extern int * a; extern unsigned long * loop; extern void A(); extern void B(); int main() { for (unsigned long i = 0; i < *loop; ++i) {   ++*cache;  if (*b) { *x += *a; A(); } else { *x += *a; B(); } } delete b; delete x; delete a; delete loop; delete cache; return 0; }   int * cache = new int(0); bool * b = new bool(true); int * x = new int(0); int * a = new int(0); unsigned long * loop = new unsigned long(0x0ffffffe); void A() { --*x; *b = false; } void B() { ++*x; *b = true; }

There is pretty much unnoticeable difference between the -O3 versions of both approaches, but without -O3, second case does run slightly faster, at least on my machine. I have tested without -O3 and with the loop = 0xfffffffe.
Best times:
alin@ubuntu:~/Desktop$ time ./1

real 0m20.231s
user 0m20.224s
sys 0m0.020s

alin@ubuntu:~/Desktop$ time ./2

real 0m19.932s
user 0m19.890s
sys 0m0.060s

Test Results

Source Link

edited Sep 27, 2015 at 11:58

Alin Ionut Lipan

EDIT:
I did some research and I have some nice results. How would you explain this behavior ?

I will first show case 1 and case 2 times. As you can see case 2 runs faster, at least on my machine.

alin@ubuntu:~/Desktop$ time ./1
real 0m4.025s
user 0m4.008s
sys 0m0.020s

alin@ubuntu:~/Desktop$ time ./2
real 0m3.285s
user 0m3.272s
sys 0m0.016s

Here is the code, compiled with gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) using -O3.

Case 1.

#include <stdio.h> extern bool * b; extern int * x; extern int * a; extern unsigned long * loop; extern void A(); extern void B(); int main() { for (unsigned long i = 0; i < *loop; ++i) { *x += *a; if (*b) { A(); } else { B(); } } delete b; delete x; delete a; delete loop; return 0; } bool * b = new bool(true); int * x = new int(0); int * a = new int(0); unsigned long * loop = new unsigned long(0xfffffffe); void A() { --*x; *b = false; } void B() { ++*x; *b = true; }

Case 2

#include <stdio.h> extern bool * b; extern int * x; extern int * a; extern unsigned long * loop; extern void A(); extern void B(); int main() { for (unsigned long i = 0; i < *loop; ++i) { if (*b) { *x += *a; A(); } else { *x += *a; B(); } } delete b; delete x; delete a; delete loop; return 0; } bool * b = new bool(true); int * x = new int(0); int * a = new int(0); unsigned long * loop = new unsigned long(0xfffffffe); void A() { --*x; *b = false; } void B() { ++*x; *b = true; }

So this pretty much says what I believed ? If the compiler has no way to know at compile time, then he can't optimize that, therefore you should do it ?

EDIT:
I did some research and I have some nice results. How would you explain this behavior ?

I will first show case 1 and case 2 times. As you can see case 2 runs faster, at least on my machine.

alin@ubuntu:~/Desktop$ time ./1
real 0m4.025s
user 0m4.008s
sys 0m0.020s

alin@ubuntu:~/Desktop$ time ./2
real 0m3.285s
user 0m3.272s
sys 0m0.016s

Here is the code, compiled with gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) using -O3.

Case 1.

#include <stdio.h> extern bool * b; extern int * x; extern int * a; extern unsigned long * loop; extern void A(); extern void B(); int main() { for (unsigned long i = 0; i < *loop; ++i) { *x += *a; if (*b) { A(); } else { B(); } } delete b; delete x; delete a; delete loop; return 0; } bool * b = new bool(true); int * x = new int(0); int * a = new int(0); unsigned long * loop = new unsigned long(0xfffffffe); void A() { --*x; *b = false; } void B() { ++*x; *b = true; }

Case 2

#include <stdio.h> extern bool * b; extern int * x; extern int * a; extern unsigned long * loop; extern void A(); extern void B(); int main() { for (unsigned long i = 0; i < *loop; ++i) { if (*b) { *x += *a; A(); } else { *x += *a; B(); } } delete b; delete x; delete a; delete loop; return 0; } bool * b = new bool(true); int * x = new int(0); int * a = new int(0); unsigned long * loop = new unsigned long(0xfffffffe); void A() { --*x; *b = false; } void B() { ++*x; *b = true; }

So this pretty much says what I believed ? If the compiler has no way to know at compile time, then he can't optimize that, therefore you should do it ?

note and question separated, text formatting added for clarity

Source Link

edit approved Sep 27, 2015 at 10:53

Ziezi

6.5k
4
42
54

I've just stumbled upon this thing, and I'm really curious if maybe modern CPUs (current ones, maybe mobile ones as well (embedded)) don't actually have a branching cost in the situation below.

Let's say we have this

x += a; // let's assume they are both declared earlier as simple ints
if (flag)
do A // let's assume A is not the same as B
else
do B // and of course B is different than A

Compared to this

if (flag)
{
x += a
do A
}
else
{
x += a
do B
}

1.Let's say we have this:

x += a; // let's assume they are both declared earlier as simple ints if (flag) do A // let's assume A is not the same as B else do B // and of course B is different than A

2.Compared to this:

if (flag) { x += a do A } else { x += a do B }

Assuming AA and BB are completely different in therms of pipeline instructions (fetch, decode, execute bla bla, etc) is the 2nd approach going to be faster ?:

Is the 2nd approach going to be faster ?

Are CPUs smart enough to tell that no matter what the flag is, the next instruction is the same (so they won't have to discard pipeline stages for it because of branch miss prediction ) ?

Note:

Are CPUs smart enough to tell that no matter what the flag is, the next instruction is the same (so they won't have to discard pipeline stages for it cuz of branch miss prediction ) ? In the first case the CPU has no option, but to discard the first few pipeline stages of the do AA or do BB if a branch miss prediction happened, because they are different. I see the 2nd example as a somehow delayed branching like "I'm going to check that flag, even if I don't know the flag, I can get on with the next instruction cuz it's the same, no matter what the flag is, I already have the next instruction and it's ok for me to use it"

I hope I made myself clear, and I'm really curious of this behavior.

Cheers.: " I'm going to check that flag, even if I don't know the flag, I can get on with the next instruction because it's the same, no matter what the flag is, I already have the next instruction and it's OK for me to use it."

I've just stumbled upon this thing, and I'm really curious if maybe modern CPUs (current ones, maybe mobile ones as well (embedded)) don't actually have a branching cost in the situation below.

Let's say we have this

x += a; // let's assume they are both declared earlier as simple ints
if (flag)
do A // let's assume A is not the same as B
else
do B // and of course B is different than A

Compared to this

if (flag)
{
x += a
do A
}
else
{
x += a
do B
}

Assuming A and B are completely different in therms of pipeline instructions (fetch, decode, execute bla bla) is the 2nd approach going to be faster ?

Are CPUs smart enough to tell that no matter what the flag is, the next instruction is the same (so they won't have to discard pipeline stages for it cuz of branch miss prediction ) ? In the first case the CPU has no option, but to discard the first few pipeline stages of the do A or do B if a branch miss prediction happened, because they are different. I see the 2nd example as a somehow delayed branching like "I'm going to check that flag, even if I don't know the flag, I can get on with the next instruction cuz it's the same, no matter what the flag is, I already have the next instruction and it's ok for me to use it"

I hope I made myself clear, and I'm really curious of this behavior.

Cheers.

I've just stumbled upon this thing, and I'm really curious if maybe modern CPUs (current ones, maybe mobile ones as well (embedded)) don't actually have a branching cost in the situation below.

1.Let's say we have this:

x += a; // let's assume they are both declared earlier as simple ints if (flag) do A // let's assume A is not the same as B else do B // and of course B is different than A

2.Compared to this:

if (flag) { x += a do A } else { x += a do B }

Assuming A and B are completely different in therms of pipeline instructions (fetch, decode, execute, etc):

Is the 2nd approach going to be faster ?

Are CPUs smart enough to tell that no matter what the flag is, the next instruction is the same (so they won't have to discard pipeline stages for it because of branch miss prediction ) ?

Note:

In the first case the CPU has no option, but to discard the first few pipeline stages of the do A or do B if a branch miss prediction happened, because they are different. I see the 2nd example as a somehow delayed branching like: " I'm going to check that flag, even if I don't know the flag, I can get on with the next instruction because it's the same, no matter what the flag is, I already have the next instruction and it's OK for me to use it."

improved formatting

Source Link

edit approved Sep 27, 2015 at 9:57

auth private

1.3k
1
9
22

Source Link

asked Sep 27, 2015 at 9:55

Alin Ionut Lipan

Collectives™ on Stack Overflow

Return to Question

Note:

Note: