Gray codes addition — Take 2

Question

For a while, I have tried to find ways (known or new) to do stuff with Gray codes. As some of you probably know, I already tried to implement an algorithm to add two Gray codes without having to convert them to a regular binary representation, perform a regular addition and convert the result back to a Gray code.

The first time, I tried to implement a bitwise algorithm described by Harold Lucal and then to optimize it. Since even the most optimized code I could come with was still slower than the naive double-conversion solution, I decided to start over from scratch, namely, to observe patterns in Gray codes and find a new way to perform an addition. Before going any further, let me describe the notation used in the rest of the question and how some mathematical symbols are (ab)used to represent common bitwise operations:

\$\oplus\$ is used to represent a bitwise XOR operation.
\$\ll\$ is used to represent a left shift.
\$\gg\$ is used to represent a right shift.
\$parity(n)\$ is used to compute the parity of a Gray code \$n\$, which happends to correspond to the parity of the number of set bits in its representation (you can find an implementation in the older question).

Here are the most interesting things I observed and used to devise the new addition algorithm:

A Gray power of \$2\$ corresponds to two bits set followed by clear bits, except for \$1\$. That one is easy to observe: with the usual integer representation, a power of \$2\$ corresponds to a single bit set and the formula to convert an integer \$n\$ to its corresponding Gray code is \$(n \gg 1) \oplus n\$.
For any Gray code \$n\$, \$2n = (n \ll 1) \oplus parity(n)\$.
For two Gray codes \$a\$ and \$b\$, if \$a\$ is a power of \$2\$ and \$a > b\$, then \$a \oplus b = a + b\$.
For two Gray codes \$a = 2^n\$ and \$b\$, if \$a \leq b < 2^{n+1}\$, then \$a \oplus b = b - a\$ (derived from the preceding observation).

Additionally, I will frequently use what I call the « base » or « base2 » of a Gray code (I don't have a better name - if you have one, please let me know), which is, for a Gray code \$a\$, the Gray code \$2^n\$ such as \$2^n \leq a < 2^{n+1}\$ (e.g. If \$a = 7\$, then \$base(a) = 4\$). I use it with an exception for \$0\$ where \$base(0) = 0\$ instead of \$1\$.

So here is the pseudocode for the new addition algorithm. I use the gray_code implementation from the older question - you can find an up-to-date version on GitHub:

function add(a: gray, b: gray) -> gray if base(a) = base(b) then: if a = b then: return 2 * a else: return 2 * base(a) + (a - base(a)) + (b - base(b)) else: if a < b then: swap(a, b) tmp := (a - base(a)) + b if base(a) > base(tmp) then: return base(a) + tmp else: # Here, base(a) = base(tmp) return 2 * base(a) + (tmp - base(a))

As you can see (can you?), the principle is to add things when it is simple (when \$a = b\$ or when \$a = 2^n\$) and to break numbers into powers of \$2\$ otherwise, then try to add them. Here is the recursive C++14 implementation of the algorithm, with the correct bitwise operations to implement the simple cases of addition, subtraction and multiplication by \$2\$ (when reading it, don't forget that I use gray_code<Unsigned> everywhere, and that every + is actually a recursive call, not a regular integer addition):

template<typename Unsigned> auto operator+(gray_code<Unsigned> lhs, gray_code<Unsigned> rhs) -> gray_code<Unsigned> { auto lhs_base = base2(lhs); auto rhs_base = base2(rhs); if (lhs_base == rhs_base) { if (lhs == rhs) { return (lhs << 1u) ^ is_odd(lhs); } return (lhs_base << 1u) ^ ((lhs_base ^ lhs) + (lhs_base ^ rhs)); } if (lhs_base.value < rhs_base.value) { std::swap(lhs, rhs); std::swap(lhs_base, rhs_base); } if (lhs == lhs_base) { return lhs ^ rhs; } auto tmp = (lhs ^ lhs_base) + rhs; if (base2(tmp).value < lhs_base.value) { return lhs_base ^ tmp; } return (lhs_base << 1u) ^ (lhs_base ^ tmp); }

And here is the implementation of base2 and of the function isomsb (isolate most significant bit) used to implement it:

// Isolate the most significant bit template<typename Unsigned> auto isomsb(gray_code<Unsigned> x) -> gray_code<Unsigned> { for (std::size_t i = 1u ; i <= std::numeric_limits<Unsigned>::digits / 2u ; i <<= 1u) { x |= x >> i; } return x & ~(x >> 1u); } // Return the greatest power of 2 not higher than // x where x and the power of 2 are encoded in Gray // code auto base2(gray_code<Unsigned> x) -> gray_code<Unsigned> { auto msb = isomsb(x); return msb | (msb >> 1u); }

Unfortunately, this new addition algorithm is even slower than the one from my previous question (not by an order of magnitude, but almost by half of one). So, I have several questions:

Correctness: do you see any flaw in the algorithm? Consider that it does not handle overflow and is not meant to. I tested it to add Gray codes whose results are in the range \$[0, 2^{32}[\$ and it passed all the tests. I think the algorithm is correct, but I didn't prove it.
Speed: do you see any way to improve the algorithm so that it runs faster on average? Currently, the only optimization is the dropped is_odd when doubling a power of \$2\$ (a base) since every power of \$2\$ is known to be even by definition (except \$1\$ which is handled separately).

@ratchetfreak The isomsb function is already using a bithack found on such a site. It boils down to 5 iterations for a 32-bits integer and 6 iterations for a 64-bits integer. I like it since it's generic and pretty fast. It seems that the compiler uses the fact that the bound is known at compile-time to unroll the loop, so it probably ends up with ~13 instructions for a 32-bits integer. — Morwenn
– Morwenn, Commented Apr 9, 2015 at 11:41
@ratchetfreak Actually, the algorithm I use is derived from this one, except that I round up to the next lowest power of 2 instead of the next highest. — Morwenn
– Morwenn, Commented Apr 9, 2015 at 12:04

Community · Accepted Answer · 2017-05-23 12:40:57Z

Performance Issues

You have at worst 4 branches in your addition algorithm. The pattern of the bits has to be considered random for the sense of branch prediction. Thus you are a victim of branch prediction failure.

In addition to these 4 branches, you have at most 3 for loops (which are easier to predict).

Alternative implementation

A quick google later landed me on wikipedia with the following code:

/* The purpose of this function is to convert an unsigned binary number to reflected binary Gray code. The operator >> is shift right. The operator ^ is exclusive or. */ unsigned int binaryToGray(unsigned int num) { return (num >> 1) ^ num; } /* The purpose of this function is to convert a reflected binary Gray code number to a binary number. */ unsigned int grayToBinary(unsigned int num) { unsigned int mask; for (mask = num >> 1; mask != 0; mask = mask >> 1) { num = num ^ mask; } return num; }

which to my eye seems much faster due to the lack of branches other than the one for loop which is easily predictable for the cpu and may possibly even be unrolled.

So:

binaryToGray(grayToBinary(x) + grayToBinary(y));

is:

Much less code than your algorithm.
Only 2 loops compared to 3 loops and 4 unpredictable branches.

Unless you can figure out a way to do the addition with 1 for loop and 0 branches, you're better of just converting back and forth.

Edit: I found a way of doing it without any branches, constant time:

unsigned int addGray(unsigned int a, unsigned int b){ assert(sizeof(unsigned int) == 4); a = a ^ (a >> 16); b = b ^ (b >> 16); a = a ^ (a >> 8); b = b ^ (b >> 8); a = a ^ (a >> 4); b = b ^ (b >> 4); a = a ^ (a >> 2); b = b ^ (b >> 2); unsigned int x = (a ^ (a >> 1)) + (b ^ (b >> 1)); return (x >> 1) ^ x; }

It's using the trick for calculating a number xored with all its right shifts which converts both inputs to binary, does the addition and back to graycode. All in roughly 30 ALU instructions.

Er, well, yeah, I know that already. The point of my question is that I am trying to find another algorithm and optimize it, just for the sake of it. If I really needed to use Gray codes in an actual project, I wouldn't try to find such an algorithm, I would use the double conversion to/from regular binary. I may have forgotten to mention in for "take 2", but I am pretty sure that I mentioned it in the original question. — Morwenn
– Morwenn, Commented Apr 20, 2015 at 11:05
Ok, it may not have been all that clear actually. Sorry about that -_- — Morwenn
– Morwenn, Commented Apr 20, 2015 at 11:10
Nevertheless my statement about branch miss-prediction still applies here. The loop in isomsb has a fixed number of iterations and it should have been unrolled by your compiler, if it hasn't you can unroll and optimize it by hand by specializing the template (check your asm). I haven't studied your algorithm in detail but if you can reduce the number of special cases (branches checking for base2==base2 at the top) you might gain some speed. — Emily L.
– Emily L., Commented Apr 20, 2015 at 11:21
ow is the new algorithm different than binaryToGray(grayToBinary(x) + grayToBinary(y));? — Morwenn
– Morwenn, Commented Apr 21, 2015 at 7:12
No. It's a clever bit hack to get rid of the loops and replace them with constant time. But if you can't tell, does it matter? If it does, how can you be sure that your devised algorithm isn't really just an implicit conversion back and forth that you don't realise? At the end of the day arithmetic operations are only defined for numbers, not for codes. Any algorithm that takes two codes and produces a code that represents the sum of the decoded input codes, has done an implicit conversion at some point, otherwise the result doesn't make sense. It's just a matter of how obfuscated it is. — Emily L.
– Emily L., Commented Apr 21, 2015 at 8:48

Community · Accepted Answer · 2017-04-13 12:19:20Z

Better bit tricks in `base2`

base2 is an expensive enough operation compared to the rest of the function, even if inlined and unrolled. Therefore, optimizing it is likely to optimize the whole thing. Actually, our isomsb function sets all the bits at the right of the msb then only keeps the leftmost set bit. We can tweak the function so that it keeps the two leftmost set bits instead and have a base2 function no slower than isomsb:

// Return the greatest power of 2 not higher than // x where x and the power of 2 are encoded in Gray // code auto base2(gray_code<Unsigned> x) -> gray_code<Unsigned> { for (std::size_t i = 1u ; i <= std::numeric_limits<Unsigned>::digits / 2u ; i <<= 1u) { x |= x >> i; } return x & ~(x >> 2u); }

Reuse already computed information

We can see that base2(lhs) and base2(rhs) are computed every time operator+ is called, but base(rhs) is sometimes already known before a recursive call and we could avoid recomputing it. Therefore, we can add a third parameter to the function to transmit rhs_base to the recursive call and let operator+ be a simple wrapper:

template<typename Unsigned> auto add(gray_code<Unsigned> lhs, gray_code<Unsigned> rhs, gray_code<Unsigned> rhs_base) -> gray_code<Unsigned> { auto lhs_base = base2(lhs); if (lhs_base == rhs_base) { if (lhs == rhs) { return (lhs << 1u) ^ is_odd(lhs); } return (lhs_base << 1u) ^ ((lhs_base ^ lhs) + (lhs_base ^ rhs)); } if (lhs_base.value < rhs_base.value) { std::swap(lhs, rhs); std::swap(lhs_base, rhs_base); } if (lhs == lhs_base) { return lhs ^ rhs; } auto tmp = add(lhs ^ lhs_base, rhs, rhs_base); if (base2(tmp).value < lhs_base.value) { return lhs_base ^ tmp; } return (lhs_base << 1u) ^ (lhs_base ^ tmp); } template<typename Unsigned> auto operator+(gray_code<Unsigned> lhs, gray_code<Unsigned> rhs) -> gray_code<Unsigned> { return add(lhs, rhs, base2(rhs)); }

The Law of Useful Return

I was reading a book from Alexander Stepanov and Daniel Rose a few weeks ago and came across a paragraph about what they call The Law of Useful Return (ok, it was probably not that capitalized). Basically, it states that once you have computed something that could be reused in a function, even if it wasn't the aim of this function, you should not waste it - provided it can be reused - but return it along with the main results.

In our case, we have a recursive call to add and we compute the base2 of its result. If we look at the code closely, we can see that we almost always have the base when returning a result of an addition, and it is generally trivial to compute when we don't have it. That means that we can return the base along with the result instead of recomputing it after the recursive call:

// Return a pair of Gray codes where second is the result of // the addition, and first is the base2 of the result template<typename Unsigned> auto add(gray_code<Unsigned> lhs, gray_code<Unsigned> rhs, gray_code<Unsigned> rhs_base) -> std::pair<gray_code<Unsigned>, gray_code<Unsigned>> { auto lhs_base = base2(lhs); if (lhs_base == rhs_base) { if (lhs == rhs) { return { base2(lhs << 1u), (lhs << 1u) ^ is_odd(lhs) }; } return { (lhs_base << 1u), (lhs_base << 1u) ^ ((lhs_base ^ lhs) + (lhs_base ^ rhs)) }; } if (lhs_base.value < rhs_base.value) { std::swap(lhs, rhs); std::swap(lhs_base, rhs_base); } if (lhs == lhs_base) { return { lhs, lhs ^ rhs }; } // Get both the result of the addition and its base auto tmp = add(lhs ^ lhs_base, rhs, rhs_base); if (tmp.first.value < lhs_base.value) { return { lhs_base, lhs_base ^ tmp.second }; } return { (lhs_base << 1u), (lhs_base << 1u) ^ (lhs_base ^ tmp.second) }; } template<typename Unsigned> auto operator+(gray_code<Unsigned> lhs, gray_code<Unsigned> rhs) -> gray_code<Unsigned> { return add(lhs, rhs, base2(rhs)).second; }

As we can see, computing the base and returning it with the result is almost always trivial, generally costing no more than a shift, probably optimized out by the compiler since it is used to compute the result anyway. The only case where computing the base is actually more expensive than computing the result is when lhs == rhs, but we can safely assume that lhs and rhs will be different from each other more often than not (and we stripped the potentially expensive ^ is_odd from the base calculation anyway since altering the least significant bit does not change the resulting base). On average, we do less computations by returning and reusing this additional value.

Special case when needed

Sometimes, adding more branches helps. Generally not - as pointed by @Emily -, that's why many potentially interesting observations I made about Gray codes didn't end up in this algorithm. However, it seems that checking whether lhs.value == 1u or whether rhs.value == 1u in operator+ (not in add) yields better results overall:

template<typename Unsigned> auto operator+(gray_code<Unsigned> lhs, gray_code<Unsigned> rhs) -> gray_code<Unsigned> { if (lhs.value == 1u) { return ++rhs; } if (rhs.value == 1u) { return ++lhs; } return add(lhs, rhs, base2(rhs)).second; }

Why is it globally faster? I have several potential explanations, none that I tried to prove, may you may be interested in hearing them anyway:

operator++ can be lightning fast for Gray codes since it only checks for the parity before flipping a bit. It is probably only that fast on architectures that store a parity bit though:

template<typename Unsigned> auto operator++(gray_code<Unsigned> gray) -> gray_code<Unsigned> { if (is_odd(gray)) { auto y = gray.value & -gray.value; gray.value ^= (y << 1u); } else { gray.value ^= 1u; } return gray; }

The algorithm decomposes number into powers of \$2\$ and more generally into incrementally smaller powers of \$2\$. Since \$1 = 2^0\$, it may be one of the most encountered values in the algorithm.
Since we put it in operator+ and not in add, it directly shortcuts everything and avoids some potentially expensive base computations since it would introduce a new case where computing the base is not as trivial as a shift.

Generally speaking, these optimizations don't come from nowhere: I tried each of them separately and probably tried most of the possible combinations, including putting the special cases in add too or in add only, etc... The combination of the four optimizations I provided in this answer is the one that yielded the fastest code for most results. However, it may still depend on the speed of the parity intrinsic provided by the compiler (when there is one). Interesting things, but not as generic as they could be.

A more meaningful name for `base2`

I mentioned in the question that base2 sounded like an awful name and therefore I tried to find a mathematical name for the function, especially avoiding references to bits since we are not dealing with a regular integer representation. And by "trying to find a better name", I actually mean that I asked the question on Mathematics.SE.

There have been some interesting terms (for example binadic floor, I really liked that one), but in the end someone managed to find a name and a symbol already in use for the function: for a number \$x\$, the number \$2^n\$ such as \$2^n \leq x < 2^{n+1}\$ is called the hyperfloor of \$x\$ and denoted \$\lfloor \lfloor x \rfloor \rfloor\$. Note that the function can also be read as \$2^{\lfloor \log_2 x \rfloor}\$ which is terser than my original description.

Also note that my implementation of base2 does note strictly compute \$\lfloor \lfloor x \rfloor \rfloor\$ since it may give different results for \$x < 1\$.

Stack Exchange Network

Gray codes addition — Take 2

2 Answers 2

Performance Issues

Alternative implementation

Edit: I found a way of doing it without any branches, constant time:

Better bit tricks in `base2`

Reuse already computed information

The Law of Useful Return

Special case when needed

A more meaningful name for `base2`

You must log in to answer this question.

Linked

Hot Network Questions

Gray codes addition — Take 2

2 Answers 2

Performance Issues

Alternative implementation

Edit: I found a way of doing it without any branches, constant time:

Better bit tricks in base2

Reuse already computed information

The Law of Useful Return

Special case when needed

A more meaningful name for base2

You must log in to answer this question.

Linked

Related

Hot Network Questions

Better bit tricks in `base2`

A more meaningful name for `base2`