Linked Questions
32 questions linked to/from What is the efficient way to count set bits at a position or lower?
1029 votes
66 answers
674k views
Count the number of set bits in a 32-bit integer
8 bits representing the number 7 look like this: 00000111 Three bits are set. What are the algorithms to determine the number of set bits in a 32-bit integer?
1656 votes
11 answers
202k views
Replacing a 32-bit loop counter with 64-bit introduces crazy performance deviations with _mm_popcnt_u64 on Intel CPUs
I was looking for the fastest way to popcount large arrays of data. I encountered a very weird effect: Changing the loop variable from unsigned to uint64_t made the performance drop by 50% on my PC. ...
958 votes
11 answers
188k views
Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly?
I wrote these two solutions for Project Euler Q14, in assembly and in C++. They implement identical brute force approach for testing the Collatz conjecture. The assembly solution was assembled with: ...
336 votes
19 answers
154k views
Is multiplication and division using shift operators in C actually faster?
Multiplication and division can be achieved using bit operators, for example i*2 = i<<1 i*3 = (i<<1) + i; i*10 = (i<<3) + (i<<1) and so on. Is it actually faster to use say (...
186 votes
11 answers
326k views
Are the shift operators (<<, >>) arithmetic or logical in C?
In C, are the shift operators (<<, >>) arithmetic or logical?
134 votes
16 answers
165k views
Best practices for circular shift (rotate) operations in C++
Left and right shift operators (<< and >>) are already available in C++. However, I couldn't find out how I could perform circular shift or rotate operations. How can operations like "Rotate ...
54 votes
5 answers
5k views
What is the fastest way to update a variable on a condition?
I have a pointer, ptr, and a condition, cond. I need the fastest possible way to reset ptr if cond is true, or keep ptr unchanged if cond is false. The current implementation is, trivially: void ...
43 votes
8 answers
9k views
Branch-aware programming
I'm reading around that branch misprediction can be a hot bottleneck for the performance of an application. As I can see, people often show assembly code that unveil the problem and state that ...
67 votes
1 answer
8k views
Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)
I'm a newbie at instruction optimization. I did a simple analysis on a simple function dotp which is used to get the dot product of two float arrays. The C code is as follows: float dotp( ...
17 votes
2 answers
10k views
Using base pointer register in C++ inline asm
I want to be able to use the base pointer register (%rbp) within inline asm. A toy example of this is like so: void Foo(int &x) { asm volatile ("pushq %%rbp;" // 'prologue' ...
7 votes
5 answers
2k views
How to prove that the C statement -x, ~x+1, and ~(x-1) yield the same results?
I want to know the logic behind this statement, the proof. The C expression -x, ~x+1, and ~(x-1) all yield the same results for any x. I can show this is true for specific examples. I think the way ...
9 votes
1 answer
4k views
Drawing a character in VGA memory with GNU C inline assembly
I´m learning to do some low level VGA programming in DOS with C and inline assembly. Right now I´m trying to create a function that prints out a character on screen. This is my code: //This is the ...
34 votes
3 answers
1k views
Why does g++ pull computations into a hot loop?
I have a very weird compiler behavior where G++ pulls computations into a hot loop, severly reducing the performance of the resulting code. What is going on here? Consider this function: #include &...
13 votes
1 answer
30k views
What are the return values of system calls in Assembly?
When I try to research about return values of system calls of the kernel, I find tables that describe them and what do I need to put in the different registers to let them work. However, I don't find ...
20 votes
1 answer
3k views
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
I want to be able to predict, by hand, exactly how long arbitrary arithmetical (i.e. no branching or memory, though that would be nice too) x86-64 assembly code will take given a particular ...