Linked Questions
118 questions linked to/from How much of ‘What Every Programmer Should Know About Memory’ is still valid?
0 votes
0 answers
127 views
N00b question about main memory and cpu register memory [duplicate]
I'm new to learning about low level constructs and have a simple question about how they work. My understanding is that if I have a piece of code int* arr = new int[100000000]; this will be a section ...
0 votes
0 answers
113 views
I have no idea why changing variable access/storage type in pthread subroutine sharply increases perfromance [duplicate]
I am new to multi threaded programing, and I knew coming into it that there are some weird side affects if you are not careful, but I didn't expect to be THIS puzzled about code I wrote. I am writing ...
2451 votes
11 answers
262k views
Why are elementwise additions much faster in separate loops than in a combined loop?
Suppose a1, b1, c1, and d1 point to heap memory, and my numerical code has the following core loop. const int n = 100000; for (int j = 0; j < n; j++) { a1[j] += b1[j]; c1[j] += d1[j]; } ...
892 votes
10 answers
225k views
What does it mean for code to be "cache-friendly"?
What is the difference between "cache unfriendly code" and "cache friendly" code? How can I make sure I write cache-efficient code?
238 votes
5 answers
147k views
How do cache lines work?
I understand that the processor brings data into the cache via cache lines, which - for instance, on my Atom processor - brings in about 64 bytes at a time, whatever the size of the actual data being ...
113 votes
6 answers
39k views
Enhanced REP MOVSB for memcpy
I would like to use enhanced REP MOVSB (ERMSB) to get a high bandwidth for a custom memcpy. ERMSB was introduced with the Ivy Bridge microarchitecture. See the section "Enhanced REP MOVSB and ...
47 votes
7 answers
40k views
Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?
Why is the size of L1 cache smaller than that of the L2 cache in most of the processors ?
64 votes
4 answers
10k views
Micro fusion and addressing modes
I have found something unexpected (to me) using the Intel® Architecture Code Analyzer (IACA). The following instruction using [base+index] addressing addps xmm1, xmmword ptr [rsi+rax*1] does not ...
35 votes
4 answers
24k views
What is locality of reference?
I am having problem in understanding locality of reference. Can anyone please help me out in understanding what it means and what is, Spatial Locality of reference Temporal Locality of reference
50 votes
2 answers
9k views
Can x86's MOV really be "free"? Why can't I reproduce this at all?
I keep seeing people claim that the MOV instruction can be free in x86, because of register renaming. For the life of me, I can't verify this in a single test case. Every test case I try debunks ...
44 votes
1 answer
8k views
Why are loops always compiled into "do...while" style (tail jump)?
When trying to understand assembly (with compiler optimization on), I see this behavior: A very basic loop like this outside_loop; while (condition) { statements; } Is often compiled into (...
33 votes
3 answers
48k views
How can I benchmark the performance of C++ code? [closed]
I am starting to study algorithms and data structures seriously, and interested in learning how to compare the performance of the different ways I can implement A&DTs. For simple tests, I can get ...
37 votes
1 answer
10k views
What happens after a L2 TLB miss?
I'm struggling to understand what happens when the first two levels of the Translation Lookaside Buffer result in misses? I am unsure whether "page walking" occurs in special hardware circuitry, or ...
21 votes
1 answer
10k views
Which cache mapping technique is used in intel core i7 processor?
I have learned about different cache mapping techniques like direct mapping and fully associative or set associative mapping, and the trade-offs between those. (Wikipedia) But I am curious which one ...
22 votes
3 answers
16k views
How to solve the 32-byte-alignment issue for AVX load/store operations?
I am having alignment issue while using ymm registers, with some snippets of code that seems fine to me. Here is a minimal working example: #include <iostream> #include <immintrin.h> ...