0

I have the following critical place in the code: I need to look up from 64-byte array around 1'000'000 times.

Minimal code:

#include <iostream> #include <stdint.h> #include <random> #include <chrono> #include <ctime> #define TYPE uint8_t #define n_lookup 64 int main(){ const int n_indices = 1000000; TYPE lookup[n_lookup]; TYPE indices[n_indices]; TYPE result[n_indices]; // preparations std::default_random_engine generator; std::uniform_int_distribution<int> distribution(0, n_lookup); for (int i=0; i < n_indices; i++) indices[i] = distribution(generator); for (int i=0; i < n_lookup; i++) lookup[i] = distribution(generator); std::chrono::time_point<std::chrono::system_clock> start = std::chrono::system_clock::now(); // main loop: for (int i=0; i < n_indices; i++) { result[i] = lookup[indices[i]]; } std::chrono::time_point<std::chrono::system_clock> end = std::chrono::system_clock::now(); std::chrono::duration<double> elapsed_seconds = end - start; std::cout << "computation took " << elapsed_seconds.count() * 1e9 / n_indices << " ns per element"<< std::endl; // printing random numbers to avoid code elimination std::cout << result[12] << result[45]; return 0; } 

After compiling with g++ lookup.cpp -std=gnu++11 -O3 -funroll-loops I get a bit less than 1ns per element on modern CPU.

I need this operation to work 2-3 times faster (without threads). How can I do this?

P.S. I also was investigating AVX512 (512 bits is exactly the size of lookup table!) instruction set, but it lacks 8-bit gather operations!

2
  • Could you cache the result? Commented Sep 11, 2016 at 17:39
  • @Galik, no, it doesn't make any sense for the problem, unfortunately. Commented Sep 11, 2016 at 17:42

4 Answers 4

3

indices and result vectors are in different places in memory, but accessed in the same time. It leads to cache-misses. I suggest you to merge result and indices in one vector. Here is the code:

#include <iostream> #include <stdint.h> #include <random> #include <chrono> #include <ctime> #define TYPE uint8_t #define n_lookup 64 int main(){ const int n_indices = 2000000; TYPE lookup[n_lookup]; // Merge indices and result // If i is index, then i+1 is result TYPE ind_res[n_indices]; // preparations std::default_random_engine generator; std::uniform_int_distribution<int> distribution(0, n_lookup); for (int i=0; i < n_indices; i += 2) ind_res[i] = distribution(generator); for (int i=0; i < n_lookup; i++) lookup[i] = distribution(generator); std::chrono::time_point<std::chrono::system_clock> start = std::chrono::system_clock::now(); // main loop: for (int i=0; i < n_indices; i += 2) { ind_res[i+1] = lookup[ind_res[i]]; // more dense access here, no cache-miss } std::chrono::time_point<std::chrono::system_clock> end = std::chrono::system_clock::now(); std::chrono::duration<double> elapsed_seconds = end - start; std::cout << "computation took " << elapsed_seconds.count() * 1e9 / n_indices << " ns per element"<< std::endl; // printing random numbers to avoid code elimination std::cout << ind_res[24] << ind_res[90]; return 0; } 

My tests shows tha this code runs much faster.

Sign up to request clarification or add additional context in comments.

9 Comments

OMG. This reduced the time to approximately 0.25 ns per element!
I don't understand why there are cache misses when I'm using two arrays, not one. Could you explain please?
Quite surpising =) I faced that behaviour, when I learning about cpu caches and I was realy confused. I think I is related to time diffirences between access to 1L and 2L caches. Since 1L cache can accumulate just 2 or 4 lines, there is a big chance that data can be erised from it. So, when you using two arrays, you need 2 lines of 1L cache, but when you use one array, you need just one. 2 times lower cance for cache-miss.
it is even more mysterious. I've tried to load/write with a shift: result[i] = lookup[indices[i + shift]] with different shifts. Under some shift I expect not to have conflicts in L1 cache, but the speed doesn't depend on shift. Magic?
I get 0,44ns with OP code and 0,58ns with this code. You should multiply the time by 2, because n_indices is not number of indices now.
|
2

with -march=native this is what your loops compiles to:

 movq %rax, %rbx xorl %eax, %eax .L145: movzbl 128(%rsp,%rax), %edx movzbl 64(%rsp,%rdx), %edx movb %dl, 1000128(%rsp,%rax) addq $1, %rax cmpq $1000000, %rax jne .L145 

I'm struggling to see how that gets any quicker without parallelisation.

By changing TYPE to int32_t, it gets vectorised:

 vpcmpeqd %ymm2, %ymm2, %ymm2 movq %rax, %rbx xorl %eax, %eax .L145: vmovdqa -8000048(%rbp,%rax), %ymm1 vmovdqa %ymm2, %ymm3 vpgatherdd %ymm3, -8000304(%rbp,%ymm1,4), %ymm0 vmovdqa %ymm0, -4000048(%rbp,%rax) addq $32, %rax cmpq $4000000, %rax jne .L145 vzeroupper 

Might that help?

2 Comments

Hi, I've benchmarked the code with uint32_t and -march=native, it is slowed down by a factor of 2. But my compiler didn't use vectorization. g++ lookup.cpp -std=gnu++11 -O3 -funroll-loops -march=native -mavx2, and it is gcc 5.4. Do you have an idea why?
ok, I got. I was using uint32_t and vectorization didn't work. Now I am using int32_t, code is vectorized (I see vpgatherdd), but application doesn't print anything to output.
2

At first, there is a bug, distribution(0, 64) produces numbers 0 to 64, 64 can not fit into the array.

You can speed up the lookup 2x by looking up two values a time:

#include <iostream> #include <stdint.h> #include <random> #include <chrono> #include <ctime> #define TYPE uint8_t #define TYPE2 uint16_t #define n_lookup 64 void tst() { const int n_indices = 1000000;// has to be multiple of 2 TYPE lookup[n_lookup]; TYPE indices[n_indices]; TYPE result[n_indices]; TYPE2 lookup2[n_lookup * 256]; // preparations std::default_random_engine generator; std::uniform_int_distribution<int> distribution(0, n_lookup-1); for (int i = 0; i < n_indices; i++) indices[i] = distribution(generator); for (int i = 0; i < n_lookup; i++) lookup[i] = distribution(generator); for (int i = 0; i < n_lookup; ++i) { for (int j = 0; j < n_lookup; ++j) { lookup2[(i << 8) | j] = (lookup[i] << 8) | lookup[j]; } } std::chrono::time_point<std::chrono::system_clock> start = std::chrono::system_clock::now(); TYPE2* indices2 = (TYPE2*)indices; TYPE2* result2 = (TYPE2*)result; // main loop: for (int i = 0; i < n_indices / 2; ++i) { *result2++ = lookup2[*indices2++]; } std::chrono::time_point<std::chrono::system_clock> end = std::chrono::system_clock::now(); for (int i = 0; i < n_indices; i++) { if (result[i] != lookup[indices[i]]) { std::cout << "!!!!!!!!!!!!!ERROR!!!!!!!!!!!!!"; } } std::chrono::duration<double> elapsed_seconds = end - start; std::cout << "computation took " << elapsed_seconds.count() * 1e9 / n_indices << " ns per element" << std::endl; // printing random numbers to avoid code elimination std::cout << result[12] << result[45]; } int main() { tst(); std::cin.get(); return 0; } 

2 Comments

Thanks for pointing bug. Nice approach, in my tests shows ~25-30% speed up.
My speedup was 2x, obviously the systems are different. You can try moving the main loop to separate function or doing manual loop unroll, but those are just blind shots.
0

Your code is already really fast. However (on my system) the execution is about 4.858 % faster when you change

const int n_indices = 1000000; 

to

const int n_indices = 1048576; // 2^10 

This is not much, but it's something.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.