I have the following critical place in the code: I need to look up from 64-byte array around 1'000'000 times.
Minimal code:
#include <iostream> #include <stdint.h> #include <random> #include <chrono> #include <ctime> #define TYPE uint8_t #define n_lookup 64 int main(){ const int n_indices = 1000000; TYPE lookup[n_lookup]; TYPE indices[n_indices]; TYPE result[n_indices]; // preparations std::default_random_engine generator; std::uniform_int_distribution<int> distribution(0, n_lookup); for (int i=0; i < n_indices; i++) indices[i] = distribution(generator); for (int i=0; i < n_lookup; i++) lookup[i] = distribution(generator); std::chrono::time_point<std::chrono::system_clock> start = std::chrono::system_clock::now(); // main loop: for (int i=0; i < n_indices; i++) { result[i] = lookup[indices[i]]; } std::chrono::time_point<std::chrono::system_clock> end = std::chrono::system_clock::now(); std::chrono::duration<double> elapsed_seconds = end - start; std::cout << "computation took " << elapsed_seconds.count() * 1e9 / n_indices << " ns per element"<< std::endl; // printing random numbers to avoid code elimination std::cout << result[12] << result[45]; return 0; } After compiling with g++ lookup.cpp -std=gnu++11 -O3 -funroll-loops I get a bit less than 1ns per element on modern CPU.
I need this operation to work 2-3 times faster (without threads). How can I do this?
P.S. I also was investigating AVX512 (512 bits is exactly the size of lookup table!) instruction set, but it lacks 8-bit gather operations!