I have implemented Sieve of Eratosthenes with Wheel Factorization in C++ again, this time using a bigger wheel and much more complicated C++ syntax, to familiarize myself with C++ and see how fast I can make the program run. It turns out my C++ program is less efficient than equivalent code in Numba.
All primes are odd except 2, this is common knowledge, but this means all primes must be 2-coprimes and this halves the number of possible primes that need to be checked.
Now all primes must be 2, 3, 5, 7 coprime, this means all primes starting at 11 must be congruent to the following 48 numbers modulo 210:
{ 1 , 11 , 13 , 17 , 19 , 23 , 29 , 31 , 37 , 41 , 43 , 47 , 53 , 59 , 61 , 67 , 71 , 73 , 79 , 83 , 89 , 97 , 101, 103, 107, 109, 113, 121, 127, 131, 137, 139, 143, 149, 151, 157, 163, 167, 169, 173, 179, 181, 187, 191, 193, 197, 199, 209 } This narrows down the candidates to 8/35 of the original, and skips 0.7714 of all candidates that don't ever need to be checked.
I wrote two versions, one version is the standard version using boolean array with optimizations, the other is a mapping based one which is much more clever and also much less efficient at the same time, the mapping based approach is adapted from this with modifications based on this, I implemented the two approaches mainly to see if I can make it work, to see if I can correctly and efficiently use C++ data types to port the code successfully.
My code has built-in performance benchmarking, it can also output the primes to a file in binary format. In basic C++, the data type that can hold the largest value is uint64_t, and because I use k*k in my code, my code can only find all primes representable as uint32_t and my code can find all 203280221 primes less than 4294967295.
Now because the primes can be represented as uint32_t, my program outputs the bytes of the primes in binary format, each prime is represented as four bytes in big endian byte order, and then all the byte sequences for the primes are joined head to tail to form a single byte sequence, this byte sequence is then write to file all at once if file writing is enabled, else the serialization step won't be performed. Enabling file writing disables benchmarking and vice-versa, this is by design.
Code
#include <array> #include <chrono> #include <cmath> #include <cstdint> #include <fstream> #include <iostream> #include <utility> #include <string> #include <unordered_map> #include <vector> using bools = std::vector<bool>; using bytes = std::vector<uint8_t>; constexpr std::array<uint8_t, 48> WHEEL = { 2, 4, 2, 4, 6, 2, 6, 4, 2, 4, 6, 6, 2, 6, 4, 2, 6, 4, 6, 8, 4, 2, 4, 2, 4, 8, 6, 4, 6, 2, 4, 6, 2, 6, 6, 4, 2, 4, 6, 2, 6, 4, 2, 4, 2, 10, 2, 10 }; constexpr std::array<std::pair<uint8_t, uint8_t>, 12> FOREWHEEL = { { {4 , 2 }, {9 , 6 }, {25 , 30 }, {35 , 30 }, {49 , 210}, {77 , 210}, {91 , 210}, {119, 210}, {133, 210}, {161, 210}, {203, 210}, {217, 210}, } }; static bools prime_wheel_sieve(uint64_t limit) { limit++; bools is_prime(limit, true); std::array<uint64_t, 48> wheel; is_prime[0] = is_prime[1] = false; for (auto& [multiple, step] : FOREWHEEL) { for (uint64_t i = multiple; i < limit; i += step) { is_prime[i] = false; } } uint32_t k, max_k; uint64_t multiple; k = 11; max_k = std::sqrt(limit); uint8_t i, j; i = j = 0; while (k <= max_k) { if (is_prime[k]) { for (j = 0; j < 48; j++) { wheel[j] = WHEEL[(i + j) % 48] * k; } multiple = k * k; j = 0; while (multiple <= limit) { is_prime[multiple] = false; multiple += wheel[j]; j++; j -= (j == 48) * 48; } } k += WHEEL[i]; i++; i -= (i == 48) * 48; } return is_prime; } static bytes prime_bytes(uint64_t limit) { bytes data; if (limit < 11) { data.reserve(16); for (uint32_t p : {2, 3, 5, 7}) { if (p <= limit) { bytes number = { 0, 0, 0, uint8_t(p) }; data.insert(data.end(), number.begin(), number.end()); } else { break; } } return data; } data = { 0, 0, 0, 2, 0, 0, 0, 3, 0, 0, 0, 5, 0, 0, 0, 7 }; data.reserve(uint32_t(limit / log(limit) * 5)); bools is_prime = prime_wheel_sieve(limit); uint64_t k = 11; uint8_t i = 0; while (k <= limit) { if (is_prime[k]) { data.push_back(uint8_t(k >> 24 & 255)); data.push_back(uint8_t(k >> 16 & 255)); data.push_back(uint8_t(k >> 8 & 255)); data.push_back(uint8_t(k & 255)); } k += WHEEL[i]; i++; i -= (i == 48) * 48; } return data; } static std::vector<uint32_t> prime_wheel_sieve_1(uint64_t limit) { std::vector<uint32_t> primes; if (limit < 11) { primes.reserve(4); for (uint32_t p : {2, 3, 5, 7}) { if (p <= limit) { primes.push_back(p); } else { break; } } return primes; } primes = { 2, 3, 5, 7 }; primes.reserve(uint32_t(limit / log(limit) * 1.25)); std::unordered_map<uint64_t, std::pair<std::array<uint64_t, 48>, uint64_t>> composites; uint64_t k, multiple; uint8_t i, j; k = 11; i = j = 0; while (k <= limit) { auto it = composites.find(k); if (it == composites.end()) { primes.push_back(k); std::array<uint64_t, 48> wheel; for (j = 0; j < 48; j++) { wheel[j] = WHEEL[(i + j) % 48] * k; } composites[k * k] = { wheel, 0 }; } else { auto& [wheel, j] = it->second; composites.erase(it); multiple = k + wheel[j]; j++; j -= (j == 48) * 48; while (composites.find(multiple) != composites.end()) { multiple += wheel[j]; j++; j -= (j == 48) * 48; } composites[multiple] = { wheel, j }; } k += WHEEL[i]; i++; i -= (i == 48) * 48; } return primes; } static bytes prime_bytes_1(uint64_t limit) { std::vector<uint32_t> primes = prime_wheel_sieve_1(limit); bytes data(primes.size() * 4); uint32_t i = 0; for (uint32_t prime : primes) { data[i] = uint8_t((prime >> 24) & 0xFF); data[i + 1] = uint8_t((prime >> 16) & 0xFF); data[i + 2] = uint8_t((prime >> 8) & 0xFF); data[i + 3] = uint8_t(prime & 0xFF); i += 4; } return data; } inline static bool is_number(std::string chars) { if (chars.empty()) { return false; } for (auto& cha : chars) { if (!std::isdigit(cha)) { return false; } } return true; } int main(int argc, char* argv[]) { if (argc != 4) { std::cerr << "number of arguments must be 2"; return -1; } std::string arg0 = argv[1]; if (!is_number(arg0)) { std::cerr << "limit must be unsigned decimal integer"; return -1; } uint64_t limit = std::stoull(arg0); if (limit > UINT32_MAX) { std::cerr << "limit is too large"; return -1; } std::string arg1 = argv[2]; bytes data; if (!is_number(arg1)) { if (std::string(argv[3]) == "1") { data = prime_bytes(limit); } else { data = prime_bytes_1(limit); } std::ofstream file(arg1, std::ios::binary); if (!file.good()) { std::cerr << "failed to open file"; return -1; } file.write(reinterpret_cast<const char*>(data.data()), data.size()); file.close(); } else { uint64_t times = std::stoull(arg1); if (std::string(argv[3]) == "1") { bools is_prime; auto start = std::chrono::steady_clock::now(); for (uint64_t time = 0; time < times; time++) { is_prime = prime_wheel_sieve(limit); } std::chrono::duration<double> elapsed = std::chrono::steady_clock::now() - start; uint32_t count = 0; if (limit < 11) { for (uint8_t p : {2, 3, 5, 7}) { if (p <= limit) { count += 1; } else { break; } } } else { count = 4; uint64_t k = 11; uint8_t i = 0; while (k <= limit) { if (is_prime[k]) { count += 1; } k += WHEEL[i]; i++; i -= (i == 48) * 48; } std::cout << "found primes: " << count << "\n"; std::cout << "average execution time: " << (elapsed.count() / times); } } else { std::vector<uint32_t> primes; auto start = std::chrono::steady_clock::now(); for (uint64_t time = 0; time < times; time++) { primes = prime_wheel_sieve_1(limit); } std::chrono::duration<double> elapsed = std::chrono::steady_clock::now() - start; std::cout << "found primes: " << primes.size() << "\n"; std::cout << "average execution time: " << (elapsed.count() / times); } } } Compiled using MSYS2 MinGW64, command line:
g++ -march=native -O3 -flto -funroll-loops -fomit-frame-pointer -fstrict-aliasing -o "D:\CliExec\wheel_sieve_gcc.exe" "C:\Users\xenig\source\repos\Prime Wheel Sieve\Prime Wheel Sieve.cpp" Benchmarking and usage example:
PS D:\CliExec> .\wheel_sieve_gcc.exe 1048576 1024 1 found primes: 82025 average execution time: 0.00135145 PS D:\CliExec> .\wheel_sieve_gcc.exe 1048576 16 0 found primes: 82025 average execution time: 0.105634 PS D:\CliExec> .\wheel_sieve_gcc.exe 4294967295 D:\UInt32_primes.bin 1 In [528]: import numpy as np In [529]: with open("D:/UInt32_primes.bin", "rb") as f: ...: data = np.frombuffer(f.read(), dtype=">u4") In [530]: len(data) Out[530]: 203280221 My C++ program is somehow less efficient than my Numba code:
import numba as nb import numpy as np @nb.njit(cache=True) def prime_wheel_sieve_opt4(n: int) -> np.ndarray: # fmt: off wheel = np.array([ 2 , 4 , 2 , 4 , 6 , 2 , 6 , 4 , 2 , 4 , 6 , 6 , 2 , 6 , 4 , 2 , 6 , 4 , 6 , 8 , 4 , 2 , 4 , 2 , 4 , 8 , 6 , 4 , 6 , 2 , 4 , 6 , 2 , 6 , 6 , 4 , 2 , 4 , 6 , 2 , 6 , 4 , 2 , 4 , 2 , 10, 2 , 10 ], dtype=np.uint8) # fmt: on primes = np.ones(n + 1, dtype=np.uint8) primes[:2] = False for square, step in ((4, 2), (9, 6), (25, 10), (49, 14)): primes[square::step] = False k = np.uint64(11) lim = int(n ** 0.5) i = 0 while k <= lim: if primes[k]: multiple = k * k j = i big_wheel = wheel * k while multiple <= n: primes[multiple] = False multiple += big_wheel[j] if (j := j + 1) == 48: j = 0 k += wheel[i] if (i := i + 1) == 48: i = 0 return primes In [531]: %timeit prime_wheel_sieve_opt4(1048576) 1.22 ms ± 20.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) How can I optimize my C++ program?
I use x / ln(x) to estimate the number of primes less than x, this has a tendency to underestimate so I multiply that value by 1.25 to make that an overestimation, so reserve call will allocate enough memory.
And about k * k, it doesn't do what exactly I wanted, however it isn't a bug in the function that uses a vector and it isn't a bug in the second function that uses a mapping, either.
Why? Well in the first function process all values of k that are congruent to one of the 48 values modulo 210, up to the square root of the limit (+ 1). The maximum possible value of limit is 4294967295, increment that and taking the square root is 65536, which is the number of possible values of UInt16_t. Now we know 65536 isn't prime, of course, because it is 216, and the minimum step is 2, and the number k won't be squared if it isn't prime, so k will never reach 65535 and get squared to overflow, thus using k * k to get a uint32_t is absolutely safe.
In fact this is guaranteed to work, because the closest prime less than 65536 is 65521, 65521 % 210 is 1, the next step is 10, 65521 + 10 = 65531 which isn't a prime, then it reaches 65537 after two +2 steps, and stops. Because all these numbers are already marked as composites, so the will never get squared. So I don't need uint64_t range.
The same is true for finding all uint64_t primes using the code, because we only ever process k up to the square root of limit and only squaring k if it is a prime number, the closest prime to UINT32_MAX is 4294967291 and its square is of course in the range of uint64_t, and numbers greater than that are already marked as composites so they won't get squared and after 3 more rolls of the wheel k exceeds the square root and the code stops. So the first function is guaranteed to never overflow and I don't need to double the bit width. And this means I can use the first function to find all uint64_t primes if I have the RAM and the time.
In the second function, since the numbers after the square root will be squared, if the square root is greater than UINT16_MAX then the numbers will overflow the uint32_t range, so I need to make k and multiple uint64_t and the function is limited to finding all uint32_t primes. But I already did that, I made k and multiple uint64_t from the very beginning, so my code has no bugs.
I have replaced all occurrences of uint64_t with uint32_t in the first function, compiled the code and it still finds all 203280221 primes less than UINT32_MAX, proving my theory. However this doesn't speed up performance as I expected.
My theory is correct, however there are unseen complications, replacing all occurrences of uint64_t with uint32_t causes the code to overflow and silently exiting at some point if I were to use it to find all 203280221 primes less than UINT32_MAX. The limit has to be uint64_t, and wheel has to be uint64_t because the wheel has to contain scaled primes and this can overflow, so multiple also has to be uint64_t. And limit has to be uint64_t because I use the code to find all primes up to UINT32_MAX, and I use limit++ to ensure the vector contains all information about primality of numbers up to and including limit, because of course C++ uses 0 based indexing. And UINT32_MAX + 1 is 232 which has 33 bits and therefore must be represented as uint64_t.
Actually there is one bug in my code, specifically I put the fake usage by outputting the count of primes found and the printing of performance measurement inside the else block, so they will only be executed if limit is greater than 11.
They should technically speaking be put outside the else block, and I meant to put them outside the else block, but I somehow put them inside the else block, although for practical purposes it doesn't matter, nobody will use a limit smaller than 11, however if someone were to specifically looking for edge cases my program won't do what I intended it to do if said someone inputs a limit less than 11 and make the code run thousands of times.
As there is already an answer I can't edit the code, although the existing answer doesn't actually answer the question because the supposed bug isn't actually a bug in my code.
multiple = k * kis OK here? Oncek > 65,535, I'd expect the multiplication to overflow. \$\endgroup\$log()is base e - facepalm. Consider appending something like "I use x / ln(x) to estimate the number of primes less than x, this has a tendency to underestimate so I multiply that value by 1.25 to make that an overestimation" comment so code is better understood. \$\endgroup\$std::uintmax_tfor that. \$\endgroup\$