5
\$\begingroup\$

I have implemented Sieve of Eratosthenes with Wheel Factorization in C++ again, this time using a bigger wheel and much more complicated C++ syntax, to familiarize myself with C++ and see how fast I can make the program run. It turns out my C++ program is less efficient than equivalent code in Numba.

All primes are odd except 2, this is common knowledge, but this means all primes must be 2-coprimes and this halves the number of possible primes that need to be checked.

Now all primes must be 2, 3, 5, 7 coprime, this means all primes starting at 11 must be congruent to the following 48 numbers modulo 210:

{ 1 , 11 , 13 , 17 , 19 , 23 , 29 , 31 , 37 , 41 , 43 , 47 , 53 , 59 , 61 , 67 , 71 , 73 , 79 , 83 , 89 , 97 , 101, 103, 107, 109, 113, 121, 127, 131, 137, 139, 143, 149, 151, 157, 163, 167, 169, 173, 179, 181, 187, 191, 193, 197, 199, 209 } 

This narrows down the candidates to 8/35 of the original, and skips 0.7714 of all candidates that don't ever need to be checked.

I wrote two versions, one version is the standard version using boolean array with optimizations, the other is a mapping based one which is much more clever and also much less efficient at the same time, the mapping based approach is adapted from this with modifications based on this, I implemented the two approaches mainly to see if I can make it work, to see if I can correctly and efficiently use C++ data types to port the code successfully.

My code has built-in performance benchmarking, it can also output the primes to a file in binary format. In basic C++, the data type that can hold the largest value is uint64_t, and because I use k*k in my code, my code can only find all primes representable as uint32_t and my code can find all 203280221 primes less than 4294967295.

Now because the primes can be represented as uint32_t, my program outputs the bytes of the primes in binary format, each prime is represented as four bytes in big endian byte order, and then all the byte sequences for the primes are joined head to tail to form a single byte sequence, this byte sequence is then write to file all at once if file writing is enabled, else the serialization step won't be performed. Enabling file writing disables benchmarking and vice-versa, this is by design.


Code

#include <array> #include <chrono> #include <cmath> #include <cstdint> #include <fstream> #include <iostream> #include <utility> #include <string> #include <unordered_map> #include <vector> using bools = std::vector<bool>; using bytes = std::vector<uint8_t>; constexpr std::array<uint8_t, 48> WHEEL = { 2, 4, 2, 4, 6, 2, 6, 4, 2, 4, 6, 6, 2, 6, 4, 2, 6, 4, 6, 8, 4, 2, 4, 2, 4, 8, 6, 4, 6, 2, 4, 6, 2, 6, 6, 4, 2, 4, 6, 2, 6, 4, 2, 4, 2, 10, 2, 10 }; constexpr std::array<std::pair<uint8_t, uint8_t>, 12> FOREWHEEL = { { {4 , 2 }, {9 , 6 }, {25 , 30 }, {35 , 30 }, {49 , 210}, {77 , 210}, {91 , 210}, {119, 210}, {133, 210}, {161, 210}, {203, 210}, {217, 210}, } }; static bools prime_wheel_sieve(uint64_t limit) { limit++; bools is_prime(limit, true); std::array<uint64_t, 48> wheel; is_prime[0] = is_prime[1] = false; for (auto& [multiple, step] : FOREWHEEL) { for (uint64_t i = multiple; i < limit; i += step) { is_prime[i] = false; } } uint32_t k, max_k; uint64_t multiple; k = 11; max_k = std::sqrt(limit); uint8_t i, j; i = j = 0; while (k <= max_k) { if (is_prime[k]) { for (j = 0; j < 48; j++) { wheel[j] = WHEEL[(i + j) % 48] * k; } multiple = k * k; j = 0; while (multiple <= limit) { is_prime[multiple] = false; multiple += wheel[j]; j++; j -= (j == 48) * 48; } } k += WHEEL[i]; i++; i -= (i == 48) * 48; } return is_prime; } static bytes prime_bytes(uint64_t limit) { bytes data; if (limit < 11) { data.reserve(16); for (uint32_t p : {2, 3, 5, 7}) { if (p <= limit) { bytes number = { 0, 0, 0, uint8_t(p) }; data.insert(data.end(), number.begin(), number.end()); } else { break; } } return data; } data = { 0, 0, 0, 2, 0, 0, 0, 3, 0, 0, 0, 5, 0, 0, 0, 7 }; data.reserve(uint32_t(limit / log(limit) * 5)); bools is_prime = prime_wheel_sieve(limit); uint64_t k = 11; uint8_t i = 0; while (k <= limit) { if (is_prime[k]) { data.push_back(uint8_t(k >> 24 & 255)); data.push_back(uint8_t(k >> 16 & 255)); data.push_back(uint8_t(k >> 8 & 255)); data.push_back(uint8_t(k & 255)); } k += WHEEL[i]; i++; i -= (i == 48) * 48; } return data; } static std::vector<uint32_t> prime_wheel_sieve_1(uint64_t limit) { std::vector<uint32_t> primes; if (limit < 11) { primes.reserve(4); for (uint32_t p : {2, 3, 5, 7}) { if (p <= limit) { primes.push_back(p); } else { break; } } return primes; } primes = { 2, 3, 5, 7 }; primes.reserve(uint32_t(limit / log(limit) * 1.25)); std::unordered_map<uint64_t, std::pair<std::array<uint64_t, 48>, uint64_t>> composites; uint64_t k, multiple; uint8_t i, j; k = 11; i = j = 0; while (k <= limit) { auto it = composites.find(k); if (it == composites.end()) { primes.push_back(k); std::array<uint64_t, 48> wheel; for (j = 0; j < 48; j++) { wheel[j] = WHEEL[(i + j) % 48] * k; } composites[k * k] = { wheel, 0 }; } else { auto& [wheel, j] = it->second; composites.erase(it); multiple = k + wheel[j]; j++; j -= (j == 48) * 48; while (composites.find(multiple) != composites.end()) { multiple += wheel[j]; j++; j -= (j == 48) * 48; } composites[multiple] = { wheel, j }; } k += WHEEL[i]; i++; i -= (i == 48) * 48; } return primes; } static bytes prime_bytes_1(uint64_t limit) { std::vector<uint32_t> primes = prime_wheel_sieve_1(limit); bytes data(primes.size() * 4); uint32_t i = 0; for (uint32_t prime : primes) { data[i] = uint8_t((prime >> 24) & 0xFF); data[i + 1] = uint8_t((prime >> 16) & 0xFF); data[i + 2] = uint8_t((prime >> 8) & 0xFF); data[i + 3] = uint8_t(prime & 0xFF); i += 4; } return data; } inline static bool is_number(std::string chars) { if (chars.empty()) { return false; } for (auto& cha : chars) { if (!std::isdigit(cha)) { return false; } } return true; } int main(int argc, char* argv[]) { if (argc != 4) { std::cerr << "number of arguments must be 2"; return -1; } std::string arg0 = argv[1]; if (!is_number(arg0)) { std::cerr << "limit must be unsigned decimal integer"; return -1; } uint64_t limit = std::stoull(arg0); if (limit > UINT32_MAX) { std::cerr << "limit is too large"; return -1; } std::string arg1 = argv[2]; bytes data; if (!is_number(arg1)) { if (std::string(argv[3]) == "1") { data = prime_bytes(limit); } else { data = prime_bytes_1(limit); } std::ofstream file(arg1, std::ios::binary); if (!file.good()) { std::cerr << "failed to open file"; return -1; } file.write(reinterpret_cast<const char*>(data.data()), data.size()); file.close(); } else { uint64_t times = std::stoull(arg1); if (std::string(argv[3]) == "1") { bools is_prime; auto start = std::chrono::steady_clock::now(); for (uint64_t time = 0; time < times; time++) { is_prime = prime_wheel_sieve(limit); } std::chrono::duration<double> elapsed = std::chrono::steady_clock::now() - start; uint32_t count = 0; if (limit < 11) { for (uint8_t p : {2, 3, 5, 7}) { if (p <= limit) { count += 1; } else { break; } } } else { count = 4; uint64_t k = 11; uint8_t i = 0; while (k <= limit) { if (is_prime[k]) { count += 1; } k += WHEEL[i]; i++; i -= (i == 48) * 48; } std::cout << "found primes: " << count << "\n"; std::cout << "average execution time: " << (elapsed.count() / times); } } else { std::vector<uint32_t> primes; auto start = std::chrono::steady_clock::now(); for (uint64_t time = 0; time < times; time++) { primes = prime_wheel_sieve_1(limit); } std::chrono::duration<double> elapsed = std::chrono::steady_clock::now() - start; std::cout << "found primes: " << primes.size() << "\n"; std::cout << "average execution time: " << (elapsed.count() / times); } } } 

Compiled using MSYS2 MinGW64, command line:

g++ -march=native -O3 -flto -funroll-loops -fomit-frame-pointer -fstrict-aliasing -o "D:\CliExec\wheel_sieve_gcc.exe" "C:\Users\xenig\source\repos\Prime Wheel Sieve\Prime Wheel Sieve.cpp" 

Benchmarking and usage example:

PS D:\CliExec> .\wheel_sieve_gcc.exe 1048576 1024 1 found primes: 82025 average execution time: 0.00135145 PS D:\CliExec> .\wheel_sieve_gcc.exe 1048576 16 0 found primes: 82025 average execution time: 0.105634 PS D:\CliExec> .\wheel_sieve_gcc.exe 4294967295 D:\UInt32_primes.bin 1 
In [528]: import numpy as np In [529]: with open("D:/UInt32_primes.bin", "rb") as f: ...: data = np.frombuffer(f.read(), dtype=">u4") In [530]: len(data) Out[530]: 203280221 

My C++ program is somehow less efficient than my Numba code:

import numba as nb import numpy as np @nb.njit(cache=True) def prime_wheel_sieve_opt4(n: int) -> np.ndarray: # fmt: off wheel = np.array([ 2 , 4 , 2 , 4 , 6 , 2 , 6 , 4 , 2 , 4 , 6 , 6 , 2 , 6 , 4 , 2 , 6 , 4 , 6 , 8 , 4 , 2 , 4 , 2 , 4 , 8 , 6 , 4 , 6 , 2 , 4 , 6 , 2 , 6 , 6 , 4 , 2 , 4 , 6 , 2 , 6 , 4 , 2 , 4 , 2 , 10, 2 , 10 ], dtype=np.uint8) # fmt: on primes = np.ones(n + 1, dtype=np.uint8) primes[:2] = False for square, step in ((4, 2), (9, 6), (25, 10), (49, 14)): primes[square::step] = False k = np.uint64(11) lim = int(n ** 0.5) i = 0 while k <= lim: if primes[k]: multiple = k * k j = i big_wheel = wheel * k while multiple <= n: primes[multiple] = False multiple += big_wheel[j] if (j := j + 1) == 48: j = 0 k += wheel[i] if (i := i + 1) == 48: i = 0 return primes 
In [531]: %timeit prime_wheel_sieve_opt4(1048576) 1.22 ms ± 20.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) 

How can I optimize my C++ program?


I use x / ln(x) to estimate the number of primes less than x, this has a tendency to underestimate so I multiply that value by 1.25 to make that an overestimation, so reserve call will allocate enough memory.

And about k * k, it doesn't do what exactly I wanted, however it isn't a bug in the function that uses a vector and it isn't a bug in the second function that uses a mapping, either.

Why? Well in the first function process all values of k that are congruent to one of the 48 values modulo 210, up to the square root of the limit (+ 1). The maximum possible value of limit is 4294967295, increment that and taking the square root is 65536, which is the number of possible values of UInt16_t. Now we know 65536 isn't prime, of course, because it is 216, and the minimum step is 2, and the number k won't be squared if it isn't prime, so k will never reach 65535 and get squared to overflow, thus using k * k to get a uint32_t is absolutely safe.

In fact this is guaranteed to work, because the closest prime less than 65536 is 65521, 65521 % 210 is 1, the next step is 10, 65521 + 10 = 65531 which isn't a prime, then it reaches 65537 after two +2 steps, and stops. Because all these numbers are already marked as composites, so the will never get squared. So I don't need uint64_t range.

The same is true for finding all uint64_t primes using the code, because we only ever process k up to the square root of limit and only squaring k if it is a prime number, the closest prime to UINT32_MAX is 4294967291 and its square is of course in the range of uint64_t, and numbers greater than that are already marked as composites so they won't get squared and after 3 more rolls of the wheel k exceeds the square root and the code stops. So the first function is guaranteed to never overflow and I don't need to double the bit width. And this means I can use the first function to find all uint64_t primes if I have the RAM and the time.

In the second function, since the numbers after the square root will be squared, if the square root is greater than UINT16_MAX then the numbers will overflow the uint32_t range, so I need to make k and multiple uint64_t and the function is limited to finding all uint32_t primes. But I already did that, I made k and multiple uint64_t from the very beginning, so my code has no bugs.


I have replaced all occurrences of uint64_t with uint32_t in the first function, compiled the code and it still finds all 203280221 primes less than UINT32_MAX, proving my theory. However this doesn't speed up performance as I expected.

My theory is correct, however there are unseen complications, replacing all occurrences of uint64_t with uint32_t causes the code to overflow and silently exiting at some point if I were to use it to find all 203280221 primes less than UINT32_MAX. The limit has to be uint64_t, and wheel has to be uint64_t because the wheel has to contain scaled primes and this can overflow, so multiple also has to be uint64_t. And limit has to be uint64_t because I use the code to find all primes up to UINT32_MAX, and I use limit++ to ensure the vector contains all information about primality of numbers up to and including limit, because of course C++ uses 0 based indexing. And UINT32_MAX + 1 is 232 which has 33 bits and therefore must be represented as uint64_t.


Actually there is one bug in my code, specifically I put the fake usage by outputting the count of primes found and the printing of performance measurement inside the else block, so they will only be executed if limit is greater than 11.

They should technically speaking be put outside the else block, and I meant to put them outside the else block, but I somehow put them inside the else block, although for practical purposes it doesn't matter, nobody will use a limit smaller than 11, however if someone were to specifically looking for edge cases my program won't do what I intended it to do if said someone inputs a limit less than 11 and make the code run thousands of times.

As there is already an answer I can't edit the code, although the existing answer doesn't actually answer the question because the supposed bug isn't actually a bug in my code.

\$\endgroup\$
8
  • 2
    \$\begingroup\$ Sure multiple = k * k is OK here? Once k > 65,535, I'd expect the multiplication to overflow. \$\endgroup\$ Commented Apr 29 at 19:34
  • \$\begingroup\$ log as defined in cmath is natural logarithm, its base is e. I use x / ln(x) to estimate the number of primes less than x, this has a tendency to underestimate so I multiply that value by 1.25 to make that an overestimation, so reserve call will allocate enough memory. And k * k won't overflow, because k is limited to uint32_t, so the square of k has the range of uint64_t, and uint64_t is exactly the datatype of multiple. \$\endgroup\$ Commented Apr 29 at 20:02
  • 1
    \$\begingroup\$ yes , of course log() is base e - facepalm. Consider appending something like "I use x / ln(x) to estimate the number of primes less than x, this has a tendency to underestimate so I multiply that value by 1.25 to make that an overestimation" comment so code is better understood. \$\endgroup\$ Commented Apr 29 at 20:05
  • \$\begingroup\$ ("doesn't speed up performance as I expected." just as I expected or contrary to my expectations? (speed up execution or improve performance)) \$\endgroup\$ Commented Apr 30 at 6:14
  • 2
    \$\begingroup\$ "the data type that can hold the largest value is uint64_t" - not necessarily true - we have std::uintmax_t for that. \$\endgroup\$ Commented May 5 at 7:43

3 Answers 3

3
\$\begingroup\$

Enable more diagnostics

The compilation command you show doesn't include any options to enable warnings. I like to get a lot more help from my compiler, to help highlight areas that need particular care, so I routinely use g++ -Wall -Wextra -Wwrite-strings -Wno-parentheses -Wdangling-else -Wpedantic -Warray-bounds -Wconversion -Wuseless-cast -Weffc++. Consider enabling more warnings yourself; this allows you to spend more of your focus on the hard stuff.

Make the code more portable

This assumes the target exercises the permission given by the standard and clearly summarised in the example:

[Example 1: The header <cstdlib> assuredly provides its declarations and definitions within the namespace std. It may also provide these names within the global namespace. The header <stdlib.h> assuredly provides the same declarations and definitions within the global namespace, much as in ISO/IEC 9899. It may also provide these names within the namespace std. — end example]

Portable code uses only what the standard guarantees - in this case, we should write std::uint8_t, std::log, etc. rather than assuming these identifiers are present in the global namespace. (Interestingly, we already have std::sqrt correct, so not even consistent with this).

Portable code also avoids the optional fixed-width types such as std::uint8_t, preferring the guaranteed types instead. In our case, std::uint_fast8_t is probably more appropriate.

And we have problems here:

 for (auto& cha : chars) { if (!std::isdigit(cha)) { 

Not only are we missing the necessary include of <cctype>, but we're also passing possibly-signed char where a positive int is expected. The usual fix is to cast to unsigned char before promotion.

Follow standard conventions

C and C++ programmers normally use all-caps identifiers for macro names only. This helps draw attention to them at point of use, so that readers understand that they are seeing textual substitution rather than C++ objects or functions. The utility of that convention is diluted by WHEEL and FOREWHEEL in this code, being C++ identifiers with normal rules of scope.

Avoid magic numbers

Are all the uses of 48 in the code related to WHEEL.size()? Or do some of them have a different meaning? Without that clarity, it's harder to take the code and adapt it to a larger wheel size.

Beware of std::vector<bool>

A vector of bool is not a standard container, so take care with it - especially when it's not confined to a single function. It may be worth comparing the speed/memory trade-off of using std::vector<char> instead, avoiding the bit-manipulation used in the densely-packed std::vector<bool>.

Improve error handling

We have:

if (argc != 4) { std::cerr << "number of arguments must be 2"; return -1; } 

but no indication as to what the 3(!) arguments should be. We're missing a newline at the end of that message, as well as several other error messages.

Return values from main() should be small positive integers. Negative values are likely to be modified by the environment. In this case, <cstdlib> already provides the ideal value to return: EXIT_FAILURE.

Also,

 std::ofstream file(arg1, std::ios::binary); if (!file.good()) { std::cerr << "failed to open file"; return -1; } file.write(reinterpret_cast<const char*>(data.data()), data.size()); file.close(); 

If the write() or close() operations fail, I don't think it's right to ignore them and return a success status. We should alert the user with an error message if file is bad after closing.

Take care when timing execution of loops

Look at this:

 auto start = std::chrono::steady_clock::now(); for (uint64_t time = 0; time < times; time++) { primes = prime_wheel_sieve_1(limit); } 

A good compiler will determine that prime_wheel_sieve_1() has no side-effects, and that consequently the loop can be replaced with a single call to the function.

\$\endgroup\$
4
\$\begingroup\$

Bug

Code uses uint32_t k; ... k * k, hoping to form a 64-bit product. Unless unsigned is 64-bit, k * k forms a 32-bit product. This readily overflows when k > 65,535.

To form a 64-bit product use (uint64_t) k * k.

Code also has composites[k * k]. Using composites[(uint64_t) k * k] may be OK is size_t is 64-bit.

OP may not see this issue unless codes exercises x > 65,535.

\$\endgroup\$
4
\$\begingroup\$

Traversing your wheel currently requires copying arrays twice. You can improve that by switching to a moveable type, and doing operations that will move them.

You can also avoid having to rehash composites.

There is also a potential for overflow, as you don't check multiple against limit.

struct BigWheel { std::vector<uint64_t> wheel; uint64_t index; } // ... std::unordered_map<uint64_t, BigWheel> composites; composites.reserve(primes.capacity()); // ... auto node = composites.extract(it); auto& multiple = node.key(); auto& [wheel, j] = node.mapped(); do { multiple += wheel[j]; j++; j %= 48; } while (composites.find(multiple) != composites.end() && multiple < limit) if (multiple < limit) { composites.insert(std::move(node)); } 
\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.