13
\$\begingroup\$

A lot of the sieve code floating around here and on the 'net is sloppy or broken; in fact, I've yet to see a single one that is correct, clean (in the sense of avoiding superfluous iterations etc.) and that can actually sieve up to the end of its operating range. Some state at least that they cannot sieve the full range, some state limits and crash anyway even for ranges below that, the rest simply crash or return wrong results. Even the 'canonical' examples - Achim Flammenkamp's code, Tomás Oliveira e Silva's and Kim Walisch's @ primesieve.org - range over the whole spectrum in that regard.

That's why I thought it would be a good idea to have a simple, robust example that is correct and flexible enough to serve as a basis for people to experiment with their own ideas. The code I'm proposing here (full code in zrbj_sx_sieve64_v1.cpp) demonstrates two different uses of segmentation with the Sieve of Eratosthenes.

The first use of segmentation is to avoid unnecessary work when the start of the range to be sieved is higher than the square root of the upper end of the range, like in SPOJ problem #2 PRIME1. Sieving all the way up to the upper end of the range (10^9 for PRIME1) can easily exceed the time limit of 5 seconds; by contrast, sieving only the potential factors up to the square root and then the actual target range is several thousand times as fast. This is pretty much the only way of sieving close to 2^64 without taking a long holiday... The second use of segmentation is for speeding up the sieving of bigger ranges but that is explained below, with the code for extending the factor sieve.

The code here uses a packed, odds-only bitmap where set bits signify 'composite', and it implements a 32-bit capable version that limits sieving ranges - and index variables - to at most 2^32 numbers in one go. This limits the sieved range to 2^31 bits per call but that is already more than enough since smaller segments work better even at the upper end of the range (L2/L3 cache).

unsigned char odd_composites32[UINT32_MAX / (2 * CHAR_BIT) + 1]; // the small factor sieve uint32_t sieved_bits = 0; // how far it's been initialised void extend_factor_sieve_to_cover (uint32_t max_factor_bit); // bit, not number! void sieve32 (unsigned char *target_segment, uint64_t bit_offset, uint32_t bit_count) { assert( bit_count > 0 && bit_count <= UINT32_MAX / 2 + 1 ); uint32_t max_bit = bit_count - 1; uint64_t max_num = 2 * (bit_offset + max_bit) + 1; uint32_t max_factor_bit = (max_factor32(max_num) - 1) / 2; if (target_segment != odd_composites32) { extend_factor_sieve_to_cover(max_factor_bit); } std::memset(target_segment, 0, std::size_t((max_bit + CHAR_BIT) / CHAR_BIT)); for (uint32_t i = 3u >> 1; i <= max_factor_bit; ++i) { if (bit(odd_composites32, i)) continue; uint32_t n = (i << 1) + 1; // the actual prime represented by bit i (< 2^32) uint32_t stride = n; // == (n * 2) / 2 uint64_t start = (uint64_t(n) * n) >> 1; uint32_t k; if (start >= bit_offset) { k = uint32_t(start - bit_offset); } else // start < offset { k = stride - (bit_offset - start - 1) % stride - 1; } while (k <= max_bit) { set_bit(target_segment, k); if ((k += stride) < stride) // k can wrap since strides go up to almost 2^32 { break; } } } } 

The second use of segmentation is to achieve locality of access, because sieving in small segments that fit into the fast L1 cache of the CPU can speed things up significantly. At least until ranges get so high that the iteration over all potential odd factors - 203,280,220 of them, when working close to 2^64 - starts dominating. When initialising the small factors sieve this is definitely not the case, since it has only 6541 small factor primes to deal with. The speed can be more than doubled by presieving and remembering offsets, but that would make the code a hundred lines longer. Two versions of these optimisations are demoed in zrbj_sx_sieve32_v2.cpp and zrbj_sx_sieve32_v4.cpp.

This function is equivalent to the single call sieve32(odd_composites32, 0, max_factor_bit) except for the speed. For sieving no higher than 2^32 (SPOJ) the single call is more than good enough, since the maximum factor for a 32-bit number cannot exceed 2^16 and the call takes less than a millisecond.

void extend_factor_sieve_to_cover (uint32_t max_factor_bit) { uint32_t const SIEVE_BITS = sizeof(odd_composites32) * CHAR_BIT; assert( max_factor_bit < SIEVE_BITS ); if (max_factor_bit >= sieved_bits) { assert( sieved_bits % CHAR_BIT == 0 ); uint32_t bits_to_sieve = max_factor_bit - sieved_bits + 1; uint32_t segment_size = 1u << 18; // == 32K * CHAR_BIT == L1 cache size uint32_t partial_bits = bits_to_sieve % segment_size; uint32_t segments = bits_to_sieve / segment_size + (partial_bits != 0); if ((SIEVE_BITS - sieved_bits) % segment_size == 0) { partial_bits = 0; // always sieve full segments as long as there is enough space } for ( ; segments--; ) { uint32_t bits_this_round = segments == 0 && partial_bits ? partial_bits : segment_size; sieve32(odd_composites32 + sieved_bits / CHAR_BIT, sieved_bits, bits_this_round); sieved_bits += bits_this_round; } } } 

Idiosyncratic implementation choice #1: using byte arrays instead of std::vector or std::bitset<> makes the code simpler, cleaner, faster and more robust. Not to mention more flexible. A more sophisticated implementation might allocate bitmaps as uint32_t[] or uint64_t[] and let functions take void* pointers as parameters, so that functions can cast to uint8_t or whatever (up to the allocated type size) as is most suitable.

Idiosyncratic coding choice #1: as you can see, I calculate many values into separate variables with relatively long (but hopefully accurate) names. This gives the calculation of these values several times as much source code space as elsewhere, where the stuff might be crammed into big expressions and loop headers. But elsewhere these calculations are often not very accurate. Besides, I'm using this as some sort of 'live' (executable) documentation for myself of how the stuff works. Most, if not all, Eratosthenes coding errors actually occur with indexing-related math, so I thought this might be appropriate here as well.

Here are two helper function templates to make pretty much anything usable as a bitmap, the moral equivalent to bit-twiddling macros in C:

/////////////////////////////////////////////////////////////////////////////////////////////////// // Bit manipulation functions are sometimes faster using the machine word size, sometimes using // unsigned char. Using templated functions makes it possible to mix and match. It also makes // it easy to experiment with intrinsics, by the simple expedient of defining specialisations // for certain types and implementing them via intrinsics. template<typename word_t> void set_bit (word_t *p, uintxx_t index) { enum { BITS_PER_WORD = sizeof(word_t) * CHAR_BIT }; // we can trust the compiler to use masking and shifting instead of division; we cannot do that // ourselves without having the log2 which cannot easily be computed as a constexpr p[index / BITS_PER_WORD] |= word_t(1) << (index % BITS_PER_WORD); } //------------------------------------------------------------------------------------------------- // Returning the result as unsigned char makes this compatible with the intrinsics for equivalent // CPU instructions; forcing the intrinsics to bool instead would cause potential inefficiencies. template<typename word_t> unsigned char bit (word_t const *p, uintxx_t index) { enum { BITS_PER_WORD = sizeof(word_t) * CHAR_BIT }; return (unsigned char)((p[index / BITS_PER_WORD] >> (index % BITS_PER_WORD)) & 1u); } 

Here's part of the output for the test program, compiled with gcc 4.8.1 (MinGW64):

sizeof(uintxx_t) == 4 sizeof(void*) == 8 initialising the small factor sieve... 4.255 s (962.6 M/s) ... sieving some bigger ranges (10M)... 1 .. 100000000 115 ms 1658.6 M/s 5761455 OK 793877173 .. 1000000000 473 ms 831.2 M/s 10000000 OK 9769857367 .. 10000000000 634 ms 692.4 M/s 10000000 OK 99746761237 .. 100000000000 822 ms 587.6 M/s 10000000 OK 999723733787 .. 1000000000000 1037 ms 508.1 M/s 10000000 OK 9999700629011 .. 10000000000000 1254 ms 455.3 M/s 10000000 OK 99999677617663 .. 100000000000000 1504 ms 408.8 M/s 10000000 OK 999999654617569 .. 1000000000000000 1790 ms 368.0 M/s 10000000 OK 9999999631636541 .. 10000000000000000 2190 ms 320.8 M/s 10000000 OK 99999999608465399 .. 100000000000000000 2884 ms 258.9 M/s 10000000 OK 999999999585415333 .. 1000000000000000000 4260 ms 185.6 M/s 10000000 OK 9999999999562573471 .. 10000000000000000000 7213 ms 115.7 M/s 10000000 OK 18446744030316425227 .. 18446744030759878665 8734 ms 96.8 M/s 10000000 OK 18446744073265777349 .. 18446744073709551615 8793 ms 96.3 M/s 10000000 OK sieving with maximum window size... 0 .. 4294967294 14405 ms 568.7 M/s 203280221 OK 

Observe the speed difference between sieving the full 32-bit range in one go (last line) and sieving the same range in small segments when initialising the factor sieve.

The digests for verifying the sieve output have been computed from reliable sources, as indicated in the topic Checksumming large swathes of prime numbers? (for verification). Sources:

  • up to 1,000,000,000,000 available as files from sites like primos.mat.br
  • up to 2^64-10*2^32 in super-fast bulk via the primesieve.org console program (pipe)
  • up to 2^64-1 - and beyond - via the gp/PARI program (pipe, about 1 million primes/minute)
\$\endgroup\$
0

3 Answers 3

10
\$\begingroup\$

Generally speaking, I have to admit that I don't understand everything the code is doing and that I could use a little more time in order to understand it. However, I hope I can still give you some hopefully useful pieces of advice:

  • You could use a little more C++ in your code, which could be considered a bit too C-ish. What comes to my mind is BITS_PER_WORD which can actually be obtained with std::numeric_limits:

    enum { BITS_PER_WORD = std::numeric_limits<word_t>::digits }; 

    The same goes for several other constants such as UINT32_MAX which could be std::numeric_limits<uint32_t>::max(). I have to admit that it makes the code longer too, but it may be easier to change the integer type with a basic find-and-replace if you ever need to.

  • You could also use the C++ algorithms instead of the C library. For example, replace std::memset by std::fill_n, which is likely to be optimized back into an std::memset anyway while being clearer:

    std::fill_n(target_segment, std::size_t((max_bit + CHAR_BIT) / CHAR_BIT), 0); 
  • It is likely that it has already been optimized by the compiler but since you compute both index / BITS_PER_WORD and index % BITS_PER_WORD, you may as well make even more sure that these two values are computed together by using std::div. It also applies to bits_to_sieve % segment_size and bits_to_sieve / segment_size somewhere else in the program.

  • If you have access to a C++11 compiler, don't hesitate to declare some variables constexpr, like SIEVE_BITS which can totally be computed at compile time.

\$\endgroup\$
1
\$\begingroup\$

Although late I publish a double segmentation solution that uses wheel factorization, useful for large numbers with low memory problems. I propose a traditional sieve that uses two vectors in order to have the advantage of the wheel without increasing the complexity. With this sieve 1/3 of the memory is used compared to the traditional sieve without increasing the complexity, this is the main advantage.

Edit using vector< char > you get a quick sieve.

#include <iostream> #include <cmath> #include <vector> using namespace std; void SieveWheel6(long long int dim1, vector<char> &Primes5mod6, vector<char> &Primes1mod6) { // Get all the primes >3 and <= 6*dim1+1 in vectors Primes5mod6 and Primes1mod6 //the sieve only considers numbers +-1 mod 6 a kind of Wheel factorization with the basis 2,3 long long int m; long long int mmin=6; long long int imax=(long long int)floor(sqrt(dim1/6))+1; for (long long int i = 1; i <= imax; i++) { if (Primes5mod6[i]) { // Cancelling out all the multiples of 6*i-1 for ( m =mmin; m <= dim1; m += 6*i-1) { Primes5mod6[m] = false; Primes1mod6[m-2*i] = false; } for (; m-2*i <= dim1; m += 6*i-1) { Primes1mod6[m-2*i] = false; } } if (Primes1mod6[i]) { // Cancelling out all the multiples of 6*i+1 for ( m =mmin; m+2*i <= dim1; m += 6*i+1) { Primes5mod6[m] = false; Primes1mod6[m+2*i] = false; } for (; m <= dim1; m += 6*i+1) { Primes5mod6[m] = false; } } mmin+=12*i+6; } } void StepSegmentedSieve(long long int dim1, long long int j, vector<char> &Primes5mod6, vector<char> &Primes1mod6, vector<char> &Primes5mod6_t, vector<char> &Primes1mod6_t) { long long int imax=(long long int )((floor(sqrt((j+2)*dim1/6))+1))+1; long long int m,m1,mmin,mmin1; long long int dim0=(j+1)*dim1; long long int mmin0=6-dim0; for(int i=1; i <imax;i++) { if (Primes5mod6[i]) { // Cancelling out all the multiples of 6*i-1 mmin=((-dim0+i)%(6*i-1)+6*i-1)%(6*i-1); mmin1=((mmin-2*i)%(6*i-1)+6*i-1)%(6*i-1); if (mmin0>=0) { mmin=mmin0; } if ((mmin0-2*i)>=0) { mmin1=mmin0-2*i; } for (m =mmin,m1 =mmin1; (m <= dim1)&&(m1 <= dim1); m += 6*i-1,m1 += 6*i-1) { Primes5mod6_t[m] = false; Primes1mod6_t[m1] = false; } for (; m <= dim1; m += 6*i-1) { Primes5mod6_t[m] = false; } for (; m1 <= dim1; m1 += 6*i-1) { Primes1mod6_t[m1] = false; } } if (Primes1mod6[i]) { // Cancelling out all the multiples of 6*i+1 mmin=((-dim0-i)%(6*i+1)+6*i+1)%(6*i+1); mmin1=(mmin+2*i)%(6*i+1); if (mmin0>=0) { mmin=mmin0; } if ((mmin0+2*i)>=0) { mmin1=mmin0+2*i; } for (m =mmin,m1 =mmin1; (m <= dim1)&&(m1 <= dim1); m += 6*i+1,m1 += 6*i+1) { Primes5mod6_t[m] = false; Primes1mod6_t[m1] = false; } for (; m <= dim1; m += 6*i+1) { Primes5mod6_t[m] = false; } for (; m1 <= dim1; m1 += 6*i+1) { Primes1mod6_t[m1] = false; } } mmin0+=12*i+6; } } void DoubleSegmentedSieve(long long int num_seg_1, long long int dim_seg_2, long long int k_start, vector<unsigned long long int > &ans) { //the sieve only considers numbers +-1 mod 6 a kind of Wheel factorization with the basis 2,3 long long int sqrt_nmax=(long long int )((floor(sqrt(k_start/6+dim_seg_2/36))+1))+1; long long int num_step=num_seg_1; long long int dim1=(long long int) (sqrt_nmax/num_seg_1+num_seg_1/6); // for convenience the minimum value of dim1 allowed is 2^12 if (dim1<(long long int )exp2(12)) { num_step=1+sqrt_nmax/(long long int )exp2(12); dim1=sqrt_nmax/num_step+num_step/6 ; } long long int imax=dim1+1; long long int dim2=dim_seg_2/6+1; long long int kmin=(long long int)k_start-1; long long int i,j,m,m1,mmin,mmin1; long long int dim_j,mmin_i,p_i; // Get all the primes >3 <= 6*dim1+1 with (6*dim1+1)>sqrt(sqrt(6*k_start+dim_seg_2) vector<char> Primes5mod6(dim1 + 1, true); vector<char> Primes1mod6(dim1 + 1, true); SieveWheel6(dim1, Primes5mod6, Primes1mod6); //vectors temp primes in j step vector<char> Primes5mod6_t(Primes5mod6); vector<char> Primes1mod6_t(Primes1mod6); // vectors final primes in range [6*k_start,6*k_start+dim_seg_2] vector<char> Primes5mod6_f(dim2 + 1, true); vector<char> Primes1mod6_f(dim2 + 1, true); for (j = 0; j < num_step; j++) { dim_j=j*dim1; mmin_i=-kmin+6+12*dim_j+6*dim_j*dim_j; for (i=1;i <imax;i++) { if (Primes5mod6_t[i]) { // Cancelling out all the multiples of p_i=6*(i+dim_j)-1 p_i=6*(i+dim_j)-1; mmin=((-kmin+i+dim_j)%p_i+p_i)%p_i; mmin1=((mmin-2*(i+dim_j))%p_i+p_i)%p_i; if (mmin_i>=0) { mmin=mmin_i; } if (mmin_i-2*(i+dim_j)>=0) { mmin1=mmin_i-2*(i+dim_j); } for ( m =mmin,m1 =mmin1; (m <= dim2)&&(m1 <= dim2); m += p_i,m1 += p_i) { Primes5mod6_f[m] = false; Primes1mod6_f[m1] = false; } for (; m <= dim2; m += p_i) { Primes5mod6_f[m] = false; } for (; m1 <= dim2; m1 += p_i) { Primes1mod6_f[m1] = false; } } if (Primes1mod6_t[i]) { // Cancelling out all the multiples of p_i=6*(i+dim_j)+1 p_i=6*(i+dim_j)+1; mmin=((-kmin-i-dim_j)%p_i+p_i)%p_i; mmin1=(mmin+2*(i+dim_j))%p_i; if (mmin_i>=0) { mmin=mmin_i; } if (mmin_i+2*(i+dim_j)>=0) { mmin1=mmin_i+2*(i+dim_j); } for ( m =mmin,m1 =mmin1; (m <= dim2)&&(m1 <= dim2); m += p_i,m1 += p_i) { Primes5mod6_f[m] = false; Primes1mod6_f[m1] = false; } for (; m <= dim2; m += p_i) { Primes5mod6_f[m] = false; } for (; m1 <= dim2; m1 += p_i) { Primes1mod6_f[m1] = false; } } mmin_i+=12*i+6+12*dim_j; } fill(Primes5mod6_t.begin(), Primes5mod6_t.end(), true); fill(Primes1mod6_t.begin(), Primes1mod6_t.end(), true); StepSegmentedSieve(dim1, j, Primes5mod6, Primes1mod6,Primes5mod6_t, Primes1mod6_t); } // Append primes in range [6*k_start, 6*k_start+dim_seg_2] in ans if (Primes1mod6_f[1]) { ans.push_back(6*(1+(unsigned long long int)kmin)+1); } for ( i = 2; i <= dim2; i++) { if (Primes5mod6_f[i]) { ans.push_back(6*(i+(unsigned long long int)kmin)-1); } if (Primes1mod6_f[i]) { ans.push_back(6*(i+(unsigned long long int)kmin)+1); } } } int main() { // find vector ans of primes in range [6*k_start, 6*k_start+dim_seg_2] vector<unsigned long long int> ans; long long int k_start=(long long int)exp2(60)/6; long long int dim_seg_2=(long long int)exp2(14); //for the search divide range up to sqrt(6*k_start+dim_seg_2) into num_seg_1 segments long long int num_seg_1=16; DoubleSegmentedSieve(num_seg_1, dim_seg_2,k_start, ans); for (unsigned long long int p : ans) { cout << p << " "; } cout << endl; } 
\$\endgroup\$
0
0
\$\begingroup\$

Sorry that I'm four years late, but I'm very interested in this topic. I hope you are still interested in it as well.

WHEEL FACTORIZATION

That would greatly speed up your code. The canonical examples you reference use wheel factorization, so I'm quite sure you are aware of it. Perhaps you feel it's too large or complex to add here and might confuse people unfamiliar with it.

\$\endgroup\$
3
  • 1
    \$\begingroup\$ Greg, I suggest you try it and come back with results. Good caching will massively improve speed, and obviously he is using a wheel { 2 } which gives most of the potential savings already. \$\endgroup\$ Commented Dec 17, 2019 at 13:48
  • \$\begingroup\$ Gnasher has hit the nail squarley on the head. A packed odds-only bitmap (a.k.a. mod-2 wheel) realises most of the potential of wheeled representation already, and most of the remaining potential can be realised via presieving and/or wheelish striding when reading out the numbers from the packed bitmap, all at virtually no cost. By contrast, the famous mod-30 wheel offers a further space reduction to a bit more than half but it makes the code hugely more complicated. If you were to drive it with lookup tables then you would lose speed, which means you have to use massively unrolled code at \$\endgroup\$ Commented Dec 24, 2019 at 13:33
  • \$\begingroup\$ some chosen phase for blasting through the in-phase whole-phase part of a segment, with setup and finisher code to get the operation into phase before the main loop and then to finish off odds and ends after the main loop. Huuugely complicated. \$\endgroup\$ Commented Dec 24, 2019 at 13:35

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.