Expanding on comments:


I would reverse the splitting.
I would split the writable memory across the threads so each thread wrote to a unique section of memory (it would be good if this were divided by page size as it may help performance). We will call the section of memory that a thread writes to a page.

Then each thread will loop over **all** the primes and remove them from the page associated with the thread. Using the original code as a starting point it would look like this:



 std::size_t sectionSizePerThread =
 static_cast<std::size_t>(std::ceil(
 static_cast<float>(n) / // Size of write memory.
 static_cast<float>(std::thread::hardware_concurrency())

 for (std::size_t first = 0u;
 first < n;
 first += sectionSizePerThread)
 {
 auto last = std::min(first+sectionSizePerThread, n);
 // Spawn a thread to strike out the multiples
 // of the prime numbers corresponding to the
 // elements of primes between first and last
 threads.emplace_back(
 [&primes, &is_prime](Integer begin, Integer end)
 {
 for (auto prime: primes)
 {
 Integer first = begin/prime;
 first += (first < begin) ? prime : 0;
 for (Integer n = first ; n < end ; n += prime)
 {
 is_prime[n].store(false, std::memory_order_relaxed);
 }
 }
 },
 first, last);
 }


Note: It does not matter if read memory `is_prime` is shared across all the threads as each thread can have its own copy in the local cache and not be affected by other threads (as it is read only).