Expanding on comments:
I would reverse the splitting.
I would split the writable memory across the threads so each thread wrote to a unique section of memory (it would be good if this were divided by page size as it may help performance). We will call the section of memory that a thread writes to a page.
Then each thread will loop over **all** the primes and remove them from the page associated with the thread. Using the original code as a starting point it would look like this:
std::size_t sectionSizePerThread =
static_cast<std::size_t>(std::ceil(
static_cast<float>(n) / // Size of write memory.
static_cast<float>(std::thread::hardware_concurrency())
for (std::size_t first = 0u;
first < n;
first += sectionSizePerThread)
{
auto last = std::min(first+sectionSizePerThread, n);
// Spawn a thread to strike out the multiples
// of the prime numbers corresponding to the
// elements of primes between first and last
threads.emplace_back(
[&primes, &is_prime](Integer begin, Integer end)
{
for (auto prime: primes)
{
Integer first = begin/prime;
first += (first < begin) ? prime : 0;
for (Integer n = first ; n < end ; n += prime)
{
is_prime[n].store(false, std::memory_order_relaxed);
}
}
},
first, last);
}
Note: It does not matter if read memory `is_prime` is shared across all the threads as each thread can have its own copy in the local cache and not be affected by other threads (as it is read only).