Revisions to Optimized ulong prime test using 6k+/-1 in parallel threads with C#

used local variables

edited Jun 4, 2015 at 23:06

6.8k
1
20
32

I've credited @Hurricane with providing the correct answer. Correct is insufficient. It was excellent, original, and thought-provoking. For the largest 63 bit, it was 2 seconds faster. For the largest 64 bit prime, it was 3 seconds, or in each case 30% faster.

I did a small reworking of his answer for the critical method. Rather than using the cryptically-named iscdfor counting down, I choose a more meaningful name of stopCheckTrigger for counting up. The up versus down doesn't matter but here at CodeReview we frown upon cryptic names.

Expanding upon his notion that performance is better accessing variables local to the lamba, I added a localNumber. REMEMBER This tweaking is only for when you expect to have well in excess of half a billion loops.

For the largest 63 bit, it went from 6.5 seconds to 6.2. For the largest 64 bit prime, it went from 9.5 seconds to 5.95.

private static bool IsPrimeParallel(ulong number, ulong root) { int composite = 0;   // I arbitrarily choose the number of chunks to be the same as the number of processors. // Each chunk will processed in its own thread, but this in no way is equivalent // of saying each thread goes to its own core, or vice versa. int chunks = Environment.ProcessorCount; // perform cast once rather than a billion times const int CheckToStopTriggerPoint = 4096; Parallel.For(0, chunks, (chunk, loopState) => { // perform cast once rather than 90-715 million times // access of variables outside of the lambda function may be slow, depending on how the compiler does it. // so these next 4 lines can be prior to the Parallel.For or inside the lamba. // safest bet is to keep them inside the lambda. ulong chunks64 = (ulong)chunks; ulong kEnd = (root / 6) + chunks64; ulong fEnd = (6 * kEnd) - 1; ulong fStep = (6 * chunks64); // Make a copy of localNumber inside the lamba! ulong localNumber = number; // This declaration MUST BE local to the lamba. int stopCheckTrigger = 0; // perform cast once rather than 90-715 million times for (ulong factor = 6 * ((ulong)chunk + 1) - 1; factor <= fEnd; factor += fStep) { if (numberlocalNumber % factor == 0) // (6k-1) { Interlocked.Exchange(ref composite, 1); loopState.Stop(); break; } else if (numberlocalNumber % (factor + 2) == 0) // (6k+1) { Interlocked.Exchange(ref composite, 1); loopState.Stop(); break; } const int CheckToStopTriggerPoint = 4096;  if (++stopCheckTrigger == CheckToStopTriggerPoint) { if (loopState.IsStopped) { break; } stopCheckTrigger = 0; // reset to 0 } } });   return (composite == 0); }

I've credited @Hurricane with providing the correct answer. Correct is insufficient. It was excellent, original, and thought-provoking. For the largest 63 bit, it was 2 seconds faster. For the largest 64 bit prime, it was 3 seconds, or in each case 30% faster.

I did a small reworking of his answer for the critical method. Rather than using the cryptically-named iscdfor counting down, I choose a more meaningful name of stopCheckTrigger for counting up. The up versus down doesn't matter but here at CodeReview we frown upon cryptic names.

private static bool IsPrimeParallel(ulong number, ulong root) { int composite = 0; // I arbitrarily choose the number of chunks to be the same as the number of processors. // Each chunk will processed in its own thread, but this in no way is equivalent // of saying each thread goes to its own core, or vice versa. int chunks = Environment.ProcessorCount; // perform cast once rather than a billion times const int CheckToStopTriggerPoint = 4096; Parallel.For(0, chunks, (chunk, loopState) => { // perform cast once rather than 90-715 million times // access of variables outside of the lambda function may be slow, depending on how the compiler does it. // so these next 4 lines can be prior to the Parallel.For or inside the lamba. // safest bet is to keep them inside the lambda. ulong chunks64 = (ulong)chunks; ulong kEnd = (root / 6) + chunks64; ulong fEnd = (6 * kEnd) - 1; ulong fStep = (6 * chunks64); // This declaration MUST BE local to the lamba. int stopCheckTrigger = 0; // perform cast once rather than 90-715 million times for (ulong factor = 6 * ((ulong)chunk + 1) - 1; factor <= fEnd; factor += fStep) { if (number % factor == 0) // (6k-1) { Interlocked.Exchange(ref composite, 1); loopState.Stop(); break; } else if (number % (factor + 2) == 0) // (6k+1) { Interlocked.Exchange(ref composite, 1); loopState.Stop(); break; } if (++stopCheckTrigger == CheckToStopTriggerPoint) { if (loopState.IsStopped) { break; } stopCheckTrigger = 0; // reset to 0 } } }); return (composite == 0); }

I've credited @Hurricane with providing the correct answer. Correct is insufficient. It was excellent, original, and thought-provoking.

I did a small reworking of his answer for the critical method. Rather than using the cryptically-named iscdfor counting down, I choose a more meaningful name of stopCheckTrigger for counting up. The up versus down doesn't matter but here at CodeReview we frown upon cryptic names.

Expanding upon his notion that performance is better accessing variables local to the lamba, I added a localNumber. REMEMBER This tweaking is only for when you expect to have well in excess of half a billion loops.

For the largest 63 bit, it went from 6.5 seconds to 6.2. For the largest 64 bit prime, it went from 9.5 seconds to 5.95.

private static bool IsPrimeParallel(ulong number, ulong root) { int composite = 0;   // I arbitrarily choose the number of chunks to be the same as the number of processors. // Each chunk will processed in its own thread, but this in no way is equivalent // of saying each thread goes to its own core, or vice versa. int chunks = Environment.ProcessorCount; Parallel.For(0, chunks, (chunk, loopState) => { // perform cast once rather than 90-715 million times // access of variables outside of the lambda function may be slow, depending on how the compiler does it. ulong chunks64 = (ulong)chunks; ulong kEnd = (root / 6) + chunks64; ulong fEnd = (6 * kEnd) - 1; ulong fStep = (6 * chunks64); // Make a copy of localNumber inside the lamba! ulong localNumber = number; // This declaration MUST BE local to the lamba. int stopCheckTrigger = 0; // perform cast once rather than 90-715 million times for (ulong factor = 6 * ((ulong)chunk + 1) - 1; factor <= fEnd; factor += fStep) { if (localNumber % factor == 0) // (6k-1) { Interlocked.Exchange(ref composite, 1); loopState.Stop(); break; } else if (localNumber % (factor + 2) == 0) // (6k+1) { Interlocked.Exchange(ref composite, 1); loopState.Stop(); break; } const int CheckToStopTriggerPoint = 4096;  if (++stopCheckTrigger == CheckToStopTriggerPoint) { if (loopState.IsStopped) { break; } stopCheckTrigger = 0; // reset to 0 } } });   return (composite == 0); }

add summary

Source Link

edited Jun 4, 2015 at 18:07

Rick Davin

6.8k
1
20
32

I've credited @Hurricane@Hurricane with providing the correct answer. Correct is insufficient. It was excellent, original, and thought-provoking. For the largest 63 bit, it was 2 seconds faster. For the largest 64 bit prime, it was 3 seconds, or in each case 30% faster.

Other than that bit of reworking, its the same as @Hurricane's@Hurricane's answer:

If you are a fan of primes, I strongly encourage you to vote for @Hurricane's answer.

Source Link

answered Jun 4, 2015 at 17:50

Rick Davin

6.8k
1
20
32

I've credited @Hurricane with providing the correct answer. Correct is insufficient. It was excellent, original, and thought-provoking. For the largest 63 bit, it was 2 seconds faster. For the largest 64 bit prime, it was 3 seconds, or in each case 30% faster.

I did a small reworking of his answer for the critical method. Rather than using the cryptically-named iscdfor counting down, I choose a more meaningful name of stopCheckTrigger for counting up. The up versus down doesn't matter but here at CodeReview we frown upon cryptic names.

Other than that bit of reworking, its the same as @Hurricane's answer:

private static bool IsPrimeParallel(ulong number, ulong root) { int composite = 0; // I arbitrarily choose the number of chunks to be the same as the number of processors. // Each chunk will processed in its own thread, but this in no way is equivalent // of saying each thread goes to its own core, or vice versa. int chunks = Environment.ProcessorCount; // perform cast once rather than a billion times const int CheckToStopTriggerPoint = 4096; Parallel.For(0, chunks, (chunk, loopState) => { // perform cast once rather than 90-715 million times // access of variables outside of the lambda function may be slow, depending on how the compiler does it. // so these next 4 lines can be prior to the Parallel.For or inside the lamba. // safest bet is to keep them inside the lambda. ulong chunks64 = (ulong)chunks; ulong kEnd = (root / 6) + chunks64; ulong fEnd = (6 * kEnd) - 1; ulong fStep = (6 * chunks64); // This declaration MUST BE local to the lamba. int stopCheckTrigger = 0; // perform cast once rather than 90-715 million times for (ulong factor = 6 * ((ulong)chunk + 1) - 1; factor <= fEnd; factor += fStep) { if (number % factor == 0) // (6k-1) { Interlocked.Exchange(ref composite, 1); loopState.Stop(); break; } else if (number % (factor + 2) == 0) // (6k+1) { Interlocked.Exchange(ref composite, 1); loopState.Stop(); break; } if (++stopCheckTrigger == CheckToStopTriggerPoint) { if (loopState.IsStopped) { break; } stopCheckTrigger = 0; // reset to 0 } } }); return (composite == 0); }

Stack Exchange Network

Return to Answer