Extremely fast method for modular exponentiation with modulus and exponent of several million digits

Question

As a hobby project I'm taking a crack at finding really large prime numbers. The primality tests for this contain modular exponentiation calculations, i.e. a^e mod n. Let's call this the modpow operation to keep the explanation simple. I am wanting to speed up this particular calculation.

Currently I am using GMP's mpz_pown function, but, it is kind of slow. The reason I think it's too slow, is because a function call to GMP's modpow is slower than a full-blown primality test of the software called PFGW for the same large number. (So to be clear, this is just the GMP's modpow part, not my whole custom primality testing routine I am comparing). PFGW is considered the fastest in it's field and for my use case it uses a Brillhart-Lehmer-Selfridge primality test - which also uses the modpow procedure - so it's not because of mathematical cleverness that PFGW is faster in that aspect (please correct me if I'm wrong here). It looks like the bottleneck in GMP is the modpow operation. An example runtime for numbers which have a little over 20,000 digits: GMP's modpow operation takes about 45 seconds and PFGW finishes the whole primality test (involving a modpow) in 9 seconds flat. The difference gets even more impressive with even bigger numbers. GMP uses FFT multiplication and Montgomery reduction for this test comparison, see comments on this post below.

I did some research. So far I understand that the modpow algorithm uses exponentiation by squaring, integer multiplication and modulo reduction - these all sound very familiar to me. Several helper methods could improve the running time of integer multiplication:

To improve the running time of the exponentiation by squaring part, one may use a signed digit representation to reduce the number of multiplications (i.e. bits are represented as 0, 1 or -1, and the bit string is represented in such a way so that it contains many more zeros than in it's original base-2 representation - this reduces the running time of exponentiation by squaring).

For optimizing the modulo part of the operation, I know of these methods:

Montgomery reduction

So here is the 150,000 dollar question: is there a software library available to do a modpow operation efficiently given a very large base, exponent and modulus? (I'm aiming for several millions of digits). If you would like to suggest an option, please try to explain the inner workings of the algorithm for the case with millions of digits for the base, modulus and exponents, as some libraries use different algorithms based on the number of digits. Basically I am looking for a library which supports the techniques mentioned above (or possibly more clever techniques) and it should perform well while running the algorithm (well, better than GMP at least). So far I've searched, found and tried GMP and PFGW, but didn't find these satisfying (PFGW is fast, but I'm just interested in the modpow operation and there is no direct programming interface to that). I'm hoping that maybe an expert in the field can suggest a library with these capabilities, as there seem to be very few that are able to handle these requirements.

Edit: made the question more concise, as it was marked too broad.

The general mpn_mul, mpn_sqr are used by mpn_powm, which will use FFT if the size of the multiplicands are over the Toom-Cook thresholds. Unless GMP has been built with --disable-fft. mpn_powm uses Montgomery reduction, including the case of an even modulus, as mpz_powm decomposes the modulus into an odd factor, and an even (2^i) factor, and applies the Chinese remainder theorem to the results. The Montgomery reduction routine mpn_redc_n will also use FFT if the operands are large. — Brett Hale
– Brett Hale, Commented Jul 11, 2014 at 12:19
Thanks for your input Brett. Looks like mpn_powm isn't documented (as mentioned here). That page says mpn_powm is faster than mpz_powm. Do you happen to know why that is and what the differences between the two are? The GMP function index also doesn't mention mpn_redc_n which sounds like something very useful! Where can I find the documentation of these functions you mentioned? Thanks! — webdevelopersdiary
– webdevelopersdiary, Commented Jul 11, 2014 at 12:30
I looked it up and am putting it here for reference: the thresholds are mentioned here. The maximum threshold for Toom-Cook seems to be a little less than 600 limbs in the 'worst' case. A limb can hold 32 or 64 bits, so that means it'll use FFT for integers larger than ~ 600*64 = 39000 bits on my 64 bit computer. I hadn't reached that number of bits yet. I'll do another test run to see what happens above that limit. — webdevelopersdiary
– webdevelopersdiary, Commented Jul 11, 2014 at 12:48
I couldn't edit my last comment anymore, so putting this in a new one. The ~20,000 digit test I did (mentioned in OP) was with a ~70,000 bit number, so I'm in the FFT spectrum already. — webdevelopersdiary
– webdevelopersdiary, Commented Jul 11, 2014 at 12:56
the mpz_ functions are the recommended interface. the mpn_ functions do the real work, implementing the lower-level routines. That said, the mpz_ functions also setup working memory, massage the data, select the best routines for the operands, etc. For such a large calculation, you would save nothing in bypassing the MPZ interface, and would have to essentially replicate the setup code. — Brett Hale
– Brett Hale, Commented Jul 11, 2014 at 13:06

ewmayer · Accepted Answer · 2015-09-25 03:40:41Z

First off, re. the Answer 1 writer's comment "I do not use GMP but I suspect when they wrote they use FFT they really mean the NTT" -- no, when GMP says "FFT' it means a floating-point FFT. IIRC they also have some NTT-based routines, but for bignum mul those are uncompetitive with FFT.

The reason a well-tuned FFT-mul beats any NTT is that the slight loss of per-word precision due to roundoff error accumulation is more than made up for by the vastly superior floating-point capabilities of modern CPU offerings, especially when one considers high-performance implementations which make use of the vector-math capabilities of CPUs such as the x86_64 family, the current iterations of which - Intel Haswell, Broadwell and Skylake - have massive vector floating-point capability. (I don't cite AMD in this regard because their AVX offerings have lagged far behind Intel's; their high-water mark was circa 2002 and since then Intel has been beating the pants off them in progressively-worse fashion each year.) The reason GMP disappoints in this area is that GMP's FFT is, relatively speaking, crap. I have great respect for the GMP coders overall, but FFT timings are FFT timings, you don't get points for effort or e.g. having a really good bignum add. Here is a paper detailing a raft of GMP FFT-mul improvements:

Pierrick Gaudry, Alex Kruppa, Paul Zimmerman: "A GMP-based Implementation of Schönhage-Strassen's Large Integer Multiplication Algorithm" [http://www.loria.fr/~gaudry/publis/issac07.pdf]

This is from 2007, but AFAIK the performance gap noted in the snippet below has not been narrowed; if anything it has widened. The paper is excellent for detailing various mathematical and algorithmic improvements which can be deployed, but let's cut to the money quote:

"A program that implements a complex floating-point FFT for integer multiplication is George Woltman’s Prime95. It is written mainly for testing large Mersenne numbers 2^p − 1 for primality in the in the Great Internet Mersenne Prime Search [24]. It uses a DWT for multiplication mod a*2^n ± c, with a and c not too large, see [17]. We compared multiplication modulo 2^2wn − 1 in Prime95 version 24.14.2 with multiplication of n-word integers using our SSA implementation on a Pentium 4 at 3.2 GHz, and on an Opteron 250 at 2.4 GHz, see Figure 4. It is plain that Prime95 beats our im- plementation by a wide margin, in fact usually by more than a factor of 10 on a Pentium 4, and by a factor between 2.5 and 3 on the Opteron."

The next few paragraphs are a raft of face-saving spin. (And again, I am personally acquainted with 2 of the 3 authors, and they are all top guys in the field of computational number theory.)

Note that the aforementioned George Woltman, whose Prime95 code has discovered all of the world-record primes since shortly after its debut 20 years ago, has made his core bignum routines available in a general API-ized form called the GWNUM library. You mentioned how much faster PFGW is than GMP for FFT-mul - that's because PFGW uses GWNUM for the core 'heavy lifting' arithmetic, that's where the 'GW' in PFGW comes from.

My own FFT implementation, which has generic-C build support but like George's uses reams of x86 vector-math assembler for high performance on that CPU family, is roughly 60-70% slower than George's on current Intel processor families. I believe that makes it the world's 2nd-fastest bignum-mul code on x86. By way of example, my code is currently running a primality test on a number with roughly 2^29 bits using a 30-Mdouble-length FFT (30*2^20 doubles); thus a little more than 17 bits per input word. Using all four of my 3.3 GHz Haswell 4670 quad's cores it takes ~90 ms per modmul.

BTW, many (if not most) of the world's top bignum-math coders hang out at mersenneforum.org, I encourage you to check it out and ask your questions to the broader (at least in this particular area) expert audience there. I appear under the same handle there as here; George Woltman appears as "Prime95', PFGW's Mark Rodenkirch goes as "rogue".

Great information, thank you! I will drop by at Mersenne Forum some time, appreciate it!
+1 As the developer of y-cruncher, I've been following GMP for years. And while I also respect what they do, they do seem overly stubborn in their ways. Anything that contains the word, "floating-point" or "parallelism" is immediately thrown out the door. Even provably-correct FFTs will beat the crap out of GMP. So they are denying themselves the superior algorithms and ability to utilize modern hardware.
The only valid reason I can think of for sticking with the SSA algorithm is that it uses very little memory and doesn't require caching twiddle factors. From a library perspective, that does simplify things, but personally I don't think it's worth throwing out 10x in performance for that. (And the memory problem can be trivially fixed by throwing a layer of Karatsuba/Toom-Cook on top of the FFT.)

Community · Accepted Answer · 2017-05-23 12:23:53Z

I do not use GMP at all so handle this with that in mind.

I rather use NTT instead of FFT for multiplication

it removes the rounding errors and in comparison to mine FFT implementations optimized to the same point is faster
- C++ NTT implementation
- C++ NTT sqr(mul) implementation
as I mentioned I do not use GMP but I suspect when they wrote they use FFT they really mean the NTT (finite field Fourier transform)
The speed difference of your test and the GMP primality test can be caused by modpow call.

if there are too much calls to it then it causes heap/stack trashing which slows things down considerably. Especially for bignums. Try to avoid heap trashing so eliminate as much data from operands and return calls as possible for frequently called functions. Also sometimes helps to eliminate the bottleneck call by directly copy the function source code into your code instead of the call (or use of macros instead) with use of local variables only.

I think GMP published their source code so finding their implementation of modpow should not be too hard. You just have to use it correctly
just to be clear

you are using big numbers like 20000+ decadic digits which means ~8.4 KBytes per number. Any return or non pointer operand means to copy that amount of data to/from heap stack. This not only takes time but also usually invalidates the CPU's CACHE which also kills performance.

Now multiply this by the algorithm iterations number and you get the Idea. Had similar problems while tweaking many of mine bignum functions and the speedup is often more than 10000% (100 times) even if there is no change in the algorithm used. Just limiting/eliminating heap/stack trashing.

So I do not think you need better modpow implementation just better usage of it but of coarse I may be wrong but without the code you are using is hard to deduce more.

Thanks for your suggestion Spektre. Please grand me some time to look into NTT and run some benchmarks to see if this solves my problem. If it does, I'll be accepting this answer :-) Trying to look up some info on NTT. This is the Number Theoretic Transform, right? I'm reading this is a special case of the Discrete Fourier Transform (DFT). Does that mean that all NTT operations have the same time complexity as DFT? (e.g. multiplication has time complexity O(n log n) if I'm not mistaken)
@webdevelopersdiary yes NTT is nuber theoretic transform ... the same as FFT but on integers modulo prime instead of complex domain that means 2x less data 2x less operations but you need n-th root of unity in that link of mine is also mine implementation class in C++ (optimized more then I could dream of in the beginning) I chosed biggest 32bit nth rooth of unity possible which eliminates all modulo divisions and grants biggest datasets usage ... Also I strongly suggest to look at heap/stack trashing there you can gain even more speed if your code is wasting it...
That sounds good, thanks! How large is the biggest number operation you ran through your program yourself (in terms of number of bits or decimals)? Also, what do you think is the largest number size it could handle? Or perhaps it's better to ask: how long was the execution time for your personal biggest number? Do you also have data available on what's the maximum number size if I want to program to finish within 1, 4, 12, 24 or 7*24 hours? (Not looking for exact measurements, an estimate is okay.)
@webdevelopersdiary NTT input/otput vector size is limited because of overflow issues !!! For 32bit modular arithmetics is N limited to (2^32)/(max(input[])^2) so bigint must be divided to smaller chunks (i use BYTES so max size of bigint processed is (2^32)/((2^8)^2)=2^16 Bytes=2^14 DWORDs=16384 DWORDs) -> max result size is 523136 bits but you should lower it a bit because used prime is not 2^32 so I think up to 400000 bits should be safe. I do not use so huge numbers that will take hours to multiply see benchmarks for smaller inside those linked questions
I'd be careful about claiming that NTTs are faster than FFTs. Because they really aren't. FFTs are around 3 - 10x faster than even the best of the NTTs - at least until you reach sizes where memory becomes an issue.

Collectives™ on Stack Overflow

Extremely fast method for modular exponentiation with modulus and exponent of several million digits

2 Answers 2

3 Comments

16 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

16 Comments

Linked

Related