3

Good afternoon!

I am trying to develop an NTT algorithm based on the naive recursive FFT implementation I already have.

Consider the following code (coefficients' length, let it be m, is an exact power of two):

/// <summary> /// Calculates the result of the recursive Number Theoretic Transform. /// </summary> /// <param name="coefficients"></param> /// <returns></returns> private static BigInteger[] Recursive_NTT_Skeleton( IList<BigInteger> coefficients, IList<BigInteger> rootsOfUnity, int step, int offset) { // Calculate the length of vectors at the current step of recursion. // - int n = coefficients.Count / step - offset / step; if (n == 1) { return new BigInteger[] { coefficients[offset] }; } BigInteger[] results = new BigInteger[n]; IList<BigInteger> resultEvens = Recursive_NTT_Skeleton(coefficients, rootsOfUnity, step * 2, offset); IList<BigInteger> resultOdds = Recursive_NTT_Skeleton(coefficients, rootsOfUnity, step * 2, offset + step); for (int k = 0; k < n / 2; k++) { BigInteger bfly = (rootsOfUnity[k * step] * resultOdds[k]) % NTT_MODULUS; results[k] = (resultEvens[k] + bfly) % NTT_MODULUS; results[k + n / 2] = (resultEvens[k] - bfly) % NTT_MODULUS; } return results; } 

It worked for complex FFT (replace BigInteger with a complex numeric type (I had my own)). It doesn't work here even though I changed the procedure of finding the primitive roots of unity appropriately.

Supposedly, the problem is this: rootsOfUnity parameter passed originally contained only the first half of m-th complex roots of unity in this order:

omega^0 = 1, omega^1, omega^2, ..., omega^(n/2)

It was enough, because on these three lines of code:

BigInteger bfly = (rootsOfUnity[k * step] * resultOdds[k]) % NTT_MODULUS; results[k] = (resultEvens[k] + bfly) % NTT_MODULUS; results[k + n / 2] = (resultEvens[k] - bfly) % NTT_MODULUS; 

I originally made use of the fact, that at any level of recursion (for any n and i), the complex root of unity -omega^(i) = omega^(i + n/2).

However, that property obviously doesn't hold in finite fields. But is there any analogue of it which would allow me to still compute only the first half of the roots?

Or should I extend the cycle from n/2 to n and pre-compute all the m-th roots of unity?

Maybe there are other problems with this code?..

Thank you very much in advance!

3
  • look here jjj.de/fxt/fxtbook.pdf and here physicsforums.com/showthread.php?t=140367 Commented Aug 30, 2013 at 10:53
  • just have edited my answer ,... my working fast NTT/INTT is included now, in case you still need it Commented Aug 31, 2013 at 20:12
  • Does anyone have a source that posits a recursive solution? Also -omega^(i) = omega^(i + n/2) very much DOES hold in finite fields. Commented May 5, 2020 at 18:17

3 Answers 3

3

I recently wanted to implement NTT for fast multiplication instead of DFFT too. Read a lot of confusing things, different letters everywhere and no simple solution, and also my finite fields knowledge is rusty , but today i finally got it right (after 2 days of trying and analog-ing with DFT coefficients) so here are my insights for NTT:

  1. Computation

    X(i) = sum(j=0..n-1) of ( Wn^(i*j)*x(i) ); 

    where X[] is NTT transformed x[] of size n where Wn is the NTT basis. All computations are on integer modulo arithmetics mod p no complex numbers anywhere.

  2. Important values


    Wn = r ^ L mod p is basis for NTT
    Wn = r ^ (p-1-L) mod p is basis for INTT
    Rn = n ^ (p-2) mod p is scaling multiplicative constant for INTT ~(1/n)
    p is prime that p mod n == 1 and p>max'
    max is max value of x[i] for NTT or X[i] for INTT
    r = <1,p)
    L = <1,p) and also divides p-1
    r,L must be combined so r^(L*i) mod p == 1 if i=0 or i=n
    r,L must be combined so r^(L*i) mod p != 1 if 0 < i < n
    max' is the sub-result max value and depends on n and type of computation. For single (I)NTT it is max' = n*max but for convolution of two n sized vectors it is max' = n*max*max etc. See Implementing FFT over finite fields for more info about it.

  3. functional combination of r,L,p is different for different n

    this is important, you have to recompute or select parameters from table before each NTT layer (n is always half of the previous recursion).

Here is my C++ code that finds the r,L,p parameters (needs modular arithmetics which is not included, you can replace it with (a+b)%c,(a-b)%c,(a*b)%c,... but in that case beware of overflows especial for modpow and modmul) The code is not optimized yet there are ways to speed it up considerably. Also prime table is fairly limited so either use SoE or any other algo to obtain primes up to max' in order to work safely.

DWORD _arithmetics_primes[]= { 2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101,103,107,109,113,127,131,137,139,149,151,157,163,167,173, 179,181,191,193,197,199,211,223,227,229,233,239,241,251,257,263,269,271,277,281,283,293,307,311,313,317,331,337,347,349,353,359,367,373,379,383,389,397,401,409, 419,421,431,433,439,443,449,457,461,463,467,479,487,491,499,503,509,521,523,541,547,557,563,569,571,577,587,593,599,601,607,613,617,619,631,641,643,647,653,659, 661,673,677,683,691,701,709,719,727,733,739,743,751,757,761,769,773,787,797,809,811,821,823,827,829,839,853,857,859,863,877,881,883,887,907,911,919,929,937,941, 947,953,967,971,977,983,991,997,1009,1013,1019,1021,1031,1033,1039,1049,1051,1061,1063,1069,1087,1091,1093,1097,1103,1109,1117,1123,1129,1151, 0}; // end of table is 0, the more primes are there the bigger numbers and n can be used // compute NTT consts W=r^L%p for n int i,j,k,n=16; long w,W,iW,p,r,L,l,e; long max=81*n; // edit1: max num for NTT for my multiplication purposses for (e=1,j=0;e;j++) // find prime p that p%n=1 AND p>max ... 9*9=81 { p=_arithmetics_primes[j]; if (!p) break; if ((p>max)&&(p%n==1)) for (r=2;r<p;r++) // check all r { for (l=1;l<p;l++)// all l that divide p-1 { L=(p-1); if (L%l!=0) continue; L/=l; W=modpow(r,L,p); e=0; for (w=1,i=0;i<=n;i++,w=modmul(w,W,p)) { if ((i==0) &&(w!=1)) { e=1; break; } if ((i==n) &&(w!=1)) { e=1; break; } if ((i>0)&&(i<n)&&(w==1)) { e=1; break; } } if (!e) break; } if (!e) break; } } if (e) { error; } // error no combination r,l,p for n found W=modpow(r, L,p); // Wn for NTT iW=modpow(r,p-1-L,p); // Wn for INTT 

and here is my slow NTT and INTT implementations (i havent got to fast NTT,INTT yet) they are both tested with Schönhage–Strassen multiplication successfully.

//--------------------------------------------------------------------------- void NTT(long *dst,long *src,long n,long m,long w) { long i,j,wj,wi,a,n2=n>>1; for (wj=1,j=0;j<n;j++) { a=0; for (wi=1,i=0;i<n;i++) { a=modadd(a,modmul(wi,src[i],m),m); wi=modmul(wi,wj,m); } dst[j]=a; wj=modmul(wj,w,m); } } //--------------------------------------------------------------------------- void INTT(long *dst,long *src,long n,long m,long w) { long i,j,wi=1,wj=1,rN,a,n2=n>>1; rN=modpow(n,m-2,m); for (wj=1,j=0;j<n;j++) { a=0; for (wi=1,i=0;i<n;i++) { a=modadd(a,modmul(wi,src[i],m),m); wi=modmul(wi,wj,m); } dst[j]=modmul(a,rN,m); wj=modmul(wj,w,m); } } //--------------------------------------------------------------------------- 


dst is destination array
src is source array
n is array size
m is modulus (p)
w is basis (Wn)

hope this helps to someone. If i forgot something please write ...

[edit1: fast NTT/INTT]

Finally I manage to get fast NTT/INTT to work. Was little bit more tricky than normal FFT:

//--------------------------------------------------------------------------- void _NFTT(long *dst,long *src,long n,long m,long w) { if (n<=1) { if (n==1) dst[0]=src[0]; return; } long i,j,a0,a1,n2=n>>1,w2=modmul(w,w,m); // reorder even,odd for (i=0,j=0;i<n2;i++,j+=2) dst[i]=src[j]; for ( j=1;i<n ;i++,j+=2) dst[i]=src[j]; // recursion _NFTT(src ,dst ,n2,m,w2); // even _NFTT(src+n2,dst+n2,n2,m,w2); // odd // restore results for (w2=1,i=0,j=n2;i<n2;i++,j++,w2=modmul(w2,w,m)) { a0=src[i]; a1=modmul(src[j],w2,m); dst[i]=modadd(a0,a1,m); dst[j]=modsub(a0,a1,m); } } //--------------------------------------------------------------------------- void _INFTT(long *dst,long *src,long n,long m,long w) { long i,rN; rN=modpow(n,m-2,m); _NFTT(dst,src,n,m,w); for (i=0;i<n;i++) dst[i]=modmul(dst[i],rN,m); } //--------------------------------------------------------------------------- 

[edit3]

I have optimized my code (3x times faster than code above),but still i am not satisfied with it so i started new question with it. There I have optimized my code even further (about 40x times faster than code above) so its almost the same speed as FFT on floating point of the same bit size. Link to it is here:

Sign up to request clarification or add additional context in comments.

9 Comments

I notice that the OP wanted a recursive analog to the FFT and not the iterative variant that you presented here. Kudos on the iterative version, but I'm genuinely interested if one can perform an NTT with the recursive structure contained in the simple FFT implementation you might see in a textbook? Would you mind looking into that? I have tried and it seems that the NTT only succeeds for indexes 1 and 1 + n/2 and fails for all others.
@z.karl 1. in my answer are both slow iterative and fast recursive versions and the link in the bottom has highly optimized NTT that outperforms FFT. 2. what do you have in mind with NTT only succeeds for indexes 1 and 1 + n/2 ? if you mean sizes than recursive FFT and NTT needs a power of 2 size s of data to work properly.
Sorry about that you do show a recursive version, I just don't seem to understand how it works I guess. You mention it's a little more tricky than the normal FFT and I can see that, could you explain why it needs to be different in the first place? As in, why can we just replace complex numbers with ints mod some prime of the form k*2^n + 1? I ask because when I did replace complex numbers with their equivalent mod p, for n = 8, my recursive NTT correctly outputs indexes [0] and [0 + 4] but fails for [1, 2, 3, 5, 6, 7], and I was wondering if you might know why?
@z.karl oow now I get it what you mean (may be its the merlot in mine weins :) ) Youre right you can simply exchange complex numbers with integers but the problem is the prime can not be any prime it must be nth-root of unity and there are not many such primes... with prime and size of data processed the other parameters change too so if you change just size it will not work anymore ...however if you get a big enough such prime you can scale the other parameters to any lower power of 2 size .... see the last link in my answer it uses single prime for any n ...
hmm I guess it could be my prime. I was using 673 = 84*2^3 + 1 as the prime, which means 5 is the principal (672)-th root of unity, which means 609 is the 2^3 = 8-th root of unity, because 8 is the length of my transform. Is there something wrong with that prime? I think it satisfies the conditions no?
|
1

To turn Cooley-Tukey (complex) FFT into modular arithmetic approach, i.e. NTT, you must replace complex definition for omega. For the approach to be purely recursive, you also need to recalculate omega for each level based on current signal size. This is possible because min. suitable modulus decreases as we move down in the call tree, so modulus used for root is suitable for lower layers. Additionally, as we are using same modulus, the same generator may be used as we move down the call tree. Also, for inverse transform, you should take additional step to take recalculated omega a and instead use as omega: b = a ^ -1 (via using inverse modulo operation). Specifically, b = invMod(a, N) s.t. b * a == 1 (mod N), where N is the chosen prime modulus.

Rewriting an expression involving omega by exploiting periodicity still works in modular arithmetic realm. You also need to find a way to determine the modulus (a prime) for the problem and a valid generator.

We note that your code works, though it is not a MWE. We extended it using common sense, and got correct result for a polynomial multiplication application. You just have to provide correct values of omega raised to certain powers.

While your code works, though, like from many other sources, you double spacing for each level. This does not lead to recursion that is as clean, though; this turns out to be identical to recalculating omega based on current signal size because the power for omega definition is inversely proportional to signal size. To reiterate: halving signal size is like squaring omega, which is like giving doubled powers for omega (which is what one would do for doubling of spacing). The nice thing about the approach that deals with recalculating of omega is that each subproblem is more cleanly complete in its own right.

There is a paper that shows some of the math for modular approach; it is a paper by Baktir and Sunar from 2006. See the paper at the end of this post.

You do not need to extend the cycle from n / 2 to n.

So, yes, some sources which say to just drop in a different omega definition for modular arithmetic approach are sweeping under the rug many details.

Another issue is that it is important to acknowledge that the signal size must be large enough if we are to not have overflow for result time-domain signal if we are performing convolution. Additionally, it may be useful to find certain implementations for exponentiation subject to modulus exist that are fast, even if the power is quite large.

References

  • Baktir and Sunar - Achieving efficient polynomial multiplication in Fermat fields using the fast Fourier transform (2006)

Comments

0

You must make sure that roots of unity actually exist. In R there are only 2 roots of unity: 1 and -1, since only for them x^n=1 can be true.

In C you have infinitely many roots of unity: w=exp(2*pi*i/N) is a primitive N-th roots of unity and all w^k for 0<=k

Now to your problem: you have to make sure the ring you're working in offers the same property: enough roots of unity.

Schönhage and Strassen (http://en.wikipedia.org/wiki/Sch%C3%B6nhage%E2%80%93Strassen_algorithm) use integers modulo 2^N+1. This ring has enough roots of unity. 2^N == -1 is a 2nd root of unity, 2^(N/2) is a 4th root of unity and so on. Furthermore, these roots of unity have the advantage that they are powers of two and can be implemented as binary shifts (with a modulo operation afterwards, which comes down to a add/subtract).

I think QuickMul (http://www.cs.nyu.edu/exact/doc/qmul.ps) works modulo 2^N-1.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.