The DFT does a brute force N^2 matrix multiply.
FFTs does clever tricks, exploiting properties of the matrix (degeneralizing the matrix multiply) in order to reduce computational cost.
Let us first look at a small DFT:
W=fft(eye(4));
x = rand(4,1)+1j*rand(4,1);
X_ref = fft(x);
X = W*x;
assert(max(abs(X-X_ref)) < 1e-7)
Great so we are able to substitute MATLABs call to the FFTW library by a small 4x4 ( complex) matrix multiplication. So what does this matrix look like?
W =
1 1 1 1
1 -i -1 i
1 -1 1 -1
1 i -1 -i
Each element is either +1, -1, +1j or -1j. Obviously, this means that we can avoid full complex multiplications. Further, the first column is identical, meaning that we are multiplying the first element of x over and over by the same factor.
It turns out that Kronecker tensor products, "twiddle factors" and a permutation matrix where the index is changed according to the binary represantation flipped is both compact and gives an alternate perspective on how FFTs are computed as a set of sparse matrix operations.
The lines below is a simple Decimation in Frequency (DIF) radix 2 forward FFT. While the steps may seem cumbersome, it is convenient to reuse for forward/inverse FFT, radix4/split-radix or decimation-in-time, while being a fair representation of how in-place FFTs tends to be implemented in the real world, I believe.
N = 4;
x = randn(N, 1) +1j*randn(N, 1);
T1 = exp(-1j2pi*([zeros(1, N/2), 0:(N/2-1)]).'/N),
M0 =kron(eye(2), fft(eye(2))),
M1 = kron(fft(eye(2)), eye(2)),
X=bitrevorder(x.'M1diag(T1)*M0),
X_ref=fft(x)
assert(max(abs(X(:)-X_ref(:)))<1e-6)
C F Van Loan has a great book on this subject.