Can the Constant Q transform be implemented more efficiently using an FFT?

Question

I am trying to implement a Constant Q transform in code and was wondering if there was an algorithm that took advantage of the FFT to speed up the calculation so it can be used efficiently in realtime audio processing.

I have some C++ code that effectively implements the transform:

std::vector<std::complex<double>> constantQTransform(const std::vector<double>& signal, double samplingRate, const std::vector<double>& frequencyBins) { std::vector<std::complex<double>> cqtResult; for (double freq : frequencyBins) { double omega = 2.0 * PI * freq / samplingRate; std::complex<double> sum(0.0, 0.0); for (size_t n = 0; n < signal.size(); ++n) { sum += signal[n] * std::exp(std::complex<double>(0.0, -omega * n)); } cqtResult.push_back(sum); } return cqtResult; }

I have been searching around trying to figure out if there is a more efficient implementation but I'm struggling to find anything on the constant Q transform beyond the basic math equation to do the initial calculation.

Does anyone with digital signal processing experience know if it can in fact be improved either by leveraging an FFT or otherwise?

One of the big time savers of the FFT is to precompute the "twiddle factors" ($e^{-j2\pi k/N})$ and store them in a table. You can do the same thing here if you want to use the transform more than once. — Hilmar
– Hilmar, Commented Apr 26, 2024 at 11:47
@Hilmar thank you! I think that would speed things up quite a bit. I'm starting to think that using an FFT may not provide much benefit at all here. I do need to read up a bit more though. Thankfully someone provided a VERY detailed answer so I have plenty to read and learn. — tjwrona
– tjwrona, Commented Apr 26, 2024 at 14:36

Marcus Müller · Accepted Answer · 2024-04-26 12:18:16Z

First off: are you sure what you're doing here is a constant-Q transform? I'm missing the CQ bin-dependent bandwidth, or equivalently, the CQ bin-dependent window length. And while you're using the (explicitly discouraged in the original paper) rectangular window (by not windowing at all), you forego to truncate your sum.

But: if your unit tests say your CQT results are close enough to a reference implementation (Matlab? Librosa? Things where you can write down the results by hand, in any case, as well), I'll say move on. (Big subtext here: you're writing signal processing algorithms. Make sure you unit test them. It's no fun finding a fundamental problem later on, especially not when having to give a report to someone. When performance-optimizing software, things are easy: first test your unoptimized version agains "known correct" results, then you can use your unoptimized version as reference for your progressively optimized versions, with any, including randomized, input!)

I don't directly see a speedup through FFT as a direct drop in for what your code does, because your freq aren't restricted to values that are multiples of sampling_rate/signal.size(), so FFT bins won't necessarily hit the right frequencies. That's actually just paraphrasing the opening sentence in the paper that introduced the Constant-Q transform, so I'd strongly recommend you go and read the original Judith C. Brown Calculation of a constant Q spectral transform, 1990 (available online).

To remedy that, you could sufficiently pad the signal (but that comes at a memory copy cost, as well as an oversized FFT), or otherwise interpolate between FFT bins (but that still comes at a computational cost). In both cases, you'd be making approximation errors. Thi

You can, of course, understand your correlation with the frequency sample vector (2.0 * PI * freq / samplingRate * n for n=0,…size-1) as (time-inversed, conjugate) convolution, and then use fast convolution based on pre-computed FFTs of that vector. BUT, you'd need to precompute these transforms, keep them around in RAM, zero-pad your signal by the length of them, and use the overlap-save method to compute the correlation. This will work out for you if your signal is long enough (rule of thumb, on PCs, fast convolution beats convolution when the transforms are 128 bins or longer).

However; yes, there seem to be FFT-based sped-up versions of the constant-Q transform. I only know about these from Wikipedia's "Constant-Q transform" article, so I'd recommend reading that article's section on Fast calculation. They link to papers. I suspect what the first paper (Judith Brown, Miller Puckette: An efficient algorithm…") does is pretty similar to my proposed fast-convolution filter bank above. You'll have to read it in detail.

Generally, for small sizes, the question is whether you'll be gaining enough right now to worry. You don't seem to be on a deeply embedded system (i.e., you're happily allocating memory at runtime, that's not typical for software running on microcontrollers or DSPs). So, before optimizing software, be sure you're actually optimizing the right things:

If we're already talking performance, it's significantly more efficient to preallocate space for cqtResult to begin with – that would give you a zero-filled vector, and you wouldn't need the auxiliary sum variable, either, and the thing is, it avoids memory allocations and copying of the existing vector contents to a new place should the vector outgrow its previous allocation. Memory allocation often trumps the complexity of math done in algorithms!

Your $\exp(0+j\omega n)$ is of course a bit inefficient, because std::exp(std::complex<>) of course has to compute the exp(Real(complex)) and multiply with it, each sample, although it's always going to be 1. However, I suspect it's reasonably performant on any modern standard library with compiler optimization not turned off. As Hilmar just noted in the comments, you could try to precompute these values for each value of freq, it's absolutely not like you will different frequency bins each computation!

Another observation is that the CQ transform's lower bins really don't need to be done on the same sampling rate as the higher bins. Once you actually have a sensible windowing of your signal (not rectangular, since the sinc has sidelobes all over the spectrum), it might make sense to look (and test!!) which QC bins you can compute on a 1/8-band downsampled signal, for example, by downsampling the signal by a factor of 8, and increasing the frequency you test against by the same factor. Same for 1/4, 1/2. (this works because by principle, low QC bins work on low-pass signals, the high frequencies don't matter to them at all; whether you filter them away before or not doesn't matter to the result; doing this one realizes that a signal where only 1/N of Nyquist bandwidth is used can be losslessly downsampled by a factor of N.).

Trick then becomes to reduce the computational effort by first calculating the half-rate downsampled signal, and use that for the bins that need half the bandwidth, then downsample that again by 1/2, use that for the bins that only need one quarter of the original bandwidth, and so on: as long as you're saving significant computations by that, "downsample by half" is a relatively cheap operation.

Thank you for the VERY detailed response! This gives me a lot to read up on. To clarify I only posted a snippet of the code (I didn't want to overwhelm people with too much C++ lol) the code that calculates the frequency bin spacing is not shown here. I also plan on pre-allocating the memory, this was just the first iteration that got me a meaningful output. I will first try to precompute the twiddle factors and see if that speeds it up enough. If not I'll dig further into a bunch of the things you mentioned here. — tjwrona
– tjwrona, Commented Apr 26, 2024 at 14:41
the main problem remains that it's not a Constant-Q transform if you don't adjust the window length and thus bandwidth with the frequency (the ratio between bandwidth and frequency is that "Q" that's supposed to be constant). So, you need to fix that first; any other optimizations might fail the moment you add that later on. And, again, it makes no sense to optimize your math if your memory management is bad. Just to give you a gut feeling: allocating a 128 element space for when you push_back the 65. element in a vector with enough space for 64 elements will take probably similar time to — Marcus Müller
– Marcus Müller, Commented Apr 26, 2024 at 14:42
doing a couple thousand multiplications (obviously, your milage might vary with clever allocators); so, optimizing the algorithm while you're still doing memory allocations in the middle of it simply makes no sense; your overall algorithm might well be limited by different factors once you eliminate any copying of old vector space to new larger vector when a vector grows. One of the defining things of modern CPUs is that they are neatly optimized for the case where you walk through memory linearly and do the same math with different values; and having a "OK now I have to stop and do memory… — Marcus Müller
– Marcus Müller, Commented Apr 26, 2024 at 14:47
… handling stuff and copy things to other addresses and then continue" really has detrimental effects. And honestly, in your code, the solution is trivial; it's a -1 line fix: gcc.godbolt.org/z/qY7PW1G9a — Marcus Müller
– Marcus Müller, Commented Apr 26, 2024 at 14:47
The other question I'd have for algorithmic performance is whether double makes sense at all here; for anything but the most challenging physical signals and extremest transform lengths (again, don't forget you actually need to window!), float would do, and instantly half the memory bandwidth needs; a performant implementaiton would quite likely to be memory bandwidth bounded in relevant parts. — Marcus Müller
– Marcus Müller, Commented Apr 26, 2024 at 14:53

Stack Exchange Network

Can the Constant Q transform be implemented more efficiently using an FFT?

1 Answer 1

Hot Network Questions

Can the Constant Q transform be implemented more efficiently using an FFT?

1 Answer 1

Related

Hot Network Questions