First off: are you sure what you're doing here is a constant-Q transform? I'm missing the CQ bin-dependent bandwidth, or equivalently, the CQ bin-dependent window length. And while you're using the (explicitly discouraged in the original paper) rectangular window (by not windowing at all), you forego to truncate your sum.
But: if your unit tests say your CQT results are close enough to a reference implementation (Matlab? Librosa? Things where you can write down the results by hand, in any case, as well), I'll say move on. (Big subtext here: you're writing signal processing algorithms. Make sure you unit test them. It's no fun finding a fundamental problem later on, especially not when having to give a report to someone. When performance-optimizing software, things are easy: first test your unoptimized version agains "known correct" results, then you can use your unoptimized version as reference for your progressively optimized versions, with any, including randomized, input!)
I don't directly see a speedup through FFT as a direct drop in for what your code does, because your freq aren't restricted to values that are multiples of sampling_rate/signal.size(), so FFT bins won't necessarily hit the right frequencies. That's actually just paraphrasing the opening sentence in the paper that introduced the Constant-Q transform, so I'd strongly recommend you go and read the original Judith C. Brown Calculation of a constant Q spectral transform, 1990 (available online).
To remedy that, you could sufficiently pad the signal (but that comes at a memory copy cost, as well as an oversized FFT), or otherwise interpolate between FFT bins (but that still comes at a computational cost). In both cases, you'd be making approximation errors. Thi
You can, of course, understand your correlation with the frequency sample vector (2.0 * PI * freq / samplingRate * n for n=0,…size-1) as (time-inversed, conjugate) convolution, and then use fast convolution based on pre-computed FFTs of that vector. BUT, you'd need to precompute these transforms, keep them around in RAM, zero-pad your signal by the length of them, and use the overlap-save method to compute the correlation. This will work out for you if your signal is long enough (rule of thumb, on PCs, fast convolution beats convolution when the transforms are 128 bins or longer).
However; yes, there seem to be FFT-based sped-up versions of the constant-Q transform. I only know about these from Wikipedia's "Constant-Q transform" article, so I'd recommend reading that article's section on Fast calculation. They link to papers. I suspect what the first paper (Judith Brown, Miller Puckette: An efficient algorithm…") does is pretty similar to my proposed fast-convolution filter bank above. You'll have to read it in detail.
Generally, for small sizes, the question is whether you'll be gaining enough right now to worry. You don't seem to be on a deeply embedded system (i.e., you're happily allocating memory at runtime, that's not typical for software running on microcontrollers or DSPs). So, before optimizing software, be sure you're actually optimizing the right things:
If we're already talking performance, it's significantly more efficient to preallocate space for cqtResult to begin with – that would give you a zero-filled vector, and you wouldn't need the auxiliary sum variable, either, and the thing is, it avoids memory allocations and copying of the existing vector contents to a new place should the vector outgrow its previous allocation. Memory allocation often trumps the complexity of math done in algorithms!
Your $\exp(0+j\omega n)$ is of course a bit inefficient, because std::exp(std::complex<>) of course has to compute the exp(Real(complex)) and multiply with it, each sample, although it's always going to be 1. However, I suspect it's reasonably performant on any modern standard library with compiler optimization not turned off. As Hilmar just noted in the comments, you could try to precompute these values for each value of freq, it's absolutely not like you will different frequency bins each computation!
Another observation is that the CQ transform's lower bins really don't need to be done on the same sampling rate as the higher bins. Once you actually have a sensible windowing of your signal (not rectangular, since the sinc has sidelobes all over the spectrum), it might make sense to look (and test!!) which QC bins you can compute on a 1/8-band downsampled signal, for example, by downsampling the signal by a factor of 8, and increasing the frequency you test against by the same factor. Same for 1/4, 1/2. (this works because by principle, low QC bins work on low-pass signals, the high frequencies don't matter to them at all; whether you filter them away before or not doesn't matter to the result; doing this one realizes that a signal where only 1/N of Nyquist bandwidth is used can be losslessly downsampled by a factor of N.).
Trick then becomes to reduce the computational effort by first calculating the half-rate downsampled signal, and use that for the bins that need half the bandwidth, then downsample that again by 1/2, use that for the bins that only need one quarter of the original bandwidth, and so on: as long as you're saving significant computations by that, "downsample by half" is a relatively cheap operation.