I need to implement an encoder which compresses a 5-dimensional structure of 10 bits values. Each dimension has between 4 and 12 elements. If a dimension ever has more than 12 elements, it is partitioned in half. So far, I have been using a separable DCT to do that, but my implementation is very slow: I have implemented a naive DCT transform, which has complexity of $O(N^2)$. I have tried some fast algorithms, but they seem to be slower than the naive approach: a naive DCT for 6 elements is faster the Fast version, but for $N = 4$ and $8$, the fast approach is definitely faster. Also, I cannot force the friendly power-of-2 sizes.
I remember the Cooley–Tukey FFT algorithm decomposes a sequence of size $N$ into their factors $N_1$ and $N_2$ such that $N = N_1N_2$. I also know that there are specialized algorithms for 2D DCT due to their use in video and image codecs.
My question is: Is there any reference about a fast multidimensional DCT or even a "reversed" Cooley-Tukey for DCT of which I could transform my 5D array into 1D and use a fast DCT?
Exrta bits: We are researching not only the DCT-II but the whole family of trigonometric transforms., insights about any of the DST and DCT are welcome!
Edited 9/23
Just as an extra comment, this solution is being implemented in C++.