Fast algorithm for n-dimensional DCT

Question

I need to implement an encoder which compresses a 5-dimensional structure of 10 bits values. Each dimension has between 4 and 12 elements. If a dimension ever has more than 12 elements, it is partitioned in half. So far, I have been using a separable DCT to do that, but my implementation is very slow: I have implemented a naive DCT transform, which has complexity of $O(N^2)$. I have tried some fast algorithms, but they seem to be slower than the naive approach: a naive DCT for 6 elements is faster the Fast version, but for $N = 4$ and $8$, the fast approach is definitely faster. Also, I cannot force the friendly power-of-2 sizes.

I remember the Cooley–Tukey FFT algorithm decomposes a sequence of size $N$ into their factors $N_1$ and $N_2$ such that $N = N_1N_2$. I also know that there are specialized algorithms for 2D DCT due to their use in video and image codecs.

My question is: Is there any reference about a fast multidimensional DCT or even a "reversed" Cooley-Tukey for DCT of which I could transform my 5D array into 1D and use a fast DCT?

Exrta bits: We are researching not only the DCT-II but the whole family of trigonometric transforms., insights about any of the DST and DCT are welcome!

Edited 9/23

Just as an extra comment, this solution is being implemented in C++.

I don't really understand the sentence on $2^{15}$ and $7^5$ — Laurent Duval
– Laurent Duval, Commented Sep 22, 2020 at 21:20
Hey, I removed this particular part, but what I meant was that if I zero-pad to the next power of 2, Id introduce unnecessary data; At the end of the day, I want to represent each 5D structure with the least amount of bits possible. — Cristian Maruan Bosin
– Cristian Maruan Bosin, Commented Sep 23, 2020 at 0:12
Hmm... I had massive speed issue in certain taper implementation (based on for-loop) ... replacing the loops by just dot-products and sums turned 60 minute task to less than one second task. Proper matrix algebra could be even faster... . This was situation with Octave (/Matlab) ... . — Juha P
– Juha P, Commented Sep 23, 2020 at 6:29
As it's C++ implementation in question, you sure use advantage of SSE / AVX where possible? — Juha P
– Juha P, Commented Sep 23, 2020 at 19:56
@JuhaP I must say I didnt even think of using something more closely related to the CPU itself. The most work I went through was being the most cache-friendly possible. I had a quick search here and actually they might be very useful. I really appreciate for that comment! =) — Cristian Maruan Bosin
– Cristian Maruan Bosin, Commented Sep 23, 2020 at 20:57

Hilmar · Accepted Answer · 2020-09-23 13:13:32Z

Since these are very short vector lengths, a "conventional" fast transform is not going to give you any gains.

Your best bet is probably to hand code each possible transform length indivdiually, unrolling all the loops and taking advantage of all "trivial " coefficients manually.

Another option could be to reduce the dimonsionality of the data but that depends how much codec gain you get in each dimension. If there is one dimension that has most of the codec gain, you can keep this and just linearize the other ones.

I don't think I've understood what you meant by "linearize the other ones". Could you expand on that coment? — Cristian Maruan Bosin
– Cristian Maruan Bosin, Commented Sep 23, 2020 at 17:39
You can just "pack" an n-dimensional matrix into a single one -dimensional vector — Hilmar
– Hilmar, Commented Sep 24, 2020 at 14:56

Laurent Duval · Accepted Answer · 2021-02-20 22:44:41Z

I will share thoughts and hints, as I don't have a clear solutions to offer. My experience is on seismic data compression, for which I implemented different types of DCT, Haar, Walsh-Hadamard, wavelets, Lapped Orthogonal Transforms and their extensions (LOT, MLT, GenLOT, GULLOT), on 2D seismic data of typical size $2000$ to $4000$ in one dimension, and $50$ to $200$ on the other. And the latter was an issue.

First, let me suppose that you apply a whole classical DCT/DST separably on each dimension of length 4 to 12. As Hilmar wrote, I doubt you can have much improvements by using a prime factor decomposition. One reason is that the performance is somehow asymptotic: globally, you can hope some $O(\log N)$ performance, but it often comes with some lower-order terms or constants that are not negligible anymore when $N$ is small. There are mostly 2D and 3D separable or non-separable versions of the DCT (for image, volume or video). Non-separable ones are generally better at data sparsification, but often not used because of the paid price in implementation complexity. Yet again, the gain in number of operations is not overwhelming, and a bit ad hoc: one can save a couple of more operations beyond the classical prime factoring, yet it seems very length-dependent. As Hilmar said, you could have manually-optimized primitives for each case, but it would be very tedious.

The second thing that come to me is the issue of multidimensional array indexing (and I really am not an expert on that). Yet, the (physical) contiguity of data samples seems to me to play a very important role. I have learned that for the huge 3D seismic data volumes, it is common to store the data in three shuffled copies, each organized to have the best byte-wise contiguity in each of the three directions. In other words, people prefer to triplicate data quantity to benefit from speed-up on each 3D volume. Seismic data can be so huge that I have heard that it could take weeks on huge super-computing facilities (HPC) to perform a mere byte-swapping on Petabytes data. This might be worth to look at for 5D data. This in turn might be combined with the above first point: byte reordering combined with two 2D optimized DCT, and a last one. I honestly do not know if it is worth the deal for your data.

A third improvement may come from the bit-wise implementation. There are a lot of tricks to play. For instance if your 5D data is 10-bits, it is conceivable to stuff them all in a 64-bit word, saving a couple of buffer-bits for carry. People did that in the past when computing was expensive: if you wanted to add three-bit words, you could write them by pairs in a single eight-bit word, saving one bit of carry for each, and adding two pairs at the same time. I'll come back later on this, on the transform side.

Fourth, even if you think that adding non-existent data by zero-padding is a shame, I would suggest you to reconsider this with more elaborate signal extension. I have observed a gain in extending for instance 87 to 96, with proper symmetry or extrapolation, both in quality and speed. So, in your case, using extension could allow you to have primitives for length 4, 6, 8, 12 only, and fill the data for the intermediate ones.

Five: there are alternatives to the floating-point DCT that can be power-savvy. Integer DCT, approximations of it (like the RCT), the Walsh-Hadamard (no multiply) that can work great on small sizes (I used them when a data is poorly correlated in one dimension), etc. Some transforms implemented as lifting save again some computations.

Six: let us now discuss the aim of the DCT and the "chunks of less than 12". Applying the DCT on small blocks, you are likely to get blocking effects, and lag on compression efficiency because you cannot use long range correlations in some dimensions. Since you are partitioned in half data above 12 samples, there are many alternatives to take the data as a whole, and perform refinement. Wavelets can do that at a quite cheap price. Apparently, you are after lossy compression, many other schemes than block DCT can do a great job.

Yet, all the above depends on the correlation of your data, and the compression you can afford to meet your subsequent processing needs. I truly believe however that you will need some composition of the above tracks.

Stack Exchange Network

Fast algorithm for n-dimensional DCT

2 Answers 2

Hot Network Questions

Fast algorithm for n-dimensional DCT

2 Answers 2

Related

Hot Network Questions