I will share thoughts and hints, as I don't have a clear solutions to offer. My experience is on seismic data compression, for which I implemented different types of DCT, Haar, Walsh-Hadamard, wavelets, Lapped Orthogonal Transforms and their extensions (LOT, MLT, GenLOT, GULLOT), on 2D seismic data of typical size $2000$ to $4000$ in one dimension, and $50$ to $200$ on the other. And the latter was an issue.
First, let me suppose that you apply a whole classical DCT/DST separably on each dimension of length 4 to 12. As Hilmar wrote, I doubt you can have much improvements by using a prime factor decomposition. One reason is that the performance is somehow asymptotic: globally, you can hope some $O(\log N)$ performance, but it often comes with some lower-order terms or constants that are not negligible anymore when $N$ is small. There are mostly 2D and 3D separable or non-separable versions of the DCT (for image, volume or video). Non-separable ones are generally better at data sparsification, but often not used because of the paid price in implementation complexity. Yet again, the gain in number of operations is not overwhelming, and a bit ad hoc: one can save a couple of more operations beyond the classical prime factoring, yet it seems very length-dependent. As Hilmar said, you could have manually-optimized primitives for each case, but it would be very tedious.
The second thing that come to me is the issue of multidimensional array indexing (and I really am not an expert on that). Yet, the (physical) contiguity of data samples seems to me to play a very important role. I have learned that for the huge 3D seismic data volumes, it is common to store the data in three shuffled copies, each organized to have the best byte-wise contiguity in each of the three directions. In other words, people prefer to triplicate data quantity to benefit from speed-up on each 3D volume. Seismic data can be so huge that I have heard that it could take weeks on huge super-computing facilities (HPC) to perform a mere byte-swapping on Petabytes data. This might be worth to look at for 5D data. This in turn might be combined with the above first point: byte reordering combined with two 2D optimized DCT, and a last one. I honestly do not know if it is worth the deal for your data.
A third improvement may come from the bit-wise implementation. There are a lot of tricks to play. For instance if your 5D data is 10-bits, it is conceivable to stuff them all in a 64-bit word, saving a couple of buffer-bits for carry. People did that in the past when computing was expensive: if you wanted to add three-bit words, you could write them by pairs in a single eight-bit word, saving one bit of carry for each, and adding two pairs at the same time. I'll come back later on this, on the transform side.
Fourth, even if you think that adding non-existent data by zero-padding is a shame, I would suggest you to reconsider this with more elaborate signal extension. I have observed a gain in extending for instance 87 to 96, with proper symmetry or extrapolation, both in quality and speed. So, in your case, using extension could allow you to have primitives for length 4, 6, 8, 12 only, and fill the data for the intermediate ones.
Five: there are alternatives to the floating-point DCT that can be power-savvy. Integer DCT, approximations of it (like the RCT), the Walsh-Hadamard (no multiply) that can work great on small sizes (I used them when a data is poorly correlated in one dimension), etc. Some transforms implemented as lifting save again some computations.
Six: let us now discuss the aim of the DCT and the "chunks of less than 12". Applying the DCT on small blocks, you are likely to get blocking effects, and lag on compression efficiency because you cannot use long range correlations in some dimensions. Since you are partitioned in half data above 12 samples, there are many alternatives to take the data as a whole, and perform refinement. Wavelets can do that at a quite cheap price. Apparently, you are after lossy compression, many other schemes than block DCT can do a great job.
Yet, all the above depends on the correlation of your data, and the compression you can afford to meet your subsequent processing needs. I truly believe however that you will need some composition of the above tracks.