I've asked a question about Scaling Matrix Multiplication by CPU Cores on StackOverflow and it seems that merely adding more CPU cores to this problem won't reduce the time to do Matrix Multiplications dramatically.
Now I'm wondering if scalable architectures are the answer for large-scale matrix multiplications OR a strong server with lots of cores and memory?
The latency of scalable architectures like Hadoop is a negative aspect but I'm also wondering if throwing more powerful CPUs (like Intel Core i9 7980XE) at the problem would be able increase performance considerably.
What I'm aiming for is a High-Throughput and Low-Latency architecture and for the sake of argument, let's pretend Price is not a constraint (But please don't advice SuperComputer architectures! The Price of those things are actually a constraint!)