1

I have a matrix multiplication test to do a [MxN] * [Nx1] multiplication. It uses Numpy (and MKL) in Windows:

import timeit import numpy as np from numpy.random import random_sample NUMBER_OF_SAMPLES = 1000000 NUMBER_OF_DIMENSIONS = 128 dataset = random_sample((NUMBER_OF_SAMPLES, NUMBER_OF_DIMENSIONS)).astype(np.float32) feature = random_sample((NUMBER_OF_DIMENSIONS, 1)).astype(np.float32) print("Finished Generating the Data...") numbers = 1000 total_time = timeit.timeit('np.dot(dataset, feature)', globals=globals(), number=numbers) print("Average Time %.3f" % float(total_time / numbers)) 

I benchmarked it on a Core i7 7700 CPU (4 Cores / 8 Threads) and again on a Core i7 7820X (8 Cores / 16 Threads) With HyperThreading disabled and enabled (Disabling Hyperthreading didn't really change the benchmark much):

Core i7 7700 (4 Cores) || Data Count 1.000.000 || 128 Dimensional || Time 0.021 s Core i7 7820X (8 Cores) || Data Count 1.000.000 || 128 Dimensional || Time 0.019 s 

I expected increasing the CPU count would cut the time by half but it barely did anything.

Is there any way to improve this speed? Thanks.

9
  • 4
    Well, you should really be using timeit because you're calling time() and also appending to a list in the loop; calls which aren't vectorized and could be taking up a significant portion of the execution time of the loop. But I wouldn't expect the kind of correlation with speed as you're expecting. Commented Jul 4, 2018 at 16:28
  • 2
    Matrix-vector products are not great candidates for parallelisation because of the low FLOP count. A matrix-matrix dot product will show far better parallel scalability Commented Jul 5, 2018 at 5:59
  • 1
    It could be that hyperthreading doesn't help with matrix multiplications at all, I don't know. To check that hypothesis, repeat your test for 1...8 cores with HT enabled and plot the data (with elapsed time * num_cores on the y-axis, num_cores on the x-axis). For perfect parallelization the line would be totally flat, likely it will rise slightly. If HT doesn't help at all, the graph will start flat and rise sharply after core 4. If that's the case adding real CPU cores would help (not just threads). I hear AMD offers some amazing chips for that, at a less insane price than Xeons… Commented Jul 5, 2018 at 14:03
  • 2
    Your Core i7-7700 has two 256bit broad FMA3 units. That means in this case that one core needs in theory (512/8*3.9*10^9)/10^9=62.4 GB/s, which is more than your RAM is capeable of (intermidiate results and the vector staying in registers and L1-cache). Practically you won't see any speedup with more than two cores. Nevertheless this doesn't explain why your i7 7820X isnt significantly faster. This processor has a quadchannel memory interface and should therefore twice as fast (Maybe it runs only in dual-channel mode?) Commented Jul 5, 2018 at 14:05
  • 2
    I benchmarked quite the same thing on a Core i7 4th gen. I also wrote a Numba version of the code (writing out the matrix-vector multiplication in for loops which shows the same results, only a tiny bit faster) But it would be interesting if someone could benchmark this on a workstation or server which has definitely more than two memory channels. I also have seen this behaiviour many times on other code which runs in a memory bottleneck. But you can also check the speedup by a bigger matrix-matrix multiplication. Commented Jul 5, 2018 at 15:43

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.