3

I have a matrix M with size (37, N) and an additionnal 1D reference vector of size (37,1)

I am looking for a way to compute the spearman correlation between each sample of M and my reference to obtain a (N, 1) matrix as a result. Using a for loop I can do :

 from scipy.stats import spearmanr import numpy as np rng = np.random.default_rng() M = rng.standard_normal((37, 100)) reference = rng.standard_normal((37, 1)) expected_result = [] for element in M.T: expected_result.append( spearmanr(element, reference).statistic) print(np.array(expected_result).shape) 

It gives what I want (shape (100,1)) but I would like it to be vectorized to speed up computations. I tried using the following code :

 from scipy.stats import spearmanr import numpy as np rng = np.random.default_rng() M = rng.standard_normal((37, 100)) reference = rng.standard_normal((37, 1)) res = spearmanr(M, reference) 

but the res.statistic ends up in the shape (101,101), which is even more confusing to me.

Is there a way to vectorize this ?

Thank you

3
  • 1
    Side question (I think it is irrelevant to your question, but it bothers me, so I ask): what with that M = np.random you put it both your codes? Is it supposed to achieve something? Commented Jul 5 at 13:53
  • You may be interested in a recent RFC discussing potential changes to this interface: github.com/scipy/scipy/issues/23238 Commented Jul 5 at 20:47
  • @chrslg My bad, this line is just a typo, I removed it Commented Jul 6 at 11:17

2 Answers 2

2

The thing is spearmanr is not the same in 2D and 1D.

In 1D, (spearmanr(M[:,0], reference), spearmanr(M[:,1], reference), ... as you are doing in your loop) it does what you expect from it.

But in 2D, each column is considered a different variable to compare. So if you pass a 37×100 matrix to spearmanr (spearmanr(M, M) for example), what it does is compute every 100×100 combination of columns of M, resulting an array of 100×100 results of spearman(M[:,i], M[:,j]) for all 100×100 possible i,j.

Note that I've used only one argument here. There is a reason for that: usually, people either pass 2 1D array to compare, or they pass one 2D array, to compare each of their columns against each other. It is not usual (but not impossible) to pass two 2D arrays. But when you do, it is just as if you were concatenating the columns of the 2 before calling spearmanr on the resulting 2D array.

So, in your case, what you are doing is calling spearmanr with 100 columns (m) and 1 column (reference) of 37 rows. So, exactly as if you have called it with 101 columns. And the result is the 101×101 possible combination of correlations.

Hence the reason why the result you are looking for, since you are interested only interested in correlation between 100 columns of M (aka the 100 first column of the resulting 101 columns), vs the column of reference (aka the 101th column of the resulting 101 columns), is the last column of the result of spearmanr(M, reference)

spearmanr(M, reference).statistic[:-1, -1] 

Which means that you compute way too much things.

Still, in your case, it is faster (because of vectorization) to compute all those 101×101 correlations, and then dropping 1+100×101 results, and keeping only 100

from scipy.stats import spearmanr import numpy as np import timeit rng = np.random.default_rng() M = rng.standard_normal((37, 100)) reference = rng.standard_normal((37, 1)) def sp0(C, reference): return spearmanr(C, reference).statistic def sp1(M, reference): expected_result = [] for element in M.T: expected_result.append(sp0(element, reference)) er1=np.array(expected_result) return er1 def sp2(M, reference): n=M.shape[1] er2=spearmanr(M, reference).statistic[:n,n] return er2 print(timeit.timeit(lambda: sp1(M, reference), number=100)) print(timeit.timeit(lambda: sp2(M, reference), number=100)) 

On my (slow) computer, that is 60 ms seconds per run (your for loop method), vs 17 ms (computing 101×101 result with vectorized version and keeping only 100)

Of course, on the long run, (if that 100 becomes bigger), the vectorized version would loose.

You may be tempted to vectorize with np.vectorize. But that is roughly the same as a for loop

sp3 = np.vectorize(sp0, signature='(m),(m,1)->()') print(timeit.timeit(lambda: sp3(M.T, reference), number=100)) 

gives 54 ms (so basically same as 60, or almost so)

Lastly, you can go back to what is a spearman correlation: just a correlation between the ranks of the values

Assuming there is no tie (and that is a very reasonable assumption if data are continuous random variables as yours):

def sp4(M, reference): n=len(reference) # 37 mean = (n-1)/2 var = (n-1)*(n+1)/12 rgm = M.argsort(axis=0).argsort(axis=0) rgr = reference.argsort(axis=0).argsort(axis=0) return ((rgm*rgr).mean(axis=0)-mean**2)/var 

This time, you have it both way: it is vectorized, and you don't compute useless result. Timing is as expected way lower : 0.021 ms (so factor 300 from your code)

Note that, since I assume no tie, so, ranks are 0 to 36 (or 1 to 37, it doesn't matter), I can compute mean and variance easily (mean is 18, whatever the data. And variance, based of the sum of the k 1st squares k×(k+1)×(2k+1)/6, is as shown)

Edit

I see that I was too slow, and in the meantime, basically the same answer was given (should have checked while I was playing with this). Since I played a bit more with timings I let it anyway. One small implementation difference (outside the fact that I compute mean and variance of rank, assuming no tie, directly) is the usage of the .argsort().argsort() trick, that is faster for small enough data. But rankdata becomes faster (around n=20000 on my computer. So way over 37) in the long run. Plus, indeed, rankdata doesn't assume tie as I do. If tie are possible, then, not only should the .argsort().argsort() trick be replaced by rankdata. But also, using just n/2 and (n-1)(n+1)/12 as mean and variance is not possible any more, and we need to compute all 101 means (not really a problem tho, since this is not where cpu time is really spend anyway)

Here is how to do is with rankdata (which, again, is faster on the long run (for n>20k approx on my machine. Clearly, the 2M samples you announce in comment are that more than 20k. But unless I am mistaken, 2M is what your real data look instead of 100. Not instead of 37, right? So if n remains 37, rankdata is really slower. 50 times slower), and works even with tie. Another consequence of possible tie, is that I cannot assume mean rank is (n-1)/2 (if rank goes from 0 to n-1; it would be (n+1)/2, with rankdata, that gives rank from 1 to n, without tie. It doesn't matter whether we start at 0 or 1, as long as mean is consistent, since it is centered), so I have to compute mean rank for both reference and each column of M. Likewise, can't assume neither that standard deviation is (n-1)(n+1)/12

def sp5(M, reference): rgm = rankdata(M, axis=0) rgr = rankdata(reference, axis=0) return ((rgm*rgr).mean(axis=0)-rgm.mean(axis=0)*rgr.mean())/rgm.std(axis=0)/rgr.std() 

So, if it is that 37 that becomes big, then, no question, rankdata (and that sp5 version) is better. It is faster and handle ties.

If that 37 remains 37, then rankdata is (I frankly don't know why) 50 times slower, so you have to decide if handling ties really worth the ×50 slowdown (or almost ×50: the computation of rank is not all. But CPU-wise, most of the time is spent in this rank computation)

I don't know what your data are. But if they are float (non-discrete I mean) measurements, ties are unlikely, and even when they occur, they show more an incertitude than a tie. I mean, if two friends A and B both tell your that their height is 1.79 cm, A<B has a 50% chance to be true, and A>B has a 50% chance to be true. What is surely not true, is A=B (one is taller than the other; there has to me some mm, micron, nanometer, ... of difference, but they can't have the exact same height. Only thing is that you don't know who). So if those 37 data where height of 37 people, handling tie would be just trying to mask incertitude.

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you a lot for these informations and for looking for such a efficient mean to compute it. My example was 100 samples long, but I would like to apply it to 2M samples and even more after that, although it stays at 37 obs. Just in case, I would like to understand in the case of duplicates in the data, the computation would be "(np.mean( a * b, axis=0) - (np.mean(a, axis=0) * np.mean(b, axis=0) ) )/ (np.std(a, axis=0) * np.std(b, axis=0))" is that right ?
Yes, that is right. I edited my answer to provide code for tie (I did it with without knowing whether it was 100 that becomes 2M, or 37, so I commented both case. But I see now that your comment made clear that 37 remains 37. Which means that you are in the "hard decision" case, where you have to decide whether to use slower rankdata to handle potential ties (ties among 37 obs. Whether there are ties between 100 or 2M samples is irrelevant), or prefer to risk to ignore some ties to have faster run. (Maybe there are other alternative than argsort().argsort() trick and `rank
I mean, I can't see any valid reason why rankdata has to be so slow (with n=37), compared to "double argsort". So it is probably just that is was implemented under the assumption that n is way bigger than 37. So, maybe some other sort + counting based method would be faster, while still handling ties.
2

There are a couple of relatively simple approaches available.

Looking through the docs, we find that the results of scipy.stats.spearmanr are the correlations between every possible combination of the inputs. Your 100-element array is therefore

res.statistic[:100, -1] 

or

res.statistic[-1, :100] 

The slightly more complex, but possibly more efficient, approach for larger datasets is to hand-roll the computation of the coefficient.

If you assume uniqueness, you could use numpy.argsort. For correctness, let's use scipy.stats.rankdata instead.

a = rankdata(M, axis=0) b = rankdata(reference, axis=0) result = np.var(a * b, axis=0) / (np.std(a, axis=0) * np.std(b, axis=0)) 

2 Comments

While I get the spirit of the last line (I guess it is the same as what I did: compute Pearson's correlation of ranks), I don't really get why variance of a*b. And result do not match. I feel var and a*b is somehow redundant (mean of a*b is a covariance. var of a*b, not sure what it is). And probably, rank need to be centered (so, classical 𝔼((a-𝔼(a))(b-𝔼(b)) or, 𝔼(ab)-𝔼(a)𝔼(b).). Or am I missing some trick?
@chrslg. I think you're right. I wrote this entirely on mobile with no testing. The first result is likely correct. The variance in the second one is likely not. I'll fix it when I get to a computer.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.