6

I am implementing a PCA algorithm in MATLAB. I see two different approaches to calculating the covariance matrix:

C = sampleMat.' * sampleMat ./ nSamples; 

and

C = cov(data); 

What is the difference between these two methods?

PS 1: When I use cov(data) is that unnecessary:

meanSample = mean(data,1); data = data - repmat(data, nSamples, 1); 

PS 2:

At first approach should I use nSamples or nSamples - 1?

1 Answer 1

10

In short: cov mainly just adds convenience to the bare formula.

If you type

edit cov 

You'll see a lot of stuff, with these lines all the way at the bottom:

xc = bsxfun(@minus,x,sum(x,1)/m); % Remove mean if flag xy = (xc' * xc) / m; else xy = (xc' * xc) / (m-1); % DEFAULT end 

which is essentially the same as your first line, save for the subtraction of the column-means.

Read the wiki on sample covariances to see why there is a minus-one in the default path.

Note however that your first line uses normal transpose (.'), whereas the cov-version uses conjugate-transpose ('). This will make the output of cov different in the context of complex-valued data.

Also note that cov is a function call to a non-built in function. That means that there will be a (possibly severe) performance penalty when using cov in a loop; Matlab's JIT compiler cannot accelerate non-built in functions.

Sign up to request clarification or add additional context in comments.

15 Comments

With the caveat that complex numbers are handled differently from the code in the question.
According to your edit 2, does it better to use first line? and which one is the correct one or are they same to use conjugate-transpose and transpose to calculate covariance?
@kamaci: it depends. If you need to calculate only 1 covariance matrix per run, it's just easier to use cov. If you need to do it hundreds of times in a loop, with different data sets, etc., using the bare formula will be much faster and is overall the better trade-off. As mentioned above: the output of cov will only be different from your first attempt, if your data is complex-valued. If it only contains real values, the outputs will be identical.
I will run it only once however my data is too big, so still using cov is OK?
@kamaci: what do you mean, "too big"? How big is that?
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.