Linear discriminant analysis in R

Question

I have two matrices, both 46175 * 741 (Rows of variables by columns of individuals/observations). Matrix A contains a categorical (perhaps dependent) variable (0/1/2 or NA) and Matrix B is continuous and independent (Ranging from 0 to a couple 100).

I want to see if there is a relationship between these data. Firstly, am I correct in thinking that LDA in R is a valid way to test this? If so, how exactly do I run this?

z <- lda(data= MatrixA , x= MatrixB, grouping=MatrixB)

is the closest I've gotten, but it doesn't work. I get:

 z <- lda(data= MatrixA , x= MatrixB, grouping=MatrixB) Error in lda.default(x, grouping, ...) : nrow(x) and length(grouping) are different

Data Snippet:

 Matrix A ------------------------- SampleA SampleB SampleC NA 0 1 NA NA NA 1 2 0 0 0 0 Matrix B ----------------------- SampleA SampleB SampleC 0 0 0 83 124 56 39 45 5 12714 12477 8751

The matrices contain data on the same individiuals, in the same order of columns and rows. Matrix A contains genotypes (genetic information) that is either 0/1/2 or could not be obtained (NA). MatrixB is the number of reads aligned to that region. Zero in this is therefore not the same as zero in MatrixA and is more similar to its NA.

It "doesn't work"? What happens? What error messages do you get? What is the str of MatrixA and MatrixB ? What are those matrices? — Peter Flom
– Peter Flom, Commented Jul 10, 2013 at 23:28
I added the error message I get. Is that because of the NAs? Removing NAs in MatrixA and 0's in MatrixB (which is equivalent) does not yield matrices of the same dim. The str of both is '46175 obs. of 741 variables: with all integers. — cianius
– cianius, Commented Jul 10, 2013 at 23:37
Why not just plot the distributions of B for A = 0, 1, 2, NA side by side? — Nick Cox
– Nick Cox, Commented Jul 11, 2013 at 0:57
The data is too big to plot(MatrixA, MatrixB). I get a "Error in plot.new() : figure margins too large" returned. How would you suggest I plot the distributions? Just a histogram? — cianius
– cianius, Commented Jul 11, 2013 at 9:49

cbeleites · Accepted Answer · 2013-07-11 11:42:20Z

There are several problems here.

Each row should correspond to one case/individual; each column to one variable.
If I understand your description correctly, that means you need to transpose your data.
This also means that you have more variates than individuals, thus the variance-covariance matrix is not of full rank which leads to problems during its inversion inside lda.
You need to drastically reduce the number of variates or increase the number of individuals before performing LDA (if I correctly understood your description of the data).
MASS::lda expects grouping to be a factor with one value per case (= row), not a matrix.
That's why it is complaining that length (grouping) is not the same as nrow (x)
It does not make any sense to give the same data for x and grouping: x should be the matrix with the independent variates, grouping is the dependent.
It is very unusual to give x, grouping and data.
Either give data and formula: with that you call the formula interface (lda.formula).
Or give x and grouping: that calls lda.default (a bit faster than the first option).

edit:

The formula version lda (grouping ~ x) is equivalent to lda (x = x, grouping = grouping). If you have a data.frame data with columns x and grouping, then you'd use lda (grouping ~ x, data = data). Note that a column of a data.frame can hold a whole matrix.

I know it's more normal to give the formula, but how do you construct that? Do you know what I should write given the data I have? ; — cianius
– cianius, Commented Jul 11, 2013 at 11:30
@pepsimax: I suspect that I still did not understand what your data is like, so I'm not completely sure a) wheter LDA is appropriate, nor b) how you should construct your call. Maybe you could show us a small part of your matrices? — cbeleites
– cbeleites, Commented Jul 11, 2013 at 11:43
Sorry, I thought my description was clear. Ive added some sample data. Having less variates that individuals will mean I will lose a lot of information surely? — cianius
– cianius, Commented Jul 11, 2013 at 11:57
@pepsimax: Having less variates than individuals means that you cannot distinguish between random differences betrween individuals and "real" effects. What is the meaning of the rows in your matrices? — cbeleites
– cbeleites, Commented Jul 11, 2013 at 12:27

Stack Exchange Network

Linear discriminant analysis in R

1 Answer 1

Hot Network Questions

Linear discriminant analysis in R

1 Answer 1

Related

Hot Network Questions