Revisions to PCA and the train/test split

added 64 characters in body

edited Jan 7, 2021 at 18:50

40.6k
5
88
151

For measuring the generalization error, you need to do the latter: a separate PCA for every training set (which would mean doing a separate PCA for every classifier and for every CV fold).

You then apply the same transformation to the test set: i.e. you do not do a separate PCA on the test set! You subtract the mean (and if needed divide by the standard deviation) of the training set, as explained here: Zero-centering the testing set after PCA on the training set. Then you project the data onto the PCs of the training set.

You'll need to define an automatic criterium for the number of PCs to use.
As it is just a first data reduction step before the "actual" classification, using a few too many PCs will likely not hurt the performance. If you have an expectation how many PCs would be good from experience, you can maybe just use that.
You can also test afterwards whether redoing the PCA for every surrogate model was necessary (repeating the analysis with only one PCA model). I think the result of this test is worth reporting.
I once measured the bias of not repeating the PCA, and found that with my spectroscopic classification data, I detected only half of the generalization error rate when not redoing the PCA for every surrogate model.
Also relevant: https://stats.stackexchange.com/a/240063/4598

That being said, you can build an additional PCA model of the whole data set for descriptive (e.g. visualization) purposes. Just make sure you keep the two approaches separate from each other.

I am still finding it difficult to get a feeling of how an initial PCA on the whole dataset would bias the results without seeing the class labels.

But it does see the data. And if the between-class variance is large compared to the within-class variance, between-class variance will influence the PCA projection. Usually the PCA step is done because you need to stabilize the classification. That is, in a situation where additional cases do influence the model.

If between-class variance is small, this bias won't be much, but in that case neither would PCA help for the classification: the PCA projection then cannot help emphasizing the separation between the classes.

For measuring the generalization error, you need to do the latter: a separate PCA for every training set (which would mean doing a separate PCA for every classifier and for every CV fold).

You then apply the same transformation to the test set: i.e. you do not do a separate PCA on the test set! You subtract the mean (and if needed divide by the standard deviation) of the training set, as explained here: Zero-centering the testing set after PCA on the training set. Then you project the data onto the PCs of the training set.

You'll need to define an automatic criterium for the number of PCs to use.
As it is just a first data reduction step before the "actual" classification, using a few too many PCs will likely not hurt the performance. If you have an expectation how many PCs would be good from experience, you can maybe just use that.
You can also test afterwards whether redoing the PCA for every surrogate model was necessary (repeating the analysis with only one PCA model). I think the result of this test is worth reporting.
I once measured the bias of not repeating the PCA, and found that with my spectroscopic classification data, I detected only half of the generalization error rate when not redoing the PCA for every surrogate model.

That being said, you can build an additional PCA model of the whole data set for descriptive (e.g. visualization) purposes. Just make sure you keep the two approaches separate from each other.

I am still finding it difficult to get a feeling of how an initial PCA on the whole dataset would bias the results without seeing the class labels.

But it does see the data. And if the between-class variance is large compared to the within-class variance, between-class variance will influence the PCA projection. Usually the PCA step is done because you need to stabilize the classification. That is, in a situation where additional cases do influence the model.

If between-class variance is small, this bias won't be much, but in that case neither would PCA help for the classification: the PCA projection then cannot help emphasizing the separation between the classes.

For measuring the generalization error, you need to do the latter: a separate PCA for every training set (which would mean doing a separate PCA for every classifier and for every CV fold).

You then apply the same transformation to the test set: i.e. you do not do a separate PCA on the test set! You subtract the mean (and if needed divide by the standard deviation) of the training set, as explained here: Zero-centering the testing set after PCA on the training set. Then you project the data onto the PCs of the training set.

You'll need to define an automatic criterium for the number of PCs to use.
As it is just a first data reduction step before the "actual" classification, using a few too many PCs will likely not hurt the performance. If you have an expectation how many PCs would be good from experience, you can maybe just use that.
You can also test afterwards whether redoing the PCA for every surrogate model was necessary (repeating the analysis with only one PCA model). I think the result of this test is worth reporting.
I once measured the bias of not repeating the PCA, and found that with my spectroscopic classification data, I detected only half of the generalization error rate when not redoing the PCA for every surrogate model.
Also relevant: https://stats.stackexchange.com/a/240063/4598

That being said, you can build an additional PCA model of the whole data set for descriptive (e.g. visualization) purposes. Just make sure you keep the two approaches separate from each other.

I am still finding it difficult to get a feeling of how an initial PCA on the whole dataset would bias the results without seeing the class labels.

But it does see the data. And if the between-class variance is large compared to the within-class variance, between-class variance will influence the PCA projection. Usually the PCA step is done because you need to stabilize the classification. That is, in a situation where additional cases do influence the model.

If between-class variance is small, this bias won't be much, but in that case neither would PCA help for the classification: the PCA projection then cannot help emphasizing the separation between the classes.

added 180 characters in body

Source Link

edited Oct 30, 2018 at 14:16

amoeba

109k
37
325
350

For measuring the generalization error, you need to do the latter: a separate PCA for every training set (which would mean doing a separate PCA for every classifier and for every CV fold).

You then apply the same transformation to the test set (i: i.e. you do not do a separate PCA on the test set! see alsoYou subtract the mean (and if needed divide by the standard deviation) of the training set, as explained here: Zero-centering the testing set after PCA on the training set). Then you project the data onto the PCs of the training set.

You'll need to define an automatic criterium for the number of PCs to use.
As it is just a first data reduction step before the "actual" classification, using a few too many PCs will likely not hurt the performance. If you have an expectation how many PCs would be good from experience, you can maybe just use that.
You can also test afterwards whether redoing the PCA for every surrogate model was necessary (repeating the analysis with only one PCA model). I think the result of this test is worth reporting.
I once measured the bias of not repeating the PCA, and found that with my spectroscopic classification data, I detected only half of the generalization error rate when not redoing the PCA for every surrogate model.

That being said, you can build an additional PCA model of the whole data set for descriptive (e.g. visualization) purposes. Just make sure you keep the two approaches separate from each other.

I am still finding it difficult to get a feeling of how an initial PCA on the whole dataset would bias the results without seeing the class labels.

But it does see the data. And if the between-class variance is large compared to the within-class variance, between-class variance will influence the PCA projection. Usually the PCA step is done because you need to stabilize the classification. That is, in a situation where additional cases do influence the model.

If between-class variance is small, this bias won't be much, but in that case neither would PCA help for the classification: the PCA projection then cannot help emphasizing the separation between the classes.

For measuring the generalization error, you need to do the latter: a separate PCA for every training set (which would mean doing a separate PCA for every classifier and for every CV fold). You then apply the same transformation to the test set (i.e. you do not do a separate PCA on the test set! see also: Zero-centering the testing set after PCA on the training set).

You'll need to define an automatic criterium for the number of PCs to use.
As it is just a first data reduction step before the "actual" classification, using a few too many PCs will likely not hurt the performance. If you have an expectation how many PCs would be good from experience, you can maybe just use that.
You can also test afterwards whether redoing the PCA for every surrogate model was necessary (repeating the analysis with only one PCA model). I think the result of this test is worth reporting.
I once measured the bias of not repeating the PCA, and found that with my spectroscopic classification data, I detected only half of the generalization error rate when not redoing the PCA for every surrogate model.

That being said, you can build an additional PCA model of the whole data set for descriptive (e.g. visualization) purposes. Just make sure you keep the two approaches separate from each other.

I am still finding it difficult to get a feeling of how an initial PCA on the whole dataset would bias the results without seeing the class labels.

But it does see the data. And if the between-class variance is large compared to the within-class variance, between-class variance will influence the PCA projection. Usually the PCA step is done because you need to stabilize the classification. That is, in a situation where additional cases do influence the model.

If between-class variance is small, this bias won't be much, but in that case neither would PCA help for the classification: the PCA projection then cannot help emphasizing the separation between the classes.

For measuring the generalization error, you need to do the latter: a separate PCA for every training set (which would mean doing a separate PCA for every classifier and for every CV fold).

You then apply the same transformation to the test set: i.e. you do not do a separate PCA on the test set! You subtract the mean (and if needed divide by the standard deviation) of the training set, as explained here: Zero-centering the testing set after PCA on the training set. Then you project the data onto the PCs of the training set.

You'll need to define an automatic criterium for the number of PCs to use.
As it is just a first data reduction step before the "actual" classification, using a few too many PCs will likely not hurt the performance. If you have an expectation how many PCs would be good from experience, you can maybe just use that.
You can also test afterwards whether redoing the PCA for every surrogate model was necessary (repeating the analysis with only one PCA model). I think the result of this test is worth reporting.
I once measured the bias of not repeating the PCA, and found that with my spectroscopic classification data, I detected only half of the generalization error rate when not redoing the PCA for every surrogate model.

That being said, you can build an additional PCA model of the whole data set for descriptive (e.g. visualization) purposes. Just make sure you keep the two approaches separate from each other.

I am still finding it difficult to get a feeling of how an initial PCA on the whole dataset would bias the results without seeing the class labels.

But it does see the data. And if the between-class variance is large compared to the within-class variance, between-class variance will influence the PCA projection. Usually the PCA step is done because you need to stabilize the classification. That is, in a situation where additional cases do influence the model.

If between-class variance is small, this bias won't be much, but in that case neither would PCA help for the classification: the PCA projection then cannot help emphasizing the separation between the classes.

replaced http://stats.stackexchange.com/ with https://stats.stackexchange.com/

Source Link

edited Apr 13, 2017 at 12:44

Community Bot

1

For measuring the generalization error, you need to do the latter: a separate PCA for every training set (which would mean doing a separate PCA for every classifier and for every CV fold). You then apply the same transformation to the test set (i.e. you do not do a separate PCA on the test set! see also: Zero-centering the testing set after PCA on the training set Zero-centering the testing set after PCA on the training set).

You'll need to define an automatic criterium for the number of PCs to use.
As it is just a first data reduction step before the "actual" classification, using a few too many PCs will likely not hurt the performance. If you have an expectation how many PCs would be good from experience, you can maybe just use that.
You can also test afterwards whether redoing the PCA for every surrogate model was necessary (repeating the analysis with only one PCA model). I think the result of this test is worth reporting.
I once measured the bias of not repeating the PCA, and found that with my spectroscopic classification data, I detected only half of the generalization error rate when not redoing the PCA for every surrogate model.

That being said, you can build an additional PCA model of the whole data set for descriptive (e.g. visualization) purposes. Just make sure you keep the two approaches separate from each other.

I am still finding it difficult to get a feeling of how an initial PCA on the whole dataset would bias the results without seeing the class labels.

But it does see the data. And if the between-class variance is large compared to the within-class variance, between-class variance will influence the PCA projection. Usually the PCA step is done because you need to stabilize the classification. That is, in a situation where additional cases do influence the model.

If between-class variance is small, this bias won't be much, but in that case neither would PCA help for the classification: the PCA projection then cannot help emphasizing the separation between the classes.

For measuring the generalization error, you need to do the latter: a separate PCA for every training set (which would mean doing a separate PCA for every classifier and for every CV fold). You then apply the same transformation to the test set (i.e. you do not do a separate PCA on the test set! see also: Zero-centering the testing set after PCA on the training set).

You'll need to define an automatic criterium for the number of PCs to use.
As it is just a first data reduction step before the "actual" classification, using a few too many PCs will likely not hurt the performance. If you have an expectation how many PCs would be good from experience, you can maybe just use that.
You can also test afterwards whether redoing the PCA for every surrogate model was necessary (repeating the analysis with only one PCA model). I think the result of this test is worth reporting.
I once measured the bias of not repeating the PCA, and found that with my spectroscopic classification data, I detected only half of the generalization error rate when not redoing the PCA for every surrogate model.

That being said, you can build an additional PCA model of the whole data set for descriptive (e.g. visualization) purposes. Just make sure you keep the two approaches separate from each other.

I am still finding it difficult to get a feeling of how an initial PCA on the whole dataset would bias the results without seeing the class labels.

But it does see the data. And if the between-class variance is large compared to the within-class variance, between-class variance will influence the PCA projection. Usually the PCA step is done because you need to stabilize the classification. That is, in a situation where additional cases do influence the model.

If between-class variance is small, this bias won't be much, but in that case neither would PCA help for the classification: the PCA projection then cannot help emphasizing the separation between the classes.

For measuring the generalization error, you need to do the latter: a separate PCA for every training set (which would mean doing a separate PCA for every classifier and for every CV fold). You then apply the same transformation to the test set (i.e. you do not do a separate PCA on the test set! see also: Zero-centering the testing set after PCA on the training set).

You'll need to define an automatic criterium for the number of PCs to use.
As it is just a first data reduction step before the "actual" classification, using a few too many PCs will likely not hurt the performance. If you have an expectation how many PCs would be good from experience, you can maybe just use that.
You can also test afterwards whether redoing the PCA for every surrogate model was necessary (repeating the analysis with only one PCA model). I think the result of this test is worth reporting.
I once measured the bias of not repeating the PCA, and found that with my spectroscopic classification data, I detected only half of the generalization error rate when not redoing the PCA for every surrogate model.

That being said, you can build an additional PCA model of the whole data set for descriptive (e.g. visualization) purposes. Just make sure you keep the two approaches separate from each other.

I am still finding it difficult to get a feeling of how an initial PCA on the whole dataset would bias the results without seeing the class labels.

But it does see the data. And if the between-class variance is large compared to the within-class variance, between-class variance will influence the PCA projection. Usually the PCA step is done because you need to stabilize the classification. That is, in a situation where additional cases do influence the model.

If between-class variance is small, this bias won't be much, but in that case neither would PCA help for the classification: the PCA projection then cannot help emphasizing the separation between the classes.