4
$\begingroup$

I have a pyspark ML pipeline that uses PCA reduction and ANN. My understanding is that PCA performs best when given standardized values while NN perform best when given normalized values. Does it make sense to standardize the values before PCA and then normalize the PCA values before the ANN?

$\endgroup$
1
  • $\begingroup$ I don't think "standardized" and "normalized" are universally defined; here you mean standardized as centered and made unit-variance, and normalized as scaled into the interval [0,1]? Can you give references for "PCA prefers standardized" and "NN prefers normalized"? $\endgroup$ Commented Feb 24 at 19:10

1 Answer 1

6
+50
$\begingroup$

As PCA uses variance, the data should indeed be standardized before.

But as a preprocessing for Deep Neural Network (DNN), standardization and normalization could be used depending on the data. With images we will divide by 255 to get into [0, 1] range but for tabular data if it follows a gaussian law or if there are outliers it would make more sense to use standardization instead of normalization (one could also try both out of curiosity to see if there is a big difference in the final result).

To answer your question, yes it makes sense to do standardization + PCA + scaling + DNN.

Note that PCA doesn't always give a better model (though it will make it faster to run) and if you use it to decrease the number of features maybe the DNN can learn this by itself. So I would also try a pipeline without standardization + PCA.

$\endgroup$
1
  • $\begingroup$ This was the type of answer I was looking for. Thank you. I will only add that for my use case using ChiSquared feature selection ended up being a much better solution than PCA. $\endgroup$ Commented Mar 3 at 14:24

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.