Variable selection and logistic regression

Question

I am using Matlab, I have a $600 \times 9$ matrix with each row representing the 9 features which I am trying to evaluate using logistic regression.

I understand that I need to perform feature scaling, but do I need to perform it on both the training and the testing set as well?
I have 9 features, up to which degree do I need to perform regularization... up to which higher order degree for 9 features do I need to consider?
How do I check which features are contributing more or less?
How do I divide my training and testing set, which ratio is the most ideal?

Posts that ask, in effect, "How do I perform regression?" or "How do I crossvalidate findings?" are so broad that they are unfortunately not a good fit for this site. It sounds as if you would benefit from extensive reading and/or from time spent with a mentor or consultant. — rolando2
– rolando2, Commented Jan 22, 2013 at 15:35
In light of @rolando2's comment, it may be more helpful to try what you can, & then ask a series of well-focused questions about the places where you're stuck (assuming those issues haven't already been covered elsewhere on CV). For each question, you would state what you've done thus far, where you're stuck, & what you understand about that issue based on your own study of statistics. — gung - Reinstate Monica
– gung - Reinstate Monica, Commented Jan 22, 2013 at 17:20

cbeleites · Accepted Answer · 2013-01-22 18:03:25Z

(I think it would be better to make the 4 questions 4 questions...)

But:

Forget the division into preprocessing and model fitting. Treat preprocessing as part of your model. That makes clear that whatever you do with your training data, you need to do with your test data as well. It will also make clear that many preprocessing steps need to be fit separately for each "surrogate model" during cross validation.
This cannot be answered in general, but just for a given problem.
Search here for "contribution". It happens that there was a question about that today...
Read about cross validation. There are some rules that you need to keep in mind for splitting (which you can find here), but I can already tell you that the ratio of training to test data is rather uncritical for cross validation. In case you want to do a one-time split (hold-out set), I may "advertise" my recent paper about "Sample Size Considerations for Classifiation Models" which is also available at arXiv.

1 Answer 1