1
$\begingroup$

I have a dataset where I am focusing on binary classification problem. In total,I have around 60 features in my dataset

When I used Xgboost Feature Importance, I was able to see that the top 5 features account for 42% whereas rest of the of the 50 features account for 40-49 % (each feature about 1%) and remaing 8-10 features have zero importance or less than 1% of importance.

This is my best paramter list for Xgboost after gridsearch

op_params = {'alpha': [10], 'as_pandas': [True], 'colsample_bytree': [0.5], 'early_stopping_rounds': [100], 'learning_rate': [0.04], 'max_depth': [6], 'metrics': ['auc'], 'num_boost_round': [10000], 'objective': ['reg:logistic'], 'scale_pos_weight': [3.08], 'seed': [123], 'subsample': [0.75]} 

Since I have many low importance features, should I try to use them all in my model to increase the model metrics?

When I built the model only with top 5 features, I was able to get 80% accuracy.

I am trying to understand is it even useful to make use of these low importance feature for prediction?

Shown below is my feature importance in descending order

enter image description here

Do they even really help?

Any insights would really be helpful

$\endgroup$

2 Answers 2

1
$\begingroup$

Its all about a tradeoff.

The more you add unimportant features, the marginal will the benefits get but you risk injecting more complexity and potentially overfitting.

Ocams Razor

Also be carefull with the default feature importance approach. Read this.

$\endgroup$
5
  • $\begingroup$ Appreciate your input. Upvoted $\endgroup$ Commented Dec 17, 2019 at 12:37
  • $\begingroup$ Hi, I have a question regarding regarding Permutation importance. $\endgroup$ Commented Dec 17, 2019 at 12:47
  • $\begingroup$ Can you please put it in another question and I will gladly answer it. Also if you are satisfied with answers (generally, not neccesarry mine) dont forget to accept them. $\endgroup$ Commented Dec 17, 2019 at 12:49
  • $\begingroup$ Hi, in addition does it make sense to add a 'Random number column' and any feature importance below this newly generated random number column can be ignored and rest all can be taken for model prediction. Am I correct in thinking this way? $\endgroup$ Commented Dec 17, 2019 at 13:12
  • $\begingroup$ I would not do it. Its overkill you could just use scikit-learn.org/stable/modules/generated/… with PermutationImporance packed together $\endgroup$ Commented Dec 17, 2019 at 13:31
2
$\begingroup$

Adding low-value features might not help you surpass your current accuracy. Getting good quality data and add more data to the dataset or training for more epochs if it doesn't converge might help you get more accuracy.

$\endgroup$
1
  • 1
    $\begingroup$ Since there are lot of low important feature, you think it could be data quality issue? Upvoted for the help. $\endgroup$ Commented Dec 17, 2019 at 12:14

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.