How to save every predicted result in each iteration of GridSearchCV with LightGBM

Question

I am trying to use GridSearchCV to tune parameters in LightGBM model, but I am not familiar enough with how to save each predicted result in each iteration of GridSearchCV.
But sadly, I only know how to save the result in a specific parameter.
Here is the code:

param = { 'bagging_freq': 5, 'bagging_fraction': 0.4, 'boost_from_average':'false', 'boost': 'gbdt', 'feature_fraction': 0.05, 'learning_rate': 0.01, 'max_depth': -1, 'metric':'auc', 'min_data_in_leaf': 80, 'min_sum_hessian_in_leaf': 10.0, 'num_leaves': 13, 'num_threads': 8, 'tree_learner': 'serial', 'objective': 'binary', 'verbosity': 1 } features = [c for c in train_df.columns if c not in ['ID_code', 'target']] target = train_df['target'] folds = StratifiedKFold(n_splits=10, shuffle=False, random_state=44000) oof = np.zeros(len(train_df)) predictions = np.zeros(len(test_df)) for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_df.values, target.values)): print("Fold {}".format(fold_)) trn_data = lgb.Dataset(train_df.iloc[trn_idx][features], label=target.iloc[trn_idx]) val_data = lgb.Dataset(train_df.iloc[val_idx][features], label=target.iloc[val_idx]) num_round = 1000000 clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=1000, early_stopping_rounds = 3000) oof[val_idx] = clf.predict(train_df.iloc[val_idx][features], num_iteration=clf.best_iteration) predictions += clf.predict(test_df[features], num_iteration=clf.best_iteration) / folds.n_splits print("CV score: {:<8.5f}".format(roc_auc_score(target, oof))) print('Saving the Result File') res= pd.DataFrame({"ID_code": test.ID_code.values}) res["target"] = predictions res.to_csv('result_10fold{}.csv'.format(num_sub), index=False)

Here is the data:

train_df.head(3) ID_code target var_0 var_1 ... var_199 0 train_0 0 8.9255 -6.7863 -9.2834 1 train_1 1 11.5006 -4.1473 7.0433 2 train_2 0 8.6093 -2.7457 -9.0837 train_df.head(3) ID_code var_0 var_1 ... var_199 0 test_0 9.4292 11.4327 -2.3805 1 test_1 5.0930 11.4607 -9.2834 2 train_2 7.8928 10.5825 -9.0837

I want to save each predictions of each iteration of GridSearchCV and I have searched several similar questions and some other relevant information of using GridSearchCV in LightGBM.
BUT I still can't code it right.
SO, if not mind, could anyone help me and give some tutorials about it?
Thanks sincerely.

There is a small issue, because you are currently referring to GridSearchCV, which requires a model, that comply with sklearn model training API. However, you use the native lightgbm training API. The two do not work together. If you want to use GridSearchCV, then you'll have to use the sklearn API of lightgbm (lgb.LGBMClassifier). However, I do not thnk that you want GridSearchCV at all. Instead you should wrap you main loop into another one, in which you will loop over parameters. You can get parameters generated analogous to grid search using sklearn.model_selection.ParameterGrid — Mischa Lisovyi
– Mischa Lisovyi, Commented Mar 30, 2019 at 10:56
@MykhailoLisovyi Thanks for your help. I have tried a lot way, but most of them couldn't meet my requirement. So, I just wrapped the loop into a function and save all different parameters into a param_list whose type is a dictionary. But I still think this is an ugly way if you mind could you give me some advice on the code. Thanks in advances. — Bowen Peng
– Bowen Peng, Commented Apr 1, 2019 at 9:02

Mischa Lisovyi · Accepted Answer · 2019-04-03 11:56:11Z

You can use the ParameterGrid or ParameterSampler from sklearn to do parameter sampling- it will correspond to the GridSearchCV and RandomSearchCV, respectively. For example,

def train_lgb(num_folds=11, param=param_original): ... return predictions, sub params = { # your base parameters } # define the grid for parameter sampling from sklearn.model_selection import ParameterGrid par_grid = ParameterGrid([{'bagging_freq':[6,7]}, {'num_leaves': [13,15]} ]) prediction_list = {} sub_list = {} import copy for i, ps in enumerate(par_grid): print('This is param{}'.format(i)) # copy the base params dictionary and update with sampled values val = copy.deepcopy(params) val.update(ps) # main training loop prediction, sub = train_lgb(param=val) prediction_list.update({key: prediction}) sub_list.update({key: sub})

Edit: By the way, I realized that i was investigating the same issue recently and was learning how to address using some ML tools. I've created a page summarising how to use MLflow for this task: https://mlisovyi.github.io/KaggleSantander2019/ (and the associated github page for the actual code). Note, that it by accident is based on the same data that you are working on :). I hope it will be useful.

Collectives™ on Stack Overflow

How to save every predicted result in each iteration of GridSearchCV with LightGBM

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related