Scikit-learn page on Grid Search says:
Model selection by evaluating various parameter settings can be seen as a way to use the labeled data to “train” the parameters of the grid.
When evaluating the resulting model it is important to do it on held-out samples that were not seen during the grid search process: it is recommended to split the data into a development set (to be fed to the GridSearchCV instance) and an evaluation set to compute performance metrics.
Does it mean that the GridSearchCV.best_score_ from the Grid Search object shouldn't be used for model performance evaluation? Why is that the case?
I've been using my GridSearchCV scores as my performance estimates, because I wanted to get a reliable score over several runs (and standard deviation), and running a separate cross-validation after Grid Search gives me overestimated scores because some of the data in the CV validation sets was already seen by the Grid Search. Is this an incorrect approach?