Error in scikit.learn cross_val_score

Question

please refer to the notebook at the following address

this portion of code,

scores = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=10) print scores print scores.mean()

generates the following error in a window 7 64bit machine

--------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-37-4a10affe67c7> in <module>() 1 # evaluate the model using 10-fold cross-validation ----> 2 scores = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=10) 3 print scores 4 print scores.mean() C:\Python27\lib\site-packages\sklearn\cross_validation.pyc in cross_val_score(estimator, X, y, scoring, cv, n_jobs, verbose, fit_params, score_func, pre_dispatch) 1140 allow_nans=True, allow_nd=True) 1141 -> 1142 cv = _check_cv(cv, X, y, classifier=is_classifier(estimator)) 1143 scorer = check_scoring(estimator, score_func=score_func, scoring=scoring) 1144 # We clone the estimator to make sure that all the folds are C:\Python27\lib\site-packages\sklearn\cross_validation.pyc in _check_cv(cv, X, y, classifier, warn_mask) 1366 if classifier: 1367 if type_of_target(y) in ['binary', 'multiclass']: -> 1368 cv = StratifiedKFold(y, cv, indices=needs_indices) 1369 else: 1370 cv = KFold(_num_samples(y), cv, indices=needs_indices) C:\Python27\lib\site-packages\sklearn\cross_validation.pyc in __init__(self, y, n_folds, indices, shuffle, random_state) 428 for test_fold_idx, per_label_splits in enumerate(zip(*per_label_cvs)): 429 for label, (_, test_split) in zip(unique_labels, per_label_splits): --> 430 label_test_folds = test_folds[y == label] 431 # the test split can be too big because we used 432 # KFold(max(c, self.n_folds), self.n_folds) instead of IndexError: too many indices for array

I am using scikit.learn 0.15.2, it is suggested here that may a specific problem for windows 7, 64 bit machine.

==============update==============

I found the following code actually works

 from sklearn.cross_validation import KFold cv = KFold(X.shape[0], 10, shuffle=True, random_state=33) scores = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=cv) print scores

==============update 2=============

it seems due to some package update, I can no longer reproduce such error on my machine. If you are facing the same issue on a windows 7 64bit machine, please let me know.

The only difference between what works and what doesn't work is cv ? X.shape[0] == 6366 also? — eickenberg
– eickenberg, Commented Oct 22, 2014 at 14:37
@eickenberg cv=10 will try do stratified 10-fold CV, KFold will not. — Fred Foo
– Fred Foo, Commented Oct 22, 2014 at 14:52
putting cv=StratifiedKFold(y, 10) explicitly would have been my next diagnosis step, if all else was equal. — eickenberg
– eickenberg, Commented Oct 22, 2014 at 16:56
is that the only change you have made? because if that works, then cv=number should, too (see @larsmans comment) — eickenberg
– eickenberg, Commented Oct 23, 2014 at 8:16

wi3o · Accepted Answer · 2016-07-20 21:28:49Z

I had the same error you got and was looking for answers when I found this question.

I used the same sklearn.cross_validation.cross_val_score (except different algorithm) and the same machine windows 7, 64 bit.

I tried your solution from above and it "worked", but it gave me the following warning:

C:\Users\E245713\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\cross_validation.py:1531: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). estimator.fit(X_train, y_train, **fit_params)

After reading the warning, I figured that the problem has something to do with the shape of 'y' (my label column). The keyword to try from the warning is "ravel()". So, I tried the following:

y_arr = pd.DataFrame.as_matrix(label) print(y_arr) print(y_arr.shape())

which gave me

 [[1] [0] [1] .., [0] [0] [1]] (87939, 1)

When I added 'ravel()':

y_arr = pd.DataFrame.as_matrix(label).ravel() print(y_arr) print(y_arr.shape())

it gave me:

[1 0 1 ..., 0 0 1] (87939,)

The dimension of 'y_arr' has to be in the form of (87939,) not (87939,1). After that my original cross_val_score worked without adding the Kfold code.

Hope this helps.

Aman · Accepted Answer · 2019-01-22 09:08:44Z

I know the answer is late.
But this answer might help other people struggling with same error. I have same issue with python 3.6 Upon changing from 3.6 to 3.5 ,I am able to use the function.
Below is the sample which i ran:

accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10, n_jobs = -1)

First create conda env with 3.5 version.

conda create -n py35 python=3.5 source activate py35

Hope this should help to move ahead

groenhen · Accepted Answer · 2020-09-03 05:52:04Z

0

Import this module and it should work:

from sklearn.model_selection import cross_val_score

edited Sep 3, 2020 at 5:52

groenhen

3,02725 gold badges51 silver badges69 bronze badges

answered Sep 3, 2020 at 4:37

DINESH SHARMA

1

1 Comment

Kim Tang Over a year ago

The error message shows, that this is not the error, since it can handle the method, but not the array provided in it.

Collectives™ on Stack Overflow

Error in scikit.learn cross_val_score

3 Answers 3

Comments

Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Related