3

please refer to the notebook at the following address

LogisticRegression

this portion of code,

scores = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=10) print scores print scores.mean() 

generates the following error in a window 7 64bit machine

--------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-37-4a10affe67c7> in <module>() 1 # evaluate the model using 10-fold cross-validation ----> 2 scores = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=10) 3 print scores 4 print scores.mean() C:\Python27\lib\site-packages\sklearn\cross_validation.pyc in cross_val_score(estimator, X, y, scoring, cv, n_jobs, verbose, fit_params, score_func, pre_dispatch) 1140 allow_nans=True, allow_nd=True) 1141 -> 1142 cv = _check_cv(cv, X, y, classifier=is_classifier(estimator)) 1143 scorer = check_scoring(estimator, score_func=score_func, scoring=scoring) 1144 # We clone the estimator to make sure that all the folds are C:\Python27\lib\site-packages\sklearn\cross_validation.pyc in _check_cv(cv, X, y, classifier, warn_mask) 1366 if classifier: 1367 if type_of_target(y) in ['binary', 'multiclass']: -> 1368 cv = StratifiedKFold(y, cv, indices=needs_indices) 1369 else: 1370 cv = KFold(_num_samples(y), cv, indices=needs_indices) C:\Python27\lib\site-packages\sklearn\cross_validation.pyc in __init__(self, y, n_folds, indices, shuffle, random_state) 428 for test_fold_idx, per_label_splits in enumerate(zip(*per_label_cvs)): 429 for label, (_, test_split) in zip(unique_labels, per_label_splits): --> 430 label_test_folds = test_folds[y == label] 431 # the test split can be too big because we used 432 # KFold(max(c, self.n_folds), self.n_folds) instead of IndexError: too many indices for array 

I am using scikit.learn 0.15.2, it is suggested here that may a specific problem for windows 7, 64 bit machine.

==============update==============

I found the following code actually works

 from sklearn.cross_validation import KFold cv = KFold(X.shape[0], 10, shuffle=True, random_state=33) scores = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=cv) print scores 

==============update 2=============

it seems due to some package update, I can no longer reproduce such error on my machine. If you are facing the same issue on a windows 7 64bit machine, please let me know.

9
  • What is the shape of y? Commented Oct 22, 2014 at 12:01
  • The only difference between what works and what doesn't work is cv ? X.shape[0] == 6366 also? Commented Oct 22, 2014 at 14:37
  • 1
    @eickenberg cv=10 will try do stratified 10-fold CV, KFold will not. Commented Oct 22, 2014 at 14:52
  • putting cv=StratifiedKFold(y, 10) explicitly would have been my next diagnosis step, if all else was equal. Commented Oct 22, 2014 at 16:56
  • 2
    is that the only change you have made? because if that works, then cv=number should, too (see @larsmans comment) Commented Oct 23, 2014 at 8:16

3 Answers 3

2

I had the same error you got and was looking for answers when I found this question.

I used the same sklearn.cross_validation.cross_val_score (except different algorithm) and the same machine windows 7, 64 bit.

I tried your solution from above and it "worked", but it gave me the following warning:

C:\Users\E245713\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\cross_validation.py:1531: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). estimator.fit(X_train, y_train, **fit_params)

After reading the warning, I figured that the problem has something to do with the shape of 'y' (my label column). The keyword to try from the warning is "ravel()". So, I tried the following:

y_arr = pd.DataFrame.as_matrix(label) print(y_arr) print(y_arr.shape()) 

which gave me

 [[1] [0] [1] .., [0] [0] [1]] (87939, 1) 

When I added 'ravel()':

y_arr = pd.DataFrame.as_matrix(label).ravel() print(y_arr) print(y_arr.shape()) 

it gave me:

[1 0 1 ..., 0 0 1] (87939,) 

The dimension of 'y_arr' has to be in the form of (87939,) not (87939,1). After that my original cross_val_score worked without adding the Kfold code.

Hope this helps.

Sign up to request clarification or add additional context in comments.

Comments

1

I know the answer is late.
But this answer might help other people struggling with same error. I have same issue with python 3.6 Upon changing from 3.6 to 3.5 ,I am able to use the function.
Below is the sample which i ran:

accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10, n_jobs = -1) 

First create conda env with 3.5 version.

conda create -n py35 python=3.5 source activate py35 

Hope this should help to move ahead

Comments

0

Import this module and it should work:

from sklearn.model_selection import cross_val_score 

1 Comment

The error message shows, that this is not the error, since it can handle the method, but not the array provided in it.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.