2

I have been reading about the technique of k-fold cross validation and I came through this example:

>>> clf = svm.SVC(kernel='linear', C=1) >>> scores = cross_validation.cross_val_score( ... clf, iris.data, iris.target, cv=5) ... >>> scores array([ 0.96..., 1. ..., 0.96..., 0.96..., 1. ]) 

The mean score and the standard deviation of the score estimate are given by:

>>> >>> print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2)) Accuracy: 0.98 (+/- 0.03) 

According to this source it says

When you perform k-fold CV, you get k different estimates of your model’s error- say e_1, e_2, e_3, ..., e_k. Since each e_i is an error estimate, it should ideally be zero.

To check out you model’s bias, find out the mean of all the e_i's. If this value is low, it basically means that your model gives low error on an average– indirectly ensuring that your model’s notions about the data are accurate enough.

According to the example of the SVM with the iris dataset, it gives a mean of 0.98, so does this mean that our model is not flexible enough?

1
  • 0.98 means 98% accuracy, which is 2% error, doesn't sound bad at all. Commented Jul 9, 2018 at 15:03

2 Answers 2

1
  1. The Wordpress site you link to refers to "error" whereas the code you are using is calculating accuracy, so higher values are better for you.
  2. The mean accuracy is 0.98. Is it good? I can't say because it can only be judged relative to a benchmark.
  3. When doing cross-validation, you are mainly interested in the stability of your classifier, not the mean accuracy. Cross-validation asks essentially: "how well does my classifier perform across different parts of my dataset?" and you use the results to answer: "how well will my classifier perform on data it has not seen before?" Therefore, you really need to look at the standard deviation of you accuracy scores.

Accuracy: 0.98 (+/- 0.03)

Your results show that you have 95% confidence that the mean accuracy will be between 0.95 and 1.

Sign up to request clarification or add additional context in comments.

Comments

0

So I think your question is a misunderstanding of what k-fold is for. Thought I would explain a couple things about it.

It's used in machine learning for when you have a smaller sample size and you need to be able to test how accurate it is. K-fold splits your data into k different tests. So say it was 5, its 20% for testing, 80% for training, and which 20% is tested for is switched each test, same with which 80% is trained for. This is useful when you are worried about a bias because of small amounts of data.

The accuracy you get back is how accurate on average across the k tests it was able to identify what you were looking for, in this case which iris was correctly identified.

0.98% is quite a decent number so your model is fine. Thats an error rate of 0.02 which is close to the 0 of the goal, as it is unlikely to ever hit 0. 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.