Revisions to Why does k-fold cross validation generate an MSE estimator that has higher bias, but lower variance then leave-one-out cross-validation?

added 355 characters in body

edited Sep 4, 2018 at 10:10

10.2k
4
49
75

There has been much debate, confusion and contradiction on this topic, both on stats.stackexchange and in scientific literature.

A useful paper is the [2004 study by Bengio & Grandvalet][1] which argues that the variance of the cross validation estimator is a linear combination of three moments:

$$ var = \frac{1}{n^2} \sum_{i,j} Cov(e_i,e_j)$$ $$= \frac{1}{n}\sigma^2 + \frac{m-1}{n}\omega + \frac{n-m}{n} \gamma$$

Where each term is a particular component of the $n \times n$ covariance matrix $\Sigma$ of cross validation errors $\mathbf{e} = (e_1,...,e_n)^T$

[![enter image description here][2]][2]

As @Amoeba points out in a comment above, this variance is not a straightforward function of $K$. Each data point $x_i$ contributes to an error term $\epsilon_i$ which are summed up into the MSE. Varying $K$ does not have a direct, algebraically straightforward impact on the variance of the CV estimator.

$k$-fold CV with any value of $k$ produces an error for each of the $n$ observations. So MSE estimate always has the denominator $n$. This denominator does not change between LOOCV and e.g. 10-fold CV. This is your main confusion here.

Now, there is a lot more subtlety in this equation of variance than it seems. In particular the terms $\omega$ and $\gamma$ are influenced by correlation between the data sets, training sets, testing sets etc.. and instability of the model. These two effects are influenced by the value of $K$ which explains why different datasets and models will lead to different behaviours,

You will need to read through the extensive (and technical) literature to really understand deeply what this meansgrasp the subtlety and special cases. [1]: http://www.jmlr.org/papers/volume5/grandvalet04a/grandvalet04a.pdf [2]: https://i.sstatic.net/HRtCF.png [3]: https://stats.stackexchange.com/a/358138/192854

There has been much debate, confusion and contradiction on this topic, both on stats.stackexchange and in scientific literature.

A useful paper is the [2004 study by Bengio & Grandvalet][1] which argues that the variance of the cross validation estimator is a linear combination of three moments:

$$ var = \frac{1}{n^2} \sum_{i,j} Cov(e_i,e_j)$$ $$= \frac{1}{n}\sigma^2 + \frac{m-1}{n}\omega + \frac{n-m}{n} \gamma$$

Where each term is a particular component of the $n \times n$ covariance matrix $\Sigma$ of cross validation errors $\mathbf{e} = (e_1,...,e_n)^T$

[![enter image description here][2]][2]

As @Amoeba points out in a comment above, this variance is not a straightforward function of $K$.

$k$-fold CV with any value of $k$ produces an error for each of the $n$ observations. So MSE estimate always has the denominator $n$. This denominator does not change between LOOCV and e.g. 10-fold CV. This is your main confusion here.

Now, there is a lot more subtlety in this equation of variance than it seems. In particular the terms $\omega$ and $\gamma$ are influenced by correlation between the data sets, training sets, testing sets etc.. and instability of the model. You will need to read through the extensive (and technical) literature to really understand deeply what this means. [1]: http://www.jmlr.org/papers/volume5/grandvalet04a/grandvalet04a.pdf [2]: https://i.sstatic.net/HRtCF.png [3]: https://stats.stackexchange.com/a/358138/192854

There has been much debate, confusion and contradiction on this topic, both on stats.stackexchange and in scientific literature.

A useful paper is the [2004 study by Bengio & Grandvalet][1] which argues that the variance of the cross validation estimator is a linear combination of three moments:

$$ var = \frac{1}{n^2} \sum_{i,j} Cov(e_i,e_j)$$ $$= \frac{1}{n}\sigma^2 + \frac{m-1}{n}\omega + \frac{n-m}{n} \gamma$$

Where each term is a particular component of the $n \times n$ covariance matrix $\Sigma$ of cross validation errors $\mathbf{e} = (e_1,...,e_n)^T$

[![enter image description here][2]][2]

As @Amoeba points out in a comment above, this variance is not a straightforward function of $K$. Each data point $x_i$ contributes to an error term $\epsilon_i$ which are summed up into the MSE. Varying $K$ does not have a direct, algebraically straightforward impact on the variance of the CV estimator.

$k$-fold CV with any value of $k$ produces an error for each of the $n$ observations. So MSE estimate always has the denominator $n$. This denominator does not change between LOOCV and e.g. 10-fold CV. This is your main confusion here.

Now, there is a lot more subtlety in this equation of variance than it seems. In particular the terms $\omega$ and $\gamma$ are influenced by correlation between the data sets, training sets, testing sets etc.. and instability of the model. These two effects are influenced by the value of $K$ which explains why different datasets and models will lead to different behaviours,

You will need to read through the extensive (and technical) literature to really grasp the subtlety and special cases. [1]: http://www.jmlr.org/papers/volume5/grandvalet04a/grandvalet04a.pdf [2]: https://i.sstatic.net/HRtCF.png [3]: https://stats.stackexchange.com/a/358138/192854

added 19 characters in body

Source Link

edited Jul 23, 2018 at 10:55

amoeba

109.1k
37
325
350

There has been much debate, confusion and contradiction on this topic, both on stats.stackexchange and in scientific literature.

A useful paper is the [2004 study by Bengio & Grandvalet][1] which argues that the variance of the cross validation estimator is a linear combination of three moments:

$$ var = \frac{1}{n^2} \sum_{i,j} Cov(e_i,e_j)$$ $$= \frac{1}{n}\sigma^2 + \frac{m-1}{n}\omega + \frac{n-m}{n} \gamma$$

Where each term is a particular component of the $n \times n$ covariance matrix $\Sigma$ of cross validation errors $\mathbf{e} = (e_1,...,e_n)^T$

[![enter image description here][2]][2]

As @Amoeba points out in a comment above, this variance is not a straightforward function of $K$.

$k$-fold CV with any value of $k$ produces an error for each of the $n$ observations. So MSE estimate always has the denominator $n$. This denominator does not change between LOOCV and e.g. 10-fold CV. This is your main confusion here.

Now, there is a lot more subtlety in this equation of variance than it seems. In particular the terms $\omega$ and $\gamma$ are influenced by correlation between the data sets, training sets, testing sets etc.. and instability of the model. You will need to read through the extensive (and technical) literature to really understand deeply what this means. [1]: http://www.jmlr.org/papers/volume5/grandvalet04a/grandvalet04a.pdf [2]: https://i.sstatic.net/HRtCF.png [3]: https://stats.stackexchange.com/a/358138/192854

There has been much debate, confusion and contradiction on this topic, both on stats.stackexchange and in scientific literature.

A useful paper is the [2004 study by Bengio & Grandvalet][1] which argues that the variance of the cross validation estimator is a linear combination of three moments:

$$ var = \frac{1}{n^2} \sum_{i,j} Cov(e_i,e_j)$$ $$= \frac{1}{n}\sigma^2 + \frac{m-1}{n}\omega + \frac{n-m}{n} \gamma$$

Where each term is a particular component of the $n \times n$ covariance matrix $\Sigma$ of cross validation errors $\mathbf{e} = (e_1,...,e_n)^T$

[![enter image description here][2]][2]

As @Amoeba points out, this variance is not a straightforward function of $K$.

$k$-fold CV with any value of $k$ produces an error for each of the $n$ observations. So MSE estimate always has the denominator $n$. This denominator does not change between LOOCV and e.g. 10-fold CV. This is your main confusion here.

Now, there is a lot more subtlety in this equation of variance than it seems. In particular the terms $\omega$ and $\gamma$ are influenced by correlation between the data sets, training sets, testing sets etc.. and instability of the model. You will need to read through the extensive (and technical) literature to really understand deeply what this means. [1]: http://www.jmlr.org/papers/volume5/grandvalet04a/grandvalet04a.pdf [2]: https://i.sstatic.net/HRtCF.png [3]: https://stats.stackexchange.com/a/358138/192854

There has been much debate, confusion and contradiction on this topic, both on stats.stackexchange and in scientific literature.

A useful paper is the [2004 study by Bengio & Grandvalet][1] which argues that the variance of the cross validation estimator is a linear combination of three moments:

$$ var = \frac{1}{n^2} \sum_{i,j} Cov(e_i,e_j)$$ $$= \frac{1}{n}\sigma^2 + \frac{m-1}{n}\omega + \frac{n-m}{n} \gamma$$

Where each term is a particular component of the $n \times n$ covariance matrix $\Sigma$ of cross validation errors $\mathbf{e} = (e_1,...,e_n)^T$

[![enter image description here][2]][2]

As @Amoeba points out in a comment above, this variance is not a straightforward function of $K$.

$k$-fold CV with any value of $k$ produces an error for each of the $n$ observations. So MSE estimate always has the denominator $n$. This denominator does not change between LOOCV and e.g. 10-fold CV. This is your main confusion here.

Now, there is a lot more subtlety in this equation of variance than it seems. In particular the terms $\omega$ and $\gamma$ are influenced by correlation between the data sets, training sets, testing sets etc.. and instability of the model. You will need to read through the extensive (and technical) literature to really understand deeply what this means. [1]: http://www.jmlr.org/papers/volume5/grandvalet04a/grandvalet04a.pdf [2]: https://i.sstatic.net/HRtCF.png [3]: https://stats.stackexchange.com/a/358138/192854

added 16 characters in body

Source Link

edited Jul 23, 2018 at 10:46

amoeba

109.1k
37
325
350

There has been much debate, confusion and contradiction on this topic, both on stats.stackexchange and in scientific literature.

A useful paper is the [2004 study by Bengio & Grandvalet][1] which argues that the variance of the cross validation estimator is a linear combination of three moments:

$$ var = \frac{1}{n^2} \sum_{i,j} Cov(e_i,e_j)$$ $$= \frac{1}{n}\sigma^2 + \frac{m-1}{n}\omega + \frac{n-m}{n} \gamma$$

Where each term is a particular component of the $n \times n$ covariance matrix $\Sigma$ of cross validation errors $\mathbf{e} = (e_1,...,e_n)^T$

[![enter image description here][2]][2]

As @Amoebe@Amoeba points out, this variance is not a straightforward function of $K$.

k $k$-fold CV with any value of k$k$ produces an error for each of the n$n$ observations. So MSE estimate always has the denominator n$n$. This denominator does not change between LOOCV and e.g. 10-fold CV. This is your main confusion here.

Now, there is a lot more subtlety in this equation of variance than it seems. In particular the terms $\omega$ and $\gamma$ are influenced by correlation between the data sets, training sets, testing sets etc.. and instability of the model. You will need to read through the extensive (and technical) literature to really understand deeply what this means. [1]: http://www.jmlr.org/papers/volume5/grandvalet04a/grandvalet04a.pdf [2]: https://i.sstatic.net/HRtCF.png [3]: https://stats.stackexchange.com/a/358138/192854

There has been much debate, confusion and contradiction on this topic, both on stats.stackexchange and in scientific literature.

A useful paper is the [2004 study by Grandvalet][1] which argues that the variance of the cross validation estimator is a linear combination of three moments:

$$ var = \frac{1}{n^2} \sum_{i,j} Cov(e_i,e_j)$$ $$= \frac{1}{n}\sigma^2 + \frac{m-1}{n}\omega + \frac{n-m}{n} \gamma$$

Where each term is a particular component of the $n \times n$ covariance matrix $\Sigma$ of cross validation errors $\mathbf{e} = (e_1,...,e_n)^T$

[![enter image description here][2]][2]

As @Amoebe points out, this variance is not a straightforward function of $K$.

k -fold CV with any value of k produces an error for each of the n observations. So MSE estimate always has the denominator n. This denominator does not change between LOOCV and e.g. 10-fold CV. This is your main confusion here.

Now, there is a lot more subtlety in this equation of variance than it seems. In particular the terms $\omega$ and $\gamma$ are influenced by correlation between the data sets, training sets, testing sets etc.. and instability of the model. You will need to read through the extensive (and technical) literature to really understand deeply what this means. [1]: http://www.jmlr.org/papers/volume5/grandvalet04a/grandvalet04a.pdf [2]: https://i.sstatic.net/HRtCF.png [3]: https://stats.stackexchange.com/a/358138/192854

There has been much debate, confusion and contradiction on this topic, both on stats.stackexchange and in scientific literature.

A useful paper is the [2004 study by Bengio & Grandvalet][1] which argues that the variance of the cross validation estimator is a linear combination of three moments:

$$ var = \frac{1}{n^2} \sum_{i,j} Cov(e_i,e_j)$$ $$= \frac{1}{n}\sigma^2 + \frac{m-1}{n}\omega + \frac{n-m}{n} \gamma$$

Where each term is a particular component of the $n \times n$ covariance matrix $\Sigma$ of cross validation errors $\mathbf{e} = (e_1,...,e_n)^T$

[![enter image description here][2]][2]

As @Amoeba points out, this variance is not a straightforward function of $K$.

$k$-fold CV with any value of $k$ produces an error for each of the $n$ observations. So MSE estimate always has the denominator $n$. This denominator does not change between LOOCV and e.g. 10-fold CV. This is your main confusion here.

Now, there is a lot more subtlety in this equation of variance than it seems. In particular the terms $\omega$ and $\gamma$ are influenced by correlation between the data sets, training sets, testing sets etc.. and instability of the model. You will need to read through the extensive (and technical) literature to really understand deeply what this means. [1]: http://www.jmlr.org/papers/volume5/grandvalet04a/grandvalet04a.pdf [2]: https://i.sstatic.net/HRtCF.png [3]: https://stats.stackexchange.com/a/358138/192854