2
$\begingroup$

I have a question about bandwidth selection of kernel density estimate in scipy.stats. In the method, if we use Scott's rule, the bandwidth is equal to n**(-1./(d+4)), which means that the bandwidth is only related to the number and dimensions of samples. However, samples with the same n and d can have different variances. Do large unit datas have the same bandwidth as those with small unit? That doesn't make sense, if the data unit is large (large covariance) but the bandwidth is small (n is small), the kernel function can cover almost only one data. As a result, when using n**(-1./(d+4)), should the data be normalized (Z-score) first?

The bandwidth, in my opinion, should be related to the covariance of the data in addition to the n、d, but why is the bandwidth equal to only n**(-1./(d+4))?

see SciPy document: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html

really need your help, guys

$\endgroup$
1

1 Answer 1

3
$\begingroup$

The source says

 covariance_factor = scotts_factor covariance_factor.__doc__ = """Computes the coefficient (`kde.factor`) that multiplies the data covariance matrix to obtain the kernel covariance matrix. The default is `scotts_factor`. A subclass can overwrite this method to provide a different method, or set it through a call to `kde.set_bandwidth`.""" 

That is, the bandwith does depend on the data covarance matrix and scotts_factor is just what you multiply the data covariance matrix by.

$\endgroup$
4
  • $\begingroup$ Thanks buddy. Here's another question. I remember the papers saying that the bandwidth is a one-dimensional scalar in the multi-dimensional KDE formula. However, The bandwidth, according to source, should be a d * d dimensional matrix, which is equal to scotts_factor * data_covariance_matrix ? If that's the case, how do I output this d*d dimension matrix ? Are there any attributes or functions in this api that can capture bandwidth, not just bandwidth coefficients? $\endgroup$ Commented Jun 15, 2020 at 8:23
  • $\begingroup$ Can I use n**(-1./(d+4)) * data_covariance_matrix to get bandwidth? $\endgroup$ Commented Jun 15, 2020 at 8:39
  • $\begingroup$ That would be Scott's rule, yes. I don't have any views on whether it's a good approach or not. $\endgroup$ Commented Jun 15, 2020 at 23:30
  • $\begingroup$ Thanks Thomas. I find that the bandwidth seems to be a covariance matrix, which can be obtained by using kernel.covariance in scipy.stats.gaussian_kde. I also find that kernel.covariance has the following relationship with the covariance of the data: kernel.covariance = kernel.factor**2 * data_covariance kernel.factor = scotts_factor = n**(-1./(d+4)) Now I have another question. I want to know what is the specific formula of Gaussian kernel density estimation based on Scott's rule in scipy.stats.gaussian_kde? i.e. what is the formula of gaussian kernel function? $\endgroup$ Commented Jun 17, 2020 at 4:03

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.