4
$\begingroup$

I am trying to estimate the bandwidth parameter of a multivariate KDE in R and then use the estimate to evaluate the KDE in Python.

The reason for this somewhat convoluted approach is that my project is in Python, but, as far as I know, there is no multivariate implementation of a plug-in selector for the bandwidth in Python. So I reverted to estimate the diagonal matrix for the bandwidth with R's ks package. Unfortunately, something does not work as I expected. I boiled it down to the following example:

import pandas as pd import numpy as np import rpy2.robjects as ro from rpy2.robjects.conversion import localconverter from rpy2.robjects import pandas2ri from rpy2.robjects.packages import importr from sklearn.datasets import make_blobs import matplotlib.pyplot as plt import statsmodels.api as sm # ############################################################################### # # call this part only once to install "ks" # utils = importr('utils') # base = importr('base') # # select a mirror, otherwise the user is prompted # utils.chooseCRANmirror(ind=1) # select the first mirror in the list # # install required package "ks" # utils.install_packages('ks') # ############################################################################### def get_bandwidth(data): ks = importr("ks") with localconverter(ro.default_converter + pandas2ri.converter): # convert pandas.DataFrame to R DataFrame r_from_pd_df = ro.conversion.py2rpy(data) # bandwidth selection with rule of thumb ks_hns = ks.Hns(r_from_pd_df) ks_hns = ro.conversion.rpy2py(ks_hns) # symetric bandwith (diagonal matrix) ks_hpi_diag = ks.Hpi_diag(r_from_pd_df) ks_hpi_diag = ro.conversion.rpy2py(ks_hpi_diag) return ks_hns, ks_hpi_diag # Create test-data data_x, data_y = make_blobs( n_samples=1000, n_features=2, centers=3, cluster_std=0.5, random_state=0 ) # re-scale one dimension data_x[:, 0] = data_x[:, 0] * 20 plt.hexbin(data_x[:, 0], data_x[:, 1]) plt.show() ks_hns, ks_hpi_diag = get_bandwidth(pd.DataFrame(data_x)) dens_n = sm.nonparametric.KDEMultivariate( data=data_x, var_type="cc", bw="normal_reference" ) dens_cw = sm.nonparametric.KDEMultivariate( data=data_x, var_type="cc", bw="cv_ml" ) print("Python 'statsmodels' normal reference bandwidth estimate:") print(np.diag(dens_n.bw)) print("R 'ks' normal scale bandwidth:") print(ks_hns) print(f"Python 'statsmodels' cross validation bandwidth estimate:") print(np.diag(dens_cw.bw)) print("R 'ks' PI bandwidth diagonal:") print(ks_hpi_diag) 

This returns:

Python 'statsmodels' normal reference bandwidth: [[10.66300197 0. ] [ 0. 0.4962383 ]] R 'ks' normal reference bandwidth: [[101.29354253 -1.81323816] [ -1.81323816 0.21938319]] Python 'statsmodels' cross validation bandwidth: [[4.40348543 0. ] [0. 0.19790704]] R 'ks' PI bandwidth diagonal: [[20.52921948 0. ] [ 0. 0.04962396]] 

I would expect that the results of the two implementation's normal reference (rule of thumb) will give me not exactly the same results, but at least something in the same order of magnitude. The same is true for ks's plug-in method and statsmodels' cross-validation method (ok, here I'm not that sure).

As suggested by @Josef, I plotted the different KDEs (code not shown): enter image description here The data for the plots on the top row are produced with statsmodels and its estimate for the bandwidth, the one on the bottom row with ks.

It seems to me from the plots that the different bandwidth produce comparable results with the corresponding implementations. For example, the two plots on the left are similar, even though the bandwidth are significantly different.

How are the estimates of the two so different?

The only idea I have is that they use different parameterizations or scales, but I couldn't find much on that. If that is the case, I would appreciate it if somebody could provide me a hint on where I could find that.

$\endgroup$
4
  • 1
    $\begingroup$ Plot the results to see if they have approximately the same smoothness. That would indicate whether some scaling factor differs. $\endgroup$ Commented Sep 27, 2022 at 19:30
  • $\begingroup$ Thanks for the hint, that is a good idea. I edited the post. $\endgroup$ Commented Sep 28, 2022 at 12:07
  • 3
    $\begingroup$ Because the R version returns a "bandwidth matrix," almost surely that will be the analog of a covariance matrix. Because Python returns only a diagonal matrix, it is plausible that it has taken the square roots so that these numbers can be interpreted as a characteristic distance. That is why it is striking that the matrices returned by R in your case are approximately the squares of the matrices returned by Python. $\endgroup$ Commented Sep 28, 2022 at 13:47
  • $\begingroup$ statsmodels KDEMultivariate uses a product kernel based mostly on Racine and Li. I guess R ks uses a multivariate gaussian kernel similar to scipy gaussian_kde. $\endgroup$ Commented Sep 28, 2022 at 16:34

1 Answer 1

1
$\begingroup$

In case somebody is looking for an answert to this question, @whuber was correct in the comments. I just make an answer out of it so it can be found easier.

From observations I can confirm that that R's ks returns the "bandwidth matrix", which is equivalent to the covarniance matrix. Python's statsmodels on the other hand seems to return square root of the diagonal.

So, if you want use your bandwidth estimate from R's ks in Python's statsmodels, you'll just have to take the element-wise square root of the bandwidth.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.