I am trying to estimate the bandwidth parameter of a multivariate KDE in R and then use the estimate to evaluate the KDE in Python.
The reason for this somewhat convoluted approach is that my project is in Python, but, as far as I know, there is no multivariate implementation of a plug-in selector for the bandwidth in Python. So I reverted to estimate the diagonal matrix for the bandwidth with R's ks package. Unfortunately, something does not work as I expected. I boiled it down to the following example:
import pandas as pd import numpy as np import rpy2.robjects as ro from rpy2.robjects.conversion import localconverter from rpy2.robjects import pandas2ri from rpy2.robjects.packages import importr from sklearn.datasets import make_blobs import matplotlib.pyplot as plt import statsmodels.api as sm # ############################################################################### # # call this part only once to install "ks" # utils = importr('utils') # base = importr('base') # # select a mirror, otherwise the user is prompted # utils.chooseCRANmirror(ind=1) # select the first mirror in the list # # install required package "ks" # utils.install_packages('ks') # ############################################################################### def get_bandwidth(data): ks = importr("ks") with localconverter(ro.default_converter + pandas2ri.converter): # convert pandas.DataFrame to R DataFrame r_from_pd_df = ro.conversion.py2rpy(data) # bandwidth selection with rule of thumb ks_hns = ks.Hns(r_from_pd_df) ks_hns = ro.conversion.rpy2py(ks_hns) # symetric bandwith (diagonal matrix) ks_hpi_diag = ks.Hpi_diag(r_from_pd_df) ks_hpi_diag = ro.conversion.rpy2py(ks_hpi_diag) return ks_hns, ks_hpi_diag # Create test-data data_x, data_y = make_blobs( n_samples=1000, n_features=2, centers=3, cluster_std=0.5, random_state=0 ) # re-scale one dimension data_x[:, 0] = data_x[:, 0] * 20 plt.hexbin(data_x[:, 0], data_x[:, 1]) plt.show() ks_hns, ks_hpi_diag = get_bandwidth(pd.DataFrame(data_x)) dens_n = sm.nonparametric.KDEMultivariate( data=data_x, var_type="cc", bw="normal_reference" ) dens_cw = sm.nonparametric.KDEMultivariate( data=data_x, var_type="cc", bw="cv_ml" ) print("Python 'statsmodels' normal reference bandwidth estimate:") print(np.diag(dens_n.bw)) print("R 'ks' normal scale bandwidth:") print(ks_hns) print(f"Python 'statsmodels' cross validation bandwidth estimate:") print(np.diag(dens_cw.bw)) print("R 'ks' PI bandwidth diagonal:") print(ks_hpi_diag) This returns:
Python 'statsmodels' normal reference bandwidth: [[10.66300197 0. ] [ 0. 0.4962383 ]] R 'ks' normal reference bandwidth: [[101.29354253 -1.81323816] [ -1.81323816 0.21938319]] Python 'statsmodels' cross validation bandwidth: [[4.40348543 0. ] [0. 0.19790704]] R 'ks' PI bandwidth diagonal: [[20.52921948 0. ] [ 0. 0.04962396]] I would expect that the results of the two implementation's normal reference (rule of thumb) will give me not exactly the same results, but at least something in the same order of magnitude. The same is true for ks's plug-in method and statsmodels' cross-validation method (ok, here I'm not that sure).
As suggested by @Josef, I plotted the different KDEs (code not shown):
The data for the plots on the top row are produced with statsmodels and its estimate for the bandwidth, the one on the bottom row with ks.
It seems to me from the plots that the different bandwidth produce comparable results with the corresponding implementations. For example, the two plots on the left are similar, even though the bandwidth are significantly different.
How are the estimates of the two so different?
The only idea I have is that they use different parameterizations or scales, but I couldn't find much on that. If that is the case, I would appreciate it if somebody could provide me a hint on where I could find that.
Rversion returns a "bandwidth matrix," almost surely that will be the analog of a covariance matrix. Because Python returns only a diagonal matrix, it is plausible that it has taken the square roots so that these numbers can be interpreted as a characteristic distance. That is why it is striking that the matrices returned byRin your case are approximately the squares of the matrices returned by Python. $\endgroup$ksuses a multivariate gaussian kernel similar to scipy gaussian_kde. $\endgroup$