I was looking for an answer to this bandwidth matrix optimisation problem and I found this excellent other thread, so I thought I'd drop it here. :
In short, it says that the sklearn KernelDensity() implementation uses bandwidth as a multiplier of the diagonal matrix (so second case of Tim's answer), while statsmodel's KDEMultivariate() estimates different multipliers (so third picture, I believe). I am not sure how this compares to scipy which multiplies the covariance matrix by the single scalar. It looks to fall in the same case as KDEMultivariate(), but with a little less control over the dimension-specific toggling. From what I understand (again from that other stackoverflow answer), they both use rule of thumb for coming up with the covariance matrix.