0
$\begingroup$

To draw random samples from a custom distribution, I recall reading that KDE's are better than histograms. (See hadley's comment here.)

When I experimented in R, I am finding that the KDE method yields results significantly different from the histogram.

I use the annual return rate of S&P500 (from 1926 to 2012) and draw from it.

Reproducible Code below:

#SP500 annual returns x<- c(11.62, 37.49, 43.61, -8.42, -24.9, -43.34, -8.19, 53.99, -1.44, 47.67, 33.92, -35.03, 31.12, -0.41, -9.78, -11.59, 20.34, 25.9, 19.75, 36.44, -8.07, 5.71, 5.5, 18.79, 31.71, 24.02, 18.37, -0.99, 52.62, 31.56, 6.56, -10.78, 43.36, 11.96, 0.47, 26.89, -8.73, 22.8, 16.48, 12.45, -10.06, 23.98, 11.06, -8.5, 4.01, 14.31, 18.98, -14.66, -26.47, 37.2, 23.84, -7.18, 6.56, 18.44, 32.5, -4.92, 21.55, 22.56, 6.27, 31.73, 18.67, 5.25, 16.61, 31.69, -3.11, 30.47, 7.62, 10.08, 1.32, 37.58, 22.96, 33.36, 28.58, 21.04, -9.11, -11.89, -22.1, 28.68, 10.88, 4.91, 15.79, 5.49, -37, 26.46, 15.06, 2.11, 16) #Drawing from the histogram of X hist(x) h <- hist(x, probability=TRUE, breaks=40) scale <- sum(h$density) sum(h$density/scale) # check if 1 #F-inverse function (cdf) for the histogram, scaled to total 1 cumprob <- cumsum(h$density/scale) #To Generate one random sample from the histogram ret.bucket<- sum(ifelse(runif(1)>cumprob,1,0)) h$mids[ret.bucket+1] #This would be the obtained sample return #Draw from the Kernel Density Estimate kde <- density(x, bw=10) #I experimented with various buckets kde plot(kde) kdf <- as.data.frame(kde$x) #512 rows to choose from names(kdf) <- "sp500Return" 

Update: The following is wrong. Cannot simply draw uniformly from the KDE.

#To generate one sample using the KDE kdf[runif(1,1,512),] 

Now let's generate a few samples and compare the two methods:

#Compare the 2 methods ret.kde = NULL ret.hist = NULL rndi <- NULL for (i in 1:5000) { rnd <- runif(1) rndi[i] <- rnd rndrow <- as.integer(512 * rnd) ret.kde [i] <- kdf[rndrow+1,] bucket<- sum(ifelse(rnd > cumprob, 1, 0)) ret.hist[i] <- h$mids[bucket+1] } mean(ret.kde) sd(ret.kde) mean(ret.hist) sd(ret.hist) cbind(rndi, ret.kde, ret.hist) 

Results I got in one run:

> mean(ret.kde); mean(ret.hist) [1] 5.691015 #wrong implementation [1] 11.8404 > sd(ret.kde); sd(ret.hist) [1] 45.87001 #wrong implementation. See below for update [1] 20.15519 

Question: Why are these two methods yielding results that are so different? Is my implementation flawed, or is it a matter of needing to clip the KDE at its ends?

UPDATE:

I found the answer in this response to this CV question. My original method of drawing from the KDE is flawed.

Here's an updated implementation of drawing from the KDE:

#Two Step process To Draw from the Kernel Density Estimate #Step 1 First get one of the original points, randomly. sampleFromKDE <- function(x, bw) { rnd <- sample(length(x),1) xi <- x[rnd] #Step 2: Get a N(xi, bw) around the selected Kernel #bw=10 return(rnorm(1, xi, bw) ) } k<-NULL for (i in 1:5000) { k[i] <- sampleFromKDE(x, 3) } 

leading to

#> mean(k) #[1] 12.20025 #> sd(k) #[1] 19.88572 
$\endgroup$

1 Answer 1

3
$\begingroup$

The mean and standard deviation of the histogram method are quite close to the sample mean and sample standard deviation of the raw data. The KDE is way off.

I tried this in Matlab (because I do not know R) with the same kernel width and generated 5000 random points. My mean and standard deviation look good:

 x = [11.62, 37.49, 43.61, -8.42, -24.9, -43.34, -8.19, 53.99, -1.44, 47.67, 33.92, -35.03, 31.12, -0.41, -9.78,-11.59,20.34, 25.9, 19.75, 36.44, -8.07, 5.71, 5.5, 18.79, 31.71, 24.02, 18.37, -0.99, 52.62, 31.56, 6.56, -10.78, 43.36,11.96, 0.47, 26.89, -8.73, 22.8, 16.48, 12.45, -10.06, 23.98, 11.06, -8.5, 4.01, 14.31, 18.98, -14.66, -26.47, 37.2,23.84, -7.18, 6.56, 18.44, 32.5, -4.92, 21.55, 22.56, 6.27, 31.73, 18.67, 5.25, 16.61, 31.69, -3.11, 30.47, 7.62,10.08,1.32, 37.58, 22.96, 33.36, 28.58, 21.04, -9.11, -11.89, -22.1, 28.68, 10.88, 4.91, 15.79, 5.49, -37, 26.46, 15.06,2.11,16]'; kd = fitdist(x, 'kernel', 'width', 10); randdata = kd.random(1000,1); mean(randdata) std(randdata)  

which gives me

 ans = 10.9047 ans = 23.0171 

I guess there a bug in your code.

$\endgroup$
2
  • $\begingroup$ +1 The code does not generate data from the KDE correctly: it chooses values uniformly from the density rather than the distribution. $\endgroup$ Commented Mar 19, 2013 at 23:31
  • $\begingroup$ Thanks, Atul, @whuber. I will look around for how to choose values from the KDE distribution. $\endgroup$ Commented Mar 20, 2013 at 1:10

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.