To draw random samples from a custom distribution, I recall reading that KDE's are better than histograms. (See hadley's comment here.)
When I experimented in R, I am finding that the KDE method yields results significantly different from the histogram.
I use the annual return rate of S&P500 (from 1926 to 2012) and draw from it.
Reproducible Code below:
#SP500 annual returns x<- c(11.62, 37.49, 43.61, -8.42, -24.9, -43.34, -8.19, 53.99, -1.44, 47.67, 33.92, -35.03, 31.12, -0.41, -9.78, -11.59, 20.34, 25.9, 19.75, 36.44, -8.07, 5.71, 5.5, 18.79, 31.71, 24.02, 18.37, -0.99, 52.62, 31.56, 6.56, -10.78, 43.36, 11.96, 0.47, 26.89, -8.73, 22.8, 16.48, 12.45, -10.06, 23.98, 11.06, -8.5, 4.01, 14.31, 18.98, -14.66, -26.47, 37.2, 23.84, -7.18, 6.56, 18.44, 32.5, -4.92, 21.55, 22.56, 6.27, 31.73, 18.67, 5.25, 16.61, 31.69, -3.11, 30.47, 7.62, 10.08, 1.32, 37.58, 22.96, 33.36, 28.58, 21.04, -9.11, -11.89, -22.1, 28.68, 10.88, 4.91, 15.79, 5.49, -37, 26.46, 15.06, 2.11, 16) #Drawing from the histogram of X hist(x) h <- hist(x, probability=TRUE, breaks=40) scale <- sum(h$density) sum(h$density/scale) # check if 1 #F-inverse function (cdf) for the histogram, scaled to total 1 cumprob <- cumsum(h$density/scale) #To Generate one random sample from the histogram ret.bucket<- sum(ifelse(runif(1)>cumprob,1,0)) h$mids[ret.bucket+1] #This would be the obtained sample return #Draw from the Kernel Density Estimate kde <- density(x, bw=10) #I experimented with various buckets kde plot(kde) kdf <- as.data.frame(kde$x) #512 rows to choose from names(kdf) <- "sp500Return" Update: The following is wrong. Cannot simply draw uniformly from the KDE.
#To generate one sample using the KDE kdf[runif(1,1,512),] Now let's generate a few samples and compare the two methods:
#Compare the 2 methods ret.kde = NULL ret.hist = NULL rndi <- NULL for (i in 1:5000) { rnd <- runif(1) rndi[i] <- rnd rndrow <- as.integer(512 * rnd) ret.kde [i] <- kdf[rndrow+1,] bucket<- sum(ifelse(rnd > cumprob, 1, 0)) ret.hist[i] <- h$mids[bucket+1] } mean(ret.kde) sd(ret.kde) mean(ret.hist) sd(ret.hist) cbind(rndi, ret.kde, ret.hist) Results I got in one run:
> mean(ret.kde); mean(ret.hist) [1] 5.691015 #wrong implementation [1] 11.8404 > sd(ret.kde); sd(ret.hist) [1] 45.87001 #wrong implementation. See below for update [1] 20.15519 Question: Why are these two methods yielding results that are so different? Is my implementation flawed, or is it a matter of needing to clip the KDE at its ends?
UPDATE:
I found the answer in this response to this CV question. My original method of drawing from the KDE is flawed.
Here's an updated implementation of drawing from the KDE:
#Two Step process To Draw from the Kernel Density Estimate #Step 1 First get one of the original points, randomly. sampleFromKDE <- function(x, bw) { rnd <- sample(length(x),1) xi <- x[rnd] #Step 2: Get a N(xi, bw) around the selected Kernel #bw=10 return(rnorm(1, xi, bw) ) } k<-NULL for (i in 1:5000) { k[i] <- sampleFromKDE(x, 3) } leading to
#> mean(k) #[1] 12.20025 #> sd(k) #[1] 19.88572