I would like to create a interactive demonstration using `Manipulate[]` that illustrates a fundamental concept in pattern classification. This problem is related to http://mathematica.stackexchange.com/questions/5950/speed-up-densityhistogram-rendering-display, but differs in a crucial way such that `ListContourPlot` will not suffice (see below).
Suppose we have two two-dimensional data sets, `data1` and `data2`, representing "training points" from two categories. One form of (trained) classifier uses all those points to partition the full two-dimensional plane into regions `myRegion1` and `myRegion2`, such that if a subsequent "test" point `x` lies in `myRegion1` it is called a member of category 1 and if it lies in `myRegion2` it is called a member of category 2. In general, a complex classifier will create `myRegion1` such that all the points of `data1` lie in `myRegion1`, and likewise for `data2` and `myRegion2`. (This is called "overfitting the data.") Simpler classifiers produce smoother, less-complex regions, and may not yield decision regions in which all of the training points lie in their corresponding regions. (This is called "underfitting the data.") For deep reasons, it turns out that the optimum classifier should have intermediate complexity.
My desired interactive demonstration will let the user adjust a single slider controlling the classifer's complexity, and the regions `myRegion1` and `myRegion2` respond accordingly. (I'd also like to show the data points.)
Here's some basic code:
data1 = RandomVariate[
MultinormalDistribution[{-1.5, 0}, {{2, 0}, {0, 1}}], {10}];
data2 = RandomVariate[
MultinormalDistribution[{1.5, 0}, {{2, 0}, {0, 1}}], {10}];
I can create the regions as follows:
myRegion1 =
ImplicitRegion[
PDF[SmoothKernelDistribution[data1, 1], {x, y}] >
PDF[SmoothKernelDistribution[data2, 1], {x, y}], {x, y}];
myRegion2 =
ImplicitRegion[
PDF[SmoothKernelDistribution[data1, 1], {x, y}] <
PDF[SmoothKernelDistribution[data2, 1], {x, y}], {x, y}];
Here I've arbitrarily set the estimation kernel size to be r = 1.0 as a measure of the complexity of the classifier. (If r is small, then the classifer and its decision regions are "more complex," and if r is large, then the classifier and its decision regions are "less complex.") The regions for r=1 look like this:
RegionPlot[{myRegion1, myRegion2},
PlotRange -> {{-4, 4}, {-4, 4}},
PlotStyle -> {{Opacity[0.2], Red}, {Opacity[0.2], Green}},
Epilog -> {Red, PointSize[0.03], Point[data1], Green, PointSize[0.03], Point[data2]}]
![Classifier with r = 1][1]
My goal is to have a `Manipulate[]` environment where the user can adjust the kernel size r and watch how the decision regions change in shape, and how the number of points in their "proper" regions also changes. I'd also like to use `RegionPlot` and `Regions` so the code can automatically count the number of points in their proper regions, using `RegionMember[]`. (This will show how the classifier error depends upon the complexity, i.e., upon r.)
I can create such an interactive demonstration, but it is FAR too slow. I'm seeking tricks that will allow the user to see the effects of a change in classifier complexity rapidly... ideally within 0.5 seconds or so.
There are many obvious tricks I've tried, but they seem to give Mathematica problems. One obvious trick is to define `myRegion2` as `RegionDifference[myBackgroundRegion, myRegion1]`, where `myBackgroundRegion` is a simple square covering the two-dimensional area of interest. However Mathematica gives error messages so I define `myRegion2` explicitly, as in the code above. Other tricks are to set discrete values for r, e.g., `{{r, 1}, .5, 2, .5}` so that there are just a finite list of values. Likewise, standard tricks such as `AutoUpdate -> False` do not sufficiently speed the region calculation and rendering.
I've found that some `SmoothHistogram[]` are faster than `SmoothKernelDistribution`, but alas there seems to be no direct way to "subtract histograms," so as to find the boundary between `myRegion1` and `myRegion2`.
I'm not wedded to using `SmoothKernelDistribution` if some faster method can be devised. The only strict requirements are that the classifier boundaries be smooth (so nearest-neighbor or other discrete methods are ruled out) and--most importantly--the `Manipulate[]` interactivity is fairly fast.
I suspect I'll have to pre-compute figures for different values of r and have the user merely display the figure having a particular r--like flipping through a flip book or scrolling to a frame in a movie. But before I give up hope and make such a database of pre-computed images, I thought I'd see if anyone can solve this programming problem.
[1]: https://i.sstatic.net/2vPJv.png