Clustering procedure where each cluster has an equal number of points?

Question

I have some points $X=\{x_1,...,x_n\}$ in $R^p$, and I want to cluster the points so that:

Each cluster contains an equal number of elements of $X$. (Assume that the number of clusters divides $n$.)
Each cluster is "spatially cohesive" in some sense, like the clusters from $k$-means.

It's easy to think of a lot of clustering procedures that satisfy one or the other of these, but does anyone know of a way to get both at once?

Is the cluster size also specified? Then, as stated, the problem seems to be unsolvable to me. Consider the following case with $n = 4, p = 1$: $X = \{-1, 0.99, 1, 1.01\}$. If you want 2 clusters, you get either different sizes or not "spatially cohesive". Or do you want something like, "as spatially cohesive as possible" - minimizing the maximal intra-cluster distance or so? The other solution would be to allow any divisor of $n$ as cluster size - but then there's always the trivial solution of $n$ clusters of size $1$. — Erik P.
– Erik P., Commented Mar 24, 2011 at 23:46
Good point. Ideally I would like something that is "as spatially cohesive as possible" while satisfying the equal cardinality constraint. But I'd be interested to hear about any procedures that make other tradeoffs here as well, since maybe they can be adapted. — Not Durrett
– Not Durrett, Commented Mar 25, 2011 at 2:16
Would splitting the data by quantiles suffice? If the values are not monotonic relative to each other, I fail to see how else they could be 'spatially cohesive'. — celenius
– celenius, Commented Mar 25, 2011 at 2:24
There has been some recent research on constrained clustering. Google google.com/search?q=constrained+k-means . — whuber
– whuber ♦, Commented Mar 25, 2011 at 2:42
Just one not tested idea. In clustering, a so-called Silhouette statistic is used often. It shows you how well an object is clustered and what is the best other, neighbour cluster it could be enrolled in. So, you could start with K-MEANS or other classification with different cluster n's. Then move objects not very well classified (accorging to the statistic) to their best neighbour clusters with lesser n until you obtain equal n. I expect iterations: move some objects, recalculate the statistics, move some objects, etc. That will be trade-off process. — ttnphns
– ttnphns, Commented Mar 25, 2011 at 7:31

Jonas · Accepted Answer · 2011-03-27 14:28:15Z

I suggest a two-step approach:

get a good initial estimates of the cluster centers, e.g. using hard or fuzzy K-means.
Use Global Nearest Neighbor assignment to associate points with cluster centers: Calculate a distance matrix between each point and each cluster center (you can make the problem a bit smaller by only calculating reasonable distances), replicate each cluster center X times, and solve the linear assignment problem. You'll get, for each cluster center, exactly X matches to data points, so that, globally, the distance between data points and cluster centers is minimized.

Note that you can update cluster centers after step 2 and repeat step 2 to basically run K-means with fixed number of points per cluster. Still, it will be a good idea to get a good initial guess first.

Has QUIT--Anony-Mousse · Accepted Answer · 2012-01-12 18:55:26Z

Try this k-means variation:

Initialization:

choose k centers from the dataset at random, or even better using kmeans++ strategy
for each point, compute the distance to its nearest cluster center, and build a heap for this
draw points from the heap, and assign them to the nearest cluster, unless the cluster is already overfull. If so, compute the next nearest cluster center and reinsert into the heap

In the end, you should have a paritioning that satisfies your requirements of the +-1 same number of objects per cluster (make sure the last few clusters also have the right number. The first m clusters should have ceil objects, the remainder exactly floor objects.)

Iteration step:

Requisites: a list for each cluster with "swap proposals" (objects that would prefer to be in a different cluster).

E step: compute the updated cluster centers as in regular k-means

M step: Iterating through all points (either just one, or all in one batch)

Compute nearest cluster center to object / all cluster centers that are closer than the current clusters. If it is a different cluster:

If the other cluster is smaller than the current cluster, just move it to the new cluster
If there is a swap proposal from the other cluster (or any cluster with a lower distance), swap the two element cluster assignments (if there is more than one offer, choose the one with the largest improvement)
otherwise, indicate a swap proposal for the other cluster

The cluster sizes remain invariant (+- the ceil/floor difference), an objects are only moved from one cluster to another as long as it results in an improvement of the estimation. It should therefore converge at some point like k-means. It might be a bit slower (i.e. more iterations) though.

I do not know if this has been published or implemented before. It's just what I would try (if I would try k-means. there are much better clustering algorithms.)

A good place to start might be with the k-means implementation in ELKI, which already seems to support three different initializations (including k-means++), and the authors said they also want to have different iteration strategys, to cover all the various common variants in a modular fashion (e.g. Lloyd, MacQueen, ...).

A similar approach is included in ELKI as a tutorial and in the tutorial "extension" module: elki.dbs.ifi.lmu.de/wiki/Tutorial/SameSizeKMeans — SO is dead
– SO is dead, Commented Jun 6, 2016 at 9:49

Alexander Kain · Accepted Answer · 2013-11-27 18:56:37Z

Recently I needed this myself for a not very large dataset. My answer, although it has a relatively long running time, is guaranteed to converge to a local optimum.

def eqsc(X, K=None, G=None): "equal-size clustering based on data exchanges between pairs of clusters" from scipy.spatial.distance import pdist, squareform from matplotlib import pyplot as plt from matplotlib import animation as ani from matplotlib.patches import Polygon from matplotlib.collections import PatchCollection def error(K, m, D): """return average distances between data in one cluster, averaged over all clusters""" E = 0 for k in range(K): i = numpy.where(m == k)[0] # indeces of datapoints belonging to class k E += numpy.mean(D[numpy.meshgrid(i,i)]) return E / K numpy.random.seed(0) # repeatability N, n = X.shape if G is None and K is not None: G = N // K # group size elif K is None and G is not None: K = N // G # number of clusters else: raise Exception('must specify either K or G') D = squareform(pdist(X)) # distance matrix m = numpy.random.permutation(N) % K # initial membership E = error(K, m, D) # visualization #FFMpegWriter = ani.writers['ffmpeg'] #writer = FFMpegWriter(fps=15) #fig = plt.figure() #with writer.saving(fig, "ec.mp4", 100): t = 1 while True: E_p = E for a in range(N): # systematically for b in range(a): m[a], m[b] = m[b], m[a] # exchange membership E_t = error(K, m, D) if E_t < E: E = E_t print("{}: {}<->{} E={}".format(t, a, b, E)) #plt.clf() #for i in range(N): #plt.text(X[i,0], X[i,1], m[i]) #writer.grab_frame() else: m[a], m[b] = m[b], m[a] # put them back if E_p == E: break t += 1 fig, ax = plt.subplots() patches = [] for k in range(K): i = numpy.where(m == k)[0] # indeces of datapoints belonging to class k x = X[i] patches.append(Polygon(x[:,:2], True)) # how to draw this clock-wise? u = numpy.mean(x, 0) plt.text(u[0], u[1], k) p = PatchCollection(patches, alpha=0.5) ax.add_collection(p) plt.show() if __name__ == "__main__": N, n = 100, 2 X = numpy.random.rand(N, n) eqsc(X, G=3)

Thanks for this contribution, @user2341646. Would you mind adding some exposition that explains what this solution is, how it works, & why it is a solution? — gung - Reinstate Monica
– gung - Reinstate Monica, Commented Nov 23, 2013 at 16:51
OK. Basically, the algorithm starts with membership assignments that are random, but there are close to G members in a cluster, and there are K clusters overall. We define the error function that measures the average distances between data in one cluster, averaged over all clusters. Going through all pairs of data systematically, we see if exchanging the membership of those two data results in a lower error. If it does, we update the lowest possible error, otherwise we undo the membership switch. We do this until there are no more moves left for one entire pass. — Alexander Kain
– Alexander Kain, Commented Nov 27, 2013 at 19:00
Hey Alexander, sorry for resurrecting your answer, but do you have any paper for referencing purposes? — tcokyasar
– tcokyasar, Commented Mar 2, 2021 at 17:50

Open Door Logistics · Accepted Answer · 2018-05-07 02:46:47Z

This is an optimisation problem. We have an open source java library which solves this problem (clustering where quantity per cluster must be between set ranges). You'd need your total number of points to be maximum of a few thousand though - no more than 5000 or maybe 10000.

The library is here:

https://github.com/PGWelch/territorium/tree/master/territorium.core

The library itself is setup for geographic / GIS type problems - so you will see references to X and Ys, latitudes and longitudes, customers, distance and time, etc. You can just ignore the 'geographic' elements though and use it as a pure clusterer.

You provide a set of initially empty input clusters each with a min and max target quantity. The clusterer will assign points to your input clusters, using a heuristic-based optimisation algorithm (swaps, moves etc). In the optimisation it firstly prioritises keeping each cluster within its min and max quantity range and then secondly minimises the distances between all points in the cluster and the cluster's central point, so a cluster is spatially cohesive.

You give the solver a metric function (i.e. distance function) between points using this interface:

https://github.com/PGWelch/territorium/blob/master/territorium.core/src/main/java/com/opendoorlogistics/territorium/problem/TravelMatrix.java

The metric is actually structured to return both a distance and 'time', because its designed for travel-based geographic problems, but for arbitrary clustering problems just set 'time' to be zero and distance to be your actual metric you're using between points.

You'd setup your problem in this class:

https://github.com/PGWelch/territorium/blob/master/territorium.core/src/main/java/com/opendoorlogistics/territorium/problem/Problem.java

Your points would be the 'Customers' and their quantity would be 1. In the customer class ensure you set costPerUnitTime = 0 and costPerUnitDistance=1 assuming you're returning your metric distance in the 'distance' field returned by the TravelMatrix.

https://github.com/PGWelch/territorium/blob/master/territorium.core/src/main/java/com/opendoorlogistics/territorium/problem/Customer.java

See here for an example of running the solver:

https://github.com/PGWelch/territorium/blob/master/territorium.core/src/test/java/com/opendoorlogistics/territorium/TestSolver.java

osdf · Accepted Answer · 2011-03-26 09:01:37Z

I suggest the recent paper Discriminative Clustering by Regularized Information Maximization (and references therein). Specifically, Section 2 talks about class balance and cluster assumption.

Stack Exchange Network

Clustering procedure where each cluster has an equal number of points?

5 Answers 5

Linked

Hot Network Questions

Clustering procedure where each cluster has an equal number of points?

5 Answers 5

Linked

Related

Hot Network Questions