2
$\begingroup$

I'm using k-means clustering to processes running on machines.

Dataset sample :

machine name, process m1,java m2,tomcat m1,word m3,excel 

Build a matrix of associated counts :

 java,tomcat,word,excel m1,1,0,1,0 m2,0,1,0,0 m3,0,0,0,1 

I then run k-means against this dataset (have tried Euclidean and Manhattan distance functions) The dataset is extremely sparse which I think is causing the generated clusters to not make much sense as many machines get grouped into the same cluster(as they are very similar)

How to achieve clusters where each cluster contains approx equal number of points ? Or perhaps this is not possible due to the sparseness of the data and instead I should try to cluster on a different attributes of dataset ?

$\endgroup$
1
  • $\begingroup$ How many attributes are you considering in your dataset? And how many examples? $\endgroup$ Commented Apr 10, 2015 at 13:14

1 Answer 1

4
$\begingroup$

Cluster analysis is not supposed to produce paritions of equal size. It is meant to discover structure in the data.

If the majority of objects is highly similar, then this majority is supposed to be in the majority cluster.

Consider all your data is identical. Any clustering algorithm producing more than one cluster has failed, in my opinion...

So you may be using the wrong class of algorithms for your problem.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.