I'm using k-means clustering to processes running on machines.
Dataset sample :
machine name, process m1,java m2,tomcat m1,word m3,excel Build a matrix of associated counts :
java,tomcat,word,excel m1,1,0,1,0 m2,0,1,0,0 m3,0,0,0,1 I then run k-means against this dataset (have tried Euclidean and Manhattan distance functions) The dataset is extremely sparse which I think is causing the generated clusters to not make much sense as many machines get grouped into the same cluster(as they are very similar)
How to achieve clusters where each cluster contains approx equal number of points ? Or perhaps this is not possible due to the sparseness of the data and instead I should try to cluster on a different attributes of dataset ?