K Means giving poor results

Question

I have several user names and their salaries. Now I need to cluster user based on their salaries. I am using KMeans clustering and following is my code

from sklearn.cluster import KMeans from sklearn.preprocessing import LabelEncoder import pandas as pd le = LabelEncoder() data = pd.read_csv('kmeans.data',header=None, names =['user', 'salary']) # Numerical conversion data['user'] = le.fit_transform(data['user']) km = KMeans(n_clusters=4, random_state= 10, n_init=10, max_iter=500) km.fit(data) data['labels'] = le.inverse_transform(data['user']) data['cluster'] = km.labels_ print data

But my results are bad and there are lot of overlapping salaries.

Is there anything wrong in the code ? How to improve the results ?

Or whether clustering is not a right approach here ? Then how can I cluster users only based on salary ?

 km.fit(data['salary'])

EDIT:

I figured out a way to solve my problem using numpy.reshape

km.fit(data['salary'].reshape(-1,1))

Has QUIT--Anony-Mousse · Accepted Answer · 2016-03-16 19:56:45Z

5

K-means is based on the assumption that the data is "translation invariant" (more precisely: variance does, and k-means is variance minimization).

In other words, it assumes that a difference of d=(x-y)^2 is of the same importance everywhere. Because of this, k-means does not work on skewed data. Furthermore, because of the square, it is sensitive to outliers and other extreme values.

For salaries and other monetary values, this usually does not hold. The difference between \$0 and \$1000 is massive, and not the same as a salary difference of \$100000 to \$101000. Salaries are usually rather skewed, and you often have some extreme values.

Converting the "user" attribute to a numerical value is outright statistical nonsense. What's variance worth in this attribute? K-means is for continuous numerical data only, and converting data does not chnage the nature, only the encoding - it's still inappropriate.

answered Mar 16, 2016 at 19:56

Has QUIT--Anony-Mousse

8,1541 gold badge16 silver badges31 bronze badges

$\begingroup$ do you mind explaining what's "translation invariant"? Do you have proof or any reference in literature to claim that k-means does not work on skewed data? $\endgroup$

Tu N.
– Tu N.

2016-03-17 03:31:59 +00:00
Commented Mar 17, 2016 at 3:31
$\begingroup$ yes I got your point, so this is not a typical clustering problem. right ? Then the best way is to sort salaries based on score and divide them to 'K' categories right ? Also is there a way to calculate the best number for 'K' ? suggestions please. $\endgroup$

Sreejithc321
– Sreejithc321

2016-03-17 06:36:54 +00:00
Commented Mar 17, 2016 at 6:36
$\begingroup$ @TuN. I already gave an example... beware, there are two kinds of skewedness. Some are fine, e.g. if your data is 1000 objects from N(1, 1) and 100 from N(10,1) then this is will be considered a skewed variable, but probably fine for k-means because the clusters are well separated and not skewed themselves. $\endgroup$

Has QUIT--Anony-Mousse
– Has QUIT--Anony-Mousse

2016-03-17 06:51:18 +00:00
Commented Mar 17, 2016 at 6:51
$\begingroup$ @Sreejithc321 I don't know if you have a "typical clusteeing problem" because you have not stated the problem. What are valid answers to your problem, and what makes one answer better than another. Until you specify this, random assignment is a valid solution to your problem. $\endgroup$

Has QUIT--Anony-Mousse
– Has QUIT--Anony-Mousse

2016-03-17 06:54:31 +00:00
Commented Mar 17, 2016 at 6:54
$\begingroup$ For 1 dimensional and 2 dimensional data, I suggest you also visualize the data (and any result) and discuss what is interesting, bad, difficult, desired on the plot. Datascience is about telling a story, and images help a lot there. $\endgroup$

Has QUIT--Anony-Mousse
– Has QUIT--Anony-Mousse

2016-03-17 06:56:45 +00:00
Commented Mar 17, 2016 at 6:56

| Show 2 more comments

dmb · Accepted Answer · 2016-03-17 01:54:02Z

This is not a 'clustering' problem as much at is it an 'interval' problem since you only have 1 dimension.

You can use an iterative process like Jenk's natural break optimization in order to figure out how large to make your intervals.

As other posters have said, do not user names as a clustering dimension unless you really think that variations in letters of a name are meaningful in some way (do you really think all the Dan's are paid similarly?).

pgalilea · Accepted Answer · 2016-03-16 13:44:02Z

0

I think the problem here is using the name as a dimension. You can, but you have to use a more robust distance metric between names (string). As far as I know, LabelEncoder just assign an int considering the element's order of ocurrence in a unique list. You could try a different hashing (string to int) or Levenshtein_distance as a distance metric

answered Mar 16, 2016 at 13:44

pgalilea

5443 silver badges8 bronze badges

$\begingroup$ How can I fit only salary ? km.fit(data['salary']) ? $\endgroup$

Sreejithc321
– Sreejithc321

2016-03-16 13:56:26 +00:00
Commented Mar 16, 2016 at 13:56
1

$\begingroup$ You don't have any other attribute? stats.stackexchange.com/questions/40454/… $\endgroup$

pgalilea
– pgalilea

2016-03-16 14:00:24 +00:00
Commented Mar 16, 2016 at 14:00
$\begingroup$ no other attribures $\endgroup$

Sreejithc321
– Sreejithc321

2016-03-16 14:01:31 +00:00
Commented Mar 16, 2016 at 14:01
1

$\begingroup$ You should try another technique instead of clustering, read the link above $\endgroup$

pgalilea
– pgalilea

2016-03-16 14:14:28 +00:00
Commented Mar 16, 2016 at 14:14

Add a comment |

Stack Exchange Network

K Means giving poor results

3 Answers 3

Hot Network Questions

K Means giving poor results

3 Answers 3

Related

Hot Network Questions