I have several user names and their salaries. Now I need to cluster user based on their salaries. I am using KMeans clustering and following is my code
from sklearn.cluster import KMeans from sklearn.preprocessing import LabelEncoder import pandas as pd le = LabelEncoder() data = pd.read_csv('kmeans.data',header=None, names =['user', 'salary']) # Numerical conversion data['user'] = le.fit_transform(data['user']) km = KMeans(n_clusters=4, random_state= 10, n_init=10, max_iter=500) km.fit(data) data['labels'] = le.inverse_transform(data['user']) data['cluster'] = km.labels_ print data But my results are bad and there are lot of overlapping salaries.
Is there anything wrong in the code ? How to improve the results ?
Or whether clustering is not a right approach here ? Then how can I cluster users only based on salary ?
km.fit(data['salary']) EDIT:
I figured out a way to solve my problem using numpy.reshape
km.fit(data['salary'].reshape(-1,1))