5

I plot elbow method to find appropriate number of KMean cluster when I am using Python and sklearn. I want to do the same when I'm working in PySpark. I am aware that PySpark has limited functionality due to the Spark's distributed nature, but, is there a way to get this number?

I am using the following code to plot the elbow Using the Elbow method to find the optimal number of clusters from sklearn.cluster import KMeans

wcss = [] for i in range(1, 11): kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0) kmeans.fit(X) wcss.append(kmeans.inertia_) plt.plot(range(1, 11), wcss) plt.title('The Elbow Method') plt.xlabel('Number of clusters') plt.ylabel('WCSS') plt.show() 

enter image description here

3 Answers 3

13

I did it another way. Calculate the cost of features using Spark ML and store the results in Python list and then plot it.

# Calculate cost and plot cost = np.zeros(10) for k in range(2,10): kmeans = KMeans().setK(k).setSeed(1).setFeaturesCol('features') model = kmeans.fit(df) cost[k] = model.summary.trainingCost # Plot the cost df_cost = pd.DataFrame(cost[2:]) df_cost.columns = ["cost"] new_col = [2,3,4,5,6,7,8, 9] df_cost.insert(0, 'cluster', new_col) import pylab as pl pl.plot(df_cost.cluster, df_cost.cost) pl.xlabel('Number of Clusters') pl.ylabel('Score') pl.title('Elbow Curve') pl.show() 
Sign up to request clarification or add additional context in comments.

Comments

0

PySpark is not the right tool to plot an eblow method. To plot a chart, the data must be collected into a Pandas dataframe, which is not possible in my case because of the massive amount of data. The alternative is to use silhouette analysis like below

# Keep changing the number of clusters and re-calculate kmeans = KMeans().setK(6).setSeed(1) model = kmeans.fit(dataset.select('features')) predictions = model.transform(dataset) silhouette = evaluator.evaluate(predictions) print("Silhouette with squared euclidean distance = " + str(silhouette)) 

Or evaluate clustering by computing Within Set Sum of Squared Errors, which is explained here

Comments

0

I think the last answer is not completely correct. The first answer, however, is correct. Looking at the documentation and source code of Pyspark.ml.clustering the model.summary.trainingCost is the inertia of Sklearn in Pyspark. In the link you can find the text:

This is equivalent to sklearn's inertia.

The silhouette score is given by the ClusteringEvaluator class of pyspark.ml.evaluation: see this link

The Davies-Bouldin index and Calinski-Harabasz index of Sklearn are not yet implemented in Pyspark. However, there are some suggested functions of them. For example for the Davies-Bouldin index.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.