PySpark how to find appropriate number of clusters

Question

I plot elbow method to find appropriate number of KMean cluster when I am using Python and sklearn. I want to do the same when I'm working in PySpark. I am aware that PySpark has limited functionality due to the Spark's distributed nature, but, is there a way to get this number?

I am using the following code to plot the elbow Using the Elbow method to find the optimal number of clusters from sklearn.cluster import KMeans

wcss = [] for i in range(1, 11): kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0) kmeans.fit(X) wcss.append(kmeans.inertia_) plt.plot(range(1, 11), wcss) plt.title('The Elbow Method') plt.xlabel('Number of clusters') plt.ylabel('WCSS') plt.show()

juanjbon · Accepted Answer · 2020-10-28 11:42:09Z

I did it another way. Calculate the cost of features using Spark ML and store the results in Python list and then plot it.

# Calculate cost and plot cost = np.zeros(10) for k in range(2,10): kmeans = KMeans().setK(k).setSeed(1).setFeaturesCol('features') model = kmeans.fit(df) cost[k] = model.summary.trainingCost # Plot the cost df_cost = pd.DataFrame(cost[2:]) df_cost.columns = ["cost"] new_col = [2,3,4,5,6,7,8, 9] df_cost.insert(0, 'cluster', new_col) import pylab as pl pl.plot(df_cost.cluster, df_cost.cost) pl.xlabel('Number of Clusters') pl.ylabel('Score') pl.title('Elbow Curve') pl.show()

Ali · Accepted Answer · 2019-06-03 02:52:07Z

PySpark is not the right tool to plot an eblow method. To plot a chart, the data must be collected into a Pandas dataframe, which is not possible in my case because of the massive amount of data. The alternative is to use silhouette analysis like below

# Keep changing the number of clusters and re-calculate kmeans = KMeans().setK(6).setSeed(1) model = kmeans.fit(dataset.select('features')) predictions = model.transform(dataset) silhouette = evaluator.evaluate(predictions) print("Silhouette with squared euclidean distance = " + str(silhouette))

Or evaluate clustering by computing Within Set Sum of Squared Errors, which is explained here

Sjoerd · Accepted Answer · 2021-08-17 08:15:26Z

I think the last answer is not completely correct. The first answer, however, is correct. Looking at the documentation and source code of Pyspark.ml.clustering the model.summary.trainingCost is the inertia of Sklearn in Pyspark. In the link you can find the text:

This is equivalent to sklearn's inertia.

The silhouette score is given by the ClusteringEvaluator class of pyspark.ml.evaluation: see this link

The Davies-Bouldin index and Calinski-Harabasz index of Sklearn are not yet implemented in Pyspark. However, there are some suggested functions of them. For example for the Davies-Bouldin index.

Collectives™ on Stack Overflow

PySpark how to find appropriate number of clusters

3 Answers 3

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Linked

Related