How to use Cosine Distance matrix for Clustering algorithms like mean-shift, DBSCAN, and optics?

Question

I am trying to compare different clustering algorithms for my text data. I first calculated the tf-idf matrix and used it for the cosine distance matrix (cosine similarity). Then I used this distance matrix for K-means and Hierarchical clustering (ward and dendrogram). I want to use the distance matrix for mean-shift, DBSCAN, and optics.

Below is the part of the code showing the distance matrix.

from sklearn.feature_extraction.text import TfidfVectorizer #define vectorizer parameters tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000, min_df=0.2, stop_words='english', use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3)) %time tfidf_matrix = tfidf_vectorizer.fit_transform(Strategies) #fit the vectorizer to synopses terms = tfidf_vectorizer.get_feature_names() from sklearn.metrics.pairwise import cosine_similarity dist = 1 - cosine_similarity(tfidf_matrix) print(dist)

I am new to both python and clustering. I found the code for K-means and hierarchical clustering and tried to understand it but I cannot apply it for other clusterings algorithms. It would be very helpful if I can get some simple explanation of each clustering algorithm and how this distance matrix can be used to implement (if possible) in different clustering.

Thanks in advance!

Brian Spiering · Accepted Answer · 2022-01-19 22:56:59Z

3

Several scikit-learn clustering algorithms can be fit using cosine distances:

from collections import defaultdict from sklearn.datasets import load_iris from sklearn.cluster import DBSCAN, OPTICS # Define sample data iris = load_iris() X = iris.data # List clustering algorithms algorithms = [DBSCAN, OPTICS] # MeanShift does not use a metric # Fit each clustering algorithm and store results results = defaultdict(int) for algorithm in algorithms: results[algorithm] = algorithm(metric='cosine').fit(X)

edited Jan 19, 2022 at 22:56

answered Mar 5, 2020 at 2:33

Brian Spiering

23.9k2 gold badges30 silver badges120 bronze badges

$\begingroup$ Thanks for the fast reply but I am getting an error. NameError: name 'clustering_algorithms' is not defined. Also, what would be X? Where I would be using 'dist' which I have calculated (in my code)? Please can you elaborate a little more? $\endgroup$

Piyush Ghasiya
– Piyush Ghasiya

2020-03-05 03:23:57 +00:00
Commented Mar 5, 2020 at 3:23
$\begingroup$ I had a typo; I fixed it. X is the standard name for a data array in scikit-learn. You don't need dist, use cosine_distances instead. $\endgroup$

Brian Spiering
– Brian Spiering

2020-03-05 06:08:37 +00:00
Commented Mar 5, 2020 at 6:08
$\begingroup$ Does that mean that I should replace X with tfidf_matrix (as visible from my code above)? When I did that I again got an error: TypeError: __init__() got an unexpected keyword argument 'metric'. $\endgroup$

Piyush Ghasiya
– Piyush Ghasiya

2020-03-05 07:21:48 +00:00
Commented Mar 5, 2020 at 7:21
$\begingroup$ Sorry for my naive questions. $\endgroup$

Piyush Ghasiya
– Piyush Ghasiya

2020-03-05 07:26:24 +00:00
Commented Mar 5, 2020 at 7:26
$\begingroup$ Got ValueError: Expected 2D array, got 1D array instead while working with DBSCAN, changing metric=cosine_distances to metric='cosine' worked. $\endgroup$

hafiz031
– hafiz031

2021-05-27 05:04:43 +00:00
Commented May 27, 2021 at 5:04

| Show 2 more comments

Stack Exchange Network

How to use Cosine Distance matrix for Clustering algorithms like mean-shift, DBSCAN, and optics?

1 Answer 1

Hot Network Questions

How to use Cosine Distance matrix for Clustering algorithms like mean-shift, DBSCAN, and optics?

1 Answer 1

Related

Hot Network Questions