2
$\begingroup$

I am trying to compare different clustering algorithms for my text data. I first calculated the tf-idf matrix and used it for the cosine distance matrix (cosine similarity). Then I used this distance matrix for K-means and Hierarchical clustering (ward and dendrogram). I want to use the distance matrix for mean-shift, DBSCAN, and optics.

Below is the part of the code showing the distance matrix.

from sklearn.feature_extraction.text import TfidfVectorizer #define vectorizer parameters tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000, min_df=0.2, stop_words='english', use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3)) %time tfidf_matrix = tfidf_vectorizer.fit_transform(Strategies) #fit the vectorizer to synopses terms = tfidf_vectorizer.get_feature_names() from sklearn.metrics.pairwise import cosine_similarity dist = 1 - cosine_similarity(tfidf_matrix) print(dist) 

I am new to both python and clustering. I found the code for K-means and hierarchical clustering and tried to understand it but I cannot apply it for other clusterings algorithms. It would be very helpful if I can get some simple explanation of each clustering algorithm and how this distance matrix can be used to implement (if possible) in different clustering.

Thanks in advance!

$\endgroup$

1 Answer 1

3
$\begingroup$

Several scikit-learn clustering algorithms can be fit using cosine distances:

from collections import defaultdict from sklearn.datasets import load_iris from sklearn.cluster import DBSCAN, OPTICS # Define sample data iris = load_iris() X = iris.data # List clustering algorithms algorithms = [DBSCAN, OPTICS] # MeanShift does not use a metric # Fit each clustering algorithm and store results results = defaultdict(int) for algorithm in algorithms: results[algorithm] = algorithm(metric='cosine').fit(X) 
$\endgroup$
7
  • $\begingroup$ Thanks for the fast reply but I am getting an error. NameError: name 'clustering_algorithms' is not defined. Also, what would be X? Where I would be using 'dist' which I have calculated (in my code)? Please can you elaborate a little more? $\endgroup$ Commented Mar 5, 2020 at 3:23
  • $\begingroup$ I had a typo; I fixed it. X is the standard name for a data array in scikit-learn. You don't need dist, use cosine_distances instead. $\endgroup$ Commented Mar 5, 2020 at 6:08
  • $\begingroup$ Does that mean that I should replace X with tfidf_matrix (as visible from my code above)? When I did that I again got an error: TypeError: __init__() got an unexpected keyword argument 'metric'. $\endgroup$ Commented Mar 5, 2020 at 7:21
  • $\begingroup$ Sorry for my naive questions. $\endgroup$ Commented Mar 5, 2020 at 7:26
  • $\begingroup$ Got ValueError: Expected 2D array, got 1D array instead while working with DBSCAN, changing metric=cosine_distances to metric='cosine' worked. $\endgroup$ Commented May 27, 2021 at 5:04

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.