How to properly use approximate_predict() with HDBSCAN clusterer for text clustering (NLP)?

Question

I have approached text clustering using HDBSCAN based on this article which describes how to do this in R. I've re-written this in Python using this library. The approach is to first calculate TF-IDF vectors for the documents, then calculate a distance matrix for all vector pairs and fit the HDBSCAN clusterer based on the distance matrix.

I have fitter the clusterer with a subset of my documents since the algorithm is slow and my whole set is a bit big. I've limited it to 5000 samples. The clusters that HDBSCAN has found are acceptable. I will fine-tune them later.

Now, I would like to create a Python method that would take a new document, not being a part of the original training set, and return the cluster label which the new document seems to belong to.

I have approached this task by trying to use approximate_predict(). This is where I have question.

I suspect the process for calculating the cluster label for the new document look like this:

Add the new document to the set of 5000 samples that I've used for the clusterer training
Calculate the disctance matrix for the 5001 samples (the matrix will be bigger than the matrix used for clusterer fitting)
Take the last row of the resulting matrix (should correspond to my new text) and remove the last element from the resulting vector (it should contain the distance of the new document to itself, which we can ignore). Removing the last element is to make the dimension of the last vector match the dimension of the matrix used to fit the clusterer. Otherwise the clusterer would complain.
Use the approximate_predict() method and pass the vector obtained in step 3 to get the cluster label.

My questions are:

Is my approach correct? (it seems overcomplicated but I don't know what it should look like)
Will it perform well in production when I start passing lots of documents to this method? (the processing needed before I actually call approximate_predict() seems huge)
How can it be done differently?
Would it be better to not use the approximate_predict() method, but instead take the cluster labels that HDBSCAN calculated for my my 5000 samples and use it for supervised learning to train a classifier to then classify new documents?

I am looking forward to your answers and I will appreciate them a lot. I am completely new to this and I don't have anybody around me to discuss this.

Has QUIT--Anony-Mousse · Accepted Answer · 2019-08-08 23:12:51Z

1

Why do you recompute the distance matrix?

Just compute the 1x5000 vector directly for all new points. You can even do this in batches, then feed one row at a time to the predictor.

answered Aug 8, 2019 at 23:12

Has QUIT--Anony-Mousse

8,1541 gold badge16 silver badges31 bronze badges

$\begingroup$ Thank you for your remark. It's very reasonable since now I am throwing away most of the calculations and take just the last vector from the matrix. I wasn't aware there was an alternative to pairwise_distances() available. I'm doing this: tfidf_vectorizer = TfidfVectorizer(use_idf=True, stop_words='english') tfidf_vectors = tfidf_vectorizer.fit_transform(aux_docs) aux_distance_matrix = pairwise_distances(tfidf_vectors, metric='cosine') What Python method should I use to compute the 1x5000 distance vector directly? Are you talking about scipy.spatial.distance.cosine(u, v)? $\endgroup$

Radek
– Radek

2019-08-09 14:29:50 +00:00
Commented Aug 9, 2019 at 14:29
1

$\begingroup$ Do not fit a new vectorizer! that is not going to work. If you check the documentation of pairwise_distances then you'll realize you can use this too. It's all in the documentation... $\endgroup$

Has QUIT--Anony-Mousse
– Has QUIT--Anony-Mousse

2019-08-09 18:00:37 +00:00
Commented Aug 9, 2019 at 18:00
$\begingroup$ Oh, so are you saying I should keep my original fitted vectorizer, just like I keep my fitted clusterer, and then just transform my new text with it? That makes sense. Please confirm if I understood you correctly. As for the documentation for pairwise_distances, I read it but didn't quite understand it yet. I will give it another shot if you say that's where the answer is. Thank you so much for your comments! I can't tell you how I appreciate being able to talk to somebody who actually understands this. PS. what will happen if the new text contains words that the fitted vectorizer did not see? $\endgroup$

Radek
– Radek

2019-08-09 20:58:31 +00:00
Commented Aug 9, 2019 at 20:58

Add a comment |

Stack Exchange Network

How to properly use approximate_predict() with HDBSCAN clusterer for text clustering (NLP)?

1 Answer 1

Hot Network Questions

How to properly use approximate_predict() with HDBSCAN clusterer for text clustering (NLP)?

1 Answer 1

Related

Hot Network Questions