I have approached text clustering using HDBSCAN based on this article which describes how to do this in R. I've re-written this in Python using this library. The approach is to first calculate TF-IDF vectors for the documents, then calculate a distance matrix for all vector pairs and fit the HDBSCAN clusterer based on the distance matrix.
I have fitter the clusterer with a subset of my documents since the algorithm is slow and my whole set is a bit big. I've limited it to 5000 samples. The clusters that HDBSCAN has found are acceptable. I will fine-tune them later.
Now, I would like to create a Python method that would take a new document, not being a part of the original training set, and return the cluster label which the new document seems to belong to.
I have approached this task by trying to use approximate_predict(). This is where I have question.
I suspect the process for calculating the cluster label for the new document look like this:
- Add the new document to the set of 5000 samples that I've used for the clusterer training
- Calculate the disctance matrix for the 5001 samples (the matrix will be bigger than the matrix used for clusterer fitting)
- Take the last row of the resulting matrix (should correspond to my new text) and remove the last element from the resulting vector (it should contain the distance of the new document to itself, which we can ignore). Removing the last element is to make the dimension of the last vector match the dimension of the matrix used to fit the clusterer. Otherwise the clusterer would complain.
- Use the approximate_predict() method and pass the vector obtained in step 3 to get the cluster label.
My questions are:
- Is my approach correct? (it seems overcomplicated but I don't know what it should look like)
- Will it perform well in production when I start passing lots of documents to this method? (the processing needed before I actually call approximate_predict() seems huge)
- How can it be done differently?
- Would it be better to not use the approximate_predict() method, but instead take the cluster labels that HDBSCAN calculated for my my 5000 samples and use it for supervised learning to train a classifier to then classify new documents?
I am looking forward to your answers and I will appreciate them a lot. I am completely new to this and I don't have anybody around me to discuss this.