2
$\begingroup$

I have a dataset which has two columns:

title price sentence1 12 sentence2 13 

I have used doc2vec to convert the sentences into vectors of size 100 as below:

LabeledSentence1 = gensim.models.doc2vec.TaggedDocument all_content = [] j=0 for title in query_result['title_clean'].values: all_content.append(LabeledSentence1(title,[j])) j+=1 print("Number of texts processed: ", j) cores = multiprocessing.cpu_count() d2v_model = Doc2Vec(dm=1, vector_size=100, negative=5, hs=0, min_count=2, sample = 0, workers=cores, alpha=0.025, min_alpha=0.001) d2v_model.build_vocab([x for x in tqdm(all_content)]) all_content = utils.shuffle(all_content) d2v_model.train(all_content,total_examples=len(all_content), epochs=30) 

So d2v_model.docvecs.doctag_syn0 returns me vectors of all the sentences

I want to now perform clustering using DBSCAN but since I have the other price column which is numeric I am having some trouble fitting the final data to the model. I have a similar problem as described in Stackoverflow, one of my columns has an array of 100 sizes each row, and the other column is just numeric. Hence when I perform dbscan on the data I get the same error.

Is there any smart way to handle such cases? Combining doc2vec output with other numerical columns to prepare it for clustering? Something like this, where both_numeric_categical_columns is the desired input to the model:

clf = DBSCAN(eps=0.5, min_samples=10) X = clf.fit(both_numeric_categical_columns) labels=clf.labels_.tolist() cluster1 = query_result_mini.copy() cluster1['clusters'] = clf.fit_predict(both_numeric_categical_columns) 
$\endgroup$
1
  • $\begingroup$ DBScan is a distance-based clustering technique. The problem with concatenating both variables is how do you normalise the price value. It may be that you just add the features together and it works. A more safe way, in my opinion, could be to do some feature extraction on the sentence embeddings (e.g., PCA) and use its output with the price using a Hierarchical clustering. $\endgroup$ Commented Dec 9, 2022 at 14:20

1 Answer 1

1
$\begingroup$

You did not mention which package you are using. If you using scikit-learn, sklearn.pipeline.FeatureUnion concatenates results of multiple transformer objects.

Something like this:

from sklearn.cluster import DBSCAN from sklearn.pipeline import FeatureUnion, Pipeline from skearnsklearn.preprocessing import StandardScaler pipeline = Pipeline([('feats', FeatureUnion([ ('doc2vec', d2v_model, ['sentence1', 'sentence2']), ('numeric', StandardScaler(), ['price']) ])), ('cluster', DBSCAN()) ]) 
$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.