I have a dataset which has two columns:
title price sentence1 12 sentence2 13 I have used doc2vec to convert the sentences into vectors of size 100 as below:
LabeledSentence1 = gensim.models.doc2vec.TaggedDocument all_content = [] j=0 for title in query_result['title_clean'].values: all_content.append(LabeledSentence1(title,[j])) j+=1 print("Number of texts processed: ", j) d2v_modelcores = Doc2Vecmultiprocessing.cpu_count(all_content, size) d2v_model = 100Doc2Vec(dm=1, window = 10vector_size=100, min_count =negative=5, 500hs=0, workers=7min_count=2, dmsample = 10, workers=cores, alpha=0.025, min_alpha=0.001) d2v_model.trainbuild_vocab(all_content, total_examples=d2v_model.corpus_count, [x for x in tqdm(all_content)]) all_content epochs=10,= start_alpha=0utils.002shuffle(all_content) d2v_model.train(all_content,total_examples=len(all_content), end_alpha=-0.016epochs=30) So d2v_model.docvecs.doctag_syn0 returns me vectors of all the sentences
I want to now perform clustering using DBSCAN but since I have the other price column which is numeric I am having some trouble fitting the final data to the model. I have a similar problem as described in Stackoverflow, one of my columns has an array of 100 sizes each row, and the other column is just numeric. Hence when I perform dbscan on the data I get the same error.
Is there any smart way to handle such cases? Combining doc2vec output with other numerical columns to prepare it for clustering? Something like this, where both_numeric_categical_columns is the desired input to the model:
clf = DBSCAN(eps=0.5, min_samples=10) X = clf.fit(both_numeric_categical_columns) labels=clf.labels_.tolist() cluster1 = query_result_mini.copy() cluster1['clusters'] = clf.fit_predict(both_numeric_categical_columns)