I have a dataset which has two columns:
title price sentence1 12 sentence2 13 I have used doc2vec to convert the sentences into vectors of size 100 as below:
LabeledSentence1 = gensim.models.doc2vec.TaggedDocument all_content = [] j=0 for title in query_result['title_clean'].values: all_content.append(LabeledSentence1(title,[j])) j+=1 print("Number of texts processed: ", j) cores = multiprocessing.cpu_count() d2v_model = Doc2Vec(dm=1, vector_size=100, negative=5, hs=0, min_count=2, sample = 0, workers=cores, alpha=0.025, min_alpha=0.001) d2v_model.build_vocab([x for x in tqdm(all_content)]) all_content = utils.shuffle(all_content) d2v_model.train(all_content,total_examples=len(all_content), epochs=30) So d2v_model.docvecs.doctag_syn0 returns me vectors of all the sentences
I want to now perform clustering using DBSCAN but since I have the other price column which is numeric I am having some trouble fitting the final data to the model. I have a similar problem as described in Stackoverflow, one of my columns has an array of 100 sizes each row, and the other column is just numeric. Hence when I perform dbscan on the data I get the same error.
Is there any smart way to handle such cases? Combining doc2vec output with other numerical columns to prepare it for clustering? Something like this, where both_numeric_categical_columns is the desired input to the model:
clf = DBSCAN(eps=0.5, min_samples=10) X = clf.fit(both_numeric_categical_columns) labels=clf.labels_.tolist() cluster1 = query_result_mini.copy() cluster1['clusters'] = clf.fit_predict(both_numeric_categical_columns)