Skip to main content
Bumped by Community user
Bumped by Community user
Bumped by Community user
Bumped by Community user
Bumped by Community user
Bumped by Community user
added 75 characters in body
Source Link
Jazz
  • 420
  • 1
  • 5
  • 15

I have a dataset which has two columns:

title price sentence1 12 sentence2 13 

I have used doc2vec to convert the sentences into vectors of size 100 as below:

LabeledSentence1 = gensim.models.doc2vec.TaggedDocument all_content = [] j=0 for title in query_result['title_clean'].values: all_content.append(LabeledSentence1(title,[j])) j+=1 print("Number of texts processed: ", j) d2v_modelcores = Doc2Vecmultiprocessing.cpu_count(all_content, size) d2v_model = 100Doc2Vec(dm=1, window = 10vector_size=100, min_count =negative=5, 500hs=0,  workers=7min_count=2, dmsample = 10, workers=cores, alpha=0.025,   min_alpha=0.001) d2v_model.trainbuild_vocab(all_content, total_examples=d2v_model.corpus_count, [x for x in tqdm(all_content)]) all_content epochs=10,= start_alpha=0utils.002shuffle(all_content) d2v_model.train(all_content,total_examples=len(all_content), end_alpha=-0.016epochs=30) 

So d2v_model.docvecs.doctag_syn0 returns me vectors of all the sentences

I want to now perform clustering using DBSCAN but since I have the other price column which is numeric I am having some trouble fitting the final data to the model. I have a similar problem as described in Stackoverflow, one of my columns has an array of 100 sizes each row, and the other column is just numeric. Hence when I perform dbscan on the data I get the same error.

Is there any smart way to handle such cases? Combining doc2vec output with other numerical columns to prepare it for clustering? Something like this, where both_numeric_categical_columns is the desired input to the model:

clf = DBSCAN(eps=0.5, min_samples=10) X = clf.fit(both_numeric_categical_columns) labels=clf.labels_.tolist() cluster1 = query_result_mini.copy() cluster1['clusters'] = clf.fit_predict(both_numeric_categical_columns) 

I have a dataset which has two columns:

title price sentence1 12 sentence2 13 

I have used doc2vec to convert the sentences into vectors of size 100 as below:

LabeledSentence1 = gensim.models.doc2vec.TaggedDocument all_content = [] j=0 for title in query_result['title_clean'].values: all_content.append(LabeledSentence1(title,[j])) j+=1 print("Number of texts processed: ", j) d2v_model = Doc2Vec(all_content, size = 100, window = 10, min_count = 500,  workers=7, dm = 1,alpha=0.025, min_alpha=0.001) d2v_model.train(all_content, total_examples=d2v_model.corpus_count,  epochs=10, start_alpha=0.002, end_alpha=-0.016) 

So d2v_model.docvecs.doctag_syn0 returns me vectors of all the sentences

I want to now perform clustering using DBSCAN but since I have the other price column which is numeric I am having some trouble fitting the final data to the model. I have a similar problem as described in Stackoverflow, one of my columns has an array of 100 sizes each row, and the other column is just numeric. Hence when I perform dbscan on the data I get the same error.

Is there any smart way to handle such cases? Combining doc2vec output with other numerical columns to prepare it for clustering? Something like this, where both_numeric_categical_columns is the desired input to the model:

clf = DBSCAN(eps=0.5, min_samples=10) X = clf.fit(both_numeric_categical_columns) labels=clf.labels_.tolist() cluster1 = query_result_mini.copy() cluster1['clusters'] = clf.fit_predict(both_numeric_categical_columns) 

I have a dataset which has two columns:

title price sentence1 12 sentence2 13 

I have used doc2vec to convert the sentences into vectors of size 100 as below:

LabeledSentence1 = gensim.models.doc2vec.TaggedDocument all_content = [] j=0 for title in query_result['title_clean'].values: all_content.append(LabeledSentence1(title,[j])) j+=1 print("Number of texts processed: ", j) cores = multiprocessing.cpu_count() d2v_model = Doc2Vec(dm=1, vector_size=100, negative=5, hs=0, min_count=2, sample = 0, workers=cores, alpha=0.025,   min_alpha=0.001) d2v_model.build_vocab([x for x in tqdm(all_content)]) all_content = utils.shuffle(all_content) d2v_model.train(all_content,total_examples=len(all_content), epochs=30) 

So d2v_model.docvecs.doctag_syn0 returns me vectors of all the sentences

I want to now perform clustering using DBSCAN but since I have the other price column which is numeric I am having some trouble fitting the final data to the model. I have a similar problem as described in Stackoverflow, one of my columns has an array of 100 sizes each row, and the other column is just numeric. Hence when I perform dbscan on the data I get the same error.

Is there any smart way to handle such cases? Combining doc2vec output with other numerical columns to prepare it for clustering? Something like this, where both_numeric_categical_columns is the desired input to the model:

clf = DBSCAN(eps=0.5, min_samples=10) X = clf.fit(both_numeric_categical_columns) labels=clf.labels_.tolist() cluster1 = query_result_mini.copy() cluster1['clusters'] = clf.fit_predict(both_numeric_categical_columns) 
Source Link
Jazz
  • 420
  • 1
  • 5
  • 15

DBSCAN on textual and numerical columns

I have a dataset which has two columns:

title price sentence1 12 sentence2 13 

I have used doc2vec to convert the sentences into vectors of size 100 as below:

LabeledSentence1 = gensim.models.doc2vec.TaggedDocument all_content = [] j=0 for title in query_result['title_clean'].values: all_content.append(LabeledSentence1(title,[j])) j+=1 print("Number of texts processed: ", j) d2v_model = Doc2Vec(all_content, size = 100, window = 10, min_count = 500, workers=7, dm = 1,alpha=0.025, min_alpha=0.001) d2v_model.train(all_content, total_examples=d2v_model.corpus_count, epochs=10, start_alpha=0.002, end_alpha=-0.016) 

So d2v_model.docvecs.doctag_syn0 returns me vectors of all the sentences

I want to now perform clustering using DBSCAN but since I have the other price column which is numeric I am having some trouble fitting the final data to the model. I have a similar problem as described in Stackoverflow, one of my columns has an array of 100 sizes each row, and the other column is just numeric. Hence when I perform dbscan on the data I get the same error.

Is there any smart way to handle such cases? Combining doc2vec output with other numerical columns to prepare it for clustering? Something like this, where both_numeric_categical_columns is the desired input to the model:

clf = DBSCAN(eps=0.5, min_samples=10) X = clf.fit(both_numeric_categical_columns) labels=clf.labels_.tolist() cluster1 = query_result_mini.copy() cluster1['clusters'] = clf.fit_predict(both_numeric_categical_columns)