Bumped by Community user

occurred Dec 9, 2022 at 14:02

Bumped by Community user

occurred Aug 1, 2022 at 1:01

Bumped by Community user

occurred Apr 2, 2022 at 13:06

Bumped by Community user

occurred Nov 29, 2021 at 14:05

Bumped by Community user

occurred Jul 29, 2021 at 7:03

Bumped by Community user

occurred Mar 26, 2021 at 3:07

added 75 characters in body

Source Link

edited Nov 6, 2020 at 9:19

Jazz

420
1
5
15

I have a dataset which has two columns:

title price sentence1 12 sentence2 13

I have used doc2vec to convert the sentences into vectors of size 100 as below:

LabeledSentence1 = gensim.models.doc2vec.TaggedDocument all_content = [] j=0 for title in query_result['title_clean'].values: all_content.append(LabeledSentence1(title,[j])) j+=1 print("Number of texts processed: ", j) d2v_modelcores = Doc2Vecmultiprocessing.cpu_count(all_content, size) d2v_model = 100Doc2Vec(dm=1, window = 10vector_size=100, min_count =negative=5, 500hs=0,  workers=7min_count=2, dmsample = 10, workers=cores, alpha=0.025,   min_alpha=0.001) d2v_model.trainbuild_vocab(all_content, total_examples=d2v_model.corpus_count, [x for x in tqdm(all_content)]) all_content epochs=10,= start_alpha=0utils.002shuffle(all_content) d2v_model.train(all_content,total_examples=len(all_content), end_alpha=-0.016epochs=30)

So d2v_model.docvecs.doctag_syn0 returns me vectors of all the sentences

I want to now perform clustering using DBSCAN but since I have the other price column which is numeric I am having some trouble fitting the final data to the model. I have a similar problem as described in Stackoverflow, one of my columns has an array of 100 sizes each row, and the other column is just numeric. Hence when I perform dbscan on the data I get the same error.

Is there any smart way to handle such cases? Combining doc2vec output with other numerical columns to prepare it for clustering? Something like this, where both_numeric_categical_columns is the desired input to the model:

clf = DBSCAN(eps=0.5, min_samples=10) X = clf.fit(both_numeric_categical_columns) labels=clf.labels_.tolist() cluster1 = query_result_mini.copy() cluster1['clusters'] = clf.fit_predict(both_numeric_categical_columns)

I have a dataset which has two columns:

title price sentence1 12 sentence2 13

I have used doc2vec to convert the sentences into vectors of size 100 as below:

LabeledSentence1 = gensim.models.doc2vec.TaggedDocument all_content = [] j=0 for title in query_result['title_clean'].values: all_content.append(LabeledSentence1(title,[j])) j+=1 print("Number of texts processed: ", j) d2v_model = Doc2Vec(all_content, size = 100, window = 10, min_count = 500,  workers=7, dm = 1,alpha=0.025, min_alpha=0.001) d2v_model.train(all_content, total_examples=d2v_model.corpus_count,  epochs=10, start_alpha=0.002, end_alpha=-0.016)

So d2v_model.docvecs.doctag_syn0 returns me vectors of all the sentences

I want to now perform clustering using DBSCAN but since I have the other price column which is numeric I am having some trouble fitting the final data to the model. I have a similar problem as described in Stackoverflow, one of my columns has an array of 100 sizes each row, and the other column is just numeric. Hence when I perform dbscan on the data I get the same error.

Is there any smart way to handle such cases? Combining doc2vec output with other numerical columns to prepare it for clustering? Something like this, where both_numeric_categical_columns is the desired input to the model:

clf = DBSCAN(eps=0.5, min_samples=10) X = clf.fit(both_numeric_categical_columns) labels=clf.labels_.tolist() cluster1 = query_result_mini.copy() cluster1['clusters'] = clf.fit_predict(both_numeric_categical_columns)

I have a dataset which has two columns:

title price sentence1 12 sentence2 13

I have used doc2vec to convert the sentences into vectors of size 100 as below:

LabeledSentence1 = gensim.models.doc2vec.TaggedDocument all_content = [] j=0 for title in query_result['title_clean'].values: all_content.append(LabeledSentence1(title,[j])) j+=1 print("Number of texts processed: ", j) cores = multiprocessing.cpu_count() d2v_model = Doc2Vec(dm=1, vector_size=100, negative=5, hs=0, min_count=2, sample = 0, workers=cores, alpha=0.025,   min_alpha=0.001) d2v_model.build_vocab([x for x in tqdm(all_content)]) all_content = utils.shuffle(all_content) d2v_model.train(all_content,total_examples=len(all_content), epochs=30)

So d2v_model.docvecs.doctag_syn0 returns me vectors of all the sentences

I want to now perform clustering using DBSCAN but since I have the other price column which is numeric I am having some trouble fitting the final data to the model. I have a similar problem as described in Stackoverflow, one of my columns has an array of 100 sizes each row, and the other column is just numeric. Hence when I perform dbscan on the data I get the same error.

Is there any smart way to handle such cases? Combining doc2vec output with other numerical columns to prepare it for clustering? Something like this, where both_numeric_categical_columns is the desired input to the model:

clf = DBSCAN(eps=0.5, min_samples=10) X = clf.fit(both_numeric_categical_columns) labels=clf.labels_.tolist() cluster1 = query_result_mini.copy() cluster1['clusters'] = clf.fit_predict(both_numeric_categical_columns)

Source Link

asked Nov 5, 2020 at 20:33

Jazz

420
1
5
15

DBSCAN on textual and numerical columns

I have a dataset which has two columns:

title price sentence1 12 sentence2 13

I have used doc2vec to convert the sentences into vectors of size 100 as below:

LabeledSentence1 = gensim.models.doc2vec.TaggedDocument all_content = [] j=0 for title in query_result['title_clean'].values: all_content.append(LabeledSentence1(title,[j])) j+=1 print("Number of texts processed: ", j) d2v_model = Doc2Vec(all_content, size = 100, window = 10, min_count = 500, workers=7, dm = 1,alpha=0.025, min_alpha=0.001) d2v_model.train(all_content, total_examples=d2v_model.corpus_count, epochs=10, start_alpha=0.002, end_alpha=-0.016)

So d2v_model.docvecs.doctag_syn0 returns me vectors of all the sentences

I want to now perform clustering using DBSCAN but since I have the other price column which is numeric I am having some trouble fitting the final data to the model. I have a similar problem as described in Stackoverflow, one of my columns has an array of 100 sizes each row, and the other column is just numeric. Hence when I perform dbscan on the data I get the same error.

Is there any smart way to handle such cases? Combining doc2vec output with other numerical columns to prepare it for clustering? Something like this, where both_numeric_categical_columns is the desired input to the model:

clf = DBSCAN(eps=0.5, min_samples=10) X = clf.fit(both_numeric_categical_columns) labels=clf.labels_.tolist() cluster1 = query_result_mini.copy() cluster1['clusters'] = clf.fit_predict(both_numeric_categical_columns)

Stack Exchange Network

Return to Question

DBSCAN on textual and numerical columns