I have a dataset that has been trained on word2vec. Is it a good idea to cluster the output vectors?.
$\begingroup$ $\endgroup$
1 - $\begingroup$ Don't cluster with the Euclidean distance if you're operating in very high dimensions (typical of word2vec). Use cosine similarity instead. The reason is a bit technical; cf. this thread. $\endgroup$Emre– Emre2016-03-11 02:35:04 +00:00Commented Mar 11, 2016 at 2:35
Add a comment |
1 Answer
$\begingroup$ $\endgroup$
1 It's totally fine to cluster word2vec output to know semantically similar words. KMeans is an option, you might also want to checkout some approximate neighbor scheme such as Locality Sensitive Hashing.
- $\begingroup$ I was also looking at examples where people had taken an average output of the prediction. $\endgroup$Krishna Kalyan– Krishna Kalyan2016-03-12 19:03:04 +00:00Commented Mar 12, 2016 at 19:03