0
$\begingroup$

I am new to NLP and I'm trying to perform embedding for a clustering problem. I have created the word2vec model using Python's gensim library, but I am wondering the following:

The word2vec model embeds the words to vectors of size vector_size. However, in further steps of the clustering approach, I realised I was clustering based on single words instead of the sentences I had in my dataset at the beginning.

Let's say my vocabulary is composed of the two words foo and bar, mapped as follows:

foo: [0.0045, -0.0593, 0.0045]
bar: [-0.943, 0.05311, 0.5839]

If I have a sentence bar foo, how can I embed it? I mean, how can I get the vector of the entire sentence as a whole?

Thanks in advance.

$\endgroup$

1 Answer 1

0
$\begingroup$

The usual approach is to average the vectors of all words in the sentence.

$\endgroup$
2
  • $\begingroup$ This was my first thought too. However, I have just realised that there is a Doc2Vec model that basically appends each word's vector. At the end, I would have to average all of the vectors appended by Doc2Vec I guess? $\endgroup$ Commented Feb 14, 2022 at 8:30
  • $\begingroup$ Sorry, I have never used doc2vec, so I can't answer the question in your comment. You may post it as a separate question to get answers to it. $\endgroup$ Commented Feb 14, 2022 at 9:47

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.