0
$\begingroup$

Please bear with me as I am new to NLP. I am specifically using tensorflow's universal sentence encoder: https://tfhub.dev/google/universal-sentence-encoder-large/3

I am clustering text based on the cosine similarity of the embedding produced by the model and I want to see what cluster a new text would most likely lie in. I was going to compare the new text embedding to the mean/median of all the embeddings within a cluster to see which cluster it would most likely lie in. Would taking the mean/median of the cluster's vectors "represent" the general idea of the cluster or will the vector not represent what I am looking for?

$\endgroup$

1 Answer 1

0
$\begingroup$

Well, the mean is pretty average for all the words.

These tend to all be quite similar, cluster in the center of the data, and have nearest neighbors to pretty bland, generic words.

The average word vector is not a good representation of what a text is about.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.