Revision edd5b1f3-0738-4f2e-b4ea-62e5283b1f58

In NLP, people tend to use cosine similarity to measure document/text distances. I want to hear what do people think of the following two scenarios, which to pick, cosine similarity or Euclidean?

Overview of the task setting. The task is to compute context similarities of multi-word expressions. For example, suppose we were given a MWE of `put up`, context refers to the words on the left side of `put up` and as well as the words on the right-side of it in one text. Mathematically speaking, similarity in this task is about calculating
```
sim(context_of_using_"put_up", context_of_using_"in_short")
```
Note that context is the feature that built on top of word embeddings, let's assume each word has an embedding dimension of `200`:

Two scenarios of representing `context_of_an_expression`.

1. concatenate the left and right context words, producing an embedding vector of dimension `200*4=800` if picking two words each side. In other words, a feature vector of [lc1, lc2, rc1, rc2] is build for context, where `lc=left_context` and `rc=right_context`.

2. get the mean of the sum of left and right context words, producing a vector of `200` dimensions. In other words, a feature vector of [mean(lc1+lc2+rc1+rc2)] is built for context.

[Edited] For both scenarios, I think Euclidean distance is a better fit. Cosine similarity is known for handling scale/length effects because of normalization. But I don't think there's much to be normalized.