In NLP, people tend to use cosine similarity to measure to document/text distance. I just want to hear out what do you think for the following two cases, cosine similarity or Euclidean?
The task is to compute context (left and right words of an expression) similarities of multi-word expressions (i.e., put up, rain cats and dogs). Mathematically, to calculate
sim(context_1_mwe, context_2_mwe). The
context_n_mwe feature vector is buit from word embeddings, assume the embedding dimension is
Two ways to represent
concatenate left and right 2 context words and then we have a new embedding vector of
200*4=800dimensions. In other words, a feature vector of [lc1, lc2, rc1, rc2] where
take mean of the sum of left and right 2 context words and then we get a vector of
200dimensions. In other words, a feature vector of [mean(lc1+lc2+rc1+rc2)].
Personal speaking, I think Euclidean distance is a better fit for both cases. Cosine similarity is specialized in handling scale/length effects. For case 1, context length is fixed -- 4 words, there's no scale effects. In terms of case 2, the term frequency matters, a word appears once is different from a word appears twice, we cannot apply cosine.