When to use cosine simlarity over Euclidean similarity



In NLP, people tend to use cosine similarity to measure to document/text distance. I just want to hear out what do you think for the following two cases, cosine similarity or Euclidean?

The task is to compute context (left and right words of an expression) similarities of multi-word expressions (i.e., put up, rain cats and dogs). Mathematically, to calculate sim(context_1_mwe, context_2_mwe). The context_n_mwe feature vector is buit from word embeddings, assume the embedding dimension is 200.

Two ways to represent context_n_mwe:

  1. concatenate left and right 2 context words and then we have a new embedding vector of 200*4=800 dimensions. In other words, a feature vector of [lc1, lc2, rc1, rc2] where lc=left_context and rc=right_context.

  2. take mean of the sum of left and right 2 context words and then we get a vector of 200 dimensions. In other words, a feature vector of [mean(lc1+lc2+rc1+rc2)].

Personal speaking, I think Euclidean distance is a better fit for both cases. Cosine similarity is specialized in handling scale/length effects. For case 1, context length is fixed -- 4 words, there's no scale effects. In terms of case 2, the term frequency matters, a word appears once is different from a word appears twice, we cannot apply cosine.


Posted 2018-02-12T13:31:46.740

Reputation: 273



When to use cosine similarity over Euclidean similarity

Cosine similarity looks at the angle between two vectors, euclidian similarity at the distance between two points.

Let's say you are in an e-commerce setting and you want to compare users for product recommendations:

  • User 1 bought 1x eggs, 1x flour and 1x sugar.
  • User 2 bought 100x eggs, 100x flour and 100x sugar
  • User 3 bought 1x eggs, 1x Vodka and 1x Red Bull

By cosine similarity, user 1 and user 2 are more similar. By euclidean similarity, user 3 is more similar to user 1.

Questions in the text

I don't understand the first part.

Cosine similarity is specialized in handling scale/length effects. For case 1, context length is fixed -- 4 words, there's no scale effects. In terms of case 2, the term frequency matters, a word appears once is different from a word appears twice, we cannot apply cosine.

This goes in the right direction, but is not completely true. For example:

$$ \cos \left (\begin{pmatrix}1\\0\end{pmatrix}, \begin{pmatrix}2\\1\end{pmatrix} \right) = \cos \left (\begin{pmatrix}1\\0\end{pmatrix}, \begin{pmatrix}4\\2\end{pmatrix} \right) \neq \cos \left (\begin{pmatrix}1\\0\end{pmatrix}, \begin{pmatrix}5\\2\end{pmatrix} \right) $$

With cosine similarity, the following is true:

$$ \cos \left (\begin{pmatrix}a\\b\end{pmatrix}, \cdot \begin{pmatrix}c\\d\end{pmatrix} \right) = \cos \left (\begin{pmatrix}a\\b\end{pmatrix}, n \cdot \begin{pmatrix}c\\d\end{pmatrix} \right) \text{ with } n \in \mathbb{N} $$

So frequencies are only ignored, if all features are multiplied with the same constant.

Curse of Dimensionality

When you look at the table of my blog post, you can see:

  • The more dimensions I have, the closer the average distance and the maximum distance between randomly placed points become.
  • Similarly, the average angle between uniformly randomly placed points becomes 90°.

So both measures suffer from high dimensionality. More about this: Curse of dimensionality - does cosine similarity work better and if so, why?. A key point:

  • Cosine is essentially the same as Euclidean on normalized data.


You might be interested in metric learning. The principle is described/used in FaceNet: A Unified Embedding for Face Recognition and Clustering (my summary). Instead of taking one of the well-defined and simple metrics. You can learn a metric for the problem domain.

Martin Thoma

Posted 2018-02-12T13:31:46.740

Reputation: 15 590

But don't we usually normalize the vectors before we use similarity measure? In that case, your user 2 in the scenario is more similar to user 1 no matter what, isn't it? – IgNite – 2019-12-01T14:38:18.727


Ok, so, your intuition here is wrong. Not necessarily about the examples you gave, but the fact that you think Euclidian distance could be useful in 200 dimensional space. 200d space is so, so empty. Everything is far from everything else. That's why we use cosine similarity - because everything is far from everything else, so if two vectors are pointing in the same direction, that's already pretty good. [This one area of NLP is where my traditional math background came in most useful].

The example that made this clear for me is thinking about the ratio of the volume of the unit sphere to the unit cube in n dimensions, as n goes to infinity. You can read the answers on Stack Exchange Math or just think about the first few cases. In one dimension, a line, the unit sphere takes up 100% of the unit cube. In 2D, that's π/4: roughly 78%. 3D this π/6, around 52%. By 10D, this is π^5/122880, or ~0.2%.

In 200D, the unit sphere is 5e10^-165 of the unit cube. It's just a point. Euclidian distance just... becomes useless for most things.

Another very-related example is the Curse of Dimensionality in Sampling, quoted below for good measure

There is an exponential increase in volume associated with adding extra dimensions to a mathematical space. For example, 102=100 evenly spaced sample points suffice to sample a unit interval (a "1-dimensional cube") with no more than 10−2=0.01 distance between points; an equivalent sampling of a 10-dimensional unit hypercube with a lattice that has a spacing of 10−2=0.01 between adjacent points would require 1020[=(102)10] sample points. In general, with a spacing distance of 10−n the 10-dimensional hypercube appears to be a factor of 10n(10-1)[=(10n)10/(10n)] "larger" than the 1-dimensional hypercube, which is the unit interval. In the above example n=2: when using a sampling distance of 0.01 the 10-dimensional hypercube appears to be 1018 "larger" than the unit interval. This effect is a combination of the combinatorics problems above and the distance function problems explained below.

Sam H.

Posted 2018-02-12T13:31:46.740

Reputation: 181

Not necessarily, it depends on the geometry properties of the embedding. Just recently came across a paper showing skip-gram trained embeddings are narrowly clustered in a single orthant (https://www.aclweb.org/anthology/D17-1308). In facebook's MUSE project, they also investigate the Euclidean method for alignment.

– Logan – 2019-01-10T01:10:56.173

The curse of dimensionality, and its relevance to real world data, was covered in Jeremy Howard & Rachel Thomas' fast.ai course. I found his view thought provoking. Quote below from his Deep Learning for Coders video: "“the more columns you have it basically creates a space that's more and more empty” That turns out just not to be the case. It's not the case for a number of reasons… in practice building models on lots and lots of columns works really really well" – Julian H – 2019-07-18T07:29:47.587

@JulianH I am not arguing against using high dimensional data. It rocks. However, Euclidian distance is often not as useful, because of the curse. – Sam H. – 2020-07-01T22:47:12.787