## Cosine similarity versus dot product as distance metrics

59

19

It looks like the cosine similarity of two features is just their dot product scaled by the product of their magnitudes. When does cosine similarity make a better distance metric than the dot product? I.e. do the dot product and cosine similarity have different strengths or weaknesses in different situations?

Note that neither of these are proper distance metrics, even if you transform them to be a value that is small when points are "similar". It may or may not matter for your use case. – Sean Owen – 2014-07-18T11:34:09.417

54

Think geometrically. Cosine similarity only cares about angle difference, while dot product cares about angle and magnitude. If you normalize your data to have the same magnitude, the two are indistinguishable. Sometimes it is desirable to ignore the magnitude, hence cosine similarity is nice, but if magnitude plays a role, dot product would be better as a similarity measure. Note that neither of them is a "distance metric".

1@ffriend You mean 'dissimilarity'. Metric has a precise definition. – Memming – 2016-03-29T12:42:06.673

4"distance metric" is commonly used as an opposite of "similarity" in literature: the larger distance, the smaller similarity, but basically they represent same idea. – ffriend – 2014-07-17T20:52:20.197

@Memming what I don't understand is that most clustering algorithms in sklearn/scipy use 1-cosine similarity as the distance 'metric', as opposed to angular distance (which is a true metric). Wouldn't the violation of triangle inequality cause weird results? And why not use angular distance? It doesn't make sense. – moefasa – 2020-03-07T03:10:39.583

10

You are right, cosine similarity has a lot of common with dot product of vectors. Indeed, it is a dot product, scaled by magnitude. And because of scaling it is normalized between 0 and 1. CS is preferable because it takes into account variability of data and features' relative frequencies. On the other hand, plain dot product is a little bit "cheaper" (in terms of complexity and implementation).

Why the dot product alone (equivalent to not normalizing) not account for features' data and frequency? I don't know that this is the difference. – Sean Owen – 2014-07-18T11:27:57.540

2Perhaps, I wasn't clear. I was talking about data diversity. E.g., we have two pairs of documents. Within each pair docs are identical, but pair-1 documents are shorter, than pair-2 ones. And we computing similarity within each pair. Dot product would produce different numbers, though in both cases maximum similarity estimate is expected. – sobach – 2014-07-18T12:36:29.073

7

I would like to add one more dimension to the answers given above. Usually we use cosine similarity with large text, because using distance matrix on paragraphs of data is not recommended. And also if you intend your cluster to be broad you tend to go with cosine similarity as it captures similarity overall.

For example if you have texts which are two or three words long at max I feel using cosine similarity does not achieve the precision as achieved by distance metric.

6

There is an excellent comparison of the common inner-product-based similarity metrics here.

In particular, Cosine Similarity is normalized to lie within $$[-1,1]$$, unlike the dot product which can be any real number. But, as everyone else is saying, that will require ignoring the magnitude of the vectors. Personally, I think that's a good thing. I think of magnitude as an internal (within-vector) structure, and angle between vectors as external (between vector) structure. They are different things and (in my opinion) are often best analyzed separately. I can't imagine a situation where I would rather compute inner products than compute cosine similarities and just compare the magnitudes afterward.

2"Cosine Similarity is normalized to lie within [0,1]" It still has a dot product in the numerator, I think the range should instead be [-1, 1]? – Kari – 2018-03-19T09:10:52.050

2

From a geometric point of view, if all your data are unitary, $\forall x, ||x||^2 = \langle x,x \rangle = 1$, then the scalar product of two vectors defines an angle $\phi$, $\langle x,y \rangle = \cos \phi$, and you have a distance $\phi = \arccos \langle x,y \rangle$.

Visually, all your data live on a unit sphere. Using a dot product as a distance will give you a chordal distance, but if you use this cosine distance, it corresponds to the length of the path between the two points on the sphere. That means, if you want an average of the two points, you should take the point in-between on this path (geodesic) rather than the mid-point obtained from the 'arithmetic average/dot product/euclidean geometry' since this point does not live on the sphere (hence essentially not the same object)!

1

As others have pointed out, these are not distance "metrics", because they do not satisfy the metric criteria. Say instead "distance measure".

Anyway, what are you measuring and why? That information will help us give a more useful answer for your situation.

I've always wondered about the difference between measures and metrics. According the government (NIST): "...We use measure for more concrete or objective attributes and metric for more abstract, higher-level, or somewhat subjective attributes. ... Robustness, quality (as in "high quality"), and effectiveness are important attributes that we have some consistent feel for, but are hard to define objectively. Thus these are metrics." But the context is software engineering, not mathematics. What's your take? – ahoffer – 2014-08-05T20:55:59.590

1Wikipedia was more helpful. distance(x,y) is must be non-negative; d(x,y)=0 only if x=y; d(x,y) = d(y,x); and satisfy triangle inequality- d(x, z) ≤ d(x, y) + d(y, z) – ahoffer – 2014-08-05T21:03:19.577

1That's pretty much it: a metric has to meet certain axioms and a measure is less strictly defined. – sintax – 2014-08-06T03:34:29.120

0

Cosine Similarity = what percentage of the effort is in the same direction. Negative value is a percentage of effort in the opposite direction. Zero is working at cross-purposes. Nothing in common.

Dot product = a measure describing the total quantity of effort in the same direction.

1Thank you for providing an answer. Is there a chance you can edit your answer so it is clearer for people to understand. (i.e. what do you mean by "effort"?) – shepan6 – 2020-07-16T07:42:36.967