10

2

I have a DataFrame with IDF of certain words computed. For example

```
(10,[0,1,2,3,4,5],[0.413734499590671,0.4244680552337798,0.4761400657781007, 1.4004620708967006,0.37876590175292424,0.48374466516332])
.... and so on
```

Now give a query Q, I can calculate the TF-IDF of this query. How do I calculate the cosine similarity of the query with all documents in the dataframe (there are close to million documents)

I could do it manually in a map-reduce job by using the vector multiplication

Cosine Similarity (Q, document) = Dot product(Q, dodcument) / ||Q|| * ||document||

but surely Spark ML must natively support calculating cosine similarity of a text?

In other words given a search Query how do I find the closest cosines of document TF-IDF from the DataFrame?

3

You can make use of Spark's Normalizer and, if you are interested in "all-pairs similarity", DIMSUM.

– Emre – 2016-08-10T06:44:26.990