Calculate cosine similarity in Apache Spark



I have a DataFrame with IDF of certain words computed. For example

(10,[0,1,2,3,4,5],[0.413734499590671,0.4244680552337798,0.4761400657781007, 1.4004620708967006,0.37876590175292424,0.48374466516332])

 .... and so on

Now give a query Q, I can calculate the TF-IDF of this query. How do I calculate the cosine similarity of the query with all documents in the dataframe (there are close to million documents)

I could do it manually in a map-reduce job by using the vector multiplication

Cosine Similarity (Q, document) = Dot product(Q, dodcument) / ||Q|| * ||document||

but surely Spark ML must natively support calculating cosine similarity of a text?

In other words given a search Query how do I find the closest cosines of document TF-IDF from the DataFrame?

Ganesh Krishnan

Posted 2016-08-10T05:43:41.613

Reputation: 223


You can make use of Spark's Normalizer and, if you are interested in "all-pairs similarity", DIMSUM.

– Emre – 2016-08-10T06:44:26.990



There's a related example to your problem in the Spark repo here. The strategy is to represent the documents as a RowMatrix and then use its columnSimilarities() method. That will get you a matrix of all the cosine similarities. Extract the row which corresponds to your query document and sort. That will give the indices of the most-similar documents.

Depending on your application, all of this work can be done pre-query.


Posted 2016-08-10T05:43:41.613

Reputation: 754