I have a DataFrame with IDF of certain words computed. For example
(10,[0,1,2,3,4,5],[0.413734499590671,0.4244680552337798,0.4761400657781007, 1.4004620708967006,0.37876590175292424,0.48374466516332]) .... and so on
Now give a query Q, I can calculate the TF-IDF of this query. How do I calculate the cosine similarity of the query with all documents in the dataframe (there are close to million documents)
I could do it manually in a map-reduce job by using the vector multiplication
Cosine Similarity (Q, document) = Dot product(Q, dodcument) / ||Q|| * ||document||
but surely Spark ML must natively support calculating cosine similarity of a text?
In other words given a search Query how do I find the closest cosines of document TF-IDF from the DataFrame?