Creating similarity metric with Doc2Vec and additional features


I have a dataset which contains many features. Each record is company that has many features.

For example...

Company A:

  • Keywords - data, big data, tableau, dashboards, etc.

  • Industry - Information Technology

  • Sub-Industry - Data Visualization

  • Total Funding - $150,000,000

I want to create a similarity metric between multiple companies, incorporating both doc2vec embeddings trained on the keyword lists as well as the additional features listed. I had a hard time searching/finding papers that did something like this. Any ideas?

Nate Raw

Posted 2018-09-27T20:52:25.110

Reputation: 133



You could think of your similarity measure as a search problem if you consider one record a query, and the "near" records as search results.

I've had some good results following this paper:

As I understand it, the document vectors used in the paper were only good for improving the relevance of search results for results that were already decent (the top N results).

That to me suggests you might try developing a similarity score that works with your other attributes first, and then do something like a weighted average, where the significance of the doc2vec score decays quickly based on the first metric.


Posted 2018-09-27T20:52:25.110

Reputation: 144

Forgot to accept previously! Sorry about that. Accepted. Thank you. – Nate Raw – 2018-12-31T04:14:56.430