What are some standard ways of computing the distance between individual search queries?

8

2

I made a similar question asking about distance between "documents" (Wikipedia articles, news stories, etc.). I made this a separate question because search queries are considerably smaller than documents and are considerably noisier. I hence don't know (and doubt) if the same distance metrics would be used here.

Either vanilla lexical distance metrics or state-of-the-art semantic distance metrics are preferred, with stronger preference for the latter.

Matt

Posted 2014-07-05T16:20:17.963

Reputation: 783

2Search queries are not noisier (there are very few words in a query not actually related to the search), but may contain misspellings, ambiguity, slang and other stuff that you have to deal with separately. Beyond these issues, queries and documents may be processed pretty much the same way. – ffriend – 2014-07-06T00:13:02.033

maybe you can extract keyword vectors from queries, and then compute the distance between those vectors, and how the similarity is defined, i think this is still an open question:) – crazyminer – 2014-07-06T06:15:21.697

1

Both of your questions are broad, subjective and will require significant maintenance to avoid becoming obsolete. Since the community appreciates that sort of question, keeping one of them might be reasonable - but certainly not both, when this discussion is a proper subset of the other. Please review What types of questions should I avoid asking?

– Air – 2014-07-08T15:13:07.897

Thanks, AirThomas! ffriend's post certainly seems to indicate that this is clearly a duplicate. I'll see what I can do about this. – Matt – 2014-07-16T00:14:25.850

Answers

4

From my experience only some classes of queries can be classified on lexical features (due to ambiguity of natural language). Instead you can try to use boolean search results (sites or segments of sites, not documents, without ranking) as features for classification (instead on words). This approach works well in classes where there is a big lexical ambiguity in a query but exists a lot of good sites relevant to the query (e.g. movies, music, commercial queries and so on).

Also, for offline classification you can do LSI on query-site matrix. See "Introduction to Information Retrieval" book for details.

Alx49

Posted 2014-07-05T16:20:17.963

Reputation: 56

On a related note, I found this relevant paper.

– Matt – 2014-08-19T19:11:38.010

4

The cosine similarity metric does a good (if not perfect) job of controlling for the document length, so comparing the similarity of 2 documents or 2 queries using the cosine metric and tf idf weights for the words should work well in either case. I would also recommend doing LSA first on tf idf weights, and then computing the cosine distance\similarities.

If you are trying to build a search engine, I would recommend using a free open source search engine like solr or elastic search, or just the raw lucene libraries, as they do most of the work for you, and have good built in methods for handling the query to document similarity problem.

Simon

Posted 2014-07-05T16:20:17.963

Reputation: 916