## Best way to search for a similar document given the ngram

7

I have a database of about 200 documents who's ngrams I have extracted. I want to find the document in my database that is most similar to a query document. In otherwords, I want to find the document in the database that shares the most number of ngrams with the query document. Right now, I can go through each one and compare it one by one, but this will take O(N) time and is expensive if N is very large. I was wondering if there are any efficient data structures or methods in doing efficient similarity search. Thanks

How are you going to find the most ngrams if you don't calculate the ngrams? Compare one by one? That is pretty simple single query. – paparazzo – 2015-11-17T05:49:02.387

The ngrams for the documents are all calculated as well as the query. The issue is that now I want to see which of the document in the database is most similar to the query. Or rather, get the top 10 documents most similar to the query, based on their ngrams. – okebz – 2015-11-17T17:04:49.543

The stated question is the ngram have been extracted. Now are you saying they are calculated. Query document to me is document. The answer is only as good as the question. What database? The syntax varies. – paparazzo – 2015-11-17T17:10:43.203

I meant extracted. Either way, the ngrams are obtained from each document so I am just working with the ngrams itself. You have a database of ngrams that represent each document in the database. You have a query document where you want to find lets say the top 10 document that appears most similar to your query. By database, lets just say that there is a huge list of the ngram model that represents the document – okebz – 2015-11-17T17:15:53.260

And you have an answer from me. A database with not stated table design does not really narrow it down. – paparazzo – 2015-11-17T17:18:06.163

3

You could use a hashing vectorizer on your documents. The result will be a list of vectors. Then vectorize your ngrams in the same way and calculate the projection of this new vector on the old ones. This is equivalent to the database join on an index, but may have less overhead.

1

From your clarification -

By database, lets just say that there is a huge list of the ngram model that represents the document

You would do well to do something a bit more structured and put the data into a relational database. This would allow you to do much more detailed analysis more easily and quickly.

I guess when you say "ngram" you mean "1gram". You could extend the analysis to include 2grams, 3grams etc, if you wanted.

I would have a table structure that looks something like this -

1Grams
ID
Value

Docs
ID
DocTitle
DocAuthor
etc.

Docs1Grams
1GramID
DocID
1GramCount

So, in the record in the Docs1Grams table when 1GramID points to the 1gram "the" and the DocID points to the document "War and Peace" then 1GramCount will hold the number of times the 1gram "the" appears in War and Peace.

If the DocID for 'War and Peace" is 1 and the DocId for "Lord of the Rings" is 2 then to calculate the 1gram similarity score for these two documents you would this query -

Select count(*) from Docs1Grams D1, Docs1Grams D2
where D1.DocID = 1 and
D2.DocID = 2 and
D1.1GramID = D2.1GramID and
D1.1GramCount > 0 and
D2.1GramCount > 0


By generalizing and expanding the query this could be easily changed to automatically pick the highest such score / count comparing your chosen document with all the others.

By modifying / expanding the D1.1GramCount > 0 and D2.1GramCount > 0 part of the query you could easily make the comparison more sophisticated by, for instance, adding 2Grams, 3Grams, etc. or modifying the simple match to score according to the percentage match per ngram.

So if your subject document has 0.0009% of the 1grams being "the", document 1 has 0.001% and document 2 has 0.0015% then document 1 would score higher on "the" because the modulus of the difference (or whatever other measure you chose to use) is smaller.

1

If you want to check for the presence of your n grams in the document you will need to convert the query document also into n grams. To accomplish this you can use the TFIDF vectorizer.

from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(tokenizer=word_tokenize,ngram_range=(1,2), binary=True, max_features=10000)
TFIDF=vect.fit_transform(df_lvl0['processed_cv_data'])


To explain the above code:
word_tokenize : this function converts the text into string tokens but you will have to clean the data accordingly.
ngram_range : sets the number of ngrams you need. In this case it will take both 1 word and 2 words.
max_features : limits the total no. of features. I recommend using it if you want to tokenize a couple of documents as the no. of features is very high(in the order of 10^6).

Now after you fit the model the features are stored in "vect". You can view them using:

keywords=set(vect.get_feature_names())


Now that you have the grams of your query document stored in a set, you can perform set operations which are much faster compared to loops. Even if the length of each set is in the order of 10^5 you will get results in seconds. I highly recommend trying sets in python.

matching_keys = keywords.intersection(all_grams)   //all_grams is the set of your collected grams


Finally you will get all the matching keywords into "matching_keys".

1

The data structure typically uses is inverted index (e.g., in databases).

Please note that matching all ngram is a good heuristic but you might want to improve it.

Taking into account the probability of each term and stemming are directions you might benefit from.

1

table

ngram
docID

PK (primary key) ngram, docID

depending the database may change a bit but this is for TSQL
x is the document you are matching

select top(1) with ties *
from
(  select tm.docID, count(*) as count
from table td
join table tm
on tm.docID <> td.docID
and tm.ngram = td.ngram
and td.docID = x
group by tm.docID
) tt
order by count desc


The join is on an index (PK) so this is very fast. I do this on a million documents in just a few seconds (with more advanced conditions).

This is going to favor larger documents but that is what you asked for.

Question seems to be changing

declare table @query (varchar ngram);
insert into @query values ('ng1'), ('ng2'), ('ng3');
select top(10) with ties *
from
(  select tm.docID, count(*) as count
from table td
join @query
on tm.ngram = @query.ngram
group by tm.docID
) tt
order by count desc