From your clarification -
By database, lets just say that there is a huge list of the ngram
model that represents the document
You would do well to do something a bit more structured and put the data into a relational database. This would allow you to do much more detailed analysis more easily and quickly.
I guess when you say "ngram" you mean "1gram". You could extend the analysis to include 2grams, 3grams etc, if you wanted.
I would have a table structure that looks something like this -
So, in the record in the Docs1Grams table when 1GramID points to the 1gram "the" and the DocID points to the document "War and Peace" then 1GramCount will hold the number of times the 1gram "the" appears in War and Peace.
If the DocID for 'War and Peace" is 1 and the DocId for "Lord of the Rings" is 2 then to calculate the 1gram similarity score for these two documents you would this query -
Select count(*) from Docs1Grams D1, Docs1Grams D2
where D1.DocID = 1 and
D2.DocID = 2 and
D1.1GramID = D2.1GramID and
D1.1GramCount > 0 and
D2.1GramCount > 0
By generalizing and expanding the query this could be easily changed to automatically pick the highest such score / count comparing your chosen document with all the others.
By modifying / expanding the
D1.1GramCount > 0 and D2.1GramCount > 0 part of the query you could easily make the comparison more sophisticated by, for instance, adding 2Grams, 3Grams, etc. or modifying the simple match to score according to the percentage match per ngram.
So if your subject document has 0.0009% of the 1grams being "the", document 1 has 0.001% and document 2 has 0.0015% then document 1 would score higher on "the" because the modulus of the difference (or whatever other measure you chose to use) is smaller.