I'm working on an application which requires creating a very large database of n-grams that exist in a large text corpus.
I need three efficient operation types: Lookup and insertion indexed by the n-gram itself, and querying for all n-grams that contain a sub-n-gram.
This sounds to me like the database should be a gigantic document tree, and document databases, e.g. Mongo, should be able to do the job well, but I've never used those at scale.
Knowing the Stack Exchange question format, I'd like to clarify that I'm not asking for suggestions on specific technologies, but rather a type of database that I should be looking for to implement something like this at scale.