Does learning content from additional encyclopedias consume much less amount of storage?


Complex AI that learns lexical-semantic content and its meaning (such as collection of words, their structure and dependencies) such as Watson takes terabytes of disk space.

Lets assume DeepQA-like AI consumed whole Wikipedia of size 10G which took the same amount of structured and unstructured stored content.

Will learning another 10G of different encyclopedia (different topics in the same language) take the same amount of data? Or will the AI reuse the existing structured and take less than half (like 1/10 of it) additional space?


Posted 2016-08-04T18:31:33.580

Reputation: 9 163



It seems easy for this to be sublinear growth or superlinear growth, depending on context.

If we imagine the space of the complex AI as split into two parts--the context model and the content model (that is, information and structure that is expected to be shared across entries vs. information and structure that is local to particular entries), then expanding the source material means we don't have much additional work to do on the context model, but whether the additional piece of the content model is larger or smaller depends on how connected the new material is to the old material.

That is, one of the reasons why Watson takes many times the space of its source material is because it stores links between objects, which one would expect to grow with roughly order n squared. If there are many links between the old and new material, then we should expect it to roughly quadruple in size instead of double; if the old material and new material are mostly unconnected and roughly the same in topology, then we expect the model to roughly double; if the new material is mostly unconnected to the old material and also mostly unconnected to itself, then we expect the model to not grow by much.

Matthew Graves

Posted 2016-08-04T18:31:33.580

Reputation: 3 957

1Would be great to add some extra references. – kenorb – 2016-08-17T01:29:27.233


I know it seems like a cop-out answer to every question on AI, but "it depends". For example, if the bulk of the storage space is storing learned concepts, and attributes of example entities, then it stands to reason that concepts and entities could be reused. In that scenario, learning from an additional 10G of text would use less storage than the original.

OTOH, as others have said, it could be that the storage is mostly storing the links between things, in which case the number of links will likely grow exponentially. In that case, the second batch of "knowledge" would add more storage requirements than the first.

So it would come down to "what exactly is the system learning, and how does it represent what it learned?" And that answer will vary from system to system.


Posted 2016-08-04T18:31:33.580

Reputation: 3 471