Generating a text training dataset from a grammar


I want to generate documents based on a grammar to build a custom training database. What are the tools and techniques to generate random texts based on a given grammar.

More specifically, I would like to build a collection of texts with random 'sentences' and each one is a possible derivation (chosen randomly) from the grammar. I would like also to draw randomly the terminal symbols (the final words). I have looked into (python) nltk and have achieved some document generation, but the derivations chosen are not selected randomly.


Posted 2016-06-12T14:34:28.933

Reputation: 513

Can you define grammar in this context? – Jan van der Vegt – 2016-06-13T11:58:08.173

Sure. I meant a format grammar and more specifically a context-free grammar ; For example, with this you can generate sentences from a grammar, BUT they are somewhat drawn in a graph-traversal order, I would like to generate randomly (with uniform distribution over all possible derivations, if possible)

– mic – 2016-06-13T12:03:13.090

Uniform distribution is a problem if the grammar defines a potentially infinite output. Often generators will limit repeated segments for this reason. – Neil Slater – 2016-06-13T18:54:28.937

Indeed. Let's say that my grammar generates a finite number of possible derivations (including terminals which are a finite set in my case). – mic – 2016-06-14T08:03:51.977

No answers