How can I evaluate the performance of a system that generates text?


I am preparing to perform research comparing the performance of two different systems that probabilistically generate the next word of an input sentence.

For example, given the word 'the', a system might output 'car', or any other word. Given the input 'the round yellow', a system might output 'sun', or it might output something that doesn't make sense.

My question is, how can I quantitatively evaluate the performance of the two different systems performing this task? Of course if I tested each system manually I could qualitatively determine how often each system responded in a way that makes sense, and compare how often each system responds correctly, but I'd really like a meaningful quantitative method of evaluation that I could preferably automate.

Precision and recall don't seem like they would work here, seeing as for each given input there are many potentially acceptable outputs. Any suggestions?

Christian Westbrook

Posted 2018-12-06T06:44:42.633

Reputation: 312



This is a tricky issue. I assume you are using transition probabilities to pick the next suitable word, so you could use some other corpus data, derive probabilities from it, and compare those to your system. Not very satisfactory, though, as you might end up evaluating the system in a circular way, deriving your test data in the same way as you generate your output.

A better, if somewhat more laborious evaluation would be to present the trigger sequence to human beings (on a fairly large scale), and then measure the overlap between your system and the human choices. Of course, with the you will get a very large set of possible replies, but the round yellow would be a lot more restricted (and it would make more sense to include that kind of sequence in an evaluation anyway).

The figure you will get might be quite small (depending on the range of possible options), but that is to be expected, so your system could score well even with a small overlap only. That's how I would approach this problem.

Oliver Mason

Posted 2018-12-06T06:44:42.633

Reputation: 3 755