## Has the Lovelace Test 2.0 been successfully used in an academic setting?

19

3

In October 2014, Dr. Mark Riedl published an approach to testing AI intelligence, called the "Lovelace Test 2.0", after being inspired by the original Lovelace Test (published in 2001). Mark believed that the original Lovelace Test would be impossible to pass, and therefore, suggested a weaker, and more practical version.

The Lovelace Test 2.0 makes the assumption that for an AI to be intelligent, it must exhibit creativity. From the paper itself:

The Lovelace 2.0 Test is as follows: artificial agent $$a$$ is challenged as follows:

• $$a$$ must create an artifact $$o$$ of type $$t$$;

• $$o$$ must conform to a set of constraints $$C$$ where $$c_i ∈ C$$ is any criterion expressible in natural language;

• a human evaluator $$h$$, having chosen $$t$$ and $$C$$, is satisfied that $$o$$ is a valid instance of $$t$$ and meets $$C$$; and

• a human referee $$r$$ determines the combination of $$t$$ and $$C$$ to not be unrealistic for an average human.

Since it is possible for a human evaluator to come up with some pretty easy constraints for an AI to beat, the human evaluator is then expected to keep coming up with more and more complex constraints for the AI until the AI fails. The point of the Lovelace Test 2.0 is to compare the creativity of different AIs, not to provide a definite dividing line between 'intelligence' and 'nonintelligence' like the Turing Test would.

However, I am curious about whether this test has actually been used in an academic setting, or it is only seen as a thought experiment at the moment. The Lovelace Test seems easy to apply in academic settings (you only need to develop some measurable constraints that you can use to test the artificial agent), but it also may be too subjective (humans can disagree on the merits of certain constraints, and whether a creative artifact produced by an AI actually meets the final result).

6

No.

TL;DR: The Lovelace Test 2.0 is very vague, making it ill-suited for evaluation of intelligence. It is also generally ignored by researchers of Computational Creativity, who already have their own tests to evaluate creativity.

Longer Answer: According to Google Scholar, there are 10 references to the "Lovelace Test 2.0" paper. All of those references exist merely to point out that the Lovelace Test 2.0 exists. In fact, at least two of articles I consulted (A novel approach for identifying a human-like self-conscious behavior and FraMoTEC: A Framework for Modular Task-Environment Construction for Evaluating Adaptive Control Systems) proposed their own tests instead.

One of the authors who wrote the FraMoTEC paper also wrote his thesis on FraMoTEC, and indirectly critiqued the Lovelace Test 2.0 and other similar such tests:

The Piaget-MacGyver Room problem [Bringsjord and Licato, 2012], Lovelace Test 2.0 [Riedl, 2014] and Toy Box problem [Johnston, 2010] all come with the caveat of being defined very vaguely — these evaluation methods may be likely to come up with a reasonable evaluation for intelligence, but it is very difficult to compare two different agents (or controllers) that partake in the their own domain-specific evaluations, which is what frequently happens when agents are tailored to pass specific evaluations.

Another major issue with the Lovelace Test 2.0 is that there is a proliferation of other tests to "measure" the creativity of AI. Evaluating Evaluation: Assessing Progress in Computational Creativity Research, published by Anna Jordanous in 2011 (3 years before the invention of the Lovelace Test 2.0) analyzed research papers about AI creativity and wrote:

Of the 18 papers that applied creativity evaluation methodologies to evaluate their system’s creativity, no one methodology emerged as standard across the community. Colton’s creative tripod framework (Colton 2008) was used most often (6 uses), with 4 papers using Ritchie’s empirical criteria (Ritchie 2007).

That leaves 10 papers with miscellaneous creativity evaluation methods.

The goal of "Evaluating Evaluation" was to standardize the process of evaluating creativity, to avoid the possibility of the field stagnating due to the proliferation of so many creativity tests. Anna Jordanous still remained interested in evaluating creativity tests, publishing articles such as "Stepping Back to Progress Forwards: Setting Standards for Meta-Evaluation of Computational Creativity" and Four PPPPerspectives on Computational Creativity.

"Evaluating Evaluation" does provide some commentary to explain the proliferation of systems to evaluate creativity:

Evaluation standards are not easy to define. It is difficult to evaluate creativity and even more difficult to describe how we evaluate creativity, in human creativity as well as in computational creativity. In fact, even the very definition of creativity is problematic (Plucker, Beghetto, and Dow 2004). It is hard to identify what ’being creative’ entails, so there are no benchmarks or ground truths to measure against.

The fact that so many tests of creativity already exist (to the extent that Jordanous can make an academic career in studying them) means that it's very difficult for any new test (such as the Lovelace Test 2.0) to even be noticed (much less cited). Why would you want to use something like the Lovelace Test 2.0 when there's so many other tests you could use instead?