clembench: Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents

Chalamalasetti, K., Götze, J., Hakimov, S., Madureira, B., Sadler, P., & Schlangen, D. (2023). clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents. In Proceedings of EMNLP 2023. PDF DOI: 10.18653/v1/2023.emnlp-main.689

There are currently two main paradigms for evaluating LLMs: reference-based evaluation looks at the performance at well-defined single-shot tasks like question answering or summarisation; while preference-based evaluation asks users to interact with such two such models (each interfaced as a potentially multi-turn chatbot) in parallel and to judge which one “performs better”.

We propose a complementary way of evaluating LLMs which combines the control (and reproducibility that comes from automation) that reference-based evaluation offers with the interactivity challenged in chatbot-type preferential evaluation. This is achieved through gameplay in well-defined conversational / dialogue games. We have implemented a set of games (such as Wordle or Taboo, or games where one player must formulate descriptions of what to do to another player) which current models can play in self-play. These games come with metrics that measure the quality of the game play. Together, across the set of games, we can calculate an overal score per model (what we call the clemscore) which serves as an indicator of how well the model can follow fine-grained instructions and how well it can simulate goal-directed conversational behaviour.

The framework for implementing such games and for running the overall benchmark is described in the code repo. The implemented games that form the current version of the clembench are described in the paper. Here, you can find the results of running the collection of games that forms the current version of the clembench against the list of models shown below.

We plan to continuously update the leaderboard with new models; and the benchmark with additional games. (Last update: 2023-11-20; models added.)

Interaction Settings in the Current Version

interaction setting description anchoring process representational domain
taboo A game where one player tries to get the other player to guess a word without using certain 'taboo' words in their clues. incremental learning/processing language model/world model
wordle A game where one player thinks of a word and the other player tries to guess it by suggesting words that are similar or related. incremental learning/processing language model/world model
wordle_withclue A variant of Wordle where the guesser is given a clue to help them guess the target word. incremental learning/processing language model/world model
wordle_withcritic A variant of Wordle where the guesser's suggestions are evaluated by a third player, who provides feedback on their accuracy. incremental learning/processing language model/world model
imagegame A game where one player draws a picture and the other player tries to guess what it represents. multimodal grounding situation model
referencegame A game where one player describes an object and the other player tries to identify it based on the description. multimodal grounding situation model
privateshared A game where two players are given different pieces of information and must work together to solve a problem. conversational grounding discourse model