clembench: Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents
Chalamalasetti, K., Götze, J., Hakimov, S., Madureira, B., Sadler, P., & Schlangen, D. (2023). clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents. In Proceedings of EMNLP 2023. PDF DOI: 10.18653/v1/2023.emnlp-main.689
Hakimov, S., Abdullayeva, Y., Koshti, K., Schmidt, A., Weiser, Y., Beyer, A., & Schlangen. D. (2025). Using Game Play to Investigate Multimodal and Conversational Grounding in Large Multimodal Models. In Proceedings of COLING 2025. arXiv preprint arXiv:2406.14035
Jordan, J., Hakimov, S., Schlangen, D. (2025). Plant in Cupboard, Orange on Rably, Inat Aphone. Benchmarking Incremental Learning of Situation and Language Model using a Text-Simulated Situated Environment. In: Text, Speech, and Dialogue. TSD 2025. Lecture Notes in Computer Science(), vol 16030. Springer, Cham. PDF DOI:10.1007/978-3-032-02551-7_6
Hakimov, S., Pfennigschmidt, L., and Schlangen, D. (2025). Ad-hoc Concept Forming in the Game Codenames as a Means for Evaluating Large Language Models. In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²), pages 728–740, Vienna, Austria and virtual meeting. Association for Computational Linguistics. PDF
Beyer, A., Chalamalasetti, K., Hakimov, S., Madureira, B., Sadler, P., & Schlangen, D. (2024). clembench-2024: A Challenging, Dynamic, Complementary, Multilingual Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents. arXiv preprint arXiv:2405.20859
Bhavsar, N., Jordan, J., Hakimov, S., & Schlangen. D. (2024). How Many Parameters Does it Take to Change a Light Bulb? Evaluating Performance in Self-Play of Conversational Games as a Function of Model Characteristics arXiv preprint arXiv:2406.14051
There are currently two main paradigms for evaluating LLMs: reference-based evaluation looks at the performance at well-defined single-shot tasks like question answering or summarisation; while preference-based evaluation asks users to interact with such two such models (each interfaced as a potentially multi-turn chatbot) in parallel and to judge which one “performs better”.
We propose a complementary way of evaluating LLMs which combines the control (and reproducibility that comes from automation) that reference-based evaluation offers with the interactivity challenged in chatbot-type preferential evaluation. This is achieved through gameplay in well-defined conversational / dialogue games. We have implemented a set of games (such as Wordle or Taboo, or games where one player must formulate descriptions of what to do to another player) which current models can play in self-play. These games come with metrics that measure the quality of the game play. Together, across the set of games, we can calculate an overal score per model (what we call the clemscore) which serves as an indicator of how well the model can follow fine-grained instructions and how well it can simulate goal-directed conversational behaviour.
The framework for implementing such games and for running the overall benchmark is described in the code repo. The implemented games that form the current version of the clembench are described in the paper. Here, you can find the results of running the collection of games that forms the current version of the clembench against the list of models shown below.
We are continuously updating the leaderboard with new models; and, less often, the benchmark with additional games. Check back often to see the latest version.
Interaction Settings in the Current Version
Text Based | |||
---|---|---|---|
Interaction Setting | Description | Anchoring Process | Representational Domain |
taboo | A game where one player tries to get the other player to guess a word without using certain 'taboo' words in their clues. | incremental learning/processing | language model/world model |
wordle | A game where one player thinks of a word and the other player tries to guess it by suggesting words that are similar or related. | incremental learning/processing | language model/world model |
wordle_withclue | A variant of Wordle where the guesser is given a clue to help them guess the target word. | incremental learning/processing | language model/world model |
wordle_withcritic | A variant of Wordle where the guesser's suggestions are evaluated by a third player, who provides feedback on their accuracy. | incremental learning/processing | language model/world model |
private-shared | A two player game where one player acts as an agent for booking some service and the other player gives their preferences. | incremental learning/processing | language model/world model |
imagegame | A game where one player draws a picture and the other player tries to guess what it represents. | multimodal grounding | situation model |
referencegame | A game where one player describes an object and the other player tries to identify it based on the description. | multimodal grounding | situation model |
text mapworld | A single-player exploration game where the player navigates a map based on given directions and room descriptions. | multimodal/conversational grounding | situation model |
text mapworld-graph-reasoning | A single-player exploration game where the player navigates a map based on given directions and in every turn is asked to produce the graph of the entire map up to that point. | multimodal/conversational grounding | situation model |
text mapworld-specific-room | A single-player exploration game where the player navigates a map and expected to find the specific room that fits the given description of it. | multimodal/conversational grounding | situation model |
codenames | A two-playergame where one player gives a clue that targets words and the other player tries to guess the words using the clue. | incremental learning/processing | language model/world model |
adventuregame | A single player exploration game where the player is given a task in a typical home environment and is expected to perform that task by navigating the rooms, interacting with the objects. | incremental learning/processing | language model/world model |
matchit-ascii | A game where players identify whether two images (in ASCII format) are the same or different through dialogue. | multimodal/conversational grounding | situation/agent model |
guess-what | A two-player game word guessing game where one player picks a word out of eight options and the second player tries to guess the word by asking questions that lead to Yes/No answers. | incremental learning/processing | language model/world model |
Multimodal | |||
---|---|---|---|
Interaction Setting | Description | Anchoring Process | Representational Domain |
matchit | A game where players identify whether two images are the same or different through dialogue. | multimodal/conversational grounding | situation/agent model |
multimodal reference game | A game where one player describes an object and the other player tries to identify it based on the description using multiple modalities. | multimodal/conversational grounding | situation model |
multimodal mapworld | A single-player exploration game in which the player navigates a map using multimodal inputs. | multimodal/conversational grounding | situation model |
multimodal mapworld-specific-room | A single-player exploration game where the player navigates a map and expected to find the specific room that fits the given description of it. | multimodal/conversational grounding | situation model |
multimodal mapworld-graph-reasoning | A single-player exploration game where the player navigates a map based on given directions and in every turn is asked to produce the graph of the entire map up to that point. | multimodal/conversational grounding | situation model |