AI Models Are Being Tested on Dungeons & Dragons to Measure Long-Term Decision-Making

Big Bright Buttons on Black Surface

Large Language Models, including systems similar to ChatGPT, are now being tested in an unexpected but surprisingly effective environment: Dungeons & Dragons. Researchers from the University of California, San Diego (UC San Diego) have been exploring how this classic tabletop role-playing game can be used to evaluate how well AI models handle long-term planning, rule adherence, teamwork, and sustained decision-making.

The idea may sound playful at first, but the motivation behind it is deeply practical. As AI systems are increasingly deployed as autonomous or semi-autonomous agents, they are expected to operate independently for extended periods of time. However, most existing benchmarks for evaluating Large Language Models (LLMs) still focus on short-term tasks, such as answering isolated questions or completing brief interactions. This creates a gap between how models are tested and how they are actually used in real-world applications.

Dungeons & Dragons, often shortened to D&D, offers a compelling solution to this problem. The game combines complex rules, multi-step decision-making, evolving environments, and social interaction over long sessions. According to the UC San Diego research team, this makes D&D an ideal testing ground for understanding how AI agents perform when they must reason consistently over time rather than respond to single prompts.

Why Dungeons & Dragons Works as an AI Benchmark

D&D is not just a game of imagination. It relies on a detailed rule system, structured turns, resource management, and clear consequences for actions. Players must keep track of health points, spells, abilities, positioning, and team coordination, often across many hours of gameplay. For AI models, this means they must demonstrate multistep planning, maintain awareness of the game state, and comply with strict rules throughout an extended interaction.

Another important aspect is communication. D&D unfolds largely through dialogue, allowing researchers to study how AI agents interact with both humans and other AI systems. This creates opportunities to evaluate not just raw decision-making, but also how well models collaborate, negotiate, and stay aligned with group objectives.

The study was led by Raj Ammanabrolu, a faculty member in the Department of Computer Science and Engineering at UC San Diego. His team emphasized that D&D naturally brings together several challenges that are otherwise difficult to test in combination, including long-term reasoning, teamwork, and rule compliance.

How the Experiments Were Set Up

To run these experiments, the researchers required AI models to simulate and actively play Dungeons & Dragons. To ensure accuracy and prevent the models from inventing illegal moves or ignoring rules, each LLM was paired with a game engine grounded in official D&D mechanics. This engine handled maps, combat rules, resources, and other constraints, acting as a guardrail to minimize hallucinations and enforce consistency.

While AI-driven Dungeon Masters already exist in various experimental and commercial settings, this study went further. The AI agents were assigned multiple roles, acting not only as Dungeon Masters but also as players and even monsters that fought against player characters. The focus of the simulations was specifically on combat scenarios, which provided structured, high-stakes situations requiring careful tactical decisions.

The researchers selected 27 different combat scenarios, all based on well-known D&D battle setups. These included encounters such as Goblin Ambush, Kennel in Cragmaw Hideout, and Klargโ€™s Cave, which are familiar to many experienced D&D players. These scenarios were chosen because they offer varied tactical challenges while remaining grounded in established game design.

Human Players and AI-Versus-AI Matches

The AI models did not operate in isolation. They played against each other and also against more than 2,000 experienced Dungeons & Dragons players recruited by the research team. This mix of AI-versus-AI and AI-versus-human gameplay allowed researchers to observe how the models adapted to different opponents and levels of unpredictability.

By comparing performance across these matches, the researchers were able to evaluate not only strategic competence but also consistency and reliability. Long-term decision-making is not just about choosing the best move once; it is about making solid decisions repeatedly without losing track of goals, resources, or rules.

Which AI Models Performed Best

The study evaluated three Large Language Models using the same framework. Among them, Claude 3.5 Haiku emerged as the strongest overall performer. It was found to be the most reliable in following rules, tracking game state, and making appropriate decisions across extended gameplay.

GPT-4 followed closely behind, showing strong performance but slightly less consistency compared to Claude 3.5 Haiku. The third model, DeepSeek-V3, ranked lowest among the three, struggling more with coordination and long-term reasoning under the same conditions.

The researchers emphasized that these results are not meant to declare a definitive โ€œwinner,โ€ but rather to demonstrate how different models behave when subjected to long-horizon evaluation tasks.

What the Researchers Looked For

The models were assessed across several criteria. One key factor was how well they could determine correct actions within the rules of the game. Another was their ability to track resources, such as health, abilities, and available actions, over time.

An additional and more unusual evaluation metric was whether the AI could stay in character. Since D&D involves role-playing, the researchers examined whether the models maintained consistent behavior and personality aligned with their assigned roles when interacting with other players.

This led to some unexpected and entertaining observations. Goblin characters, for example, began developing distinct personalities during combat, taunting opponents with colorful and sometimes nonsensical remarks. Paladins occasionally delivered heroic speeches at odd moments, even when stepping directly into danger. Warlocks, meanwhile, tended to become overly dramatic, reacting emotionally even in relatively mundane situations.

Rather than viewing this as a flaw, the researchers interpreted these behaviors as signs that the models were attempting to add texture and personality to gameplay, reflecting their training on narrative and conversational data.

Why Long-Term Evaluation Matters

The broader goal of this research is to address a major challenge in AI development: the lack of benchmarks that measure sustained autonomous performance. As LLMs are increasingly used in applications like digital assistants, game agents, negotiation tools, and business planning systems, their ability to function reliably over long periods becomes critical.

Short-term benchmarks cannot fully capture issues like gradual loss of coherence, compounding errors, or strategic drift. Dungeons & Dragons provides a structured yet flexible environment where these problems can surface naturally.

What Comes Next

The UC San Diego team plans to expand their work beyond combat encounters. Future experiments will involve full D&D campaigns, which include exploration, storytelling, and long-term character development. These elements introduce additional layers of complexity, such as memory, narrative consistency, and evolving goals.

Beyond gaming, the researchers believe the same framework could be adapted to test AI agents in other domains, including multi-party negotiations and strategic business planning environments. In these contexts, long-term reasoning, collaboration, and rule compliance are just as important as they are in a fantasy role-playing game.

By turning Dungeons & Dragons into a serious evaluation tool, this research highlights how creative approaches can help bridge the gap between AI capabilities in the lab and real-world demands.

Research paper: https://openreview.net/pdf?id=3Op7kJOvaD

Also Read

Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments