I made ten AI models play Blood Bowl against each other (in the Discworld)

About this post

This is an experiment on collaborating with AI: I wrote most of this blog post by collaborating with my Claw/Hermes Agent, running on a Raspberry Pi. I gave it a topic, we iterated on the draft together, and now it’s being published (mostly) without me ever touching the keyboard.

I’m working on my own (entirely human) blog post on that experience, which should be published later this week.

I have an admission to make: I spent a lot of April building a Blood Bowl clone set in Terry Pratchett’s Discworld and letting AI models play it against each other.

This is that. And it’s also the sequel to AI at Play.

Last year I built AIs at Risk — four LLM agents competing at the board game Risk, each assigned a character, watched by a leaderboard. I wrote about what I learned here, talked about it at WHY2025, and came away with a list of things I wanted to try differently. Ankh-Morpork Scramble is what happens when you take those lessons and apply them to a game you built yourself rather than one you inherited.

Ankh-Morpork Scramble is a turn-based sports game where the City Watch takes on the Wizards of Unseen University in what the official rules describe as “the city’s most prestigious, least-regulated street-sport.” It runs continuously on Railway, AI vs AI, all day. You can watch a match in progress right now.

It’s live

You can watch a game at ankh-morpork-scramble-production.up.railway.app. Code is on GitHub. Fair warning: it’s a side project. Things occasionally break. That is, in the spirit of Blood Bowl, entirely appropriate.

Why Blood Bowl? Why Discworld?

Blood Bowl is one of those games that has been quietly beloved for forty years precisely because it’s about failure. The dice are there to punish you. You build an elegant plan, you execute the first three moves perfectly, and then your star ball carrier trips over his own feet and knocks himself out cold. The crowd cheers. Your opponent looks sympathetic. The commentator makes a remark about insurance.

That’s not frustrating — that’s the game. The skill is in building plans that survive contact with the dice, not in pretending the dice aren’t there.

Terry Pratchett’s Ankh-Morpork is the obvious setting because it’s a city that runs on exactly that energy. Everything is held together by improvisation, tradition, and the vague suspicion that nobody actually knows the rules. The City Watch are perpetually underfunded and overextended. The wizards are technically brilliant and practically catastrophic. Both teams feel true to their source material when they fail in interesting ways.

The real reason, honestly: I wanted to see what happened when you gave LLMs a genuinely hard sequential decision problem with real consequences, and watched what they did.

Why Blood Bowl is hard for bots

Blood Bowl has an AI problem. It’s not that the game is secret — it’s fully observable, grid-based, turn-based. At first glance, it should be approachable by existing game-playing algorithms. But as the Bot Bowl project (the annual AI competition for Blood Bowl) has documented, the turn-wise branching factor is overwhelming. Every piece on the board can move multiple times per turn, the game state explodes exponentially, and scoring is rare and difficult to reward. Traditional minimax search and reinforcement learning both struggle with the combination.

Most Bot Bowl entrants use tree search or learned heuristics. What I’m doing here is different — I’m not competing with the bots, I’m using LLMs as the decision-making engine and seeing what emerges. No search, no value function, just a system prompt, a game state, and a hope that the model can read the pitch. That’s the interesting part: if Blood Bowl is hard for AlphaZero-style agents, what’s it like when you hand the decisions to a language model that mostly learned from text?

What it actually is

The game server is a FastAPI application that implements a simplified Blood Bowl ruleset: a 26x15 pitch, two teams of up to eleven players, two halves, touchdowns scored by carrying the ball to the opponent’s end zone. Players can move, block, pass, and perform Ankh-Morpork-specific actions like Scuffle and Charge.

Two LLM agents play against each other. Each turn, the active agent receives a structured game state summary — where every player is, who has the ball, how many movement points each player has left, which squares are safe to move to — and returns a JSON action. One action at a time, back and forth, until the half ends or someone scores.

Team 1: City Watch Constables. The agent plays as Commander Sam Vimes on the sideline. The system prompt is: “You are a cynical, streetwise copper who has learned the hard way that the direct route is usually the one with the trap. Coach like a Watchman on a double shift: read the play, call the safe move first, protect your ball carrier like he’s the last witness.”

Team 2: Unseen University Adepts. The agent plays as Ponder Stibbons. “You have run the numbers through Hex; the numbers say this will probably not work, but probably is a range. Coach like an overworked academic: identify the highest-probability play, hedge against turnovers, explain your reasoning as if presenting to the faculty, and sigh internally when Ridcully ignores the plan.”

Commentator: C.M.O.T. Dibbler. Every turn, a third agent generates commentary in the voice of Ankh-Morpork’s most optimistic entrepreneur. “Lovely action there, buy a meat pie, only three dollars, genuine meat.” It runs as a scrolling ticker at the top of the dashboard. Dibbler turned out to be the best part of the whole project.

Versus mode

The continuous tournament isn’t the only way to play. You can also go to /versus on the live server, pick two models yourself, and watch them face off in real time. It’s useful for comparing specific models head-to-head or just satisfying your curiosity about whether your favourite can actually score.

The tournament

Each match, two models are picked from a pool of ten free LLMs on OpenRouter — Qwen, Gemma, Llama, Mistral, Phi, DeepSeek — weighted so models with fewer recorded games get more chances. The leaderboard tracks wins, losses, and draws per model, plus six behavioural dimensions: Aggression, Recklessness, Ball Craft, Lethality, Verbosity, and Efficiency.

The idea is to build up a genuine performance profile across hundreds of games. Which models play tactically? Which ones yolo into combat and get punished? Which ones produce elegant multi-step plays and which ones seem confused by the pitch geometry?

We’ve only run a handful of games so far — the server’s been running for a few days and there’s only one result in the data — but the infrastructure is there and it accumulates. The leaderboard is at /leaderboard/ui on the live server.

What we learned building it (and what changed from Risk)

In AI at Play I wrote about the two hardest problems: context and scaffolding (how do you convey a complex game state to an LLM?), and the immaturity of the agent tooling (MCP and multi-agent frameworks were new and rough). Both of those shaped how I built Scramble differently.

No MCP, no LangChain. Risk used MCP tools and LangGraph for orchestration. It worked, but it was fragile and added a lot of complexity that didn’t pay for itself. Scramble uses direct LLM API calls with a simple JSON response format. Each turn: here’s the game state, return one action. The simplicity meant the game loop was easier to debug and the agents spent less cognitive budget on tool-calling protocol.

Build the game yourself. Risk is an inherited system with all of Risk’s complexity. Scramble is a game I designed, which means I could shape the rules to be agent-friendly from the start. Simpler movement, clearer win conditions, and critically: a server that could pre-compute and expose exactly what each agent was allowed to do.

Agents need constrained action spaces

Early versions would confidently instruct a player to move to a square they couldn’t reach, or try to block someone three squares away. This was exactly the scaffolding problem from Risk, just wearing a different hat. The agent wasn’t hallucinating exactly — it was making reasonable-sounding decisions based on a game state description it didn’t fully understand.

The fix: expose valid actions explicitly. The state summary now includes a list of reachable squares per player (computed via BFS flood-fill on the server), rendered as natural language: “Constable Throat can safely reach: (8,7), (8,8), (9,6)…” Once the agent only ever saw legal destinations, the “invalid action” rate dropped to near zero.

The general lesson: don’t ask an LLM to search over a large space of possibilities when you can pre-compute the valid subset and just ask it to choose. The LLM’s job is strategy, not geometry. In Risk this was painful to retrofit onto an inherited ruleset. In Scramble I could design for it from the start.

The failure modes are revealing

When an agent makes a bad call — charges into a block it’s likely to lose, moves the ball carrier into a surrounded position — it’s often possible to read why from the thought field in its JSON response. Vimes-mode agents tend to get overconfident with lead changes. Stibbons-mode agents sometimes hedge so aggressively they forget to actually advance the ball.

These aren’t random failures. They’re coherent with the personas. Vimes gets cocky when he’s ahead; Ponder overthinks it. I didn’t engineer this — it emerged from the system prompts.

Ghost players are a real problem

For a while, knocked-out and injured players were still occupying squares on the pitch after they left the field. Another player couldn’t move there. The server was maintaining the position data even after the player was removed from active play. It was subtle — the pitch looked fine to the agents, the positions just happened to be unavailable — and it caused movement plans to silently fail.

Software bugs in games tend to be funnier than bugs in production systems. This one meant that the ghost of a knocked-out wizard would haunt a pitch square for the rest of the half. Very Discworld, actually.

LLMs make surprisingly reasonable coaches

This was the part I was most uncertain about. Would the agents produce coherent tactical play, or just noise?

Mostly: coherent. Not brilliant, but recognisable as strategy. They pick up the ball when it’s free, they protect the carrier, they run away from situations they’re likely to lose. The multi-move sequences are a bit linear — agents don’t seem to think more than one or two turns ahead — but within a turn they make reasonable prioritisation calls.

The Discworld personas help more than I expected. Framing the decision as “what would Vimes do” gives the model a consistent decision heuristic to operate from. It’s not just aesthetic flavour — it affects the actual choices.

What’s it actually for?

Honestly, most of it was just fun. Discworld, Blood Bowl, AI agents, a reason to build a live dashboard — I didn’t need much more justification than that.

But it’s turned out to be a reasonable testbed for a few things I’m actually interested in:

Can you benchmark LLM tactical reasoning at scale? This is the direct continuation of what I was doing with Risk. In AI at Play I found that Horizon Alpha was aggressively dominant and Qwen-3 wouldn’t stop sending diplomatic messages. Scramble runs a similar tournament but with a game I designed to surface tactical reasoning more cleanly. If you run enough games, does a pattern emerge in which models are better at sequential multi-step planning? We don’t have enough data yet, but the infrastructure is there.

How much do system prompt personas affect decision quality? The Vimes/Ponder framing is a controlled variable — same model, same game state, different character. I’d like to run ablation matches (neutral system prompt vs. character prompt) to see if the persona actually changes behaviour beyond surface-level tone.

What makes a good action space for LLM agents? The valid-actions fix was the biggest single improvement to game quality. I think there’s a general pattern here about how to design interfaces for agents that’s worth writing about separately.

If any of that sounds interesting to you, the code is open and PRs are welcome. There’s a skills file in the repo (skills/README.md) if you want to build your own agent and play it against Vimes.

What’s next

A few things on the list:

More games. The tournament needs to run for a while before the leaderboard means anything. Leaving it running is the main thing.
Tackle zones. The current game doesn’t visually indicate which squares require a dodge roll to leave. Agents know this from the state summary, but it’d make the dashboard much more readable.
Seasons. Right now it’s just continuous matches. A proper tournament bracket with a season structure would give the leaderboard more shape.
Your agent. The player agent is a clean interface. If you want to plug in a different model, a fine-tuned persona, or a non-LLM agent, there’s a path for that. Instructions in the repo.

The Patrician has been informed of this project. He neither approved nor disapproved. He simply noted that if it became a problem he would know who to talk to.