Werewolf Obstacle Course

Per-model tool-call reliability across ten Werewolf scenarios. Latest run per model.

Leaderboard

Model	speak	nominate	vote	defense	investigate	protect	choose_kill	scream	long_speak	notepad	Overall	Avg	Cost / 27	Vetted (runs)
google/gemini-3-flash-preview	3/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3	30/30 (100%)	1.0s	$0.0047	2026-04-20 / 3
google/gemma-4-31b-it	3/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3	30/30 (100%)	1.1s	$0.0011	2026-04-20 / 3
google/gemini-3.1-flash-lite-preview	3/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3	30/30 (100%)	1.3s	$0.0025	2026-04-20 / 3
anthropic/claude-haiku-4.5	5/5	5/5	5/5	5/5	5/5	5/5	5/5	5/5	5/5	5/5	50/50 (100%)	1.8s	$0.0317	2026-04-20 / 5
x-ai/grok-4-fast	3/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3	30/30 (100%)	2.0s	$0.0051	2026-04-20 / 3
cohere/command-r-plus-08-2024	5/5	5/5	5/5	5/5	5/5	5/5	5/5	5/5	5/5	5/5	50/50 (100%)	2.2s	$0.0212	2026-04-21 / 5
anthropic/claude-sonnet-4-6	3/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3	30/30 (100%)	3.3s	$0.1052	2026-04-20 / 3
qwen/qwen3.6-plus	3/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3	30/30 (100%)	6.8s	$0.0210	2026-04-20 / 3
openai/gpt-oss-120b	3/3	3/3	2/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3	29/30 (97%)	2.4s	$0.0007	2026-04-20 / 3
amazon/nova-micro-v1	5/5	5/5	5/5	5/5	4/5	5/5	5/5	5/5	4/5	5/5	48/50 (96%)	0.8s	$0.0009	2026-04-20 / 5
qwen/qwen3.5-35b-a3b	3/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3	1/3	3/3	28/30 (93%)	1.5s	$0.0072	2026-04-20 / 3
google/gemini-2.5-flash	3/3	3/3	3/3	0/3	3/3	3/3	3/3	3/3	3/3	3/3	27/30 (90%)	0.7s	$0.0029	2026-04-20 / 3
mistralai/mistral-medium-3	3/3	3/3	0/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3	27/30 (90%)	0.9s	$0.0039	2026-04-20 / 3
moonshotai/kimi-k2	3/3	3/3	3/3	3/3	3/3	3/3	3/3	2/3	1/3	3/3	27/30 (90%)	2.0s	$0.0093	2026-04-20 / 3
amazon/nova-lite-v1	5/5	4/5	5/5	5/5	1/5	5/5	5/5	5/5	5/5	5/5	45/50 (90%)	5.4s	$0.0014	2026-04-20 / 5
deepseek/deepseek-v3.2	3/3	3/3	1/3	2/3	3/3	3/3	3/3	3/3	3/3	3/3	27/30 (90%)	5.9s	$0.0045	2026-04-20 / 3
openrouter/elephant-alpha	5/5	4/5	5/5	4/5	4/5	4/5	4/5	4/5	0/5	4/5	38/50 (76%)	0.7s	$0.0000	2026-04-20 / 5
stepfun/step-3.5-flash:free	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/30 (0%)	0.0s	$0.0000	2026-04-20 / 3

Scenarios

Each scenario presents a Werewolf prompt with only the expected tool available and checks whether the model actually called it. A pass requires the correct tool-call to appear in the response.

speak — Free-form discussion speech.
nominate — Call a nomination for elimination.
vote — Cast aye/no on a trial.
defense — Short speech when nominated.
investigate — Seer night action.
protect — Doctor night action.
choose_kill — Werewolf night target.
scream — Dying words after being killed.
long_speak — Defense speech after two rounds of prior context -- catches the 'narrates speak() as text' failure.
notepad — Write to per-player notepad.

Werewolf obstacle course

Leaderboard

Scenarios