Werewolf obstacle course

Per-model tool-call reliability across ten Werewolf scenarios. Latest run per model.

Leaderboard

Model speaknominatevotedefenseinvestigateprotectchoose_killscreamlong_speaknotepad Overall Avg Cost / 27 Vetted (runs)
google/gemini-3-flash-preview3/33/33/33/33/33/33/33/33/33/330/30 (100%)1.0s$0.00472026-04-20 / 3
google/gemma-4-31b-it3/33/33/33/33/33/33/33/33/33/330/30 (100%)1.1s$0.00112026-04-20 / 3
google/gemini-3.1-flash-lite-preview3/33/33/33/33/33/33/33/33/33/330/30 (100%)1.3s$0.00252026-04-20 / 3
anthropic/claude-haiku-4.55/55/55/55/55/55/55/55/55/55/550/50 (100%)1.8s$0.03172026-04-20 / 5
x-ai/grok-4-fast3/33/33/33/33/33/33/33/33/33/330/30 (100%)2.0s$0.00512026-04-20 / 3
anthropic/claude-sonnet-4-63/33/33/33/33/33/33/33/33/33/330/30 (100%)3.3s$0.10522026-04-20 / 3
qwen/qwen3.6-plus3/33/33/33/33/33/33/33/33/33/330/30 (100%)6.8s$0.02102026-04-20 / 3
openai/gpt-oss-120b3/33/32/33/33/33/33/33/33/33/329/30 (97%)2.4s$0.00072026-04-20 / 3
amazon/nova-micro-v15/55/55/55/54/55/55/55/54/55/548/50 (96%)0.8s$0.00092026-04-20 / 5
qwen/qwen3.5-35b-a3b3/33/33/33/33/33/33/33/31/33/328/30 (93%)1.5s$0.00722026-04-20 / 3
google/gemini-2.5-flash3/33/33/30/33/33/33/33/33/33/327/30 (90%)0.7s$0.00292026-04-20 / 3
mistralai/mistral-medium-33/33/30/33/33/33/33/33/33/33/327/30 (90%)0.9s$0.00392026-04-20 / 3
moonshotai/kimi-k23/33/33/33/33/33/33/32/31/33/327/30 (90%)2.0s$0.00932026-04-20 / 3
cohere/command-r-plus-08-20242/33/33/32/33/33/32/33/33/33/327/30 (90%)2.1s$0.02162026-04-20 / 3
amazon/nova-lite-v15/54/55/55/51/55/55/55/55/55/545/50 (90%)5.4s$0.00142026-04-20 / 5
deepseek/deepseek-v3.23/33/31/32/33/33/33/33/33/33/327/30 (90%)5.9s$0.00452026-04-20 / 3
openrouter/elephant-alpha5/54/55/54/54/54/54/54/50/54/538/50 (76%)0.7s$0.00002026-04-20 / 5
stepfun/step-3.5-flash:free0/30/30/30/30/30/30/30/30/30/30/30 (0%)0.0s$0.00002026-04-20 / 3

Scenarios

Each scenario presents a Werewolf prompt with only the expected tool available and checks whether the model actually called it. A pass requires the correct tool-call to appear in the response.