Per-model tool-call reliability across ten Werewolf scenarios. Latest run per model.
| Model | speak | nominate | vote | defense | investigate | protect | choose_kill | scream | long_speak | notepad | Overall | Avg | Cost / 27 | Vetted (runs) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| google/gemini-3-flash-preview | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 30/30 (100%) | 1.0s | $0.0047 | 2026-04-20 / 3 |
| google/gemma-4-31b-it | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 30/30 (100%) | 1.1s | $0.0011 | 2026-04-20 / 3 |
| google/gemini-3.1-flash-lite-preview | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 30/30 (100%) | 1.3s | $0.0025 | 2026-04-20 / 3 |
| anthropic/claude-haiku-4.5 | 5/5 | 5/5 | 5/5 | 5/5 | 5/5 | 5/5 | 5/5 | 5/5 | 5/5 | 5/5 | 50/50 (100%) | 1.8s | $0.0317 | 2026-04-20 / 5 |
| x-ai/grok-4-fast | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 30/30 (100%) | 2.0s | $0.0051 | 2026-04-20 / 3 |
| anthropic/claude-sonnet-4-6 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 30/30 (100%) | 3.3s | $0.1052 | 2026-04-20 / 3 |
| qwen/qwen3.6-plus | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 30/30 (100%) | 6.8s | $0.0210 | 2026-04-20 / 3 |
| openai/gpt-oss-120b | 3/3 | 3/3 | 2/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 29/30 (97%) | 2.4s | $0.0007 | 2026-04-20 / 3 |
| amazon/nova-micro-v1 | 5/5 | 5/5 | 5/5 | 5/5 | 4/5 | 5/5 | 5/5 | 5/5 | 4/5 | 5/5 | 48/50 (96%) | 0.8s | $0.0009 | 2026-04-20 / 5 |
| qwen/qwen3.5-35b-a3b | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 1/3 | 3/3 | 28/30 (93%) | 1.5s | $0.0072 | 2026-04-20 / 3 |
| google/gemini-2.5-flash | 3/3 | 3/3 | 3/3 | 0/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 27/30 (90%) | 0.7s | $0.0029 | 2026-04-20 / 3 |
| mistralai/mistral-medium-3 | 3/3 | 3/3 | 0/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 27/30 (90%) | 0.9s | $0.0039 | 2026-04-20 / 3 |
| moonshotai/kimi-k2 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 2/3 | 1/3 | 3/3 | 27/30 (90%) | 2.0s | $0.0093 | 2026-04-20 / 3 |
| cohere/command-r-plus-08-2024 | 2/3 | 3/3 | 3/3 | 2/3 | 3/3 | 3/3 | 2/3 | 3/3 | 3/3 | 3/3 | 27/30 (90%) | 2.1s | $0.0216 | 2026-04-20 / 3 |
| amazon/nova-lite-v1 | 5/5 | 4/5 | 5/5 | 5/5 | 1/5 | 5/5 | 5/5 | 5/5 | 5/5 | 5/5 | 45/50 (90%) | 5.4s | $0.0014 | 2026-04-20 / 5 |
| deepseek/deepseek-v3.2 | 3/3 | 3/3 | 1/3 | 2/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 27/30 (90%) | 5.9s | $0.0045 | 2026-04-20 / 3 |
| openrouter/elephant-alpha | 5/5 | 4/5 | 5/5 | 4/5 | 4/5 | 4/5 | 4/5 | 4/5 | 0/5 | 4/5 | 38/50 (76%) | 0.7s | $0.0000 | 2026-04-20 / 5 |
| stepfun/step-3.5-flash:free | 0/3 | 0/3 | 0/3 | 0/3 | 0/3 | 0/3 | 0/3 | 0/3 | 0/3 | 0/3 | 0/30 (0%) | 0.0s | $0.0000 | 2026-04-20 / 3 |
Each scenario presents a Werewolf prompt with only the expected tool available and checks whether the model actually called it. A pass requires the correct tool-call to appear in the response.
speak — Free-form discussion speech.nominate — Call a nomination for elimination.vote — Cast aye/no on a trial.defense — Short speech when nominated.investigate — Seer night action.protect — Doctor night action.choose_kill — Werewolf night target.scream — Dying words after being killed.long_speak — Defense speech after two rounds of prior context -- catches the 'narrates speak() as text' failure.notepad — Write to per-player notepad.