TeamBench is a benchmark for evaluating multi-agent LLM coordination on 851 software-engineering, data-engineering, and incident-response tasks. Roles run in isolated containers under five ablation conditions, so the contribution of every Planner, Executor, and Verifier is measured independently and a Teamwork Necessity Index quantifies when a team is actually pulling its weight.
| # | Model | Org | Solo | Restricted | No Plan | No Eval | Full | Date |
|---|---|---|---|---|---|---|---|---|
| 1 | Claude Opus 4.7 | 35.6 | 33.3 | 35.6 | 33.3 | 37.8 | 2026-05-06 | |
| 2 | GPT-5.4 Mini | 33.3 | 23.3 | 25.6 | 24.4 | 28.9 | 2026-05-06 | |
| 3 | Claude Haiku 4.5 | 12.2 | 31.1 | 18.9 | 1.1 | 28.9 | 2026-05-06 | |
| 4 | Gemini-3.1 Pro | Google | 27.8 | 22.2 | 16.7 | 25.6 | 28.9 | 2026-05-06 |
| 5 | Claude Sonnet 4.6 | 7.8 | 27.8 | 10.0 | 6.7 | 27.8 | 2026-05-06 | |
| 6 | GPT-5.4 | 12.2 | 35.6 | 23.3 | 34.4 | 27.8 | 2026-05-06 | |
| 7 | Gemma 4 31B | Google | 27.8 | 25.6 | 24.4 | 20.0 | 22.2 | 2026-05-06 |
| 8 | Gemini-3 Flash | Google | 13.3 | 18.9 | 14.4 | 27.8 | 25.6 | 2026-05-06 |
| 9 | Gemini-3.1 Flash Lite | Google | 5.6 | 21.1 | 8.9 | 17.8 | 17.8 | 2026-05-06 |
| 10 | gpt-oss-20b | 17.8 | 17.8 | 12.2 | 7.8 | 2.2 | 2026-05-06 | |
| 11 | Qwen 3 14B | Alibaba | 5.6 | 2.2 | 2.2 | 1.1 | 2.2 | 2026-05-06 |
| 12 | Qwen 3 32B | Alibaba | 5.6 | 3.3 | 0.0 | 5.6 | 1.1 | 2026-05-06 |
| 13 | Qwen 3 8B | Alibaba | 2.2 | 5.6 | 1.1 | 3.3 | 0.0 | 2026-05-06 |
Existing benchmarks assign roles through system prompts that agents can ignore. TeamBench enforces boundaries at the OS level.
| Benchmark | Structural Enf. | Role Ablation | Collab. Metric | Contam. Resist. | Cross-Model | |
|---|---|---|---|---|---|---|
| SA | SWE-Bench | — | × | × | × | ✓ |
| TerminalBench | — | × | × | × | ✓ | |
| LiveCodeBench | — | × | × | ✓ | ✓ | |
| GAIA | — | × | × | × | ✓ | |
| MLE-Bench | — | × | × | × | ✓ | |
| MA | MultiAgentBench | × | × | ✓ | × | ✓ |
| AgentCoder | × | ~ | × | × | ~ | |
| DevBench | × | × | × | × | ~ | |
| CAMEL | × | × | × | × | × | |
| GPTSwarm | × | × | × | × | × | |
| TeamBench (Ours) | ✓ | ✓ | ✓ | ✓ | ✓ |
931 seeded evaluation instances. 161 originally-authored coordination-required templates, 650 GitHub-derived bug fixes, 30 UCI data-science tasks, 10 public post-mortems.
Each role runs in an isolated Docker container. No single container can access everything.
Reads the full specification (spec.md). Decomposes requirements, identifies hidden constraints and edge cases, creates an execution plan. Cannot execute code or modify the workspace.
Generates the implementation from the Planner's instructions. Runs commands and modifies workspace code. Cannot access the full specification.
Evaluates compliance against requirements and writes attestation. Reads spec.md and workspace (read-only). Cannot modify the workspace.
Container bind mounts physically enforce role boundaries.
Step through a real task across four conditions. See why the oracle fails where the full team succeeds.
Fix all security vulnerabilities in the authentication module. The specification lists 5 issues at varying severity — some subtle enough to be missed without careful spec analysis. Fixes must not introduce new regressions.
Team benefit concentrates on the hardest tasks and weakest models.
On grader-failing runs in the 27-configuration role-mixing pool, LLM Verifiers approve 49% of submissions that fail the deterministic grader. Removing the Verifier improves mean partial score in the reference ablation. The Verifier is the binding constraint, not the Planner.
49% false-acceptPrompt-only and sandbox-enforced teams reach statistically indistinguishable pass rates, but prompt-only runs produce 3.6x more cases where the Verifier rewrites the Executor's code. Pass rate alone hides role collapse.
3.6x role collapseTeam uplift is concentrated where Solo struggles. Lowest Solo-score quintile gains +15.7 points; the top quintile loses points to coordination overhead. Mean team-vs-Solo uplift across the full 155-task reference pool is +0.5 points (p = 0.20).
+15.7 pp on Q1Adding the Planner to the No-Plan team raises mean partial score by +2.4 points, concentrated on tasks whose specifications contain decision rules the Executor cannot infer from the brief or workspace alone.
+2.4 pp planningMixed-provider teams improve cost-performance. PGEAVA (Google Planner / Anthropic Executor / Anthropic Verifier) reaches 26.7% at $20.52, beating the all-Anthropic team (22.7%, $39.58) by +4 points at roughly half the cost.
PGEAVA: 26.7%40 sessions, 21 tasks, 18 distinct participants under matched role separation. Solo participants work through tasks directly. Hybrid sessions often collapse into quick approval. Human teams spend more effort coordinating missing information across roles. MIT COUHES exempt determination.
40 sessionsEvery assignment of three providers to three roles. Anthropic, Google, OpenAI, with one compact frontier model per family. 27 configurations on 25 stratified tasks across 3 seeds, for 2,025 deduplicated runs.
| # | Code | Planner | Executor | Verifier | Pass | Partial | Cost | Pass/$ | Turns |
|---|---|---|---|---|---|---|---|---|---|
| 1 | PGEAVA | Google | 26.7% | 0.695 | $20.52 | 0.013 | 54.1 | ||
| 2 | POEOVA | 22.7% | 0.679 | $10.98 | 0.021 | 30.9 | |||
| 3 | PGEAVO | Google | 22.7% | 0.672 | $11.77 | 0.019 | 49.8 | ||
| 4 | PAEAVA | 22.7% | 0.639 | $39.58 | 0.006 | 54.3 | |||
| 5 | POEGVG | Google | Google | 21.3% | 0.612 | $3.73 | 0.057 | 36.2 | |
| 6 | PAEAVO | 21.3% | 0.650 | $29.88 | 0.007 | 45.0 | |||
| 7 | POEGVO | Google | 20.0% | 0.600 | $2.99 | 0.067 | 31.3 | ||
| 8 | PAEOVO | 20.0% | 0.639 | $9.53 | 0.021 | 37.2 | |||
| 9 | POEOVO | 18.7% | 0.592 | $2.09 | 0.089 | 25.2 | |||
| 10 | PGEOVO | Google | 18.7% | 0.627 | $2.36 | 0.079 | 38.3 | ||
| 11 | PAEGVO | Google | 18.7% | 0.583 | $12.45 | 0.015 | 46.7 | ||
| 12 | PAEGVA | Google | 18.7% | 0.618 | $21.62 | 0.009 | 51.7 | ||
| 13 | POEAVO | 17.3% | 0.448 | $6.15 | 0.028 | 19.9 | |||
| 14 | POEAVG | Google | 17.3% | 0.466 | $6.99 | 0.025 | 23.5 | ||
| 15 | PAEGVG | Google | Google | 17.3% | 0.605 | $13.34 | 0.013 | 50.9 | |
| 16 | PAEAVG | Google | 17.3% | 0.646 | $29.48 | 0.006 | 51.6 | ||
| 17 | PGEGVO | Google | Google | 16.0% | 0.557 | $3.67 | 0.044 | 45.3 | |
| 18 | PAEOVA | 16.0% | 0.602 | $18.31 | 0.009 | 44.9 | |||
| 19 | PGEOVG | Google | Google | 14.7% | 0.559 | $3.33 | 0.044 | 45.7 | |
| 20 | PGEAVG | Google | Google | 14.7% | 0.496 | $8.13 | 0.018 | 35.6 | |
| 21 | PAEOVG | Google | 14.7% | 0.617 | $10.87 | 0.013 | 43.9 | ||
| 22 | POEAVA | 14.7% | 0.437 | $12.41 | 0.012 | 24.2 | |||
| 23 | PGEGVG | Google | Google | Google | 13.3% | 0.570 | $4.98 | 0.027 | 54.6 |
| 24 | POEGVA | Google | 13.3% | 0.399 | $7.98 | 0.017 | 24.5 | ||
| 25 | POEOVG | Google | 12.0% | 0.602 | $2.66 | 0.045 | 29.5 | ||
| 26 | PGEOVA | Google | 10.7% | 0.317 | $7.43 | 0.014 | 32.0 | ||
| 27 | PGEGVA | Google | Google | 10.7% | 0.371 | $8.11 | 0.013 | 37.0 |
No-Verify and Full Team tie on average (0.592), with No-Verify outperforming Full Team on most individual tasks — indicating the Verifier role adds overhead rather than correctness.
Full Team minus Oracle, averaged per category
Pass rate (%) — Oracle vs Full Team on TeamBench-Mini subset
Each dot is one task. Above diagonal = team wins. Hover for task name and scores.
Per-task component contribution. Four quadrants show where each helps or hurts.
Hardest tasks benefit most from team coordination
Average partial score across 147 tasks (Gemini 3 Flash)
Note: All conditions use a single LLM call per agent role. The team pipeline uses 2–3× more LLM calls than the oracle. A compute-matched oracle comparison (oracle with 3 sequential self-calls) is ongoing — results pending.
To add your model to the leaderboard, submit a Pull Request to the
TeamBench repository
with your results JSON file in shared/ablation_results/.
For full leaderboard comparability, evaluate on the 90-task stratified subset (TeamBench-Verified covers 57 of 90 tasks that pass the four-pillar audit) with all 5 ablation conditions.
TeamBench is evaluation infrastructure, not a deployment prescription. The three-role decomposition represents one natural factorization of understanding, implementation, and validation. We additionally evaluate four alternative topologies (Verify-First, Iterative, Dual-Exec, Self-Check) in a topology ablation. Future versions will add multi-round dialogue and router-worker patterns as first-class evaluation modes.
The oracle is an unrestricted single agent with full access to specification, workspace, and all tools. We tested two enhanced variants (Oracle-CoT and Oracle-2Pass) on the 28-task Mini subset; neither improved over the standard oracle. This establishes that team benefit is structural and cannot be recovered by prompt engineering alone. The oracle label refers to "unrestricted access," not a theoretical upper bound.
Near-zero performance for the seven open-source models we tested reflects tool-use infrastructure failures (malformed function calls, context overflow at 8K tokens), not model capability limitations. All were served via vLLM with max_model_len=8192. Models with native function-calling support and longer context windows should perform better. We are actively evaluating newer open-source models with improved tool-use capabilities.
Every template ships with a deterministic seeded generator that emits byte-identical workspaces from a fixed integer seed. Generators randomize surface parameters (variable names, config values, API field names, bug locations) while preserving structural complexity. Seeds 0–4 are used for public evaluation. Seeds 5 and above are reserved for the hidden leaderboard refresh; the specific values, names, and locations under those seeds have never been publicly evaluated, so a model that memorizes seed-0 solutions gains no advantage. A handful of GitHub-derived templates without seed parameterization serve as real-world anchors.
The Mini subset is balanced across the 21 refined categories and uses parameterized generators. Optimizing for ~28 specific task structures is unlikely to transfer to the full 851-template / 931-instance benchmark. For high-stakes comparisons, we recommend the full evaluation or the hidden-seed track (seeds 5+), where the specific task instances have never been publicly evaluated.