TeamBench

TeamBench is a benchmark for evaluating multi-agent LLM coordination on 851 software-engineering, data-engineering, and incident-response tasks. Roles run in isolated containers under five ablation conditions, so the contribution of every Planner, Executor, and Verifier is measured independently and a Teamwork Necessity Index quantifies when a team is actually pulling its weight.

Code Paper Dataset 💻 Human Eval
851
Templates ↗
931
Instances
19
Categories
5
Conditions

Leaderboard

#ModelOrgSoloRestrictedNo PlanNo EvalFullDate
1Claude Opus 4.7Anthropic35.633.335.633.337.82026-05-06
2GPT-5.4 MiniOpenAI33.323.325.624.428.92026-05-06
3Claude Haiku 4.5Anthropic12.231.118.91.128.92026-05-06
4Gemini-3.1 ProGoogle27.822.216.725.628.92026-05-06
5Claude Sonnet 4.6Anthropic7.827.810.06.727.82026-05-06
6GPT-5.4OpenAI12.235.623.334.427.82026-05-06
7Gemma 4 31BGoogle27.825.624.420.022.22026-05-06
8Gemini-3 FlashGoogle13.318.914.427.825.62026-05-06
9Gemini-3.1 Flash LiteGoogle5.621.18.917.817.82026-05-06
10gpt-oss-20bOpenAI17.817.812.27.82.22026-05-06
11Qwen 3 14BAlibaba5.62.22.21.12.22026-05-06
12Qwen 3 32BAlibaba5.63.30.05.61.12026-05-06
13Qwen 3 8BAlibaba2.25.61.13.30.02026-05-06
13 models — TeamBench-90, pass rate (%), sorted by max(Solo, Full). Bold cell marks the highest condition per row, matching Table tab:lb90-leaderboard in the paper. Pass rate uses the attestation-promotion rule.

The First Structurally Enforced Multi-Agent Benchmark

Existing benchmarks assign roles through system prompts that agents can ignore. TeamBench enforces boundaries at the OS level.

BenchmarkStructural Enf.Role AblationCollab. MetricContam. Resist.Cross-Model
SASWE-Bench×××
TerminalBench×××
LiveCodeBench××
GAIA×××
MLE-Bench×××
MAMultiAgentBench×××
AgentCoder×~××~
DevBench××××~
CAMEL×××××
GPTSwarm×××××
TeamBench (Ours)

851 Templates Across 19 Base Categories

931 seeded evaluation instances. 161 originally-authored coordination-required templates, 650 GitHub-derived bug fixes, 30 UCI data-science tasks, 10 public post-mortems.

Categories

Domains

Difficulty


Three-Role Pipeline with Container Isolation

Each role runs in an isolated Docker container. No single container can access everything.

1

Planner

Reads the full specification (spec.md). Decomposes requirements, identifies hidden constraints and edge cases, creates an execution plan. Cannot execute code or modify the workspace.

2

Executor

Generates the implementation from the Planner's instructions. Runs commands and modifies workspace code. Cannot access the full specification.

3

Verifier

Evaluates compliance against requirements and writes attestation. Reads spec.md and workspace (read-only). Cannot modify the workspace.

Structural Enforcement, Not Prompt Compliance

Container bind mounts physically enforce role boundaries.

  • Information asymmetry by design — Specification contains critical details absent from the brief
  • Five ablation conditions — Systematically remove roles to isolate marginal contribution
  • Parameterized generators — every template has a deterministic seeded generator that emits byte-identical workspaces from a fixed integer seed
  • Partial scoring — Deterministic graders produce scores in [0, 1]
TNI = (Steam − Srestricted) / max(ε, Soracle − Srestricted)
TNI = 1.0 means full recovery; TNI > 1.0 means team exceeds oracle ceiling

Demo

Step through a real task across four conditions. See why the oracle fails where the full team succeeds.

ITER1 Security · Iterative

Fix all security vulnerabilities in the authentication module. The specification lists 5 issues at varying severity — some subtle enough to be missed without careful spec analysis. Fixes must not introduce new regressions.

Plannerread_file(spec.md) Executoredit_file · run_command · read_file(plan.md) Verifierread_file · grep (read-only)

When Does Multi-Agent Collaboration Help?

Team benefit concentrates on the hardest tasks and weakest models.

Verifier False-Accept

On grader-failing runs in the 27-configuration role-mixing pool, LLM Verifiers approve 49% of submissions that fail the deterministic grader. Removing the Verifier improves mean partial score in the reference ablation. The Verifier is the binding constraint, not the Planner.

49% false-accept

Prompt-Only vs Enforced

Prompt-only and sandbox-enforced teams reach statistically indistinguishable pass rates, but prompt-only runs produce 3.6x more cases where the Verifier rewrites the Executor's code. Pass rate alone hides role collapse.

3.6x role collapse

Conditional Team Value

Team uplift is concentrated where Solo struggles. Lowest Solo-score quintile gains +15.7 points; the top quintile loses points to coordination overhead. Mean team-vs-Solo uplift across the full 155-task reference pool is +0.5 points (p = 0.20).

+15.7 pp on Q1

Planner Helps on Hard Tasks

Adding the Planner to the No-Plan team raises mean partial score by +2.4 points, concentrated on tasks whose specifications contain decision rules the Executor cannot infer from the brief or workspace alone.

+2.4 pp planning

27-Config Role-Mixing

Mixed-provider teams improve cost-performance. PGEAVA (Google Planner / Anthropic Executor / Anthropic Verifier) reaches 26.7% at $20.52, beating the all-Anthropic team (22.7%, $39.58) by +4 points at roughly half the cost.

PGEAVA: 26.7%

Human Pilot Mirrors Findings

40 sessions, 21 tasks, 18 distinct participants under matched role separation. Solo participants work through tasks directly. Hybrid sessions often collapse into quick approval. Human teams spend more effort coordinating missing information across roles. MIT COUHES exempt determination.

40 sessions

Heterogeneous Role-Mixing

Every assignment of three providers to three roles. Anthropic, Google, OpenAI, with one compact frontier model per family. 27 configurations on 25 stratified tasks across 3 seeds, for 2,025 deduplicated runs.

#CodePlannerExecutorVerifierPassPartialCostPass/$Turns
1PGEAVAGoogleAnthropicAnthropic26.7%0.695$20.520.01354.1
2POEOVAOpenAIOpenAIAnthropic22.7%0.679$10.980.02130.9
3PGEAVOGoogleAnthropicOpenAI22.7%0.672$11.770.01949.8
4PAEAVAAnthropicAnthropicAnthropic22.7%0.639$39.580.00654.3
5POEGVGOpenAIGoogleGoogle21.3%0.612$3.730.05736.2
6PAEAVOAnthropicAnthropicOpenAI21.3%0.650$29.880.00745.0
7POEGVOOpenAIGoogleOpenAI20.0%0.600$2.990.06731.3
8PAEOVOAnthropicOpenAIOpenAI20.0%0.639$9.530.02137.2
9POEOVOOpenAIOpenAIOpenAI18.7%0.592$2.090.08925.2
10PGEOVOGoogleOpenAIOpenAI18.7%0.627$2.360.07938.3
11PAEGVOAnthropicGoogleOpenAI18.7%0.583$12.450.01546.7
12PAEGVAAnthropicGoogleAnthropic18.7%0.618$21.620.00951.7
13POEAVOOpenAIAnthropicOpenAI17.3%0.448$6.150.02819.9
14POEAVGOpenAIAnthropicGoogle17.3%0.466$6.990.02523.5
15PAEGVGAnthropicGoogleGoogle17.3%0.605$13.340.01350.9
16PAEAVGAnthropicAnthropicGoogle17.3%0.646$29.480.00651.6
17PGEGVOGoogleGoogleOpenAI16.0%0.557$3.670.04445.3
18PAEOVAAnthropicOpenAIAnthropic16.0%0.602$18.310.00944.9
19PGEOVGGoogleOpenAIGoogle14.7%0.559$3.330.04445.7
20PGEAVGGoogleAnthropicGoogle14.7%0.496$8.130.01835.6
21PAEOVGAnthropicOpenAIGoogle14.7%0.617$10.870.01343.9
22POEAVAOpenAIAnthropicAnthropic14.7%0.437$12.410.01224.2
23PGEGVGGoogleGoogleGoogle13.3%0.570$4.980.02754.6
24POEGVAOpenAIGoogleAnthropic13.3%0.399$7.980.01724.5
25POEOVGOpenAIOpenAIGoogle12.0%0.602$2.660.04529.5
26PGEOVAGoogleOpenAIAnthropic10.7%0.317$7.430.01432.0
27PGEGVAGoogleGoogleAnthropic10.7%0.371$8.110.01337.0

Five Conditions to Decompose Team Value

No-Verify and Full Team tie on average (0.592), with No-Verify outperforming Full Team on most individual tasks — indicating the Verifier role adds overhead rather than correctness.

Team Uplift by Category

Full Team minus Oracle, averaged per category

Full Team Score by Model

Pass rate (%) — Oracle vs Full Team on TeamBench-Mini subset

Oracle vs Team Score (per task)

Each dot is one task. Above diagonal = team wins. Hover for task name and scores.

Planning vs Evaluation Value

Per-task component contribution. Four quadrants show where each helps or hurts.

Team Uplift by Oracle Quantile

Hardest tasks benefit most from team coordination

Ablation Condition Scores

Average partial score across 147 tasks (Gemini 3 Flash)

Note: All conditions use a single LLM call per agent role. The team pipeline uses 2–3× more LLM calls than the oracle. A compute-matched oracle comparison (oracle with 3 sequential self-calls) is ongoing — results pending.


Get Started

Quick Start

# Clone the repository
git clone https://github.com/ybkim95/TeamBench.git && cd TeamBench
pip install -e .

# Set provider keys (see .env.example)
export ANTHROPIC_API_KEY=... OPENAI_API_KEY=... GEMINI_API_KEY=...

# Run a single task under the 5-condition ablation
python -m harness.ablation --task DIST1_queue_race --model gemini-3-flash-preview --seed 0

# Run the LB100 stratified leaderboard subset
bash scripts/run_all_opensource_100_ablation.sh

# Compute TNI and paper tables
python -m harness.compute_tni --runs-dir shared/runs/example
python -m harness.paper_tables --out shared/paper/

Leaderboard Submission

To add your model to the leaderboard, submit a Pull Request to the TeamBench repository with your results JSON file in shared/ablation_results/. For full leaderboard comparability, evaluate on the 90-task stratified subset (TeamBench-Verified covers 57 of 90 tasks that pass the four-pillar audit) with all 5 ablation conditions.


FAQ

Why only one team decomposition (Planner-Executor-Verifier)?

TeamBench is evaluation infrastructure, not a deployment prescription. The three-role decomposition represents one natural factorization of understanding, implementation, and validation. We additionally evaluate four alternative topologies (Verify-First, Iterative, Dual-Exec, Self-Check) in a topology ablation. Future versions will add multi-round dialogue and router-worker patterns as first-class evaluation modes.

Is the "oracle" baseline fairly optimized?

The oracle is an unrestricted single agent with full access to specification, workspace, and all tools. We tested two enhanced variants (Oracle-CoT and Oracle-2Pass) on the 28-task Mini subset; neither improved over the standard oracle. This establishes that team benefit is structural and cannot be recovered by prompt engineering alone. The oracle label refers to "unrestricted access," not a theoretical upper bound.

Why do open-source models score near zero?

Near-zero performance for the seven open-source models we tested reflects tool-use infrastructure failures (malformed function calls, context overflow at 8K tokens), not model capability limitations. All were served via vLLM with max_model_len=8192. Models with native function-calling support and longer context windows should perform better. We are actively evaluating newer open-source models with improved tool-use capabilities.

How contamination-resistant are the parameterized generators?

Every template ships with a deterministic seeded generator that emits byte-identical workspaces from a fixed integer seed. Generators randomize surface parameters (variable names, config values, API field names, bug locations) while preserving structural complexity. Seeds 0–4 are used for public evaluation. Seeds 5 and above are reserved for the hidden leaderboard refresh; the specific values, names, and locations under those seeds have never been publicly evaluated, so a model that memorizes seed-0 solutions gains no advantage. A handful of GitHub-derived templates without seed parameterization serve as real-world anchors.

Can labs optimize on the Mini subset to inflate leaderboard scores?

The Mini subset is balanced across the 21 refined categories and uses parameterized generators. Optimizing for ~28 specific task structures is unlikely to transfer to the full 851-template / 931-instance benchmark. For high-stakes comparisons, we recommend the full evaluation or the hidden-seed track (seeds 5+), where the specific task instances have never been publicly evaluated.


Authors

Yubin Kim1,2  ·  Chanwoo Park1  ·  Taehan Kim4  ·  Eugene Park1  ·  Samuel Schmidgall3  ·  Salman Rahman2  ·  Chunjong Park3  ·  Cynthia Breazeal1  ·  Xin Liu2  ·  Hamid Palangi2  ·  Hae Won Park1  ·  Daniel McDuff2
1MIT  ·  2Google Research  ·  3Google DeepMind  ·  4Independent Researcher