TeamBench: A Multi-Agent Teamwork Benchmark with OS-Enforced Role Separation

Leaderboard

#	Model	Org	Solo	Restricted	No Plan	No Eval	Full	Date
1	Claude Opus 4.7	Anthropic	35.6	33.3	35.6	33.3	37.8	2026-05-06
2	GPT-5.4 Mini	OpenAI	33.3	23.3	25.6	24.4	28.9	2026-05-06
3	Claude Haiku 4.5	Anthropic	12.2	31.1	18.9	1.1	28.9	2026-05-06
4	Gemini-3.1 Pro	Google	27.8	22.2	16.7	25.6	28.9	2026-05-06
5	Claude Sonnet 4.6	Anthropic	7.8	27.8	10.0	6.7	27.8	2026-05-06
6	GPT-5.4	OpenAI	12.2	35.6	23.3	34.4	27.8	2026-05-06
7	Gemma 4 31B	Google	27.8	25.6	24.4	20.0	22.2	2026-05-06
8	Gemini-3 Flash	Google	13.3	18.9	14.4	27.8	25.6	2026-05-06
9	Gemini-3.1 Flash Lite	Google	5.6	21.1	8.9	17.8	17.8	2026-05-06
10	gpt-oss-20b	OpenAI	17.8	17.8	12.2	7.8	2.2	2026-05-06
11	Qwen 3 14B	Alibaba	5.6	2.2	2.2	1.1	2.2	2026-05-06
12	Qwen 3 32B	Alibaba	5.6	3.3	0.0	5.6	1.1	2026-05-06
13	Qwen 3 8B	Alibaba	2.2	5.6	1.1	3.3	0.0	2026-05-06

13 models — TeamBench-90, pass rate (%), sorted by max(Solo, Full). Bold cell marks the highest condition per row, matching Table tab:lb90-leaderboard in the paper. Pass rate uses the attestation-promotion rule.

The First Structurally Enforced Multi-Agent Benchmark

Existing benchmarks assign roles through system prompts that agents can ignore. TeamBench enforces boundaries at the OS level.

	Benchmark	Structural Enf.	Role Ablation	Collab. Metric	Contam. Resist.	Cross-Model
SA	SWE-Bench	—	×	×	×	✓
	TerminalBench	—	×	×	×	✓
	LiveCodeBench	—	×	×	✓	✓
	GAIA	—	×	×	×	✓
	MLE-Bench	—	×	×	×	✓
MA	MultiAgentBench	×	×	✓	×	✓
	AgentCoder	×	~	×	×	~
	DevBench	×	×	×	×	~
	CAMEL	×	×	×	×	×
	GPTSwarm	×	×	×	×	×
	TeamBench (Ours)	✓	✓	✓	✓	✓

Three-Role Pipeline with Container Isolation

Each role runs in an isolated Docker container. No single container can access everything.

Planner

Reads the full specification (spec.md). Decomposes requirements, identifies hidden constraints and edge cases, creates an execution plan. Cannot execute code or modify the workspace.

Executor

Generates the implementation from the Planner's instructions. Runs commands and modifies workspace code. Cannot access the full specification.

Verifier

Evaluates compliance against requirements and writes attestation. Reads spec.md and workspace (read-only). Cannot modify the workspace.

Structural Enforcement, Not Prompt Compliance

Container bind mounts physically enforce role boundaries.

Information asymmetry by design — Specification contains critical details absent from the brief
Five ablation conditions — Systematically remove roles to isolate marginal contribution
Parameterized generators — every template has a deterministic seeded generator that emits byte-identical workspaces from a fixed integer seed
Partial scoring — Deterministic graders produce scores in [0, 1]

TNI = (S_team − S_restricted) / max(ε, S_oracle − S_restricted)

TNI = 1.0 means full recovery; TNI > 1.0 means team exceeds oracle ceiling

Demo

Step through a real task across four conditions. See why the oracle fails where the full team succeeds.

ITER1 Security · Iterative

Fix all security vulnerabilities in the authentication module. The specification lists 5 issues at varying severity — some subtle enough to be missed without careful spec analysis. Fixes must not introduce new regressions.

Plannerread_file(spec.md) Executoredit_file · run_command · read_file(plan.md) Verifierread_file · grep (read-only)

When Does Multi-Agent Collaboration Help?

Team benefit concentrates on the hardest tasks and weakest models.

Verifier False-Accept

On grader-failing runs in the 27-configuration role-mixing pool, LLM Verifiers approve 49% of submissions that fail the deterministic grader. Removing the Verifier improves mean partial score in the reference ablation. The Verifier is the binding constraint, not the Planner.

49% false-accept

Prompt-Only vs Enforced

Prompt-only and sandbox-enforced teams reach statistically indistinguishable pass rates, but prompt-only runs produce 3.6x more cases where the Verifier rewrites the Executor's code. Pass rate alone hides role collapse.

3.6x role collapse

Conditional Team Value

Team uplift is concentrated where Solo struggles. Lowest Solo-score quintile gains +15.7 points; the top quintile loses points to coordination overhead. Mean team-vs-Solo uplift across the full 155-task reference pool is +0.5 points (p = 0.20).

+15.7 pp on Q1

Planner Helps on Hard Tasks

Adding the Planner to the No-Plan team raises mean partial score by +2.4 points, concentrated on tasks whose specifications contain decision rules the Executor cannot infer from the brief or workspace alone.

+2.4 pp planning

27-Config Role-Mixing

Mixed-provider teams improve cost-performance. PGEAVA (Google Planner / Anthropic Executor / Anthropic Verifier) reaches 26.7% at $20.52, beating the all-Anthropic team (22.7%, $39.58) by +4 points at roughly half the cost.

PGEAVA: 26.7%

Human Pilot Mirrors Findings

40 sessions, 21 tasks, 18 distinct participants under matched role separation. Solo participants work through tasks directly. Hybrid sessions often collapse into quick approval. Human teams spend more effort coordinating missing information across roles. MIT COUHES exempt determination.

40 sessions

Heterogeneous Role-Mixing

Every assignment of three providers to three roles. Anthropic, Google, OpenAI, with one compact frontier model per family. 27 configurations on 25 stratified tasks across 3 seeds, for 2,025 deduplicated runs.

#	Code	Planner	Executor	Verifier	Pass	Partial	Cost	Pass/$	Turns
1	PGEAVA	Google	Anthropic	Anthropic	26.7%	0.695	$20.52	0.013	54.1
2	POEOVA	OpenAI	OpenAI	Anthropic	22.7%	0.679	$10.98	0.021	30.9
3	PGEAVO	Google	Anthropic	OpenAI	22.7%	0.672	$11.77	0.019	49.8
4	PAEAVA	Anthropic	Anthropic	Anthropic	22.7%	0.639	$39.58	0.006	54.3
5	POEGVG	OpenAI	Google	Google	21.3%	0.612	$3.73	0.057	36.2
6	PAEAVO	Anthropic	Anthropic	OpenAI	21.3%	0.650	$29.88	0.007	45.0
7	POEGVO	OpenAI	Google	OpenAI	20.0%	0.600	$2.99	0.067	31.3
8	PAEOVO	Anthropic	OpenAI	OpenAI	20.0%	0.639	$9.53	0.021	37.2
9	POEOVO	OpenAI	OpenAI	OpenAI	18.7%	0.592	$2.09	0.089	25.2
10	PGEOVO	Google	OpenAI	OpenAI	18.7%	0.627	$2.36	0.079	38.3
11	PAEGVO	Anthropic	Google	OpenAI	18.7%	0.583	$12.45	0.015	46.7
12	PAEGVA	Anthropic	Google	Anthropic	18.7%	0.618	$21.62	0.009	51.7
13	POEAVO	OpenAI	Anthropic	OpenAI	17.3%	0.448	$6.15	0.028	19.9
14	POEAVG	OpenAI	Anthropic	Google	17.3%	0.466	$6.99	0.025	23.5
15	PAEGVG	Anthropic	Google	Google	17.3%	0.605	$13.34	0.013	50.9
16	PAEAVG	Anthropic	Anthropic	Google	17.3%	0.646	$29.48	0.006	51.6
17	PGEGVO	Google	Google	OpenAI	16.0%	0.557	$3.67	0.044	45.3
18	PAEOVA	Anthropic	OpenAI	Anthropic	16.0%	0.602	$18.31	0.009	44.9
19	PGEOVG	Google	OpenAI	Google	14.7%	0.559	$3.33	0.044	45.7
20	PGEAVG	Google	Anthropic	Google	14.7%	0.496	$8.13	0.018	35.6
21	PAEOVG	Anthropic	OpenAI	Google	14.7%	0.617	$10.87	0.013	43.9
22	POEAVA	OpenAI	Anthropic	Anthropic	14.7%	0.437	$12.41	0.012	24.2
23	PGEGVG	Google	Google	Google	13.3%	0.570	$4.98	0.027	54.6
24	POEGVA	OpenAI	Google	Anthropic	13.3%	0.399	$7.98	0.017	24.5
25	POEOVG	OpenAI	OpenAI	Google	12.0%	0.602	$2.66	0.045	29.5
26	PGEOVA	Google	OpenAI	Anthropic	10.7%	0.317	$7.43	0.014	32.0
27	PGEGVA	Google	Google	Anthropic	10.7%	0.371	$8.11	0.013	37.0

Five Conditions to Decompose Team Value

No-Verify and Full Team tie on average (0.592), with No-Verify outperforming Full Team on most individual tasks — indicating the Verifier role adds overhead rather than correctness.

Team Uplift by Category

Full Team minus Oracle, averaged per category

Full Team Score by Model

Pass rate (%) — Oracle vs Full Team on TeamBench-Mini subset

Oracle vs Team Score (per task)

Each dot is one task. Above diagonal = team wins. Hover for task name and scores.

Planning vs Evaluation Value

Per-task component contribution. Four quadrants show where each helps or hurts.

Team Uplift by Oracle Quantile

Hardest tasks benefit most from team coordination

Ablation Condition Scores

Average partial score across 147 tasks (Gemini 3 Flash)

Note: All conditions use a single LLM call per agent role. The team pipeline uses 2–3× more LLM calls than the oracle. A compute-matched oracle comparison (oracle with 3 sequential self-calls) is ongoing — results pending.

Get Started

Quick Start

# Clone the repository
git clone https://github.com/ybkim95/TeamBench.git && cd TeamBench
pip install -e .

# Set provider keys (see .env.example)
export ANTHROPIC_API_KEY=...  OPENAI_API_KEY=...  GEMINI_API_KEY=...

# Run a single task under the 5-condition ablation
python -m harness.ablation --task DIST1_queue_race --model gemini-3-flash-preview --seed 0

# Run the LB100 stratified leaderboard subset
bash scripts/run_all_opensource_100_ablation.sh

# Compute TNI and paper tables
python -m harness.compute_tni     --runs-dir shared/runs/example
python -m harness.paper_tables    --out shared/paper/

Leaderboard Submission

To add your model to the leaderboard, submit a Pull Request to the TeamBench repository with your results JSON file in shared/ablation_results/. For full leaderboard comparability, evaluate on the 90-task stratified subset (TeamBench-Verified covers 57 of 90 tasks that pass the four-pillar audit) with all 5 ablation conditions.

FAQ

Why only one team decomposition (Planner-Executor-Verifier)?

TeamBench is evaluation infrastructure, not a deployment prescription. The three-role decomposition represents one natural factorization of understanding, implementation, and validation. We additionally evaluate four alternative topologies (Verify-First, Iterative, Dual-Exec, Self-Check) in a topology ablation. Future versions will add multi-round dialogue and router-worker patterns as first-class evaluation modes.

Is the "oracle" baseline fairly optimized?

The oracle is an unrestricted single agent with full access to specification, workspace, and all tools. We tested two enhanced variants (Oracle-CoT and Oracle-2Pass) on the 28-task Mini subset; neither improved over the standard oracle. This establishes that team benefit is structural and cannot be recovered by prompt engineering alone. The oracle label refers to "unrestricted access," not a theoretical upper bound.

Why do open-source models score near zero?

Near-zero performance for the seven open-source models we tested reflects tool-use infrastructure failures (malformed function calls, context overflow at 8K tokens), not model capability limitations. All were served via vLLM with max_model_len=8192. Models with native function-calling support and longer context windows should perform better. We are actively evaluating newer open-source models with improved tool-use capabilities.

How contamination-resistant are the parameterized generators?

Every template ships with a deterministic seeded generator that emits byte-identical workspaces from a fixed integer seed. Generators randomize surface parameters (variable names, config values, API field names, bug locations) while preserving structural complexity. Seeds 0–4 are used for public evaluation. Seeds 5 and above are reserved for the hidden leaderboard refresh; the specific values, names, and locations under those seeds have never been publicly evaluated, so a model that memorizes seed-0 solutions gains no advantage. A handful of GitHub-derived templates without seed parameterization serve as real-world anchors.

Can labs optimize on the Mini subset to inflate leaderboard scores?

The Mini subset is balanced across the 21 refined categories and uses parameterized generators. Optimizing for ~28 specific task structures is unlikely to transfer to the full 851-template / 931-instance benchmark. For high-stakes comparisons, we recommend the full evaluation or the hidden-seed track (seeds 5+), where the specific task instances have never been publicly evaluated.

TeamBench