Metamorphic Agent Benchmark for Self-Iterating Tool Ecosystems (MABITE)

1. Introduction & Background

Current AI agent benchmarks (e.g., SWE-bench, GAIA) focus on task execution accuracy or tool selection efficiency but lack metrics for autonomous tool evolution. While frameworks like OpenManus enable dynamic tool generation, no benchmark exists to quantify agents' abilities to iteratively self-improve tools through multi-cycle refinement. MABITE addresses this gap by introducing a metamorphic testing environment where creator agents must:

Create tools for unseen tasks,
Refine tools based on static execution feedback,
Compose tools into higher-order workflows,
Generalize across shifting domains,
While being evaluated against a fixed, non-adaptive executor.

Level	Requirement	Example (Data Analysis Task)
L1	Single tool	CSV parser (no dependencies)
L2	Linear chain	Parser → Stats calculator → Visualizer
L3	Conditional branching	Parser → (Error detector → Corrector) → Visualizer

2. Novelty & Research Gaps

2.1 Limitations of Existing Work

Benchmark	Focus	Gaps Addressed by MABITE
SWE-bench	Code patch correctness	No tool-creation; static tasks
BrowseComp	Web navigation persistence	Ignores tool synthesis
HumanEval	Function-level code gen	Lacks iterative refinement
OpenManus (Agent Flow)	Plan execution	Measures workflow success, not tool evolution

2.2 MABITE’s Innovations

Mandatory Tool Chaining: Tasks require ≥3 interdependent tools with verified I/O compatibility (e.g., CSV parser → stats calculator → visualizer).
Metamorphic Task Generation: Dynamic zero-shot domain shifts (e.g., "financial analysis" → "bioinformatics") using CETBench-style transformations.
Phase-Decoupled Evaluation: Strict isolation between creator (tool generator) and fixed executor.
Tool Mutation Resilience: Measures adaptation to API signature changes.

1. Introduction & Background

2. Novelty & Research Gaps

2.1 Limitations of Existing Work

2.2 MABITE’s Innovations

3. Technical Framework

3.1 Dataset Design