1. Introduction & Background
Current AI agent benchmarks (e.g., SWE-bench, GAIA) focus on task execution accuracy or tool selection efficiency but lack metrics for autonomous tool evolution. While frameworks like OpenManus enable dynamic tool generation, no benchmark exists to quantify agents' abilities to iteratively self-improve tools through multi-cycle refinement. MABITE addresses this gap by introducing a metamorphic testing environment where creator agents must:
- Create tools for unseen tasks,
- Refine tools based on static execution feedback,
- Compose tools into higher-order workflows,
- Generalize across shifting domains,
- While being evaluated against a fixed, non-adaptive executor.
Level |
Requirement |
Example (Data Analysis Task) |
L1 |
Single tool |
CSV parser (no dependencies) |
L2 |
Linear chain |
Parser → Stats calculator → Visualizer |
L3 |
Conditional branching |
Parser → (Error detector → Corrector) → Visualizer |
2. Novelty & Research Gaps
2.1 Limitations of Existing Work
Benchmark |
Focus |
Gaps Addressed by MABITE |
SWE-bench |
Code patch correctness |
No tool-creation; static tasks |
BrowseComp |
Web navigation persistence |
Ignores tool synthesis |
HumanEval |
Function-level code gen |
Lacks iterative refinement |
OpenManus (Agent Flow) |
Plan execution |
Measures workflow success, not tool evolution |
2.2 MABITE’s Innovations
- Mandatory Tool Chaining: Tasks require ≥3 interdependent tools with verified I/O compatibility (e.g., CSV parser → stats calculator → visualizer).
- Metamorphic Task Generation: Dynamic zero-shot domain shifts (e.g., "financial analysis" → "bioinformatics") using CETBench-style transformations.
- Phase-Decoupled Evaluation: Strict isolation between creator (tool generator) and fixed executor.
- Tool Mutation Resilience: Measures adaptation to API signature changes.
3. Technical Framework
3.1 Dataset Design