1. Introduction & Background

Current AI agent benchmarks (e.g., SWE-bench, GAIA) focus on task execution accuracy or tool selection efficiency but lack metrics for autonomous tool evolution. While frameworks like OpenManus enable dynamic tool generation, no benchmark exists to quantify agents' abilities to iteratively self-improve tools through multi-cycle refinement. MABITE addresses this gap by introducing a metamorphic testing environment where creator agents must:


Level Requirement Example (Data Analysis Task)
L1 Single tool CSV parser (no dependencies)
L2 Linear chain Parser → Stats calculator → Visualizer
L3 Conditional branching Parser → (Error detector → Corrector) → Visualizer

2. Novelty & Research Gaps

2.1 Limitations of Existing Work

Benchmark Focus Gaps Addressed by MABITE
SWE-bench Code patch correctness No tool-creation; static tasks
BrowseComp Web navigation persistence Ignores tool synthesis
HumanEval Function-level code gen Lacks iterative refinement
OpenManus (Agent Flow) Plan execution Measures workflow success, not tool evolution

2.2 MABITE’s Innovations


3. Technical Framework

3.1 Dataset Design