Abstract

While multi-agent frameworks that map textual instructions to textual outputs are now common, systems that enable multi-agent collaboration over image inputs and produce an image as the final artifact remain scarce. We propose Canvas-Native Multi-Agent Reasoning (CN-MAR): agents do not just talk—they act in image space by drawing, erasing, measuring, labeling, and validating on a shared canvas via MCP-exposed tools. Unlike prior “visual CoT,” where sketches are internal aids, CN-MAR treats the image as the deliverable and bakes in domain validators (e.g., accessibility or building-code checks) as first-class citizens. We target two families of tasks: (1) Map planning under legal/accessibility constraints (satellite/town map → annotated, ADA-compliant evacuation route map) and (2) Cost-constrained 2D housing layout (floor plan that satisfies habitability rules under a budget). We will release: (i) a canvas action API (MCP tools) and shared-canvas runtime; (ii) task specs & metrics focused on compliance-constrained, image-as-output problems; and (iii) reference agents with SFT→RL training.

1. Problem framing & contributions

Problem. Given an input image (map or base floorplan) and a textual instruction/spec, produce a final image that meets domain constraints, using a team of agents that manipulate a shared visual canvas.

Key ideas.

Image-space actions: Agents operate tools like draw_line, draw_polygon, place_icon, label, erase, measure_width, sample_slope, etc.
Role specialization: Planner, Cartographer/Layout Artist, Cost Estimator, and Auditor (code/ADA).
Validators in the loop: Machine-checkable rules (e.g., minimum clear widths) gate progress and supply training signals.