single agent

textual space → image space

tools - draw/edit/crop

visual cot

image.png

Abstract

While multi-agent frameworks that map textual instructions to textual outputs are now common, systems that enable multi-agent collaboration over image inputs and produce an image as the final artifact remain scarce. We propose Canvas-Native Multi-Agent Reasoning (CN-MAR): agents do not just talk—they act in image space by drawing, erasing, measuring, labeling, and validating on a shared canvas via MCP-exposed tools. Unlike prior “visual CoT,” where sketches are internal aids, CN-MAR treats the image as the deliverable and bakes in domain validators (e.g., accessibility or building-code checks) as first-class citizens. We target two families of tasks: (1) Map planning under legal/accessibility constraints (satellite/town map → annotated, ADA-compliant evacuation route map) and (2) Cost-constrained 2D housing layout (floor plan that satisfies habitability rules under a budget). We will release: (i) a canvas action API (MCP tools) and shared-canvas runtime; (ii) task specs & metrics focused on compliance-constrained, image-as-output problems; and (iii) reference agents with SFT→RL training.


1. Problem framing & contributions

Problem. Given an input image (map or base floorplan) and a textual instruction/spec, produce a final image that meets domain constraints, using a team of agents that manipulate a shared visual canvas.

Key ideas.