single agent
textual space → image space
tools - draw/edit/crop
visual cot

While multi-agent frameworks that map textual instructions to textual outputs are now common, systems that enable multi-agent collaboration over image inputs and produce an image as the final artifact remain scarce. We propose Canvas-Native Multi-Agent Reasoning (CN-MAR): agents do not just talk—they act in image space by drawing, erasing, measuring, labeling, and validating on a shared canvas via MCP-exposed tools. Unlike prior “visual CoT,” where sketches are internal aids, CN-MAR treats the image as the deliverable and bakes in domain validators (e.g., accessibility or building-code checks) as first-class citizens. We target two families of tasks: (1) Map planning under legal/accessibility constraints (satellite/town map → annotated, ADA-compliant evacuation route map) and (2) Cost-constrained 2D housing layout (floor plan that satisfies habitability rules under a budget). We will release: (i) a canvas action API (MCP tools) and shared-canvas runtime; (ii) task specs & metrics focused on compliance-constrained, image-as-output problems; and (iii) reference agents with SFT→RL training.
Problem. Given an input image (map or base floorplan) and a textual instruction/spec, produce a final image that meets domain constraints, using a team of agents that manipulate a shared visual canvas.
Key ideas.
draw_line, draw_polygon, place_icon, label, erase, measure_width, sample_slope, etc.