Mozart: Orchestrating AI Agents with Discipline

May 4, 2026 4 min read
Mozart: Orchestrating AI Agents with Discipline

Most AI agent orchestrators fail in the same predictable way. They throw every persona at every problem. You get planning, coding, security review, UX critique, infrastructure checks, and validation all at once, whether the task needs it or not. It sounds thorough on paper, but in practice it is expensive, slow, and full of noise. After a while you stop paying attention because most of what comes back is not actually relevant.

That is not thoroughness. It is a lack of discipline.

I built Mozart because I kept running into that problem in real workflows. Not because I needed a smarter coding agent, but because I needed something that could decide how work should flow. Mozart does not write code. It acts more like a senior delivery lead sitting above the process. It decides who should be involved, when they should be involved, and just as importantly, when they should not be involved.

This is not a framework or a concept sketch. It is an actual working system in the repo: mozart-orchestration. The behavior described here is exactly how it runs.

The system is built around a set of specialists, each with a narrow job. The list looks long at first, but the point is that most of them are not used most of the time.

  • sarah handles research and prior art
  • harry turns ideas into a structured plan
  • bob reviews architecture and sequencing
  • librarian checks for duplicate functionality in existing code
  • xander focuses on security
  • dexter looks at code health and refactoring concerns
  • ruby handles UI and UX
  • otto looks at Kubernetes and infrastructure posture
  • ian evaluates change impact and blast radius
  • dick investigates bugs and produces findings, but never fixes them
  • jackson implements the actual code
  • valerie validates that the implementation matches the plan
  • scott handles documentation across README, changelog, and wiki

There is also an external pass using the OpenAI Codex CLI when the work needs an independent read.

The important part is not the roster. Plenty of systems have roles like this. The difference is how they are used.

The first decision Mozart makes is whether orchestration is even necessary. If you ask for a security review on auth code, it routes directly to xander and returns the result. No plan. No pipeline. Same idea if you ask for a UI critique or a bug investigation. Those are single-agent problems, so they get single-agent answers. There is no reason to turn everything into a multi-stage workflow.

When orchestration is needed, Mozart still does not involve everyone. It looks at what the work actually touches and selects specialists based on that. If the change involves authentication or secrets, security runs. If it introduces a shared abstraction in an existing codebase, the duplicate check runs. If there is a UI surface, the UI review runs. If it is straightforward backend logic with no security or infrastructure angle, it may only involve an architecture review. The pipeline is shaped by the work instead of being forced onto it.

There is also a simple tiering model. Small, contained changes are treated very differently from changes that touch schemas, infrastructure, or security boundaries.

  • TINY work stays lightweight and skips most gates
  • STANDARD work runs a full but selective pipeline
  • HEAVY work adds stricter checks, including impact analysis, security involvement, and an external review on the final result

When there is uncertainty, it leans toward the heavier path. The cost of an extra check is small compared to the cost of missing something important.

Another thing that mattered to me was not forcing everything into a “build the feature” shape. Sometimes you just want a plan. Sometimes you want an audit. Sometimes you want someone to investigate a problem and explain what is happening before anything gets fixed. Mozart treats those as complete outcomes. It will plan and stop. It will audit and return a report without touching the code. It will validate a diff without redoing the earlier work. It can also take an existing plan and go straight into implementation without re-running earlier stages.

There are two operating modes. In autonomous mode, it runs without stopping at every phase. In loop-in mode, it pauses before each commit and gives you a chance to review. Even in autonomous mode, it does not blindly push forward. It will stop if there are open questions, conflicting opinions, missing tooling, or anything that could lead to a bad or irreversible decision. Autonomy here means it does not interrupt you for routine progress. It does not mean it makes risky decisions on your behalf.

A typical run starts with intake and classification. If needed, it does research, produces a plan, and runs targeted reviews based on what that plan actually touches. Implementation happens in phases, with specialists stepping in only where their perspective matters. For higher-risk work, there is an external review. Then the result is validated against the plan, and documentation is produced. Each step only exists if the work justifies it.

Every run leaves behind a trail. There is a state file so the process can resume if something breaks. There is a flow summary that shows which agents ran and in what order. It also explicitly lists which agents were skipped and why. That part matters more than it sounds. In most systems, if something does not appear, you are left guessing whether it was forgotten or intentionally skipped. Here, that decision is visible. There is also a ticket that tracks the work in whatever system you are using.

None of these ideas are completely new on their own. You can find pieces of this in different tools. What I have not seen, at least not in a cohesive way, is the combination of restraint across all of them. The system routes around itself when it is not needed. It selects specialists based on actual impact. It scales effort to risk. It treats partial workflows as valid outcomes. It limits autonomy when it matters. And it documents both what it did and what it chose not to do.

Most agent systems try to prove their value by doing more. More agents, more steps, more output.

Mozart is built around the opposite idea. The value comes from knowing when not to do something.

The industry is getting very good at generating code quickly. The bottleneck is already shifting toward understanding and trusting what gets produced. Throwing more agents at every problem makes that worse, not better. What actually helps is applying the right perspectives at the right time and being clear about what was skipped and why.

An orchestra does not work because every instrument plays all the time. It works because someone knows when to bring each part in, and when to leave it out.

That is the role Mozart is trying to fill.