Gitar Logo

Guardrails for a Multi Agent System

How Gitar's Judge ensures high quality with low noise

5 min read
By Poorvi Parikshya
Guardrails for a Multi Agent System

Every AI system that talks to people needs a way to make sure it's saying the right things. Not just accurate things but the right things, to the right person, at the right moment. That layer has a name in the industry: a guardrail. Think of it like a pop filter on a microphone. It doesn't change what you're saying, it just stops the things that shouldn't reach the listener from getting through.

At Gitar, we call our guardrail subsystem the Judge. It sits across everything Gitar consumes and outputs: code review comments, replies to developers, commit messages, rules evaluations.

The reason it exists comes down to experience. When someone sees the same code review finding posted twice by different agents, or gets a reply to a conversation they were having with a teammate, or finds an internal system reference that crept into a comment, it erodes trust.

On the way in, it filters messages directed at someone else and commands Gitar was never meant to receive.

On the way out, it collapses duplicates, strips internal references, and drops replies that add nothing. What reaches the developer is clean and worth reading.

How we built and scaled the Judge

The first version used multiple specialized sub-agents: separate agents for duplicate detection, content filtering, and relevance scoring. Clean on paper. In practice, coordination overhead stacked up. When a comment got dropped, tracing why meant digging through three agents.

We simplified to a single decision-maker. One agent, one pass, one clear outcome.

A good way to see what the Judge does is to look at what actually reaches developers versus what Gitar originally drafted. An agent might produce a technically correct comment but tack on an internal process note at the end, something like "The code review agent will pick this up in the next iteration." That sentence is meaningless to the developer and reveals implementation details they never needed to see. The Judge catches it, strips the trailing sentence, and posts only the finding.

The Multi-Turn Pipeline

As volume grew, the synchronous design became a bottleneck. Making it async gave us the throughput we needed, and with that we built proper infrastructure around the single decision-maker: a dedicated tool for surfacing classifications, lifecycle management to make sure decisions always returned, and a collector to carry results out.

With the scale problem handled, the next improvement was less obvious. The original Judge classified comments in isolation, so a reply saying "Done" looked like noise. Adding thread history, work log flags, and recent comment history progressively changed that. It could read the conversation rather than just the latest message.

Not every comment on a PR comes from a human either. When another automated tool posts a reply and expects a response, the Judge evaluates both sides of that exchange. Inputs and outputs are both checked, so even when another bot is wired to always reply, Gitar isn't. The Judge recognizes there's nothing worth adding and stays quiet rather than generating noise to match noise.

One API call

Even a single agent carries overhead: context loading, tool registration, lifecycle management, collector cleanup. All for a three-way classification. As volume kept growing, so did the cost.

The current architecture replaced the entire agentic pipeline with a single structured output call.

No tool calls. No lifecycle management. One comment in, one structured response out. If the decision is MODIFY, the replacement text comes back in the same response.

This is an easy mistake in an agentic framework. The agent pattern is available, familiar, and handles ambiguity well. But classifying whether a comment should be posted is a decision problem, not an exploration problem. A classification task needs a classifier, not an agent.

One principle runs through all of it: fail-open. If the Judge hits an error or times out, comments pass through unfiltered. We'd rather post something imperfect than silently lose real feedback.

At some point we put the Judge on trial. Running a production audit of its decisions, we found the DROP rate was more than double our target.

Two rules were the culprits.

The first fired on acknowledgment openers like "Exactly!", regardless of what followed. Detailed technical replies were getting dropped because they started too warmly. Fix: only drop short replies where the conversation has clearly closed.

The second was MODIFY. "Keep only the core message" turned out to mean "remove all personality." Fix: delete only what needs deleting, preserve everything else. Rewrites accumulate invisible errors. Deletions are surgical.

The personality point matters. Some developers use Gitar as a sounding board, asking quick questions mid-review in an offhand way. The Judge picks up on that register. Gitar's response matches the tone rather than defaulting to formal. It doesn't feel like talking to a bot because it's not trying to sound like one.

You can't find these bugs in unit tests. You need real conversations, at volume, and the instrumentation to surface what's actually happening.

The Judge Today

The same classification system now runs across four surfaces: PR comments and replies, code review findings, rules evaluations, and commit messages. Same architecture, different context.

The next iteration should be adaptive, learning from developer reactions rather than static prompts alone. If a team consistently dismisses a category of finding, that signal should feed back into the model.

Silence is a feature. Every comment Gitar posts should earn its place. Sometimes the right call is to have no output at all.

Gitar provides AI-powered code review that respects your time. Try Gitar free on your next PR