Architectureversioningrollbackagent-config

AI Agent Versioning: How to Roll Back a Config in Production

When an agent config change breaks production, git revert is not enough. How immutable agent versioning + one-click rollback actually work for AI agents.

Gil KalApril 22, 20265 min read

You tweak an agent's system prompt on a Thursday afternoon. Add a new tool. Bump the temperature from 0.3 to 0.6. Ship it. By Friday morning, ticket quality is down, resolution rates are off, and the team cannot pinpoint which change caused what. The prompt is in a database column, the tool list is an array you mutated in place, and the temperature lives in a config object that was merged with the default. There is no git history because the agent config is not in git.

This is the default failure mode for every AI platform that starts simple and ends in production. Agent configuration lives in a mutable database row, changes land without a diff, and when something breaks you are left reconstructing state from audit logs and Slack threads. The fix is not a new framework — it is applying one of the oldest patterns in software to agent configs: immutability.

Why Git Isn't Enough

Teams sometimes try to solve this by keeping agent configs in a git repo and syncing on deploy. It works for 10 agents and breaks at 100. The problems stack fast: non-engineers cannot edit prompts, the deploy cycle for a prompt tweak is now measured in minutes of CI time, emergency rollbacks require a revert-and-redeploy, and the history you get is commit-level — not field-level. Worse, the production database drifts from the git repo because someone clicked Save in the admin UI and the sync job has a lag.

What agent config actually needs is database-native versioning. Every edit writes a new row with the full config snapshot. No updates-in-place. Every version has an ID, a timestamp, an author, and a diff against its parent. Rolling back is just pointing the agent's 'active version' pointer at an older row.

The Append-Only Versioning Pattern

Dobby uses a pattern that BigQuery practitioners recognize immediately — append-only with version + is_latest flags. Every agent config edit writes a new row, and a trigger (or application code) flips the previous row's is_latest from true to false. The primary key is (agent_id, version_number). A rollback flips the pointer back.

-- Every edit to an agent writes a new row
INSERT INTO agent_configs (
  agent_id, version, is_latest, parent_version,
  name, system_prompt, tools, temperature,
  author_user_id, created_at
)
VALUES (
  'agent_123', 42, TRUE, 41,
  'Support Triage', 'You are a support agent...',
  ['read_tickets','search_kb'], 0.3,
  'user_gil', CURRENT_TIMESTAMP()
);

-- Flip previous row
UPDATE agent_configs
SET is_latest = FALSE
WHERE agent_id = 'agent_123' AND version = 41;

Three properties fall out of this design. You never lose a config — the full history is queryable with a single SELECT. Diffing is trivial because you have both rows. And rollback is just an UPDATE that flips is_latest between two versions, with nothing to re-deploy.

An agent config you can edit without producing a version is a config you cannot safely roll back.

Field-Level Diff, Not Blob Diff

The UI matters almost as much as the storage format. A blob diff — the entire JSON of version 41 vs version 42, side by side — is useless for humans. What operators actually want is field-level: 'temperature changed from 0.3 to 0.6, tools added [send_email], system prompt changed at line 14.' That requires comparing structured fields, not text.

In Dobby, the diff view groups changes by field type: metadata (name, description), behavior (system prompt, temperature, max tokens), capabilities (tools, MCP servers, model), and guardrails (budgets, policy overrides). Each section highlights only the fields that changed. The system prompt gets a token-level diff because that is where most behavior regressions hide.

Rollback as a Button, Not a Runbook

Rollback must be a single action. 'Show me the history of agent_123 → select version 41 → confirm → the agent is back to that config.' No redeploy, no approval chain during an incident, no hunting for who last changed what. On the backend, it writes a new row that is an exact copy of version 41 (to preserve the append-only invariant) but labelled version 43 with a 'rollback_from' field pointing to 42.

This matters because rollback is also the only safe answer during a bad incident. You do not debug a live agent — you roll back to the last known good version, then debug offline. If rollback is a runbook and not a button, teams will avoid it and try to patch-forward instead. That is how minor incidents become major ones.

Versioning Enables Outcomes and Learning Loops

The most valuable side effect of versioning is not incident response — it is measurement. If every run of the agent logs which version it used, you can compare outcome metrics across versions. Version 41 had 72% first-response-resolution. Version 42 dropped to 58%. The diff tells you exactly what changed between them. That is the foundation of an AI learning loop that does not rely on feelings.

Dobby wires agent_version_id into every task execution record, every LLM call, and every outcome grade. The admin UI lets you filter any report by version. This turns prompt engineering from a craft into a measurable discipline — you can say 'the new prompt is worse' with a p-value, not a hunch.

Ready to take control of your AI agents?

Start free with Dobby AI — connect, monitor, and govern agents from any framework.

Get Started Free