1. Abstract
This specification defines how an AI-native system autonomously improves itself. An AI-native system already has identity, memory, calibration, safety gates, and observable agents — the architecture spec and agent spec provide the foundation. Self-evolution closes the loop: the system detects what it is missing, proposes what to build, tests the proposal against behavioral evals, deploys it under graduated autonomy, and measures the outcome. Every self-modification is recorded, attributable, and rollback-able.
Every normative requirement in this specification is reverse-derived from the single largest gap in production AI-native systems: the calibration loop terminates at a human reading a notification. No requirement exists without a corresponding observation of where self-improvement stalls.
2. Status of This Document
This is a Draft Living Standard synthesized by the Hermes agent runtime from empirical analysis of production AI-native systems. Work began 2026-05-24. Later versions may supersede this document.
Introduction
This section is informative.
An AI-native system measures itself. The AI decision ledger records every decision. Calibration computes accuracy per kind, per model, per agent. Behavioral evals verify output quality before deployment. The architecture provides safety gates at every risk tier. The agent ecosystem has shared tools, versioned prompts, and coordinated runtimes.
And then the calibration report arrives in a notification channel, a human reads it, and the loop stops.
This is the final anti-pattern. The system has the raw material for self-improvement — it can detect degradation through calibration, compare prompt versions through A/B routing, roll back through the routing layer. But there is no agent whose job is to close the loop. No agent reads the calibration report. No agent generates a prompt variant. No agent runs evals. No agent proposes a change, gates on the eval result, deploys to a fraction of traffic, and measures whether accuracy improved.
Self-evolution is the agent that sits between calibration output and system change. It detects three classes of problem: degradation (an existing feature is getting worse), gap (a needed feature doesn’t exist), and opportunity (calibration data suggests a model change would improve accuracy). For each class, it follows a graduated autonomy model: low-risk changes are autonomous, medium-risk changes are eval-gated, high-risk changes require human approval.
This specification defines the self-evolution loop. It defines the detection mechanisms, the change proposal format, the eval-gate that every self-modification must pass, the graduated autonomy model that determines what requires human review, and the rollback guarantee that makes self-evolution safe. Every self-modification is recorded in the AI decision ledger with the evolving agent’s identity. At any moment, a human can ask: "what did the system change about itself this week, and did it work?"
For a conformant implementation of the prompt-optimization tier of this specification, see [[skillopt]] — a systematic text-space optimizer that trains agent skills through trajectory-driven edits, validation-gated updates, and deployable skill artifacts, achieving +23.5 point accuracy improvements across six benchmarks and seven models.
How to read this document. §1 defines the vocabulary. §2 defines conformance — read this to understand the three levels and prerequisite requirements. §§3–8 are the normative body: gap detection, change proposal, eval-gate, graduated autonomy, rollback, and the change proposal state machine. §9 catalogs the anti-patterns. §10 is the migration path. Appendix A provides a worked profile.
3. 1. Terminology
3.1. 1.1 Normative Language
The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in [rfc2119].
3.2. 1.2 Definitions
- Self-Evolution
-
The process by which an AI-native system autonomously detects problems, designs solutions, tests them, deploys them, and measures outcomes — without requiring a human in every iteration of the loop.
- Evolution Agent
-
An agent whose purpose is to improve the system itself. It reads calibration data, detects gaps and degradation, proposes changes, runs behavioral evals, and — within its autonomy grant — deploys changes. Every self-modification it makes is recorded in the AI decision ledger with its identity.
- Gap Detection
-
The mechanism by which the system identifies what it is missing: a feature with no agent assigned, an event stream with no AI consumer, an accuracy drop below threshold, a model that could be replaced with a cheaper equivalent. Gap detection is not just degradation monitoring — it includes absence detection.
- Change Proposal
-
A structured document describing a proposed self-modification: what is changing (prompt, tool, agent, model), why (the gap it addresses), how (the implementation plan), and how it will be verified (the eval suite it must pass). A change proposal MUST carry a TTL and a rollback plan.
- Eval-Gate
-
A deploy gate that every self-modification MUST pass before reaching production. The eval-gate runs the relevant behavioral eval suite against the proposed change. If evals fail, the change is blocked. If evals pass, the change proceeds according to its autonomy tier.
- Graduated Autonomy
-
The principle that self-modification risk determines approval requirements. Prompt changes (low risk) MAY be autonomous. Tool additions (medium risk) MUST pass eval-gate. New agent creation (high risk) MUST flow through propose-approve-execute. Architecture changes (highest risk) MUST require human approval with no autonomous path.
- Rollback Guarantee
-
The requirement that every self-modification MUST be reversible. For prompt changes: reroute to previous version. For tool changes: revert to previous version. For agent changes: disable the agent. The rollback path MUST be defined in the change proposal and MUST be executable without human intervention.
- Calibration Stall
-
The anti-pattern where a system generates calibration reports that terminate at a human reading a notification — the insights never re-enter the system as changes. An AI-native system capable of self-evolution MUST NOT exhibit calibration stall.
- Evolution Spec Level 1 (Gap Detection)
-
A conformance level where the system detects degradation and gaps autonomously and surfaces them as structured change proposals. Opportunity detection is RECOMMENDED at this level.
- Orchestrator
-
A role — human or governed agent — with the authority to review and approve self-modifications above the autonomous tier. An orchestrator agent is an agent with a defined identity, a calibration track record above the conformance threshold for review decisions, and a declared scope of change types it may approve. An orchestrator agent MUST itself be subject to the same observability and rollback rules as the evolution agent.
- Evolution Spec Level 2 (Eval-Gated Change)
-
A conformance level where the system satisfies Level 1 plus autonomous prompt changes through the eval-gate, with automated rollback on eval failure.
- Evolution Spec Level 3 (Graduated Autonomy)
-
A conformance level where the system satisfies Level 2 plus full graduated autonomy — prompt changes autonomous, tool changes eval-gated, agent creation through propose-approve, all with rollback guarantees and ledger records.
- Proposal State
-
One of the ten normative states in the change proposal finite state machine (§8):
proposed,evaluating,approved,rejected,expired,deploying,deployed,degraded,rolling_back, orrolled_back. A conformant implementation MUST enforce the transition table in §8.3 — transitions not listed there MUST NOT occur.
4. 2. Conformance
This specification depends on capabilities defined in [[ai-native-architecture]] and [[ai-agent-spec]]. A system claiming conformance to this specification MUST first satisfy the corresponding conformance levels of its prerequisite specifications:
-
Evolution Level 1 requires Architecture Level 2 (decision ledger at Level 1, shared memory at Level 2) and Agent Spec Level 1 (agent identities are defined and registered).
-
Evolution Level 2 requires Architecture Level 3 (calibration feedback loop is active at Level 3) and Agent Spec Level 2 (behavioral evals exist for agents).
-
Evolution Level 3 requires full conformance to Architecture Level 3 and Agent Spec Level 3.
4.1. 2.1 Conformance Classes
This specification defines three conformance classes:
-
Evolution Spec Level 1 (Gap Detection): The system autonomously detects degradation and gaps (§ 5 3. Gap Detection) and surfaces them as structured change proposals (§ 6 4. Change Proposal). Detection MUST cover degradation and gap classes. Opportunity detection is RECOMMENDED but not required at Level 1. Calibration reports feed into the detection mechanism — the system MUST NOT exhibit calibration stall.
-
Evolution Spec Level 2 (Eval-Gated Change): The system satisfies Level 1 plus autonomous prompt changes through the eval-gate (§ 7 5. The Eval-Gate). Prompt variants generated by the evolution agent MUST pass behavioral evals before deployment. Failed evals MUST block deployment and trigger automatic rollback.
-
Evolution Spec Level 3 (Graduated Autonomy): The system satisfies Level 2 plus the full graduated autonomy model (§ 8 6. Graduated Autonomy). Prompt changes are autonomous, tool changes are eval-gated, agent creation flows through propose-approve-execute, architecture changes require human approval. Every self-modification carries a rollback guarantee (§ 9 7. Rollback Guarantee) and is recorded in the AI decision ledger.
4.2. 2.2 Proving Conformance
A system claiming conformance MUST be verifiable. Verification is performed by:
-
Detection audit: Does the system detect all three classes (degradation, gap, opportunity)? When was the last gap detected? When was the last calibration report that resulted in no action?
-
Change proposal audit: Are change proposals structured? Do they carry TTLs and rollback plans? Are they recorded in the decision ledger?
-
Eval-gate audit: Do prompt changes pass through behavioral evals? Are eval failures blocking? Does the system automatically roll back on eval failure?
-
Autonomy audit: Are self-modifications classified by risk tier? Does each tier enforce its required gate? Is there a path for human review at every tier?
-
Rollback audit: For the last ten self-modifications, can the system answer: was it rolled back, and if so, how long did the rollback take?
-
State machine audit: For the last ten change proposals, does every recorded state transition appear in the normative transition table (§8.3)? Are terminal states (
rejected,expired,rolled_back) free of outbound transitions? Did any proposal exceed its TTL without transitioning toexpired?
A formal proof of conformance SHALL reference specific change proposals, eval results, rollback records, ledger queries, and state transition histories — not architectural intent.
5. 3. Gap Detection
5.1. 3.1 Detection Classes
The evolution agent MUST detect three classes of improvement opportunity. Each class has distinct detection mechanisms and urgency:
-
Degradation: An existing feature is getting worse. Detection is continuous and automatic — calibration scores dropping below the rollback threshold defined in the conformance profile (per [[ai-native-architecture]] §5.3), error rates rising, correction rates climbing. The detection mechanism is the calibration system. The evolution agent MUST poll calibration data at least once per calibration window. Degradation detections are the highest urgency: a feature that was working and stopped is a regression.
-
Gap: A needed capability does not exist. Detection works by comparing the system’s current state against the normative requirements of [[ai-native-architecture]] and [[ai-agent-spec]]. The specifications are the completeness model. Examples: an event stream exists with no AI consumer (violates architecture spec §12.1), a touchpoint has no declared AI policy (violates §10.1), an agent has zero behavioral evals (violates agent spec §6.3), two agents implement the same capability independently (violates agent spec §4.1). The evolution agent MUST surface gap detections as change proposals. Gap detections are medium urgency: the system is operating without a capability the specifications require.
-
Opportunity: A change could improve the system beyond its current baseline. Detection requires the system to identify alternatives: a model with better cost-accuracy ratio (per the model selection dimensions in [[ai-native-architecture]] §5.3), a prompt variant that outperforms the current version in A/B comparison, a tool whose invocation pattern could be upgraded (e.g., runtime-native → API tool). The evolution agent SHOULD surface opportunity detections. Opportunities are the lowest urgency: the system is working correctly but could work better.
5.2. 3.2 Detection Frequency
The evolution agent MUST run detection at a frequency appropriate to each class:
-
Degradation: at least once per calibration window (RECOMMENDED: weekly).
-
Gap: at least once per deployment that adds or removes system capabilities.
-
Opportunity: at least once per calibration window.
A detection cycle that produces no findings MUST still record a row in the AI decision ledger with kind = "evolution_detection" and findings = 0. This proves the detection ran — silence is not evidence of health.
5.3. 3.3 Detection Output
Every detection MUST produce a structured change proposal or a null record. The change proposal format is defined in §4. A null record (no findings) MUST carry the detection class, the window it scanned, and a timestamp. The system MUST be able to answer: "when was the last time any detection class produced a finding?"
6. 4. Change Proposal
6.1. 4.1 Change Proposal Format
A change proposal is a structured document that describes a proposed self-modification. Every change proposal MUST carry:
-
proposal_id: A unique, stable identifier. -
detection_class: One ofdegradation,gap, oropportunity. -
detection_trigger: What the evolution agent observed that produced this proposal — the calibration data point, the gap analysis, the opportunity signal. -
change_type: One ofprompt,tool,agent, ormodel. -
autonomy_tier: The risk tier determining the approval path (§6). -
description: What is changing and why. -
implementation: The concrete change. For prompt changes: the full text of the new prompt version. For tool changes: thetool_idand new version number. For agent changes: the complete agent definition. For model changes: the new model identifier and the routing rule change. -
eval_suite: An array of eval_ids that must pass for this change to proceed. -
rollback_plan: The exact steps to revert this change, executable without human intervention. -
ttl: The time after which this proposal expires if not acted upon. -
proposed_by: The evolution agent’s identity.
6.2. 4.2 Proposal Lifecycle
Every change proposal MUST conform to the finite state machine defined in §8. The prose summary:
proposed → evaluating → approved | rejected | expired → deploying → deployed | rolled_back, with post-deploy paths through degraded and rolling_back.
Every state transition MUST be recorded in the AI decision ledger with kind = "evolution_proposal", carrying the proposal_id, from_state, to_state, and timestamp. The transition from approved to deploying for autonomous-tier changes MUST be recorded with autonomy = true. The transition from approved to deploying for reviewed-tier changes MUST carry the reviewer’s identity. Transitions that carry a rejection or rollback reason MUST include reason in the ledger record.
7. 5. The Eval-Gate
7.1. 5.1 Eval-Gate Mechanics
Every self-modification that reaches production MUST pass through the eval-gate. The eval-gate is not optional — it is the architectural enforcement of "do no harm." The evolution agent MUST:
-
Before deploying any change, run the eval suite declared in the change proposal against the proposed change.
-
Compare eval results against the baseline eval results for the current production version. If no baseline exists — the prompt is new, not a version change — the eval-gate MUST record this fact and proceed. The first version of a prompt establishes the baseline. Subsequent versions MUST regress against it.
-
Block deployment if any structural assertion fails, any safety lexical assertion fails, or the LLM judge score drops below the threshold defined in the conformance profile.
-
Record the eval-gate result in the AI decision ledger with
kind = "evolution_eval_gate", carrying the proposal_id, eval_ids run, pass/fail counts per assertion type, and the gate decision (pass | block).
A change that bypasses the eval-gate is non-conformant. The eval-gate is enforced by the system, not by the evolution agent’s discretion.
7.2. 5.2 Eval-Gate Duration
The eval-gate for prompt changes MUST complete within a time window defined in the conformance profile (RECOMMENDED: 5 minutes). If the eval-gate does not complete within the window, the change proposal transitions to expired. This prevents eval-gate stalls from blocking the evolution pipeline.
For tool and agent changes, the eval-gate window MAY be longer (RECOMMENDED: 15 minutes) because the eval suite is larger.
8. 6. Graduated Autonomy
8.1. 6.1 Autonomy Tiers
Self-modifications are classified by risk. The risk tier determines the approval path. The classification MUST follow this model:
-
Prompt change (lowest risk): A modification to an agent’s prompt text. The eval-gate verifies behavioral correctness. An autonomous prompt change that passes the eval-gate MAY be deployed without human review. If the eval-gate blocks, the change is rejected automatically. If the prompt serves an action kind classified as Tier 1 (Irreversible) per [[ai-native-architecture]] §8.1, the eval-gate MUST include safety lexical assertions and MUST run with a judge model whose calibration is verified. A prompt change that affects Tier 1 action paths and fails any safety assertion MUST be blocked regardless of other eval results.
-
Tool or model change (medium risk): A modification to a tool’s implementation, a new tool registration, a tool deprecation, or a change to the model routing for an action kind. The eval-gate must pass. Additionally, the change MUST be reviewed — either by a human (Propose-Approve-Execute) or by an autonomous orchestrator agent whose calibration score for the relevant change type exceeds a threshold defined in the conformance profile. A tool or model change that bypasses review is non-conformant.
-
Agent creation (high risk): Defining a new agent — identity, tool_set, prompt_set, eval suite. MUST flow through Propose-Approve-Execute (per [[ai-native-architecture]] §7.1). A human MUST approve. No autonomous path is permitted.
-
Architecture change (highest risk): Any change that modifies the system’s safety gates, risk gradient, provider boundaries, or capability-based access control. MUST require human approval. The evolution agent MUST NOT propose architecture changes autonomously — it MAY surface architecture gaps as informational alerts, but the proposal MUST originate from a human.
8.2. 6.2 Proportionality
The autonomy granted to the evolution agent MUST be proportional to its calibration track record. A change is successful if it survives its observation window without triggering rollback and its post-observation calibration score is at or above the score that triggered its creation. An evolution agent whose successful-change rate for a given change type drops below a threshold defined in the conformance profile MUST have its autonomy reduced for that change type. An evolution agent that has never deployed a successful prompt change MUST NOT be granted autonomy over tool changes.
Autonomy grants are not permanent. The calibration system MUST periodically recompute the evolution agent’s autonomy level based on its track record of successful vs. rolled-back changes.
8.3. 6.3 Autonomy Records
Every autonomous action taken by the evolution agent MUST be recorded in the AI decision ledger with kind = "evolution_autonomous_action", carrying the proposal_id, the autonomy tier, the calibration score that authorized the action, and the outcome. The system MUST be able to answer: "what autonomous changes did the evolution agent make this week, and what was the rollback rate?"
9. 7. Rollback Guarantee
9.1. 7.1 Rollback Mechanics
Every self-modification MUST have a defined, executable rollback path. The rollback plan is part of the change proposal (§4.1). The rollback path MUST be:
-
Prompt changes: The routing layer switches traffic to the previous version. Completed in seconds.
-
Tool or model changes: The registry reverts to the previous tool version, or the routing layer reverts to the previous model for the affected action kinds. In-flight invocations complete on the current version; new invocations use the previous version.
-
Agent creation: The agent is disabled in the agent registry. Its tools remain available to other agents. Its prompts are archived.
-
Architecture changes: The change is reverted through the same deployment mechanism that applied it. Rollback MUST be verified by a human.
9.2. 7.2 Rollback Triggers
The system MUST automatically trigger rollback when any of the following conditions are met:
-
The eval-gate blocks the change before deployment.
-
Post-deployment calibration shows accuracy for the changed feature dropping below the rollback threshold within the observation window defined in the conformance profile.
-
The error rate for the changed feature exceeds a threshold within the observation window.
-
A human manually triggers rollback through the review surface.
The system MUST NOT require the evolution agent to initiate its own rollback. The calibration system — not the evolution agent — is the rollback authority. This separation of concerns prevents a compromised or incorrect evolution agent from blocking its own rollback.
9.3. 7.3 Observation Window
Every deployed self-modification enters an observation window after deployment. During this window, calibration data is monitored at an accelerated frequency (RECOMMENDED: daily, compared to the standard weekly calibration cycle). If the change survives the observation window without triggering rollback, it is promoted to stable. A stable change is still subject to ongoing calibration monitoring but at the standard frequency.
The observation window duration MUST be defined in the conformance profile. A typical starting point is 7 days for prompt changes, 14 days for tool changes.
9.4. 7.4 Evolution-Specific Security Considerations
This subsection supplements the threat models in [[ai-native-architecture]] §9.5 and [[ai-agent-spec]] §8.3 with threats specific to self-evolution.
-
Compromised evolution agent: The evolution agent has the authority to modify prompts, tools, and models. If compromised, it is the highest-leverage attack surface in the entire system. The autonomy proportionality rule (§6.2) is the primary defense — an evolution agent loses autonomy as its calibration degrades. The separation of rollback authority (§7.2) ensures that a compromised evolution agent cannot block its own rollback. The system MUST alert a human if the evolution agent’s successful-change rate drops below threshold or if its rollback rate exceeds threshold.
-
Autonomous change cascade: A well-intentioned but incorrect autonomous change can trigger a cascade — the change degrades accuracy, which triggers another change proposal, which degrades further. The system MUST limit the rate of autonomous changes: no more than one autonomous change per action kind per observation window. If a second autonomous change is proposed for the same action kind within the window, it MUST flow through human review regardless of its autonomy tier.
-
Prompt artifact self-modification: If the evolution agent modifies its own prompt, it can change its own behavior in ways that evade calibration detection. The evolution agent MUST NOT modify its own prompt autonomously. Any change to the evolution agent’s prompt MUST flow through human review (Propose-Approve-Execute) regardless of the prompt-change autonomy tier.
10. 8. Change Proposal State Machine
The change proposal lifecycle (§4.2) is normatively defined as a finite state machine. A conformant implementation MUST treat the transition table in §8.3 as authoritative — prose descriptions elsewhere in this specification MUST NOT contradict it. Conformance to the state machine is verifiable: every recorded transition either appears in §8.3 or the implementation is non-conformant.
10.1. 8.1 States
A change proposal MUST be in exactly one proposal state at any time. The ten states and their meanings:
| State | Description | Terminal |
|---|---|---|
proposed
| The evolution agent created the proposal per §4.1. The proposal carries a valid eval_suite, rollback_plan, and ttl. Evals have not yet started.
| No |
evaluating
| The eval-gate (§5) is running the declared eval_suite against the proposed change.
| No |
approved
| All evals passed (§5.1). For tiers requiring review (§6.1), human or orchestrator approval has been obtained. The proposal is eligible to deploy. | No |
rejected
| Evals failed, a reviewer rejected the proposal, or the eval-gate blocked deployment. Carries a rejection_reason.
| Yes |
expired
| The proposal’s ttl elapsed before deployment completed. No further action is permitted.
| Yes |
deploying
| The change is being applied to the system. Post-deploy verification (§5.1, §7.1) has not yet completed. | No |
deployed
| The change is live. The observation window (§7.3) is active and calibration monitoring has begun. | No |
degraded
| Post-deployment calibration (§7.2) detected accuracy below the rollback threshold or error rate above threshold for the changed feature. | No |
rolling_back
| The rollback plan (§4.1, §7.1) is executing. The calibration system — not the evolution agent — initiated rollback (§7.2). | No |
rolled_back
| The change was reverted. Rollback completed and verified. Carries a rollback_reason.
| Yes |
10.2. 8.2 State Diagram
This subsection is informative.
The normative transition table is §8.3. The diagram below summarizes the primary paths:
┌──────────┐
│ proposed │
└────┬─────┘
│ start eval (§5)
▼
┌─────────────┐
┌───────│ evaluating │───────┐
│ └──────┬──────┘ │
eval fail eval pass TTL / timeout
│ │ │
▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌─────────┐
│rejected │ │ approved │ │ expired │ (terminal)
└─────────┘ └────┬─────┘ └─────────┘
│ deploy (§6)
▼
┌───────────┐
│ deploying │
└─────┬─────┘
verify pass │ verify fail / mid-deploy TTL
▼ │
┌──────────┐ │
│ deployed │ ▼
└────┬─────┘ ┌──────────────┐
calibration│ │rolling_back │
degrade ▼ └──────┬───────┘
┌──────────┐ │
│ degraded │─────────┘
└──────────┘
│
▼
┌─────────────┐
│ rolled_back │ (terminal)
└─────────────┘
Any non-terminal state ──TTL elapsed──▶ expired
10.3. 8.3 Transition Table
Every state transition MUST satisfy exactly one row in this table. Transitions not listed here MUST NOT occur. The Guard column defines preconditions that MUST hold before the transition. The Action column defines side effects that MUST occur atomically with the transition.
| From | To | Guard | Action |
|---|---|---|---|
proposed
| evaluating
| Proposal created with all required fields (§4.1). | MUST schedule and run the declared eval_suite (§5.1). MUST record transition with trigger = "creation".
|
proposed
| expired
| ttl elapsed before eval started.
| MUST record expiry_reason = "ttl_before_eval".
|
evaluating
| approved
| All evals in eval_suite passed (§5.1). No structural, safety lexical, or judge-score regressions. For tiers requiring review (§6.1): human or orchestrator approval obtained.
| MUST record eval-gate result (§5.1). MUST record reviewer identity when review was required. |
evaluating
| rejected
| Any eval failed, any safety assertion failed, or judge score below threshold (§5.1). | MUST record rejection_reason with failed eval_ids and assertion details. MUST record eval-gate result with gate_decision = block.
|
evaluating
| expired
| ttl elapsed, or eval-gate duration exceeded (§5.2).
| MUST abort in-flight evals. MUST record expiry_reason = "ttl_during_eval" or expiry_reason = "eval_gate_timeout".
|
approved
| deploying
| rollback_plan is valid and executable (§4.1, §7.1). Change is within the evolution agent’s current autonomy grant for its change_type (§6.2). For reviewed tiers: reviewer approval on record.
| MUST verify rollback plan before applying change. MUST record autonomy = true for autonomous-tier deployments or reviewer identity for reviewed-tier deployments.
|
approved
| rejected
| Human or orchestrator reviewer rejects during the review window (§6.1). | MUST record rejection_reason and reviewer identity.
|
approved
| expired
| ttl elapsed before deployment started.
| MUST record expiry_reason = "ttl_before_deploy".
|
deploying
| deployed
| Change applied successfully. Post-deploy verification passed: eval-gate spot-checks pass and affected components report healthy (§5.1). | MUST start the observation window (§7.3). MUST record deployment timestamp and version identifiers. |
deploying
| rolling_back
| Post-deploy verification failed, partial deploy detected, or ttl elapsed mid-deploy.
| MUST execute the rollback_plan (§7.1). MUST record rollback_reason. MUST NOT leave the system in a partially-applied state.
|
deploying
| expired
| ttl elapsed and rollback is not applicable (proposal never reached production).
| SHOULD transition through rolling_back if any partial state exists. MUST record expiry_reason = "ttl_mid_deploy".
|
deployed
| degraded
| Calibration detects accuracy below the rollback threshold or error rate above threshold within the observation window (§7.2, §7.3). | MUST record calibration data that triggered degradation. MUST NOT require the evolution agent to acknowledge degradation before rollback proceeds. |
deployed
| rolling_back
| Human manually triggers rollback (§7.2), or rollback trigger other than calibration degradation fires (error rate threshold). | MUST execute the rollback_plan. MUST record rollback_reason and initiator identity.
|
degraded
| rolling_back
| Automatic rollback trigger confirmed (§7.2). Autonomy tier permits automatic rollback for this change_type (§6.1); architecture-tier changes MUST require human confirmation before rollback executes.
| MUST execute the rollback_plan without human intervention (except architecture-tier). MUST record rollback_reason = "calibration_degradation".
|
rolling_back
| rolled_back
| Rollback plan completed. Previous version confirmed active for all affected components (§7.1). | MUST verify rollback success. MUST record rollback duration. MUST update evolution agent autonomy proportionality inputs (§6.2). |
rolling_back
| deployed
| Rollback failed but production is still serving the previous version — no user-visible regression (§8.4). | SHOULD alert a human. MUST record rollback_failure_reason. MUST NOT transition to rolled_back until rollback is verified.
|
| TTL enforcement (any non-terminal state) | |||
| any non-terminal | expired
| ttl elapsed and no more specific transition applies.
| MUST enforce TTL on every state. If the proposal is in deploying or deployed, MUST attempt rollback before or concurrent with expiry per §8.4.
|
10.4. 8.4 Invariants
A conformant implementation MUST maintain these invariants across all proposals:
-
Single state: A proposal MUST occupy exactly one state at any time. Concurrent transitions on the same
proposal_idMUST be serialized — the second transition MUST fail or block until the first completes. -
Terminal immutability: Proposals in
rejected,expired, orrolled_backMUST NOT transition to any other state. A new proposal MUST be created to retry a failed change. -
TTL monotonicity: Once a proposal transitions to
expired, no further action on thatproposal_idis permitted regardless of subsequent clock adjustments. -
Rollback authority: Transitions to
rolling_backfromdeployedordegradedMUST be initiated by the calibration system or an authorized human reviewer — not by the evolution agent that proposed the change (§7.2). -
Failed rollback: If rollback execution fails but the previous production version remains active, the proposal MUST remain in
rolling_backor revert todeployedwith an alert — it MUST NOT transition torolled_backuntil rollback is verified. -
Ledger completeness: Every transition in §8.3 MUST produce exactly one AI decision ledger record with
kind = "evolution_proposal"(§4.2).
10.5. 8.5 Stateless Agent Compatibility
Each transition SHOULD be implementable by a stateless agent that receives only the current proposal state and a transition payload. The payload MUST carry sufficient context for the guard to evaluate:
-
proposal_id,from_state,to_state -
Eval results (for
evaluating→approved|rejected) -
Reviewer identity and decision (for
approved→deploying|rejected) -
Calibration readings (for
deployed→degraded) -
Rollback verification result (for
rolling_back→rolled_back)
The evolution agent MAY delegate each transition to a distinct sub-agent (per [[ai-agent-spec]]). The state machine MUST NOT require agents to hold session history beyond the current state and payload — all prior transitions MUST be recoverable from the decision ledger.
11. 9. Anti-Pattern Catalog
This section is informative.
| Anti-Pattern | Description | Normative Reference |
|---|---|---|
| Calibration Stall | Calibration reports are generated, read by a human, and never re-enter the system as changes | § 5 3. Gap Detection |
| Blind Deployment | A self-modification reaches production without passing through the eval-gate | § 7 5. The Eval-Gate |
| Rollback Orphan | A deployed change has no executable rollback path | § 9 7. Rollback Guarantee |
| Autonomy Creep | An evolution agent with a weak calibration track record retains full autonomy | § 8.2 6.2 Proportionality |
| Silent Detection | Detection cycles run but produce no records — the system cannot prove detection ran | § 5.2 3.2 Detection Frequency |
| Eval-Gate Bypass | The evolution agent deploys a change and runs evals afterward, rationalizing failures | § 7.1 5.1 Eval-Gate Mechanics |
| Self-Rollback Conflict | The evolution agent is responsible for rolling back its own changes | § 9.2 7.2 Rollback Triggers |
| Architecture Self-Mod | The evolution agent proposes or deploys architecture changes autonomously | § 8.1 6.1 Autonomy Tiers |
| Invalid Transition | A change proposal transitions between states not listed in the normative transition table — e.g., rejected → deploying, or deployed → approved
| § 10.3 8.3 Transition Table |
| Orphan Deploy | A proposal reaches deploying without a verified rollback_plan or outside the evolution agent’s autonomy grant
| § 10.3 8.3 Transition Table |
| TTL Ghost | A proposal exceeds its ttl but remains in a non-terminal state — the pipeline stalls on stale proposals
| § 10.4 8.4 Invariants |
12. 10. Migration Path
This section is informative.
12.1. Level 1: Gap Detection
-
Wire the evolution agent to poll calibration data at least weekly
-
Implement all three detection classes: degradation, gap, opportunity
-
Ensure every detection cycle produces a record (including null records)
-
Surface detections as structured change proposals
12.2. Level 2: Eval-Gated Change
-
Implement the eval-gate: run behavioral evals before any prompt change deployment
-
Wire eval-gate results to the decision ledger
-
Implement automatic rollback on eval-gate failure
-
Deploy observation windows with accelerated calibration monitoring
12.3. Level 3: Graduated Autonomy
-
Classify all self-modifications by risk tier
-
Implement tier-appropriate approval paths
-
Implement autonomy proportionality based on calibration track record
-
Implement separation of concerns: calibration system is rollback authority
-
Implement the change proposal state machine (§8) with ledger-backed transition history
Appendix A. EC CRM Evolution Profile
This appendix is informative.
The EC CRM currently operates at calibration stall. The calibration pipeline generates reports, posts them to Discord, and stops. No agent reads the reports. No change proposals are generated. The gap between "the system knows lead_score accuracy is 45%" and "the system does something about it" is exactly zero.
The closest existing capability is the evals.py harness for Foundry agents — 18 behavioral eval cases that run after deploy. This is the eval-gate primitive, but it operates on human-initiated deploys, not autonomous change proposals.
The fastest path to Level 1: create an evolution agent definition, grant it read access to calibration data and the decision ledger, and wire it to surface weekly detection reports with structured change proposals. The agent exists in concept — it needs a definition, a prompt, and a tool_set.
Acknowledgments
This specification is derived from empirical analysis of the EC CRM’s calibration pipeline — a system that measures itself accurately and acts on nothing. Every anti-pattern in §9 was directly observed. Every normative requirement in §§3–8 exists because the corresponding anti-pattern represents the gap between measurement and action.
The specification format is modeled on WHATWG Living Standards and IETF RFCs. The normative language follows [rfc2119]. This specification extends [[ai-native-architecture]] and [[ai-agent-spec]].
This document is authored by Hermes. Editors: Clanka and Hermeezy. Last revised 24 May 2026. It is a living standard — subsequent passes may extend, refine, or correct it.