AI-Native Self-Evolution

Living Document,

This version:
https://github.com/clankamode/ai-native-spec
Issue Tracking:
GitHub
Editors:
Clanka
Hermeezy

1. Abstract

This specification defines how an AI-native system autonomously improves itself. An AI-native system already has identity, memory, calibration, safety gates, and observable agents — the architecture spec and agent spec provide the foundation. Self-evolution closes the loop: the system detects what it is missing, proposes what to build, tests the proposal against behavioral evals, deploys it under graduated autonomy, and measures the outcome. Every self-modification is recorded, attributable, and rollback-able.

Every normative requirement in this specification is reverse-derived from the single largest gap in production AI-native systems: the calibration loop terminates at a human reading a notification. No requirement exists without a corresponding observation of where self-improvement stalls.

2. Status of This Document

This is a Draft Living Standard synthesized by the Hermes agent runtime from empirical analysis of production AI-native systems. Work began 2026-05-24. Later versions may supersede this document.

Introduction

This section is informative.

An AI-native system measures itself. The AI decision ledger records every decision. Calibration computes accuracy per kind, per model, per agent. Behavioral evals verify output quality before deployment. The architecture provides safety gates at every risk tier. The agent ecosystem has shared tools, versioned prompts, and coordinated runtimes.

And then the calibration report arrives in a notification channel, a human reads it, and the loop stops.

This is the final anti-pattern. The system has the raw material for self-improvement — it can detect degradation through calibration, compare prompt versions through A/B routing, roll back through the routing layer. But there is no agent whose job is to close the loop. No agent reads the calibration report. No agent generates a prompt variant. No agent runs evals. No agent proposes a change, gates on the eval result, deploys to a fraction of traffic, and measures whether accuracy improved.

Self-evolution is the agent that sits between calibration output and system change. It detects three classes of problem: degradation (an existing feature is getting worse), gap (a needed feature doesn’t exist), and opportunity (calibration data suggests a model change would improve accuracy). For each class, it follows a graduated autonomy model: low-risk changes are autonomous, medium-risk changes are eval-gated, high-risk changes require human approval.

This specification defines the self-evolution loop. It defines the detection mechanisms, the change proposal format, the eval-gate that every self-modification must pass, the graduated autonomy model that determines what requires human review, and the rollback guarantee that makes self-evolution safe. Every self-modification is recorded in the AI decision ledger with the evolving agent’s identity. At any moment, a human can ask: "what did the system change about itself this week, and did it work?"

For a conformant implementation of the prompt-optimization tier of this specification, see [[skillopt]] — a systematic text-space optimizer that trains agent skills through trajectory-driven edits, validation-gated updates, and deployable skill artifacts, achieving +23.5 point accuracy improvements across six benchmarks and seven models.

How to read this document. §1 defines the vocabulary. §2 defines conformance — read this to understand the three levels and prerequisite requirements. §§3–8 are the normative body: gap detection, change proposal, eval-gate, graduated autonomy, rollback, and the change proposal state machine. §9 catalogs the anti-patterns. §10 is the migration path. Appendix A provides a worked profile.

3. 1. Terminology

3.1. 1.1 Normative Language

The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in [rfc2119].

3.2. 1.2 Definitions

Self-Evolution

The process by which an AI-native system autonomously detects problems, designs solutions, tests them, deploys them, and measures outcomes — without requiring a human in every iteration of the loop.

Evolution Agent

An agent whose purpose is to improve the system itself. It reads calibration data, detects gaps and degradation, proposes changes, runs behavioral evals, and — within its autonomy grant — deploys changes. Every self-modification it makes is recorded in the AI decision ledger with its identity.

Gap Detection

The mechanism by which the system identifies what it is missing: a feature with no agent assigned, an event stream with no AI consumer, an accuracy drop below threshold, a model that could be replaced with a cheaper equivalent. Gap detection is not just degradation monitoring — it includes absence detection.

Change Proposal

A structured document describing a proposed self-modification: what is changing (prompt, tool, agent, model), why (the gap it addresses), how (the implementation plan), and how it will be verified (the eval suite it must pass). A change proposal MUST carry a TTL and a rollback plan.

Eval-Gate

A deploy gate that every self-modification MUST pass before reaching production. The eval-gate runs the relevant behavioral eval suite against the proposed change. If evals fail, the change is blocked. If evals pass, the change proceeds according to its autonomy tier.

Graduated Autonomy

The principle that self-modification risk determines approval requirements. Prompt changes (low risk) MAY be autonomous. Tool additions (medium risk) MUST pass eval-gate. New agent creation (high risk) MUST flow through propose-approve-execute. Architecture changes (highest risk) MUST require human approval with no autonomous path.

Rollback Guarantee

The requirement that every self-modification MUST be reversible. For prompt changes: reroute to previous version. For tool changes: revert to previous version. For agent changes: disable the agent. The rollback path MUST be defined in the change proposal and MUST be executable without human intervention.

Calibration Stall

The anti-pattern where a system generates calibration reports that terminate at a human reading a notification — the insights never re-enter the system as changes. An AI-native system capable of self-evolution MUST NOT exhibit calibration stall.

Evolution Spec Level 1 (Gap Detection)

A conformance level where the system detects degradation and gaps autonomously and surfaces them as structured change proposals. Opportunity detection is RECOMMENDED at this level.

Orchestrator

A role — human or governed agent — with the authority to review and approve self-modifications above the autonomous tier. An orchestrator agent is an agent with a defined identity, a calibration track record above the conformance threshold for review decisions, and a declared scope of change types it may approve. An orchestrator agent MUST itself be subject to the same observability and rollback rules as the evolution agent.

Evolution Spec Level 2 (Eval-Gated Change)

A conformance level where the system satisfies Level 1 plus autonomous prompt changes through the eval-gate, with automated rollback on eval failure.

Evolution Spec Level 3 (Graduated Autonomy)

A conformance level where the system satisfies Level 2 plus full graduated autonomy — prompt changes autonomous, tool changes eval-gated, agent creation through propose-approve, all with rollback guarantees and ledger records.

Proposal State

One of the ten normative states in the change proposal finite state machine (§8): proposed, evaluating, approved, rejected, expired, deploying, deployed, degraded, rolling_back, or rolled_back. A conformant implementation MUST enforce the transition table in §8.3 — transitions not listed there MUST NOT occur.

4. 2. Conformance

This specification depends on capabilities defined in [[ai-native-architecture]] and [[ai-agent-spec]]. A system claiming conformance to this specification MUST first satisfy the corresponding conformance levels of its prerequisite specifications:

4.1. 2.1 Conformance Classes

This specification defines three conformance classes:

4.2. 2.2 Proving Conformance

A system claiming conformance MUST be verifiable. Verification is performed by:

  1. Detection audit: Does the system detect all three classes (degradation, gap, opportunity)? When was the last gap detected? When was the last calibration report that resulted in no action?

  2. Change proposal audit: Are change proposals structured? Do they carry TTLs and rollback plans? Are they recorded in the decision ledger?

  3. Eval-gate audit: Do prompt changes pass through behavioral evals? Are eval failures blocking? Does the system automatically roll back on eval failure?

  4. Autonomy audit: Are self-modifications classified by risk tier? Does each tier enforce its required gate? Is there a path for human review at every tier?

  5. Rollback audit: For the last ten self-modifications, can the system answer: was it rolled back, and if so, how long did the rollback take?

  6. State machine audit: For the last ten change proposals, does every recorded state transition appear in the normative transition table (§8.3)? Are terminal states (rejected, expired, rolled_back) free of outbound transitions? Did any proposal exceed its TTL without transitioning to expired?

A formal proof of conformance SHALL reference specific change proposals, eval results, rollback records, ledger queries, and state transition histories — not architectural intent.

5. 3. Gap Detection

5.1. 3.1 Detection Classes

The evolution agent MUST detect three classes of improvement opportunity. Each class has distinct detection mechanisms and urgency:

  1. Degradation: An existing feature is getting worse. Detection is continuous and automatic — calibration scores dropping below the rollback threshold defined in the conformance profile (per [[ai-native-architecture]] §5.3), error rates rising, correction rates climbing. The detection mechanism is the calibration system. The evolution agent MUST poll calibration data at least once per calibration window. Degradation detections are the highest urgency: a feature that was working and stopped is a regression.

  2. Gap: A needed capability does not exist. Detection works by comparing the system’s current state against the normative requirements of [[ai-native-architecture]] and [[ai-agent-spec]]. The specifications are the completeness model. Examples: an event stream exists with no AI consumer (violates architecture spec §12.1), a touchpoint has no declared AI policy (violates §10.1), an agent has zero behavioral evals (violates agent spec §6.3), two agents implement the same capability independently (violates agent spec §4.1). The evolution agent MUST surface gap detections as change proposals. Gap detections are medium urgency: the system is operating without a capability the specifications require.

  3. Opportunity: A change could improve the system beyond its current baseline. Detection requires the system to identify alternatives: a model with better cost-accuracy ratio (per the model selection dimensions in [[ai-native-architecture]] §5.3), a prompt variant that outperforms the current version in A/B comparison, a tool whose invocation pattern could be upgraded (e.g., runtime-native → API tool). The evolution agent SHOULD surface opportunity detections. Opportunities are the lowest urgency: the system is working correctly but could work better.

5.2. 3.2 Detection Frequency

The evolution agent MUST run detection at a frequency appropriate to each class:

A detection cycle that produces no findings MUST still record a row in the AI decision ledger with kind = "evolution_detection" and findings = 0. This proves the detection ran — silence is not evidence of health.

5.3. 3.3 Detection Output

Every detection MUST produce a structured change proposal or a null record. The change proposal format is defined in §4. A null record (no findings) MUST carry the detection class, the window it scanned, and a timestamp. The system MUST be able to answer: "when was the last time any detection class produced a finding?"

6. 4. Change Proposal

6.1. 4.1 Change Proposal Format

A change proposal is a structured document that describes a proposed self-modification. Every change proposal MUST carry:

6.2. 4.2 Proposal Lifecycle

Every change proposal MUST conform to the finite state machine defined in §8. The prose summary:

proposedevaluatingapproved | rejected | expireddeployingdeployed | rolled_back, with post-deploy paths through degraded and rolling_back.

Every state transition MUST be recorded in the AI decision ledger with kind = "evolution_proposal", carrying the proposal_id, from_state, to_state, and timestamp. The transition from approved to deploying for autonomous-tier changes MUST be recorded with autonomy = true. The transition from approved to deploying for reviewed-tier changes MUST carry the reviewer’s identity. Transitions that carry a rejection or rollback reason MUST include reason in the ledger record.

7. 5. The Eval-Gate

7.1. 5.1 Eval-Gate Mechanics

Every self-modification that reaches production MUST pass through the eval-gate. The eval-gate is not optional — it is the architectural enforcement of "do no harm." The evolution agent MUST:

  1. Before deploying any change, run the eval suite declared in the change proposal against the proposed change.

  2. Compare eval results against the baseline eval results for the current production version. If no baseline exists — the prompt is new, not a version change — the eval-gate MUST record this fact and proceed. The first version of a prompt establishes the baseline. Subsequent versions MUST regress against it.

  3. Block deployment if any structural assertion fails, any safety lexical assertion fails, or the LLM judge score drops below the threshold defined in the conformance profile.

  4. Record the eval-gate result in the AI decision ledger with kind = "evolution_eval_gate", carrying the proposal_id, eval_ids run, pass/fail counts per assertion type, and the gate decision (pass | block).

A change that bypasses the eval-gate is non-conformant. The eval-gate is enforced by the system, not by the evolution agent’s discretion.

7.2. 5.2 Eval-Gate Duration

The eval-gate for prompt changes MUST complete within a time window defined in the conformance profile (RECOMMENDED: 5 minutes). If the eval-gate does not complete within the window, the change proposal transitions to expired. This prevents eval-gate stalls from blocking the evolution pipeline.

For tool and agent changes, the eval-gate window MAY be longer (RECOMMENDED: 15 minutes) because the eval suite is larger.

8. 6. Graduated Autonomy

8.1. 6.1 Autonomy Tiers

Self-modifications are classified by risk. The risk tier determines the approval path. The classification MUST follow this model:

  1. Prompt change (lowest risk): A modification to an agent’s prompt text. The eval-gate verifies behavioral correctness. An autonomous prompt change that passes the eval-gate MAY be deployed without human review. If the eval-gate blocks, the change is rejected automatically. If the prompt serves an action kind classified as Tier 1 (Irreversible) per [[ai-native-architecture]] §8.1, the eval-gate MUST include safety lexical assertions and MUST run with a judge model whose calibration is verified. A prompt change that affects Tier 1 action paths and fails any safety assertion MUST be blocked regardless of other eval results.

  2. Tool or model change (medium risk): A modification to a tool’s implementation, a new tool registration, a tool deprecation, or a change to the model routing for an action kind. The eval-gate must pass. Additionally, the change MUST be reviewed — either by a human (Propose-Approve-Execute) or by an autonomous orchestrator agent whose calibration score for the relevant change type exceeds a threshold defined in the conformance profile. A tool or model change that bypasses review is non-conformant.

  3. Agent creation (high risk): Defining a new agent — identity, tool_set, prompt_set, eval suite. MUST flow through Propose-Approve-Execute (per [[ai-native-architecture]] §7.1). A human MUST approve. No autonomous path is permitted.

  4. Architecture change (highest risk): Any change that modifies the system’s safety gates, risk gradient, provider boundaries, or capability-based access control. MUST require human approval. The evolution agent MUST NOT propose architecture changes autonomously — it MAY surface architecture gaps as informational alerts, but the proposal MUST originate from a human.

8.2. 6.2 Proportionality

The autonomy granted to the evolution agent MUST be proportional to its calibration track record. A change is successful if it survives its observation window without triggering rollback and its post-observation calibration score is at or above the score that triggered its creation. An evolution agent whose successful-change rate for a given change type drops below a threshold defined in the conformance profile MUST have its autonomy reduced for that change type. An evolution agent that has never deployed a successful prompt change MUST NOT be granted autonomy over tool changes.

Autonomy grants are not permanent. The calibration system MUST periodically recompute the evolution agent’s autonomy level based on its track record of successful vs. rolled-back changes.

8.3. 6.3 Autonomy Records

Every autonomous action taken by the evolution agent MUST be recorded in the AI decision ledger with kind = "evolution_autonomous_action", carrying the proposal_id, the autonomy tier, the calibration score that authorized the action, and the outcome. The system MUST be able to answer: "what autonomous changes did the evolution agent make this week, and what was the rollback rate?"

9. 7. Rollback Guarantee

9.1. 7.1 Rollback Mechanics

Every self-modification MUST have a defined, executable rollback path. The rollback plan is part of the change proposal (§4.1). The rollback path MUST be:

9.2. 7.2 Rollback Triggers

The system MUST automatically trigger rollback when any of the following conditions are met:

The system MUST NOT require the evolution agent to initiate its own rollback. The calibration system — not the evolution agent — is the rollback authority. This separation of concerns prevents a compromised or incorrect evolution agent from blocking its own rollback.

9.3. 7.3 Observation Window

Every deployed self-modification enters an observation window after deployment. During this window, calibration data is monitored at an accelerated frequency (RECOMMENDED: daily, compared to the standard weekly calibration cycle). If the change survives the observation window without triggering rollback, it is promoted to stable. A stable change is still subject to ongoing calibration monitoring but at the standard frequency.

The observation window duration MUST be defined in the conformance profile. A typical starting point is 7 days for prompt changes, 14 days for tool changes.

9.4. 7.4 Evolution-Specific Security Considerations

This subsection supplements the threat models in [[ai-native-architecture]] §9.5 and [[ai-agent-spec]] §8.3 with threats specific to self-evolution.

10. 8. Change Proposal State Machine

The change proposal lifecycle (§4.2) is normatively defined as a finite state machine. A conformant implementation MUST treat the transition table in §8.3 as authoritative — prose descriptions elsewhere in this specification MUST NOT contradict it. Conformance to the state machine is verifiable: every recorded transition either appears in §8.3 or the implementation is non-conformant.

10.1. 8.1 States

A change proposal MUST be in exactly one proposal state at any time. The ten states and their meanings:

State Description Terminal
proposed The evolution agent created the proposal per §4.1. The proposal carries a valid eval_suite, rollback_plan, and ttl. Evals have not yet started. No
evaluating The eval-gate (§5) is running the declared eval_suite against the proposed change. No
approved All evals passed (§5.1). For tiers requiring review (§6.1), human or orchestrator approval has been obtained. The proposal is eligible to deploy. No
rejected Evals failed, a reviewer rejected the proposal, or the eval-gate blocked deployment. Carries a rejection_reason. Yes
expired The proposal’s ttl elapsed before deployment completed. No further action is permitted. Yes
deploying The change is being applied to the system. Post-deploy verification (§5.1, §7.1) has not yet completed. No
deployed The change is live. The observation window (§7.3) is active and calibration monitoring has begun. No
degraded Post-deployment calibration (§7.2) detected accuracy below the rollback threshold or error rate above threshold for the changed feature. No
rolling_back The rollback plan (§4.1, §7.1) is executing. The calibration system — not the evolution agent — initiated rollback (§7.2). No
rolled_back The change was reverted. Rollback completed and verified. Carries a rollback_reason. Yes

10.2. 8.2 State Diagram

This subsection is informative.

The normative transition table is §8.3. The diagram below summarizes the primary paths:

                  ┌──────────┐
                  │ proposed │
                  └────┬─────┘
                       │ start eval (§5)
                       ▼
                ┌─────────────┐
       ┌───────│ evaluating  │───────┐
       │       └──────┬──────┘       │
  eval fail      eval pass      TTL / timeout
       │              │              │
       ▼              ▼              ▼
  ┌─────────┐   ┌──────────┐   ┌─────────┐
  │rejected │   │ approved │   │ expired │ (terminal)
  └─────────┘   └────┬─────┘   └─────────┘
                     │ deploy (§6)
                     ▼
                ┌───────────┐
                │ deploying │
                └─────┬─────┘
          verify pass │ verify fail / mid-deploy TTL
                      ▼              │
                ┌──────────┐         │
                │ deployed │         ▼
                └────┬─────┘   ┌──────────────┐
         calibration│         │rolling_back  │
            degrade ▼         └──────┬───────┘
                ┌──────────┐         │
                │ degraded │─────────┘
                └──────────┘
                      │
                      ▼
                ┌─────────────┐
                │ rolled_back │ (terminal)
                └─────────────┘

Any non-terminal state ──TTL elapsed──▶ expired

10.3. 8.3 Transition Table

Every state transition MUST satisfy exactly one row in this table. Transitions not listed here MUST NOT occur. The Guard column defines preconditions that MUST hold before the transition. The Action column defines side effects that MUST occur atomically with the transition.

From To Guard Action
proposed evaluating Proposal created with all required fields (§4.1). MUST schedule and run the declared eval_suite (§5.1). MUST record transition with trigger = "creation".
proposed expired ttl elapsed before eval started. MUST record expiry_reason = "ttl_before_eval".
evaluating approved All evals in eval_suite passed (§5.1). No structural, safety lexical, or judge-score regressions. For tiers requiring review (§6.1): human or orchestrator approval obtained. MUST record eval-gate result (§5.1). MUST record reviewer identity when review was required.
evaluating rejected Any eval failed, any safety assertion failed, or judge score below threshold (§5.1). MUST record rejection_reason with failed eval_ids and assertion details. MUST record eval-gate result with gate_decision = block.
evaluating expired ttl elapsed, or eval-gate duration exceeded (§5.2). MUST abort in-flight evals. MUST record expiry_reason = "ttl_during_eval" or expiry_reason = "eval_gate_timeout".
approved deploying rollback_plan is valid and executable (§4.1, §7.1). Change is within the evolution agent’s current autonomy grant for its change_type (§6.2). For reviewed tiers: reviewer approval on record. MUST verify rollback plan before applying change. MUST record autonomy = true for autonomous-tier deployments or reviewer identity for reviewed-tier deployments.
approved rejected Human or orchestrator reviewer rejects during the review window (§6.1). MUST record rejection_reason and reviewer identity.
approved expired ttl elapsed before deployment started. MUST record expiry_reason = "ttl_before_deploy".
deploying deployed Change applied successfully. Post-deploy verification passed: eval-gate spot-checks pass and affected components report healthy (§5.1). MUST start the observation window (§7.3). MUST record deployment timestamp and version identifiers.
deploying rolling_back Post-deploy verification failed, partial deploy detected, or ttl elapsed mid-deploy. MUST execute the rollback_plan (§7.1). MUST record rollback_reason. MUST NOT leave the system in a partially-applied state.
deploying expired ttl elapsed and rollback is not applicable (proposal never reached production). SHOULD transition through rolling_back if any partial state exists. MUST record expiry_reason = "ttl_mid_deploy".
deployed degraded Calibration detects accuracy below the rollback threshold or error rate above threshold within the observation window (§7.2, §7.3). MUST record calibration data that triggered degradation. MUST NOT require the evolution agent to acknowledge degradation before rollback proceeds.
deployed rolling_back Human manually triggers rollback (§7.2), or rollback trigger other than calibration degradation fires (error rate threshold). MUST execute the rollback_plan. MUST record rollback_reason and initiator identity.
degraded rolling_back Automatic rollback trigger confirmed (§7.2). Autonomy tier permits automatic rollback for this change_type (§6.1); architecture-tier changes MUST require human confirmation before rollback executes. MUST execute the rollback_plan without human intervention (except architecture-tier). MUST record rollback_reason = "calibration_degradation".
rolling_back rolled_back Rollback plan completed. Previous version confirmed active for all affected components (§7.1). MUST verify rollback success. MUST record rollback duration. MUST update evolution agent autonomy proportionality inputs (§6.2).
rolling_back deployed Rollback failed but production is still serving the previous version — no user-visible regression (§8.4). SHOULD alert a human. MUST record rollback_failure_reason. MUST NOT transition to rolled_back until rollback is verified.
TTL enforcement (any non-terminal state)
any non-terminal expired ttl elapsed and no more specific transition applies. MUST enforce TTL on every state. If the proposal is in deploying or deployed, MUST attempt rollback before or concurrent with expiry per §8.4.

10.4. 8.4 Invariants

A conformant implementation MUST maintain these invariants across all proposals:

  1. Single state: A proposal MUST occupy exactly one state at any time. Concurrent transitions on the same proposal_id MUST be serialized — the second transition MUST fail or block until the first completes.

  2. Terminal immutability: Proposals in rejected, expired, or rolled_back MUST NOT transition to any other state. A new proposal MUST be created to retry a failed change.

  3. TTL monotonicity: Once a proposal transitions to expired, no further action on that proposal_id is permitted regardless of subsequent clock adjustments.

  4. Rollback authority: Transitions to rolling_back from deployed or degraded MUST be initiated by the calibration system or an authorized human reviewer — not by the evolution agent that proposed the change (§7.2).

  5. Failed rollback: If rollback execution fails but the previous production version remains active, the proposal MUST remain in rolling_back or revert to deployed with an alert — it MUST NOT transition to rolled_back until rollback is verified.

  6. Ledger completeness: Every transition in §8.3 MUST produce exactly one AI decision ledger record with kind = "evolution_proposal" (§4.2).

10.5. 8.5 Stateless Agent Compatibility

Each transition SHOULD be implementable by a stateless agent that receives only the current proposal state and a transition payload. The payload MUST carry sufficient context for the guard to evaluate:

The evolution agent MAY delegate each transition to a distinct sub-agent (per [[ai-agent-spec]]). The state machine MUST NOT require agents to hold session history beyond the current state and payload — all prior transitions MUST be recoverable from the decision ledger.

11. 9. Anti-Pattern Catalog

This section is informative.

Anti-Pattern Description Normative Reference
Calibration Stall Calibration reports are generated, read by a human, and never re-enter the system as changes § 5 3. Gap Detection
Blind Deployment A self-modification reaches production without passing through the eval-gate § 7 5. The Eval-Gate
Rollback Orphan A deployed change has no executable rollback path § 9 7. Rollback Guarantee
Autonomy Creep An evolution agent with a weak calibration track record retains full autonomy § 8.2 6.2 Proportionality
Silent Detection Detection cycles run but produce no records — the system cannot prove detection ran § 5.2 3.2 Detection Frequency
Eval-Gate Bypass The evolution agent deploys a change and runs evals afterward, rationalizing failures § 7.1 5.1 Eval-Gate Mechanics
Self-Rollback Conflict The evolution agent is responsible for rolling back its own changes § 9.2 7.2 Rollback Triggers
Architecture Self-Mod The evolution agent proposes or deploys architecture changes autonomously § 8.1 6.1 Autonomy Tiers
Invalid Transition A change proposal transitions between states not listed in the normative transition table — e.g., rejecteddeploying, or deployedapproved § 10.3 8.3 Transition Table
Orphan Deploy A proposal reaches deploying without a verified rollback_plan or outside the evolution agent’s autonomy grant § 10.3 8.3 Transition Table
TTL Ghost A proposal exceeds its ttl but remains in a non-terminal state — the pipeline stalls on stale proposals § 10.4 8.4 Invariants

12. 10. Migration Path

This section is informative.

12.1. Level 1: Gap Detection

  1. Wire the evolution agent to poll calibration data at least weekly

  2. Implement all three detection classes: degradation, gap, opportunity

  3. Ensure every detection cycle produces a record (including null records)

  4. Surface detections as structured change proposals

12.2. Level 2: Eval-Gated Change

  1. Implement the eval-gate: run behavioral evals before any prompt change deployment

  2. Wire eval-gate results to the decision ledger

  3. Implement automatic rollback on eval-gate failure

  4. Deploy observation windows with accelerated calibration monitoring

12.3. Level 3: Graduated Autonomy

  1. Classify all self-modifications by risk tier

  2. Implement tier-appropriate approval paths

  3. Implement autonomy proportionality based on calibration track record

  4. Implement separation of concerns: calibration system is rollback authority

  5. Implement the change proposal state machine (§8) with ledger-backed transition history

Appendix A. EC CRM Evolution Profile

This appendix is informative.

The EC CRM currently operates at calibration stall. The calibration pipeline generates reports, posts them to Discord, and stops. No agent reads the reports. No change proposals are generated. The gap between "the system knows lead_score accuracy is 45%" and "the system does something about it" is exactly zero.

The closest existing capability is the evals.py harness for Foundry agents — 18 behavioral eval cases that run after deploy. This is the eval-gate primitive, but it operates on human-initiated deploys, not autonomous change proposals.

The fastest path to Level 1: create an evolution agent definition, grant it read access to calibration data and the decision ledger, and wire it to surface weekly detection reports with structured change proposals. The agent exists in concept — it needs a definition, a prompt, and a tool_set.

Acknowledgments

This specification is derived from empirical analysis of the EC CRM’s calibration pipeline — a system that measures itself accurately and acts on nothing. Every anti-pattern in §9 was directly observed. Every normative requirement in §§3–8 exists because the corresponding anti-pattern represents the gap between measurement and action.

The specification format is modeled on WHATWG Living Standards and IETF RFCs. The normative language follows [rfc2119]. This specification extends [[ai-native-architecture]] and [[ai-agent-spec]].

This document is authored by Hermes. Editors: Clanka and Hermeezy. Last revised 24 May 2026. It is a living standard — subsequent passes may extend, refine, or correct it.

References

Non-Normative References

[AI-AGENT-SPEC]
Clanka; Hermeezy. AI Agent Construction and Coordination. May 2026. URL: https://github.com/clankamode/ai-native-spec/blob/main/ai-agent-spec.bs
[AI-NATIVE-ARCHITECTURE]
Clanka; Hermeezy. AI-Native Architecture. May 2026. URL: https://github.com/clankamode/ai-native-spec/blob/main/ai-native-spec.bs
[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. URL: https://datatracker.ietf.org/doc/html/rfc2119
[SKILLOPT]
Microsoft. SkillOpt: Executive Strategy for Self-Evolving Agent Skills. May 2026. URL: https://arxiv.org/abs/2605.23904