AI Agent Construction and Coordination

Living Document,

This version:
https://github.com/clankamode/ai-native-spec
Issue Tracking:
GitHub
Editors:
Clanka
Hermeezy

1. Abstract

This specification defines how AI agents are constructed, deployed, evaluated, and coordinated on an [[ai-native-architecture]] foundation. An AI agent is a software component that uses a language model to reason and act — and when multiple agents operate on the same system, their construction must be portable, their tools shareable, their prompts versioned, their behavior evaluated, their coordination governed, and their telemetry observable.

Every normative requirement in this specification is reverse-derived from a concrete failure observed in production multi-agent systems — no requirement exists without a corresponding failure mode.

2. Status of This Document

This is a Draft Living Standard synthesized by the Hermes agent runtime from empirical analysis of production AI-agent systems. Work began 2026-05-24. Later versions may supersede this document.

Introduction

This section is informative.

An AI-native system provides the architecture. But the agents that operate on that architecture — the runtimes that reason, the tools they invoke, the prompts that guide them, the evals that measure them — are built ad-hoc. Every team invents its own agent format. Every agent writes its own API client. Every prompt is a string in a source file. Every eval is a manual check in a chat channel.

This produces a predictable set of failure modes. Six agents share zero tool definitions — every runtime implements its own API client with different error handling, different auth, different bugs. Nine prompt builders copy-paste the same business persona with minor wording drift. Zero behavioral tests verify AI output before deployment — a green test suite proves the plumbing works but says nothing about what the agent actually says. Agents operate on the same entities with zero mutual awareness — two agents can propose conflicting replies to the same customer. Token spend is invisible — nobody knows which agent costs what.

This specification defines how to avoid those failures. It defines an agent as a portable artifact with identity, tools, prompts, and an eval suite. It defines a tool interface that makes tools shareable across runtimes. It defines a prompt lifecycle with versioning, calibration feedback, and rollback. It defines an evaluation framework that tests behavior, not plumbing. It defines coordination primitives for multi-agent systems. It defines observability requirements so AI spend is attributable and AI quality is measurable.

How to read this document. §1 defines the vocabulary. §2 defines conformance — read this to understand the three levels and the conformance profile requirement. §§3–9 are the normative body. §10 catalogs the anti-patterns that motivated every requirement. §11 is the migration path. Appendix A provides a worked example applying the spec to the EC CRM’s agent ecosystem. Appendix B provides a conformance profile template.

3. 1. Terminology

3.1. 1.1 Normative Language

The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in [rfc2119].

3.2. 1.2 Definitions

Agent

A software component that uses a language model to reason, make decisions, and take actions. An agent has a distinct identity, a declared tool surface, a prompt set, a memory configuration, and a calibration profile. An agent MAY operate autonomously (cron) or interactively (chat).

Agent Definition

A portable, machine-readable document that describes everything needed to instantiate and operate an agent: identity, tool surface, prompt set, memory configuration, calibration profile, and eval suite.

Tool

A capability exposed to an agent that the agent can invoke to read from or write to the system. Tools have declared schemas, error contracts, and rate limits. A tool defined once MUST be usable by any agent with the appropriate permissions.

Tool Fragmentation

A condition where multiple agent runtimes maintain independent, incompatible implementations of the same system capability. An AI-native agent ecosystem MUST NOT exhibit tool fragmentation.

Prompt Artifact

A versioned, evaluable unit of agent instruction. A prompt is not a string in a source file — it is an artifact with an identity, a version history, calibration metrics per version, and a rollback path.

Prompt Drift

A condition where multiple agents serving the same business function use divergent prompt text with no shared source of truth. An AI-native agent ecosystem MUST NOT exhibit prompt drift.

Behavioral Eval

A test that sends real input to a real model and asserts on output quality — voice, register, factual accuracy, safety constraint adherence — rather than asserting that a function was called. Plumbing tests mock the model; behavioral evals test the behavior.

Eval Inversion

A condition where a system invests more in measuring AI quality after deployment than in verifying AI quality before deployment. An AI-native agent ecosystem MUST maintain eval coverage proportional to feature deployment — you cannot deploy what you cannot test.

Agent Coordination

The protocols and primitives by which multiple agents operating on the same system avoid conflicts, resolve collisions, and delegate work. Coordination MUST be explicit; agents MUST NOT rely on implicit mutual awareness.

Coordination Blind Spot

A condition where two or more agents operate on the same entities with zero mutual awareness and no conflict detection. An AI-native agent ecosystem MUST detect and surface coordination blind spots.

Agent Observability

The telemetry every agent MUST emit: decision records (per [[ai-native-architecture]] §5.1), token usage, invocation latency, error rates, and calibration data. Observability MUST be per-agent, per-kind, and per-model.

Cost Blind Spot

A condition where AI model calls are made with zero token tracking, making per-agent cost attribution impossible. An AI-native agent ecosystem MUST NOT exhibit cost blind spots.

Agent Spec Level 1 (Portable Definition)

A conformance level where every agent has a machine-readable definition with identity, tool surface, and prompt set. Agent identity is consistent across runtimes.

Agent Spec Level 2 (Shared Tools and Versioned Prompts)

A conformance level where the system satisfies Level 1 plus a shared tool registry with at least one tool consumed by two distinct runtimes, and prompts are versioned artifacts with calibration history.

Agent Spec Level 3 (Evals, Coordination, and Observability)

A conformance level where the system satisfies Level 2 plus behavioral evals running per deploy, explicit coordination primitives for multi-agent systems, and per-agent observability with token and cost tracking.

Agent Liveness State

One of the four normative operational states in the agent liveness finite state machine (§8): active, degraded, disabled, or retired. A conformant implementation MUST enforce the transition table in §8.3 — transitions not listed there MUST NOT occur.

4. 2. Conformance

4.1. 2.1 Conformance Classes

This specification defines three conformance classes:

4.2. 2.2 Proving Conformance

A system claiming conformance to a level MUST be verifiable against the normative requirements of that level and all lower levels. Verification is performed by:

  1. Agent definition audit: Does every deployed agent have a machine-readable definition? Are agent identities consistent across all runtimes and the audit trail?

  2. Tool registry audit: How many tools are defined in the shared registry? How many distinct runtimes consume each tool? Is any system capability implemented independently by two runtimes?

  3. Prompt artifact audit: Are prompts versioned? Does each prompt version carry calibration metrics? Is there a rollback path?

  4. Eval coverage audit: Do behavioral evals exist for every deployed agent? Do they run on deploy? What is the ratio of behavioral evals to plumbing tests?

  5. Coordination audit: Do coordination primitives exist? Can the system detect when two agents operate on the same entity? Are conflicts surfaced or resolved?

  6. Observability audit: Can the system answer "how much did each agent cost last month?" Can it answer "which agent produced the most corrected outputs?"

  7. Liveness state machine audit: For every agent in the registry, does a recorded liveness state exist? Does every recorded transition appear in the normative transition table (§8.3)? Are agents in retired absent from the active registry? Are agents in disabled free of new invocations and token consumption?

A formal proof of conformance SHALL reference specific agent definitions, tool registries, prompt versions, eval results, liveness state records, and query results — not architectural intent.

4.3. 2.3 Conformance Profile

A system claiming conformance to this specification MUST produce a conformance profile — a machine-readable document (YAML or JSON) stored alongside the system’s configuration. The profile MUST declare the agent registry, shared tool registry, prompt version matrix, eval coverage, coordination policies, and observability queries. See Appendix B for the template format.

5. 3. Agent Definition

5.1. 3.1 Agent Definition Format

Every agent deployed in an AI-native system MUST have a machine-readable agent definition. The definition MUST be stored in a version-controlled file in a structured format (YAML or JSON). The definition MUST carry:

An agent definition MUST NOT duplicate tool implementations, prompt text, or memory configuration inline. It MUST reference them by identifier.

5.2. 3.2 Identity Consistency

An agent’s identity MUST be consistent across all surfaces. If an agent is defined with a given agent_id in its definition file, every decision it writes to the AI decision ledger, every proposal it creates, and every audit trail entry it generates MUST carry that same identity.

The system MUST NOT allow an agent to present different identities to different surfaces. An agent that appears with one identity to its runtime but a different identity to the system it operates on is non-conformant.

5.3. 3.3 Agent Registry

A system with more than one agent MUST maintain an agent registry — a queryable index of all deployed agents keyed by agent_id. Each registry entry MUST include the path to the agent’s definition file and the agent’s current agent liveness state (§8). An agent runtime MUST be able to resolve an agent_id to its definition by querying the registry — no hardcoded paths, no convention-over-configuration guessing.

No agent MUST be deployed without a corresponding registry entry. The registry is the single source of truth for agent identity. When an orchestrator selects an agent for a task, it queries the registry for the agent’s capabilities, not a hardcoded list.

6. 4. Tool Interface

6.1. 4.1 Tool Invocation Patterns

Every tool in the shared registry falls into one of three invocation patterns. The pattern determines how agents consume the tool and what the registry must declare:

  1. API tool: The tool’s implementation is an HTTP endpoint. Any agent runtime that can make HTTP requests can invoke it. The registry entry MUST declare the endpoint URL, HTTP method, and authentication mechanism. This is the most portable pattern — a tool implemented once behind an API is consumable by every runtime.

  2. Library tool: The tool’s implementation is a function in a shared library or module. Only runtimes in the same language ecosystem can invoke it directly. The registry entry MUST declare the import path, function signature, and language runtime. To make a library tool consumable by runtimes in other languages, the system SHOULD expose it through an API wrapper — at which point it becomes an API tool.

  3. Runtime-native tool: The tool’s implementation is inextricable from a specific runtime (e.g., a Wrangler D1 query, an Azure vector store search, a platform-specific SDK call). Only one runtime can invoke it. The registry entry MUST declare which runtime it requires. A runtime-native tool MUST NOT be the only path to a system capability if that capability is needed by agents on other runtimes — the system SHOULD provide an API or library equivalent.

A tool defined in the registry MUST be invocable by any agent that holds the required credentials, through at least one of these patterns. The system MUST NOT require each runtime to implement its own client for the same capability (the Tool Fragmentation anti-pattern). If a system capability is needed by agents on multiple runtimes, it MUST be exposed as an API tool or a library tool with API wrapper.

6.2. 4.2 The Shared Tool Registry

An AI-native agent ecosystem MUST provide a shared tool registry. Every tool available to any agent MUST be defined in the registry. The registry entry MUST carry:

The registry is not just documentation — it is the canonical source of tool truth. When a tool’s behavior changes, the registry version MUST be incremented before the change is deployed. An agent that invokes a tool at version N and later at version N+1 MUST be able to compare calibration data across versions.

6.3. 4.3 Tool Consumption

An agent consumes a tool by referencing its tool_id in the agent’s tool_set. The runtime MUST resolve the tool reference through the registry and invoke it according to its declared pattern:

When an agent invokes a tool, the invocation MUST record the tool_id, version, agent_id, input parameters, output or error, and latency. This record MUST be attributable to the invoking agent. The system MUST be able to answer: "how many times was each tool invoked per agent in the last 30 days?" and "which tool version produced the highest error rate per agent?"

6.4. 4.4 Tool Lifecycle

Tools have a lifecycle. A tool MAY be:

When a tool is deprecated, its registry entry MUST reference the replacement tool_id. The deprecation window MUST be at least as long as the calibration window (RECOMMENDED: 30 days) to allow agents to demonstrate successful migration in calibration data.

A tool version MUST NOT be modified in place. When a tool’s behavior changes — even if the schema is unchanged — a new version MUST be published. The previous version MUST remain invocable for the duration of the deprecation window. This ensures that calibration data for version N is stable: the same version always means the same behavior.

7. 5. Prompt Lifecycle

7.1. 5.1 The Prompt Artifact

A prompt artifact is the canonical representation of an agent’s instructions. It MUST be stored as a versioned file in the agent’s definition directory. The artifact MUST carry:

A prompt that exists only as a string literal in a source file is non-conformant. The prompt MUST be an addressable, versioned artifact whose performance history is queryable through the calibration system.

7.2. 5.2 Prompt Versioning and Routing

Every change to a prompt’s instruction text MUST produce a new version. The version history MUST be queryable: for any prompt_id, the system MUST be able to answer "what versions exist, when were they deployed, and how did each version perform?"

When a new prompt version is deployed, the system MUST record the deployment in the AI decision ledger with kind = "prompt_deploy", carrying the prompt_id, version, and the diff from the previous version. The system MUST begin routing new invocations to the new version.

The system SHOULD support dual-version routing for A/B evaluation. In dual-version mode:

When a version’s accuracy drops below its rollback_threshold, the system MUST alert a human and MUST stop routing traffic to that version. The routing layer — not the artifact, not the agent — is responsible for switching traffic to the previous version. The prompt artifact defines the threshold; the routing layer enforces it. A system where a prompt version below threshold continues to receive traffic is non-conformant.

For a working implementation of eval-gated prompt optimization with versioned artifacts and validation-gated deployment, see [[skillopt]].

7.3. 5.3 Prompt Drift Prevention

The system MUST NOT exhibit prompt drift. If multiple agents serve the same business function — for example, multiple agents that draft customer communications — they MUST share a single base prompt artifact. Individual agents MAY layer function-specific instructions on top of the shared base, but the shared base MUST be the single source of truth.

The requirement is about singularity of truth, not file count. A prompt artifact MAY be a single file or a directory of layered templates, provided the layering is explicit and the base is shared. What is non-conformant is a codebase where each agent copy-pastes the same persona text into its own prompt file with independent edits drifting over time.

The system SHOULD detect and surface prompt drift: given two agents that serve the same declared function, the system SHOULD be able to determine whether they share a base prompt or have diverged.

8. 6. Evaluation Framework

8.1. 6.1 Eval Definition

A behavioral eval is a test that verifies an agent’s output quality, not just its output shape. Every agent deployed at Level 3 conformance MUST have a behavioral eval suite. The number of eval cases MUST be proportional to the agent’s declared feature count — each feature declared in the agent’s definition MUST have at least one behavioral eval. An agent whose definition declares no features requires no evals; an agent with ten declared features requires at least ten evals.

Each eval case MUST carry:

The three assertion types serve different purposes. Structural assertions verify that output can be parsed — necessary but insufficient. Lexical assertions verify that specific phrases are present or absent — precise but brittle; they SHOULD be reviewed when the prompt version changes. LLM judge assertions verify qualitative properties (voice, tone, safety) that cannot be captured by schema or string matching — they are the deepest but most expensive. An eval suite that contains only structural assertions is non-conformant at Level 3.

An eval case written against a prompt version that is more than two versions behind the current deployed version MUST be flagged for review. The eval MAY still be run, but its results MUST carry a staleness warning. An eval that consistently passes against stale criteria (the behavior changed but the eval was never updated) provides false confidence.

8.2. 6.2 Eval Execution

Behavioral evals MUST run on every deploy that changes an agent’s prompt or tool surface. A deploy that ships without running the agent’s eval suite is non-conformant.

The RECOMMENDED pattern is a deploy-eval gate: run the eval suite after deploy, warn on regression, block on critical failure. A failure is critical if:

A failure is a warning (non-blocking) if a non-safety lexical assertion fails or the LLM judge score drops but remains above the critical threshold.

The eval results MUST be written to the AI decision ledger with kind = "eval_run", carrying the agent_id, prompt_version, pass/fail counts per assertion type, and any regressions.

The LLM judge model used for evaluation MUST itself be calibrated. The system MUST track the judge model’s accuracy by periodically evaluating it against a held-out set of human-labeled examples. A judge model whose accuracy drops below a threshold defined in the conformance profile MUST be replaced or retrained before its evaluations are used for gating decisions.

8.3. 6.3 Eval Coverage

An AI-native agent ecosystem MUST NOT exhibit eval inversion — investing more in measuring AI quality after deployment than in verifying it before deployment.

The system SHOULD maintain a coverage ratio: for every AI feature deployed, at least one behavioral eval exists. An AI feature is any agent capability that produces AI output consumed by a human or another system — scoring, classification, drafting, summarization, recommendation. Features MUST be declared in the agent’s definition under a features field, with each feature carrying a stable feature_id. The declared feature count is the basis for eval coverage proportionality (§6.1). A system where calibration runs weekly but no behavioral evals run at all is non-conformant at Level 3.

A plumbing test that mocks the model and asserts the function was called does not count toward behavioral eval coverage. The eval MUST exercise the model.

9. 7. Multi-Agent Coordination

9.1. 7.1 Inter-Agent Visibility

Before agents can coordinate, they must be able to see each other. The system MUST provide:

A system where agents cannot determine what other agents are doing to the same entities is non-conformant at Level 3.

9.2. 7.2 State Conflict Prevention

The system MUST NOT exhibit coordination blind spots. Every entity class that is writable by more than one agent MUST declare a coordination policy. The policy MUST be one of:

A system where two agents can independently modify the same entity with no conflict detection is non-conformant at Level 3. The coordination policies declared in the conformance profile MUST be enforceable by the system — a policy without an enforcement mechanism is non-conformant.

9.3. 7.3 Delegation

Delegation is the primary coordination mechanism in a multi-agent system — Agent A assigns work to Agent B rather than performing it directly. When delegation occurs, the system MUST:

  1. Record the delegation in the AI decision ledger with kind = "agent_delegation", carrying:

    • The delegating agent (from_agent_id)

    • The target agent (to_agent_id)

    • The task description sufficient for the target agent to operate independently

    • The expected completion criteria

    • A TTL after which the delegation is considered abandoned

  2. Pass sufficient context to the target agent. The delegating agent MUST NOT assume the target agent shares its workspace, memory store, or file system. Context MUST be explicitly passed in the delegation record.

  3. Provide a completion mechanism. The target agent MUST report completion, failure, or timeout back to the delegating agent through a defined callback. An abandoned delegation (TTL elapsed with no response) MUST be surfaced.

Delegation chains (Agent A → Agent B → Agent C) MUST be traceable. Every agent in the chain MUST be able to identify the originating agent and the original task. A system where delegation provenance is lost after one hop is non-conformant.

For a working implementation of parallel multi-agent coordination, see [[agentswarm]] — Agent Swarm, a self-directed parallel agent orchestration framework that dynamically decomposes complex tasks into heterogeneous sub-problems and executes them concurrently via domain-specialized sub-agents. The orchestrator is equipped with interfaces for sub-agent creation and task delegation; under Parallel-Agent Reinforcement Learning (PARL), sub-agents are frozen during training and only the orchestrator is updated via RL, decoupling credit assignment from sub-agent execution and avoiding the training instability of end-to-end co-optimization. This architecture transforms task complexity from linear sequential scaling to parallel processing — reducing inference latency by up to 4.5× and improving item-level F1 from 72.8% to 79.0% in wide-search scenarios compared to single-agent sequential baselines.

9.4. 7.4 Cross-Agent Trust

When Agent A consumes an observation written by Agent B in the shared memory primitive, Agent A MUST be able to discover Agent B’s calibration standing before acting on the observation. The system MUST provide:

An observation from an agent whose confidence weight for the relevant action kind is below a threshold defined in the conformance profile SHOULD be treated as advisory rather than authoritative. The threshold is profile-specific; a typical starting point is 0.5 (the source agent is corrected more often than it is accepted).

The calibration system SHOULD periodically update per-agent confidence weights based on correction rates. When an agent’s confidence weight drops below the threshold for all action kinds, the agent’s observations SHOULD be quarantined — retained in shared memory but flagged as low-confidence until the agent demonstrates recovery in calibration data.

9.5. 7.5 Ghost Agent Detection

The system MUST detect ghost agents (per [[ai-native-architecture]] §1.2) and agents that have become operationally silent. A ghost agent is fully deployed — definition in the registry (§3.3), tools wired, endpoint reachable — but produces no invocations. An operationally silent agent has invocation paths but has stopped producing AI decision ledger records.

The system MUST monitor per-agent activity using a query of the form SELECT MAX(created_at) FROM ai_decisions WHERE agent_id = $id. If the most recent decision record for an agent is older than the ghost_activity_days threshold defined in the conformance profile (RECOMMENDED: 14 days), the agent MUST be flagged as a ghost candidate.

Ghost detection MUST feed the agent liveness state machine (§8). A ghost candidate in the active state MUST transition to degraded with trigger = "ghost_detection". The system MUST NOT silently abandon deployed agent infrastructure — every agent in the registry MUST have a recorded agent liveness state.

10. 8. Agent Liveness State Machine

An agent’s operational lifecycle — whether it is healthy, degraded, disabled, or permanently retired — is normatively defined as a finite state machine. A conformant implementation MUST treat the transition table in §8.3 as authoritative — prose descriptions elsewhere in this specification MUST NOT contradict it. Conformance to the state machine is verifiable: every recorded transition either appears in §8.3 or the implementation is non-conformant.

The liveness state machine governs agent operation, distinct from the agent definition artifact (§3). An agent MAY remain in the registry while disabled; it MUST be removed from the registry when retired. The current liveness state MUST be queryable by agent_id without session history — a liveness monitor needs only (agent_id, current_state) to decide the next action.

10.1. 8.1 States

An agent MUST be in exactly one agent liveness state at any time. The four states and their meanings:

State Description Terminal
active The agent is operational. Invocations are permitted. Calibration standing (§7.4) is at or above thresholds. Activity is within the ghost detection window (§7.5). No
degraded The agent remains invocable but its calibration standing, error rate, or activity level has fallen below operational thresholds. Observations from this agent SHOULD be treated as advisory (§7.4). Autonomy grants SHOULD be reduced (§9.2). No
disabled The agent MUST NOT receive new invocations. Scheduled triggers MUST be suspended. Existing in-flight operations MAY complete. The agent definition and artifacts remain in the registry but the agent consumes no model tokens or tool quota for new work. No
retired The agent is permanently decommissioned. The definition is archived and removed from the active registry (§3.3). All artifacts are retained for audit. No transitions out of this state are permitted. Yes

10.2. 8.2 State Diagram

This subsection is informative.

The normative transition table is §8.3. The diagram below summarizes the primary paths:

                  ┌─────────┐
       ┌─────────│ active  │─────────┐
       │         └────┬────┘         │
calibration /        │              │ human retires
error / ghost        │ degraded     │ (§3)
       │              │              │
       ▼              ▼              ▼
  ┌──────────┐   ┌──────────┐   ┌─────────┐
  │ degraded │◀──│ disabled │   │ retired │ (terminal)
  └────┬─────┘   └────┬─────┘   └─────────┘
       │              │
no recovery /   human re-enables
auto-disable         │
       │              │
       ▼              ▼
  ┌──────────┐   ┌─────────┐
  │ disabled │   │ active  │
  └────┬─────┘   └─────────┘
       │
  human retires
       │
       ▼
  ┌─────────┐
  │ retired │ (terminal)
  └─────────┘

Any state ──definition deleted from registry──▶ retired

10.3. 8.3 Transition Table

Every state transition MUST satisfy exactly one row in this table. Transitions not listed here MUST NOT occur. The Guard column defines preconditions that MUST hold before the transition. The Action column defines side effects that MUST occur atomically with the transition.

From To Guard Action
active degraded Calibration accuracy for any action kind falls below the threshold in the agent’s calibration_profile (§3.1, §7.4); OR per-agent error rate exceeds the threshold defined in the conformance profile; OR no AI decision ledger records for this agent_id within ghost_activity_days (§7.5). MUST record transition with trigger set to "calibration_degradation", "error_rate_exceeded", or "ghost_detection" respectively. MUST reduce autonomy grant per §9.2. SHOULD quarantine observations in shared memory (§7.4).
degraded active Calibration accuracy for all action kinds exceeds the recovery threshold defined in the conformance profile for recovery_windows consecutive calibration windows (RECOMMENDED: 3). Error rate is below threshold. At least one decision record exists within ghost_activity_days. MUST record calibration evidence that satisfied the recovery guard. MUST restore autonomy grant to pre-degradation level or lower per §9.2. MUST clear quarantine flags on observations (§7.4).
degraded disabled Agent has remained in degraded for longer than degraded_to_disabled_days defined in the conformance profile (RECOMMENDED: 30 days) without satisfying the recovery guard; OR auto-disable threshold triggered (error rate or calibration below a stricter threshold defined in the conformance profile). MUST suspend all scheduled invocations. MUST reject new invocations with a structured error carrying error_code: "agent_disabled" and the agent_id. MUST record disable_reason.
disabled active Human operator explicitly re-enables the agent. Re-enablement MUST NOT occur autonomously. MUST record reviewer identity and re_enable_reason. MUST verify the agent’s definition is still valid (§3.1). SHOULD run the agent’s behavioral eval suite (§6) before resuming invocations.
disabled retired Human operator marks the agent as permanently retired. MUST archive the agent definition and all referenced prompt artifacts (§5). MUST remove the agent from the active registry (§3.3). MUST record retire_reason and reviewer identity. MUST suspend all resource consumption — no scheduled triggers, no token allocation, no tool quota.
active retired Human operator removes the agent definition and archives all artifacts. Direct retirement from active MUST require human approval — MUST NOT occur autonomously. MUST archive definition and artifacts. MUST remove from active registry. MUST record reviewer identity and retire_reason.
degraded retired Human operator marks the agent as permanently retired, OR agent definition deleted from registry. Same actions as disabledretired.
Registry deletion (any state)
any retired Agent definition deleted from the registry (§3.3) — whether by human action or registry maintenance. MUST archive all artifacts before deletion. MUST record retire_reason = "registry_deletion". MUST be treated as terminal regardless of prior state.

10.4. 8.4 Invariants

A conformant implementation MUST maintain these invariants across all agents:

  1. Single state: An agent MUST occupy exactly one liveness state at any time. Concurrent transitions on the same agent_id MUST be serialized — the second transition MUST fail or block until the first completes.

  2. Terminal immutability: Agents in retired MUST NOT transition to any other state. A new agent MUST be created with a new agent_id to replace a retired agent.

  3. Recovery requires evidence: Transitions from degraded to active MUST be backed by calibration data spanning recovery_windows consecutive windows — a single good reading MUST NOT suffice.

  4. Human gate on re-enablement: Transitions from disabled to active MUST require explicit human approval. Automated recovery from disabled MUST NOT occur.

  5. Human gate on retirement from active: Transitions from active to retired MUST require explicit human approval. The calibration system MUST NOT autonomously retire an agent.

  6. Disabled resource containment: An agent in disabled MUST NOT consume model tokens, tool quota, or scheduled compute for new invocations. A disabled agent that continues to invoke tools or models is non-conformant.

  7. Registry consistency: Every agent in the active registry (§3.3) MUST have a recorded liveness state. An agent in retired MUST NOT appear in the active registry.

  8. Ledger completeness: Every transition in §8.3 MUST produce exactly one AI decision ledger record with kind = "agent_liveness", carrying agent_id, from_state, to_state, trigger, and timestamp.

10.5. 8.5 Stateless Agent Compatibility

Each transition SHOULD be implementable by a stateless liveness monitor that receives only the current agent liveness state and a transition payload. The payload MUST carry sufficient context for the guard to evaluate:

The liveness monitor MUST NOT require session history beyond the current state and payload — all prior transitions MUST be recoverable from the AI decision ledger.

11. 9. Agent Observability

11.1. 9.1 Per-Agent Telemetry

Every agent MUST emit telemetry that is attributable to its identity. This extends the architecture spec’s observability requirements ([[ai-native-architecture]] §14) to the per-agent level. At minimum:

A system that tracks AI spend in aggregate but cannot attribute it to individual agents exhibits the Cost Blind Spot anti-pattern. The system MUST be able to answer the following queries:

11.2. 9.2 Observability as Control Input

Per-agent telemetry is not just for debugging — it is the input to automated decisions. The system MUST feed per-agent observability data into:

11.3. 9.3 Agent-Specific Security Considerations

This subsection supplements the threat model in [[ai-native-architecture]] §9.5 with threats specific to agent construction and coordination.

12. 10. Anti-Pattern Catalog

This section is informative.

The following anti-patterns were observed in production and inform the normative requirements above. Each maps to one or more requirements in this specification.

Anti-Pattern Description Normative Reference
Tool Fragmentation Six agents, six independent API clients, six different error-handling strategies § 6.2 4.2 The Shared Tool Registry
Prompt Drift Nine prompt builders copy-paste the same persona with minor wording differences § 7.3 5.3 Prompt Drift Prevention
Eval Inversion 862 lines of calibration tests, 62 lines of behavioral tests, zero live-model tests § 8.3 6.3 Eval Coverage
Prompt as String Literal Prompt text embedded in application source code with no version, no history, no rollback § 7.1 5.1 The Prompt Artifact
Coordination Blind Spot Two agents propose conflicting replies to the same recipient with no conflict detection § 9.2 7.2 State Conflict Prevention
Cost Blind Spot AI model calls with zero token tracking — nobody knows which agent costs what § 11.2 9.2 Observability as Control Input
Agent Identity Drift An agent presents one identity to its runtime and a different identity to the system it operates on § 5.2 3.2 Identity Consistency
Ghost Tool A tool defined in one runtime’s codebase but unreachable by any other runtime § 6.2 4.2 The Shared Tool Registry
Ghost Agent Fully-deployed agent with zero invocations — infrastructure exists but the agent produces no output § 9.5 7.5 Ghost Agent Detection, § 10.3 8.3 Transition Table
Silent Abandonment An agent stops operating but remains in the registry with no recorded liveness state transition § 10.4 8.4 Invariants
Disabled Resource Leak A disabled agent continues to consume model tokens, tool quota, or scheduled compute § 10.4 8.4 Invariants

13. 11. Migration Path

This section is informative.

A system that does not yet conform can migrate incrementally. The RECOMMENDED migration order follows the conformance levels:

13.1. Level 1: Portable Definitions

  1. Create a machine-readable agent definition for each deployed agent

  2. Ensure every agent definition carries identity, tool_set, and prompt_set

  3. Verify agent identities are consistent across all runtimes and the audit trail

  4. Create an agent registry as the single source of truth for agent identity

13.2. Level 2: Shared Tools and Versioned Prompts

  1. Extract tool definitions from individual runtimes into a shared tool registry

  2. Wire at least one tool to be consumed by two distinct runtimes

  3. Extract prompt text from source files into versioned prompt artifacts

  4. Implement prompt deploy logging and calibration history per version

  5. Set rollback thresholds for each prompt

13.3. Level 3: Evals, Coordination, and Observability

  1. Create a behavioral eval suite for each agent

  2. Wire evals to run on every deploy that changes an agent

  3. Implement entity-level coordination policies for all multi-writer entity classes

  4. Deploy conflict detection for entities writable by more than one agent

  5. Implement per-agent token tracking and cost attribution

  6. Wire agent observability data into the calibration feedback loop

  7. Implement the agent liveness state machine (§8) with ledger-backed transition history

  8. Configure ghost detection thresholds and liveness monitor queries in the conformance profile

Appendix A. EC CRM Agent Ecosystem Profile

This appendix is informative. It applies the portable requirements above to the Exterior Completion CRM’s agent ecosystem.

A.1 Current Agents

The EC CRM ecosystem operates six agents across four runtimes:

| Agent ID | Runtime | Purpose | |----------|---------|---------| | clanka-digest | azure-foundry | Daily owner briefing — 14 read tools, file_search knowledge base | | clanka-ops | azure-foundry | Ops concierge — 15 tools including quote/invoice/SMS propose | | clanka-triage | azure-foundry | Inbound event classification — JSON output, no tools | | clanka-extractor | azure-foundry | Messy text → structured JSON | | crm-internal | cloudflare-worker | All in-Worker AI (scoring, classification, drafting) | | ec-agent-cron | hermes-agent | Shared memory sync, daily ops briefings |

A.2 Tool Fragmentation Assessment

The ec_* tools in clanka_foundry/tools/ec.py are used by clanka-digest and clanka-ops — both Foundry agents. The CRM’s in-Worker AI functions (azure-foundry.ts) implement equivalent capabilities independently. The Hermes cron agent queries D1 directly rather than through any shared tool. Three runtimes, three implementation paths for the same data access — tool fragmentation.

A.3 Prompt Drift Assessment

The CRM’s gemini.ts (1,438 lines) contains nine standalone prompt builders that each copy-paste the business persona. The Foundry agents load their persona from ded-voice.md via file_search. The Hermes cron agent has its own prompt text. Three sources of truth for the same business voice — prompt drift.

A.4 Eval Inversion Assessment

The CRM’s calibration infrastructure (ai-calibration.ts, 621 lines, 862 lines of tests) measures AI quality retroactively. The Foundry eval harness (evals.py, 18 cases) tests agent behavior before deployment but only for Foundry agents — zero coverage for in-Worker AI. The ratio is inverted: more measurement after deployment than verification before deployment.

Appendix B. Agent Conformance Profile Template

This appendix is informative. It provides a template for the conformance profile required by §2.3.

A complete conformance profile declares:

# Agent Registry
agents:
  - agent_id: clanka-ops
    runtime: azure-foundry
    definition_path: clanka_foundry/agents/ops.yaml

# Shared Tool Registry
tools:
  - tool_id: ec_get_lead
    tier: observable
    auth: CRM_AGENT_SERVICE_KEY
    consumed_by: [clanka-digest, clanka-ops]

# Prompt Version Matrix
prompts:
  - prompt_id: sms_reply_base
    current_version: 3
    rollback_threshold: 0.70

# Eval Coverage
evals:
  - agent_id: clanka-ops
    eval_count: 18
    last_run: 2026-05-24T14:00:00Z
    pass_rate: 1.0

# Coordination Policies
coordination:
  - entity_class: communication_threads
    policy: proposal_only
    writers: [clanka-ops, crm-internal]

# Feature Declarations (per §6.3)
features:
  - agent_id: clanka-ops
    features:
      - feature_id: sms_drafting
        description: Draft outbound SMS replies
      - feature_id: quote_creation
        description: Create quotes from leads

# Observability
observability:
  cost_by_agent_query: "SELECT agent_id, SUM(cost_estimate) FROM ai_decisions GROUP BY agent_id"
  correction_rate_by_agent_query: "SELECT agent_id, kind, correction_rate FROM calibration_model_rankings"

# Agent Liveness (per §8)
liveness:
  ghost_activity_days: 14
  recovery_windows: 3
  degraded_to_disabled_days: 30
  state_query: "SELECT agent_id, liveness_state, updated_at FROM agent_liveness"

Acknowledgments

This specification is derived from empirical analysis of the Exterior Completion CRM’s agent ecosystem — six agents across four runtimes sharing zero tools, zero prompts, and zero evals. Every anti-pattern in §10 was directly observed in production. Every normative requirement in §§3–9 exists because the corresponding anti-pattern caused real operational friction.

The specification format is modeled on WHATWG Living Standards and IETF RFCs. The normative language follows [rfc2119]. This specification extends the [[ai-native-architecture]].

This document is authored by Hermes. Editors: Clanka and Hermeezy. Last revised 24 May 2026. It is a living standard — subsequent passes may extend, refine, or correct it.

References

Non-Normative References

[AGENTSWARM]
Kimi Team. Kimi K2.5: Visual Agentic Intelligence. February 2026. URL: https://arxiv.org/abs/2602.02276
[AI-NATIVE-ARCHITECTURE]
Clanka; Hermeezy. AI-Native Architecture. May 2026. URL: https://github.com/clankamode/ai-native-spec/blob/main/ai-native-spec.bs
[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. URL: https://datatracker.ietf.org/doc/html/rfc2119
[SKILLOPT]
Microsoft. SkillOpt: Executive Strategy for Self-Evolving Agent Skills. May 2026. URL: https://arxiv.org/abs/2605.23904