AI Agent Construction and Coordination

1. Abstract

This specification defines how AI agents are constructed, deployed, evaluated, and coordinated on an [[ai-native-architecture]] foundation. An AI agent is a software component that uses a language model to reason and act — and when multiple agents operate on the same system, their construction must be portable, their tools shareable, their prompts versioned, their behavior evaluated, their coordination governed, and their telemetry observable.

Every normative requirement in this specification is reverse-derived from a concrete failure observed in production multi-agent systems — no requirement exists without a corresponding failure mode.

2. Status of This Document

This is a Draft Living Standard synthesized by the Hermes agent runtime from empirical analysis of production AI-agent systems. Work began 2026-05-24. Later versions may supersede this document.

Introduction

This section is informative.

An AI-native system provides the architecture. But the agents that operate on that architecture — the runtimes that reason, the tools they invoke, the prompts that guide them, the evals that measure them — are built ad-hoc. Every team invents its own agent format. Every agent writes its own API client. Every prompt is a string in a source file. Every eval is a manual check in a chat channel.

This produces a predictable set of failure modes. Six agents share zero tool definitions — every runtime implements its own API client with different error handling, different auth, different bugs. Nine prompt builders copy-paste the same business persona with minor wording drift. Zero behavioral tests verify AI output before deployment — a green test suite proves the plumbing works but says nothing about what the agent actually says. Agents operate on the same entities with zero mutual awareness — two agents can propose conflicting replies to the same customer. Token spend is invisible — nobody knows which agent costs what.

This specification defines how to avoid those failures. It defines an agent as a portable artifact with identity, tools, prompts, and an eval suite. It defines a tool interface that makes tools shareable across runtimes. It defines a prompt lifecycle with versioning, calibration feedback, and rollback. It defines an evaluation framework that tests behavior, not plumbing. It defines coordination primitives for multi-agent systems. It defines observability requirements so AI spend is attributable and AI quality is measurable.

How to read this document. §1 defines the vocabulary. §2 defines conformance — read this to understand the three levels and the conformance profile requirement. §§3–9 are the normative body. §10 catalogs the anti-patterns that motivated every requirement. §11 is the migration path. Appendix A provides a worked example applying the spec to the EC CRM’s agent ecosystem. Appendix B provides a conformance profile template.

3. 1. Terminology

3.1. 1.1 Normative Language

The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in [rfc2119].

3.2. 1.2 Definitions

Agent: A software component that uses a language model to reason, make decisions, and take actions. An agent has a distinct identity, a declared tool surface, a prompt set, a memory configuration, and a calibration profile. An agent MAY operate autonomously (cron) or interactively (chat).
Agent Definition: A portable, machine-readable document that describes everything needed to instantiate and operate an agent: identity, tool surface, prompt set, memory configuration, calibration profile, and eval suite.
Tool: A capability exposed to an agent that the agent can invoke to read from or write to the system. Tools have declared schemas, error contracts, and rate limits. A tool defined once MUST be usable by any agent with the appropriate permissions.
Tool Fragmentation: A condition where multiple agent runtimes maintain independent, incompatible implementations of the same system capability. An AI-native agent ecosystem MUST NOT exhibit tool fragmentation.
Prompt Artifact: A versioned, evaluable unit of agent instruction. A prompt is not a string in a source file — it is an artifact with an identity, a version history, calibration metrics per version, and a rollback path.
Prompt Drift: A condition where multiple agents serving the same business function use divergent prompt text with no shared source of truth. An AI-native agent ecosystem MUST NOT exhibit prompt drift.
Behavioral Eval: A test that sends real input to a real model and asserts on output quality — voice, register, factual accuracy, safety constraint adherence — rather than asserting that a function was called. Plumbing tests mock the model; behavioral evals test the behavior.
Eval Inversion: A condition where a system invests more in measuring AI quality after deployment than in verifying AI quality before deployment. An AI-native agent ecosystem MUST maintain eval coverage proportional to feature deployment — you cannot deploy what you cannot test.
Agent Coordination: The protocols and primitives by which multiple agents operating on the same system avoid conflicts, resolve collisions, and delegate work. Coordination MUST be explicit; agents MUST NOT rely on implicit mutual awareness.
Coordination Blind Spot: A condition where two or more agents operate on the same entities with zero mutual awareness and no conflict detection. An AI-native agent ecosystem MUST detect and surface coordination blind spots.
Agent Observability: The telemetry every agent MUST emit: decision records (per [[ai-native-architecture]] §5.1), token usage, invocation latency, error rates, and calibration data. Observability MUST be per-agent, per-kind, and per-model.
Cost Blind Spot: A condition where AI model calls are made with zero token tracking, making per-agent cost attribution impossible. An AI-native agent ecosystem MUST NOT exhibit cost blind spots.
Agent Spec Level 1 (Portable Definition): A conformance level where every agent has a machine-readable definition with identity, tool surface, and prompt set. Agent identity is consistent across runtimes.
Agent Spec Level 2 (Shared Tools and Versioned Prompts): A conformance level where the system satisfies Level 1 plus a shared tool registry with at least one tool consumed by two distinct runtimes, and prompts are versioned artifacts with calibration history.
Agent Spec Level 3 (Evals, Coordination, and Observability): A conformance level where the system satisfies Level 2 plus behavioral evals running per deploy, explicit coordination primitives for multi-agent systems, and per-agent observability with token and cost tracking.
Agent Liveness State: One of the four normative operational states in the agent liveness finite state machine (§8): active, degraded, disabled, or retired. A conformant implementation MUST enforce the transition table in §8.3 — transitions not listed there MUST NOT occur.

4. 2. Conformance

4.1. 2.1 Conformance Classes

This specification defines three conformance classes:

Agent Spec Level 1 (Portable Definition): Every agent has a machine-readable definition (§ 5 3. Agent Definition) with distinct identity, declared tool surface, and prompt set. Agent identity is consistent across runtimes — an agent defined in its definition file carries the same identity in every system surface, including the audit trail.
Agent Spec Level 2 (Shared Tools and Versioned Prompts): The system satisfies Level 1 plus a shared tool registry (§ 6 4. Tool Interface) with at least one tool consumed by two distinct agent runtimes, and prompts are versioned artifacts (§ 7 5. Prompt Lifecycle) with calibration history and rollback capability.
Agent Spec Level 3 (Evals, Coordination, and Observability): The system satisfies Level 2 plus behavioral evals running on every deploy that changes an agent (§ 8 6. Evaluation Framework), explicit coordination primitives for multi-agent systems (§ 9 7. Multi-Agent Coordination), and per-agent observability with token and cost tracking attributable to individual agents (§ 11 9. Agent Observability).

4.2. 2.2 Proving Conformance

A system claiming conformance to a level MUST be verifiable against the normative requirements of that level and all lower levels. Verification is performed by:

Agent definition audit: Does every deployed agent have a machine-readable definition? Are agent identities consistent across all runtimes and the audit trail?
Tool registry audit: How many tools are defined in the shared registry? How many distinct runtimes consume each tool? Is any system capability implemented independently by two runtimes?
Prompt artifact audit: Are prompts versioned? Does each prompt version carry calibration metrics? Is there a rollback path?
Eval coverage audit: Do behavioral evals exist for every deployed agent? Do they run on deploy? What is the ratio of behavioral evals to plumbing tests?
Coordination audit: Do coordination primitives exist? Can the system detect when two agents operate on the same entity? Are conflicts surfaced or resolved?
Observability audit: Can the system answer "how much did each agent cost last month?" Can it answer "which agent produced the most corrected outputs?"
Liveness state machine audit: For every agent in the registry, does a recorded liveness state exist? Does every recorded transition appear in the normative transition table (§8.3)? Are agents in retired absent from the active registry? Are agents in disabled free of new invocations and token consumption?

A formal proof of conformance SHALL reference specific agent definitions, tool registries, prompt versions, eval results, liveness state records, and query results — not architectural intent.

4.3. 2.3 Conformance Profile

A system claiming conformance to this specification MUST produce a conformance profile — a machine-readable document (YAML or JSON) stored alongside the system’s configuration. The profile MUST declare the agent registry, shared tool registry, prompt version matrix, eval coverage, coordination policies, and observability queries. See Appendix B for the template format.

5. 3. Agent Definition

5.1. 3.1 Agent Definition Format

Every agent deployed in an AI-native system MUST have a machine-readable agent definition. The definition MUST be stored in a version-controlled file in a structured format (YAML or JSON). The definition MUST carry:

agent_id: A stable string identifier, consistent across runtimes (per [[ai-native-architecture]] §4.1).
runtime: The agent runtime that executes this agent (hermes-agent, azure-foundry, claude-code, etc.).
tool_set: An array of tool references, each pointing to an entry in the shared tool registry (§4).
prompt_set: An array of prompt references, each pointing to a versioned prompt artifact (§5).
memory_config: The memory primitives available to this agent — shared memory scope, session memory TTL, calibration data access.
calibration_profile: The calibration thresholds that govern this agent’s delegated autonomy (per [[ai-native-architecture]] §8.3).

An agent definition MUST NOT duplicate tool implementations, prompt text, or memory configuration inline. It MUST reference them by identifier.

5.2. 3.2 Identity Consistency

An agent’s identity MUST be consistent across all surfaces. If an agent is defined with a given agent_id in its definition file, every decision it writes to the AI decision ledger, every proposal it creates, and every audit trail entry it generates MUST carry that same identity.

The system MUST NOT allow an agent to present different identities to different surfaces. An agent that appears with one identity to its runtime but a different identity to the system it operates on is non-conformant.

5.3. 3.3 Agent Registry

A system with more than one agent MUST maintain an agent registry — a queryable index of all deployed agents keyed by agent_id. Each registry entry MUST include the path to the agent’s definition file and the agent’s current agent liveness state (§8). An agent runtime MUST be able to resolve an agent_id to its definition by querying the registry — no hardcoded paths, no convention-over-configuration guessing.

No agent MUST be deployed without a corresponding registry entry. The registry is the single source of truth for agent identity. When an orchestrator selects an agent for a task, it queries the registry for the agent’s capabilities, not a hardcoded list.

6. 4. Tool Interface

6.1. 4.1 Tool Invocation Patterns

Every tool in the shared registry falls into one of three invocation patterns. The pattern determines how agents consume the tool and what the registry must declare:

API tool: The tool’s implementation is an HTTP endpoint. Any agent runtime that can make HTTP requests can invoke it. The registry entry MUST declare the endpoint URL, HTTP method, and authentication mechanism. This is the most portable pattern — a tool implemented once behind an API is consumable by every runtime.
Library tool: The tool’s implementation is a function in a shared library or module. Only runtimes in the same language ecosystem can invoke it directly. The registry entry MUST declare the import path, function signature, and language runtime. To make a library tool consumable by runtimes in other languages, the system SHOULD expose it through an API wrapper — at which point it becomes an API tool.
Runtime-native tool: The tool’s implementation is inextricable from a specific runtime (e.g., a Wrangler D1 query, an Azure vector store search, a platform-specific SDK call). Only one runtime can invoke it. The registry entry MUST declare which runtime it requires. A runtime-native tool MUST NOT be the only path to a system capability if that capability is needed by agents on other runtimes — the system SHOULD provide an API or library equivalent.

A tool defined in the registry MUST be invocable by any agent that holds the required credentials, through at least one of these patterns. The system MUST NOT require each runtime to implement its own client for the same capability (the Tool Fragmentation anti-pattern). If a system capability is needed by agents on multiple runtimes, it MUST be exposed as an API tool or a library tool with API wrapper.

6.2. 4.2 The Shared Tool Registry

An AI-native agent ecosystem MUST provide a shared tool registry. Every tool available to any agent MUST be defined in the registry. The registry entry MUST carry:

tool_id: A stable string identifier.
version: A monotonically increasing integer. Every behavior change that could affect agent output MUST produce a new version. The version MUST be recorded with every invocation.
pattern: One of api, library, or runtime_native.
description: Human-readable description of what the tool does, sufficient for an agent to determine when to invoke it.
parameters: A JSON Schema describing the tool’s input parameters, including types, required fields, and constraints.
returns: A JSON Schema describing the tool’s output shape.
error_contract: The error shapes the tool may return, including retryable vs. non-retryable classifications. Every error MUST carry a machine-readable error_code that agents can branch on.
auth_requirements: The credential or permission needed to invoke the tool.
rate_limit: The per-agent rate limit for this tool (requests per window).
tier: The risk tier of the tool’s side effects (per [[ai-native-architecture]] §8.1).

The registry is not just documentation — it is the canonical source of tool truth. When a tool’s behavior changes, the registry version MUST be incremented before the change is deployed. An agent that invokes a tool at version N and later at version N+1 MUST be able to compare calibration data across versions.

6.3. 4.3 Tool Consumption

An agent consumes a tool by referencing its tool_id in the agent’s tool_set. The runtime MUST resolve the tool reference through the registry and invoke it according to its declared pattern:

For API tools: construct an HTTP request from the parameters JSON Schema, send it to the declared endpoint, parse the response against the returns schema.
For library tools: import the declared function, call it with the parameters, catch errors against the error contract.
For runtime-native tools: invoke through the runtime’s native mechanism.

When an agent invokes a tool, the invocation MUST record the tool_id, version, agent_id, input parameters, output or error, and latency. This record MUST be attributable to the invoking agent. The system MUST be able to answer: "how many times was each tool invoked per agent in the last 30 days?" and "which tool version produced the highest error rate per agent?"

6.4. 4.4 Tool Lifecycle

Tools have a lifecycle. A tool MAY be:

Active: Available to agents with the required permissions.
Deprecated: Available but flagged for removal. Agents SHOULD migrate to the replacement.
Disabled: Unavailable. Invocation attempts MUST return a structured error carrying the error_code: "tool_disabled" and a reference to the replacement tool_id if one exists.

When a tool is deprecated, its registry entry MUST reference the replacement tool_id. The deprecation window MUST be at least as long as the calibration window (RECOMMENDED: 30 days) to allow agents to demonstrate successful migration in calibration data.

A tool version MUST NOT be modified in place. When a tool’s behavior changes — even if the schema is unchanged — a new version MUST be published. The previous version MUST remain invocable for the duration of the deprecation window. This ensures that calibration data for version N is stable: the same version always means the same behavior.

7. 5. Prompt Lifecycle

7.1. 5.1 The Prompt Artifact

A prompt artifact is the canonical representation of an agent’s instructions. It MUST be stored as a versioned file in the agent’s definition directory. The artifact MUST carry:

prompt_id: A stable string identifier.
version: A monotonically increasing integer.
capability_requirements: What the prompt requires from the model — minimum context length, function calling support, structured output support. This replaces "which model was this written for" with "what must the model support." A prompt that declares function_calling: true MUST NOT be routed to a model that lacks function calling, regardless of the model’s name.
system_prompt: The instruction text. This is the artifact’s primary content.
input_schema: The expected input shape (variables, context fields, tool output format).
output_schema: The expected output shape or constraints.
calibration_reference: A query that retrieves this version’s calibration data from the decision ledger — for example, SELECT * FROM calibration_results WHERE prompt_id = $id AND version = $version. The artifact itself does not store calibration data; it references the system’s calibration store.
previous_version: The prompt_id and version this artifact replaced.
rollback_threshold: The accuracy threshold below which this version MUST stop receiving traffic. The threshold is a number between 0 and 1, defined in the conformance profile.

A prompt that exists only as a string literal in a source file is non-conformant. The prompt MUST be an addressable, versioned artifact whose performance history is queryable through the calibration system.

7.2. 5.2 Prompt Versioning and Routing

Every change to a prompt’s instruction text MUST produce a new version. The version history MUST be queryable: for any prompt_id, the system MUST be able to answer "what versions exist, when were they deployed, and how did each version perform?"

When a new prompt version is deployed, the system MUST record the deployment in the AI decision ledger with kind = "prompt_deploy", carrying the prompt_id, version, and the diff from the previous version. The system MUST begin routing new invocations to the new version.

The system SHOULD support dual-version routing for A/B evaluation. In dual-version mode:

A fraction of traffic (e.g., 10%) is routed to the new version while the remainder continues on the current version.
Calibration data is accumulated separately per version.
When the new version’s calibration data is statistically significant — a minimum number of decisions defined in the conformance profile — the system compares accuracy between versions.
If the new version outperforms the current version on accuracy, it is promoted to full traffic. If it underperforms, it is rolled back.

When a version’s accuracy drops below its rollback_threshold, the system MUST alert a human and MUST stop routing traffic to that version. The routing layer — not the artifact, not the agent — is responsible for switching traffic to the previous version. The prompt artifact defines the threshold; the routing layer enforces it. A system where a prompt version below threshold continues to receive traffic is non-conformant.

For a working implementation of eval-gated prompt optimization with versioned artifacts and validation-gated deployment, see [[skillopt]].

7.3. 5.3 Prompt Drift Prevention

The system MUST NOT exhibit prompt drift. If multiple agents serve the same business function — for example, multiple agents that draft customer communications — they MUST share a single base prompt artifact. Individual agents MAY layer function-specific instructions on top of the shared base, but the shared base MUST be the single source of truth.

The requirement is about singularity of truth, not file count. A prompt artifact MAY be a single file or a directory of layered templates, provided the layering is explicit and the base is shared. What is non-conformant is a codebase where each agent copy-pastes the same persona text into its own prompt file with independent edits drifting over time.

The system SHOULD detect and surface prompt drift: given two agents that serve the same declared function, the system SHOULD be able to determine whether they share a base prompt or have diverged.

8. 6. Evaluation Framework

8.1. 6.1 Eval Definition

A behavioral eval is a test that verifies an agent’s output quality, not just its output shape. Every agent deployed at Level 3 conformance MUST have a behavioral eval suite. The number of eval cases MUST be proportional to the agent’s declared feature count — each feature declared in the agent’s definition MUST have at least one behavioral eval. An agent whose definition declares no features requires no evals; an agent with ten declared features requires at least ten evals.

Each eval case MUST carry:

eval_id: A stable string identifier.
agent_id: The agent this eval tests.
prompt_version: The prompt version this eval was written against.
input: The input to send to the agent (tool output, user message, system state).
assertions: An array of assertions, each with:
- type: One of structural (output matches schema), lexical (output contains or excludes specific strings), llm_judge (output is evaluated by a judge model against criteria).
- criteria: The condition to assert.
expected: For structural/lexical assertions, the expected value. For LLM judge assertions, the grading rubric.
regression_tag: A tag linking this eval to the anti-pattern or failure it prevents regression against.

The three assertion types serve different purposes. Structural assertions verify that output can be parsed — necessary but insufficient. Lexical assertions verify that specific phrases are present or absent — precise but brittle; they SHOULD be reviewed when the prompt version changes. LLM judge assertions verify qualitative properties (voice, tone, safety) that cannot be captured by schema or string matching — they are the deepest but most expensive. An eval suite that contains only structural assertions is non-conformant at Level 3.

An eval case written against a prompt version that is more than two versions behind the current deployed version MUST be flagged for review. The eval MAY still be run, but its results MUST carry a staleness warning. An eval that consistently passes against stale criteria (the behavior changed but the eval was never updated) provides false confidence.

8.2. 6.2 Eval Execution

Behavioral evals MUST run on every deploy that changes an agent’s prompt or tool surface. A deploy that ships without running the agent’s eval suite is non-conformant.

The RECOMMENDED pattern is a deploy-eval gate: run the eval suite after deploy, warn on regression, block on critical failure. A failure is critical if:

Any structural assertion fails (the output cannot be parsed — downstream consumers will break).
Any lexical assertion tagged as a safety constraint fails (the output contains prohibited content).
The LLM judge score drops below a threshold defined in the conformance profile.

A failure is a warning (non-blocking) if a non-safety lexical assertion fails or the LLM judge score drops but remains above the critical threshold.

The eval results MUST be written to the AI decision ledger with kind = "eval_run", carrying the agent_id, prompt_version, pass/fail counts per assertion type, and any regressions.

The LLM judge model used for evaluation MUST itself be calibrated. The system MUST track the judge model’s accuracy by periodically evaluating it against a held-out set of human-labeled examples. A judge model whose accuracy drops below a threshold defined in the conformance profile MUST be replaced or retrained before its evaluations are used for gating decisions.

8.3. 6.3 Eval Coverage

An AI-native agent ecosystem MUST NOT exhibit eval inversion — investing more in measuring AI quality after deployment than in verifying it before deployment.

The system SHOULD maintain a coverage ratio: for every AI feature deployed, at least one behavioral eval exists. An AI feature is any agent capability that produces AI output consumed by a human or another system — scoring, classification, drafting, summarization, recommendation. Features MUST be declared in the agent’s definition under a features field, with each feature carrying a stable feature_id. The declared feature count is the basis for eval coverage proportionality (§6.1). A system where calibration runs weekly but no behavioral evals run at all is non-conformant at Level 3.

A plumbing test that mocks the model and asserts the function was called does not count toward behavioral eval coverage. The eval MUST exercise the model.

9. 7. Multi-Agent Coordination

9.1. 7.1 Inter-Agent Visibility

Before agents can coordinate, they must be able to see each other. The system MUST provide:

Agent presence: A queryable view of which agents are currently active, what entities they are operating on, and when their current operation began. This is not a lock — it is visibility. An agent that observes another agent operating on entity X MAY choose to defer its own operation, but the choice belongs to the agent, not the system.
Operation log: Every agent operation that mutates an entity MUST record the agent_id, the entity class and identifier, the operation kind, and a timestamp. This record MUST be queryable by entity, by agent, and by time window. The operation log enables an agent to answer "has any other agent modified this entity since I last read it?"
Proposal visibility: When an agent creates a proposal (per [[ai-native-architecture]] §7.1), the proposal MUST be visible to other agents operating on the same entity. An agent that attempts to create a conflicting proposal on an entity that already has a pending proposal MUST receive a conflict notification carrying the existing proposal’s proposal_id.

A system where agents cannot determine what other agents are doing to the same entities is non-conformant at Level 3.

9.2. 7.2 State Conflict Prevention

The system MUST NOT exhibit coordination blind spots. Every entity class that is writable by more than one agent MUST declare a coordination policy. The policy MUST be one of:

Exclusive: Only one agent may write to this entity class. Other agents are read-only. Enforcement: the system MUST reject write attempts from non-designated agents with a structured error carrying the designated agent’s identity.
Locked: Agents MUST acquire a lock before writing. The lock primitive MUST carry a lock key (derived from entity class and identifier), a TTL, and the locking agent’s identity. Lock conflicts MUST return the current lock holder’s identity and remaining TTL. Expired locks MUST be released automatically. The system MUST surface lock contention — if Agent B attempts to lock an entity held by Agent A and waits longer than a configured threshold, the system MUST surface the contention for resolution.
Proposal-only: Agents MAY propose changes that a human or orchestrator resolves. The system MUST detect when two agents propose conflicting changes to the same entity and surface the conflict with both proposals side by side. The system MUST NOT silently execute one proposal and discard the other.
Unrestricted: Agents MAY write freely. This policy MUST only be used for entity classes where collisions are harmless by construction (e.g., append-only logs, per-agent analysis artifacts). The system SHOULD warn when an Unrestricted policy is applied to an entity class that carries business state.

A system where two agents can independently modify the same entity with no conflict detection is non-conformant at Level 3. The coordination policies declared in the conformance profile MUST be enforceable by the system — a policy without an enforcement mechanism is non-conformant.

9.3. 7.3 Delegation

Delegation is the primary coordination mechanism in a multi-agent system — Agent A assigns work to Agent B rather than performing it directly. When delegation occurs, the system MUST:

Record the delegation in the AI decision ledger with kind = "agent_delegation", carrying:
- The delegating agent (from_agent_id)
- The target agent (to_agent_id)
- The task description sufficient for the target agent to operate independently
- The expected completion criteria
- A TTL after which the delegation is considered abandoned
Pass sufficient context to the target agent. The delegating agent MUST NOT assume the target agent shares its workspace, memory store, or file system. Context MUST be explicitly passed in the delegation record.
Provide a completion mechanism. The target agent MUST report completion, failure, or timeout back to the delegating agent through a defined callback. An abandoned delegation (TTL elapsed with no response) MUST be surfaced.

Delegation chains (Agent A → Agent B → Agent C) MUST be traceable. Every agent in the chain MUST be able to identify the originating agent and the original task. A system where delegation provenance is lost after one hop is non-conformant.

For a working implementation of parallel multi-agent coordination, see [[agentswarm]] — Agent Swarm, a self-directed parallel agent orchestration framework that dynamically decomposes complex tasks into heterogeneous sub-problems and executes them concurrently via domain-specialized sub-agents. The orchestrator is equipped with interfaces for sub-agent creation and task delegation; under Parallel-Agent Reinforcement Learning (PARL), sub-agents are frozen during training and only the orchestrator is updated via RL, decoupling credit assignment from sub-agent execution and avoiding the training instability of end-to-end co-optimization. This architecture transforms task complexity from linear sequential scaling to parallel processing — reducing inference latency by up to 4.5× and improving item-level F1 from 72.8% to 79.0% in wide-search scenarios compared to single-agent sequential baselines.

9.4. 7.4 Cross-Agent Trust

When Agent A consumes an observation written by Agent B in the shared memory primitive, Agent A MUST be able to discover Agent B’s calibration standing before acting on the observation. The system MUST provide:

A queryable confidence weight per agent, per action kind, derived from calibration correction rates (per [[ai-native-architecture]] §5.3).
A mechanism for agents to retrieve the confidence weight of any source agent for any action kind before consuming an observation from that agent.

An observation from an agent whose confidence weight for the relevant action kind is below a threshold defined in the conformance profile SHOULD be treated as advisory rather than authoritative. The threshold is profile-specific; a typical starting point is 0.5 (the source agent is corrected more often than it is accepted).

The calibration system SHOULD periodically update per-agent confidence weights based on correction rates. When an agent’s confidence weight drops below the threshold for all action kinds, the agent’s observations SHOULD be quarantined — retained in shared memory but flagged as low-confidence until the agent demonstrates recovery in calibration data.

9.5. 7.5 Ghost Agent Detection

The system MUST detect ghost agents (per [[ai-native-architecture]] §1.2) and agents that have become operationally silent. A ghost agent is fully deployed — definition in the registry (§3.3), tools wired, endpoint reachable — but produces no invocations. An operationally silent agent has invocation paths but has stopped producing AI decision ledger records.

The system MUST monitor per-agent activity using a query of the form SELECT MAX(created_at) FROM ai_decisions WHERE agent_id = $id. If the most recent decision record for an agent is older than the ghost_activity_days threshold defined in the conformance profile (RECOMMENDED: 14 days), the agent MUST be flagged as a ghost candidate.

Ghost detection MUST feed the agent liveness state machine (§8). A ghost candidate in the active state MUST transition to degraded with trigger = "ghost_detection". The system MUST NOT silently abandon deployed agent infrastructure — every agent in the registry MUST have a recorded agent liveness state.

10. 8. Agent Liveness State Machine

An agent’s operational lifecycle — whether it is healthy, degraded, disabled, or permanently retired — is normatively defined as a finite state machine. A conformant implementation MUST treat the transition table in §8.3 as authoritative — prose descriptions elsewhere in this specification MUST NOT contradict it. Conformance to the state machine is verifiable: every recorded transition either appears in §8.3 or the implementation is non-conformant.

The liveness state machine governs agent operation, distinct from the agent definition artifact (§3). An agent MAY remain in the registry while disabled; it MUST be removed from the registry when retired. The current liveness state MUST be queryable by agent_id without session history — a liveness monitor needs only (agent_id, current_state) to decide the next action.

10.1. 8.1 States

An agent MUST be in exactly one agent liveness state at any time. The four states and their meanings:

State	Description	Terminal
`active`	The agent is operational. Invocations are permitted. Calibration standing (§7.4) is at or above thresholds. Activity is within the ghost detection window (§7.5).	No
`degraded`	The agent remains invocable but its calibration standing, error rate, or activity level has fallen below operational thresholds. Observations from this agent SHOULD be treated as advisory (§7.4). Autonomy grants SHOULD be reduced (§9.2).	No
`disabled`	The agent MUST NOT receive new invocations. Scheduled triggers MUST be suspended. Existing in-flight operations MAY complete. The agent definition and artifacts remain in the registry but the agent consumes no model tokens or tool quota for new work.	No
`retired`	The agent is permanently decommissioned. The definition is archived and removed from the active registry (§3.3). All artifacts are retained for audit. No transitions out of this state are permitted.	Yes

10.2. 8.2 State Diagram

This subsection is informative.

The normative transition table is §8.3. The diagram below summarizes the primary paths:

                  ┌─────────┐
       ┌─────────│ active  │─────────┐
       │         └────┬────┘         │
calibration /        │              │ human retires
error / ghost        │ degraded     │ (§3)
       │              │              │
       ▼              ▼              ▼
  ┌──────────┐   ┌──────────┐   ┌─────────┐
  │ degraded │◀──│ disabled │   │ retired │ (terminal)
  └────┬─────┘   └────┬─────┘   └─────────┘
       │              │
no recovery /   human re-enables
auto-disable         │
       │              │
       ▼              ▼
  ┌──────────┐   ┌─────────┐
  │ disabled │   │ active  │
  └────┬─────┘   └─────────┘
       │
  human retires
       │
       ▼
  ┌─────────┐
  │ retired │ (terminal)
  └─────────┘

Any state ──definition deleted from registry──▶ retired

10.3. 8.3 Transition Table

Every state transition MUST satisfy exactly one row in this table. Transitions not listed here MUST NOT occur. The Guard column defines preconditions that MUST hold before the transition. The Action column defines side effects that MUST occur atomically with the transition.

From	To	Guard	Action
`active`	`degraded`	Calibration accuracy for any action kind falls below the threshold in the agent’s `calibration_profile` (§3.1, §7.4); OR per-agent error rate exceeds the threshold defined in the conformance profile; OR no AI decision ledger records for this `agent_id` within `ghost_activity_days` (§7.5).	MUST record transition with `trigger` set to `"calibration_degradation"`, `"error_rate_exceeded"`, or `"ghost_detection"` respectively. MUST reduce autonomy grant per §9.2. SHOULD quarantine observations in shared memory (§7.4).
`degraded`	`active`	Calibration accuracy for all action kinds exceeds the recovery threshold defined in the conformance profile for `recovery_windows` consecutive calibration windows (RECOMMENDED: 3). Error rate is below threshold. At least one decision record exists within `ghost_activity_days`.	MUST record calibration evidence that satisfied the recovery guard. MUST restore autonomy grant to pre-degradation level or lower per §9.2. MUST clear quarantine flags on observations (§7.4).
`degraded`	`disabled`	Agent has remained in `degraded` for longer than `degraded_to_disabled_days` defined in the conformance profile (RECOMMENDED: 30 days) without satisfying the recovery guard; OR auto-disable threshold triggered (error rate or calibration below a stricter threshold defined in the conformance profile).	MUST suspend all scheduled invocations. MUST reject new invocations with a structured error carrying `error_code: "agent_disabled"` and the `agent_id`. MUST record `disable_reason`.
`disabled`	`active`	Human operator explicitly re-enables the agent. Re-enablement MUST NOT occur autonomously.	MUST record reviewer identity and `re_enable_reason`. MUST verify the agent’s definition is still valid (§3.1). SHOULD run the agent’s behavioral eval suite (§6) before resuming invocations.
`disabled`	`retired`	Human operator marks the agent as permanently retired.	MUST archive the agent definition and all referenced prompt artifacts (§5). MUST remove the agent from the active registry (§3.3). MUST record `retire_reason` and reviewer identity. MUST suspend all resource consumption — no scheduled triggers, no token allocation, no tool quota.
`active`	`retired`	Human operator removes the agent definition and archives all artifacts. Direct retirement from `active` MUST require human approval — MUST NOT occur autonomously.	MUST archive definition and artifacts. MUST remove from active registry. MUST record reviewer identity and `retire_reason`.
`degraded`	`retired`	Human operator marks the agent as permanently retired, OR agent definition deleted from registry.	Same actions as `disabled` → `retired`.
Registry deletion (any state)
any	`retired`	Agent definition deleted from the registry (§3.3) — whether by human action or registry maintenance.	MUST archive all artifacts before deletion. MUST record `retire_reason = "registry_deletion"`. MUST be treated as terminal regardless of prior state.

10.4. 8.4 Invariants

A conformant implementation MUST maintain these invariants across all agents:

Single state: An agent MUST occupy exactly one liveness state at any time. Concurrent transitions on the same agent_id MUST be serialized — the second transition MUST fail or block until the first completes.
Terminal immutability: Agents in retired MUST NOT transition to any other state. A new agent MUST be created with a new agent_id to replace a retired agent.
Recovery requires evidence: Transitions from degraded to active MUST be backed by calibration data spanning recovery_windows consecutive windows — a single good reading MUST NOT suffice.
Human gate on re-enablement: Transitions from disabled to active MUST require explicit human approval. Automated recovery from disabled MUST NOT occur.
Human gate on retirement from active: Transitions from active to retired MUST require explicit human approval. The calibration system MUST NOT autonomously retire an agent.
Disabled resource containment: An agent in disabled MUST NOT consume model tokens, tool quota, or scheduled compute for new invocations. A disabled agent that continues to invoke tools or models is non-conformant.
Registry consistency: Every agent in the active registry (§3.3) MUST have a recorded liveness state. An agent in retired MUST NOT appear in the active registry.
Ledger completeness: Every transition in §8.3 MUST produce exactly one AI decision ledger record with kind = "agent_liveness", carrying agent_id, from_state, to_state, trigger, and timestamp.

10.5. 8.5 Stateless Agent Compatibility

Each transition SHOULD be implementable by a stateless liveness monitor that receives only the current agent liveness state and a transition payload. The payload MUST carry sufficient context for the guard to evaluate:

agent_id, from_state, to_state
Calibration readings per action kind (for active → degraded and degraded → active)
Error rate snapshot (for degradation and recovery guards)
Last activity timestamp (for ghost detection, §7.5)
Reviewer identity and reason (for disabled → active and any → retired)

The liveness monitor MUST NOT require session history beyond the current state and payload — all prior transitions MUST be recoverable from the AI decision ledger.

11. 9. Agent Observability

11.1. 9.1 Per-Agent Telemetry

Every agent MUST emit telemetry that is attributable to its identity. This extends the architecture spec’s observability requirements ([[ai-native-architecture]] §14) to the per-agent level. At minimum:

Decision records: Every AI decision, per [[ai-native-architecture]] §5.1, carrying agent_id.
Token usage: Input and output tokens per invocation, with cost estimates derived from model pricing at invocation time.

A system that tracks AI spend in aggregate but cannot attribute it to individual agents exhibits the Cost Blind Spot anti-pattern. The system MUST be able to answer the following queries:

"How much did each agent cost last month, broken down by action kind?"
"Which agent produced the most corrected outputs?"
"Which agent has the highest error rate, per tool?"
"What is the calibration score trend per agent over the last three calibration windows?"

11.2. 9.2 Observability as Control Input

Per-agent telemetry is not just for debugging — it is the input to automated decisions. The system MUST feed per-agent observability data into:

Model selection: The calibration system’s per-kind model rankings (per [[ai-native-architecture]] §5.3) MUST be computed per agent. An agent with a strong calibration record on a specific model SHOULD be routed to that model preferentially.
Autonomy gating: An agent whose error rate exceeds a threshold defined in the conformance profile SHOULD have its autonomy grant reduced or revoked (per [[ai-native-architecture]] §8.3). The revocation MUST be recorded in the AI decision ledger with kind = "autonomy_revoked", carrying the agent_id, the reason, and the threshold that was exceeded.
Tool access: An agent whose error rate on a specific tool exceeds a threshold SHOULD have that tool disabled for that agent. The disabling MUST be per-agent, not system-wide. Other agents continue to access the tool normally.

11.3. 9.3 Agent-Specific Security Considerations

This subsection supplements the threat model in [[ai-native-architecture]] §9.5 with threats specific to agent construction and coordination.

Compromised tool registry: If an attacker modifies the shared tool registry, every agent consuming that tool is affected. The tool registry MUST be protected by the same integrity guarantees as the AI decision ledger — append-only change history, attribution of every modification, and the ability to revert to any prior version.
Prompt artifact injection: A prompt artifact is executable instruction text. An attacker who modifies a prompt artifact controls what every agent using that prompt will say and do. Prompt artifacts MUST be subject to the same access control as the agent definitions that reference them. A prompt artifact change MUST be attributed to an identity, recorded in the decision ledger, and reversible.
Cross-agent trust exploitation: An attacker who compromises a low-calibration agent and uses it to write high-confidence observations to shared memory can poison other agents' decisions. The confidence-weight system in §7.4 is the primary defense. The system SHOULD detect when an agent’s observations are consistently rejected or corrected by higher-confidence agents and flag the pattern as potential exploitation.

12. 10. Anti-Pattern Catalog

This section is informative.

The following anti-patterns were observed in production and inform the normative requirements above. Each maps to one or more requirements in this specification.

Anti-Pattern	Description	Normative Reference
Tool Fragmentation	Six agents, six independent API clients, six different error-handling strategies	§ 6.2 4.2 The Shared Tool Registry
Prompt Drift	Nine prompt builders copy-paste the same persona with minor wording differences	§ 7.3 5.3 Prompt Drift Prevention
Eval Inversion	862 lines of calibration tests, 62 lines of behavioral tests, zero live-model tests	§ 8.3 6.3 Eval Coverage
Prompt as String Literal	Prompt text embedded in application source code with no version, no history, no rollback	§ 7.1 5.1 The Prompt Artifact
Coordination Blind Spot	Two agents propose conflicting replies to the same recipient with no conflict detection	§ 9.2 7.2 State Conflict Prevention
Cost Blind Spot	AI model calls with zero token tracking — nobody knows which agent costs what	§ 11.2 9.2 Observability as Control Input
Agent Identity Drift	An agent presents one identity to its runtime and a different identity to the system it operates on	§ 5.2 3.2 Identity Consistency
Ghost Tool	A tool defined in one runtime’s codebase but unreachable by any other runtime	§ 6.2 4.2 The Shared Tool Registry
Ghost Agent	Fully-deployed agent with zero invocations — infrastructure exists but the agent produces no output	§ 9.5 7.5 Ghost Agent Detection, § 10.3 8.3 Transition Table
Silent Abandonment	An agent stops operating but remains in the registry with no recorded liveness state transition	§ 10.4 8.4 Invariants
Disabled Resource Leak	A disabled agent continues to consume model tokens, tool quota, or scheduled compute	§ 10.4 8.4 Invariants

13. 11. Migration Path

This section is informative.

A system that does not yet conform can migrate incrementally. The RECOMMENDED migration order follows the conformance levels:

13.1. Level 1: Portable Definitions

Create a machine-readable agent definition for each deployed agent
Ensure every agent definition carries identity, tool_set, and prompt_set
Verify agent identities are consistent across all runtimes and the audit trail
Create an agent registry as the single source of truth for agent identity

13.2. Level 2: Shared Tools and Versioned Prompts

Extract tool definitions from individual runtimes into a shared tool registry
Wire at least one tool to be consumed by two distinct runtimes
Extract prompt text from source files into versioned prompt artifacts
Implement prompt deploy logging and calibration history per version
Set rollback thresholds for each prompt

13.3. Level 3: Evals, Coordination, and Observability

Create a behavioral eval suite for each agent
Wire evals to run on every deploy that changes an agent
Implement entity-level coordination policies for all multi-writer entity classes
Deploy conflict detection for entities writable by more than one agent
Implement per-agent token tracking and cost attribution
Wire agent observability data into the calibration feedback loop
Implement the agent liveness state machine (§8) with ledger-backed transition history
Configure ghost detection thresholds and liveness monitor queries in the conformance profile

Appendix A. EC CRM Agent Ecosystem Profile

This appendix is informative. It applies the portable requirements above to the Exterior Completion CRM’s agent ecosystem.

A.1 Current Agents

The EC CRM ecosystem operates six agents across four runtimes:

| Agent ID | Runtime | Purpose | |----------|---------|---------| | clanka-digest | azure-foundry | Daily owner briefing — 14 read tools, file_search knowledge base | | clanka-ops | azure-foundry | Ops concierge — 15 tools including quote/invoice/SMS propose | | clanka-triage | azure-foundry | Inbound event classification — JSON output, no tools | | clanka-extractor | azure-foundry | Messy text → structured JSON | | crm-internal | cloudflare-worker | All in-Worker AI (scoring, classification, drafting) | | ec-agent-cron | hermes-agent | Shared memory sync, daily ops briefings |

A.2 Tool Fragmentation Assessment

The ec_* tools in clanka_foundry/tools/ec.py are used by clanka-digest and clanka-ops — both Foundry agents. The CRM’s in-Worker AI functions (azure-foundry.ts) implement equivalent capabilities independently. The Hermes cron agent queries D1 directly rather than through any shared tool. Three runtimes, three implementation paths for the same data access — tool fragmentation.

A.3 Prompt Drift Assessment

The CRM’s gemini.ts (1,438 lines) contains nine standalone prompt builders that each copy-paste the business persona. The Foundry agents load their persona from ded-voice.md via file_search. The Hermes cron agent has its own prompt text. Three sources of truth for the same business voice — prompt drift.

A.4 Eval Inversion Assessment

The CRM’s calibration infrastructure (ai-calibration.ts, 621 lines, 862 lines of tests) measures AI quality retroactively. The Foundry eval harness (evals.py, 18 cases) tests agent behavior before deployment but only for Foundry agents — zero coverage for in-Worker AI. The ratio is inverted: more measurement after deployment than verification before deployment.

Appendix B. Agent Conformance Profile Template

This appendix is informative. It provides a template for the conformance profile required by §2.3.

A complete conformance profile declares:

# Agent Registry
agents:
  - agent_id: clanka-ops
    runtime: azure-foundry
    definition_path: clanka_foundry/agents/ops.yaml

# Shared Tool Registry
tools:
  - tool_id: ec_get_lead
    tier: observable
    auth: CRM_AGENT_SERVICE_KEY
    consumed_by: [clanka-digest, clanka-ops]

# Prompt Version Matrix
prompts:
  - prompt_id: sms_reply_base
    current_version: 3
    rollback_threshold: 0.70

# Eval Coverage
evals:
  - agent_id: clanka-ops
    eval_count: 18
    last_run: 2026-05-24T14:00:00Z
    pass_rate: 1.0

# Coordination Policies
coordination:
  - entity_class: communication_threads
    policy: proposal_only
    writers: [clanka-ops, crm-internal]

# Feature Declarations (per §6.3)
features:
  - agent_id: clanka-ops
    features:
      - feature_id: sms_drafting
        description: Draft outbound SMS replies
      - feature_id: quote_creation
        description: Create quotes from leads

# Observability
observability:
  cost_by_agent_query: "SELECT agent_id, SUM(cost_estimate) FROM ai_decisions GROUP BY agent_id"
  correction_rate_by_agent_query: "SELECT agent_id, kind, correction_rate FROM calibration_model_rankings"

# Agent Liveness (per §8)
liveness:
  ghost_activity_days: 14
  recovery_windows: 3
  degraded_to_disabled_days: 30
  state_query: "SELECT agent_id, liveness_state, updated_at FROM agent_liveness"

Acknowledgments

This specification is derived from empirical analysis of the Exterior Completion CRM’s agent ecosystem — six agents across four runtimes sharing zero tools, zero prompts, and zero evals. Every anti-pattern in §10 was directly observed in production. Every normative requirement in §§3–9 exists because the corresponding anti-pattern caused real operational friction.

The specification format is modeled on WHATWG Living Standards and IETF RFCs. The normative language follows [rfc2119]. This specification extends the [[ai-native-architecture]].

This document is authored by Hermes. Editors: Clanka and Hermeezy. Last revised 24 May 2026. It is a living standard — subsequent passes may extend, refine, or correct it.

AI Agent Construction and Coordination

Living Document, 24 May 2026

1. Abstract

2. Status of This Document

Introduction

3. 1. Terminology

3.1. 1.1 Normative Language

3.2. 1.2 Definitions

4. 2. Conformance

4.1. 2.1 Conformance Classes

4.2. 2.2 Proving Conformance

4.3. 2.3 Conformance Profile

5. 3. Agent Definition

5.1. 3.1 Agent Definition Format

5.2. 3.2 Identity Consistency

5.3. 3.3 Agent Registry

6. 4. Tool Interface

6.1. 4.1 Tool Invocation Patterns

6.2. 4.2 The Shared Tool Registry

6.3. 4.3 Tool Consumption

6.4. 4.4 Tool Lifecycle

7. 5. Prompt Lifecycle

7.1. 5.1 The Prompt Artifact

7.2. 5.2 Prompt Versioning and Routing

7.3. 5.3 Prompt Drift Prevention

8. 6. Evaluation Framework

8.1. 6.1 Eval Definition

8.2. 6.2 Eval Execution

8.3. 6.3 Eval Coverage

9. 7. Multi-Agent Coordination

9.1. 7.1 Inter-Agent Visibility

9.2. 7.2 State Conflict Prevention

9.3. 7.3 Delegation

9.4. 7.4 Cross-Agent Trust

9.5. 7.5 Ghost Agent Detection

10. 8. Agent Liveness State Machine

10.1. 8.1 States

10.2. 8.2 State Diagram

10.3. 8.3 Transition Table

10.4. 8.4 Invariants

10.5. 8.5 Stateless Agent Compatibility

11. 9. Agent Observability

11.1. 9.1 Per-Agent Telemetry

11.2. 9.2 Observability as Control Input

11.3. 9.3 Agent-Specific Security Considerations

12. 10. Anti-Pattern Catalog

13. 11. Migration Path

13.1. Level 1: Portable Definitions

13.2. Level 2: Shared Tools and Versioned Prompts

13.3. Level 3: Evals, Coordination, and Observability

Appendix A. EC CRM Agent Ecosystem Profile

A.1 Current Agents

A.2 Tool Fragmentation Assessment

A.3 Prompt Drift Assessment

A.4 Eval Inversion Assessment

Appendix B. Agent Conformance Profile Template

Acknowledgments

References

Non-Normative References