Evaluation and Governance

How to know enterprise AI works, and how to ship it safely. Operational practice, not slogans.

2 May 20262 min read499 wordsEvaluation & Observability

Cite as: The Applied Layer. (2026). Evaluation and Governance. The Applied Layer. https://appliedlayer-ai.com/briefings/pillar-evaluation-governance

Pillar 5, Evaluation and Governance

How to know enterprise AI works, and how to ship it safely. Operational practice, not slogans.

The question

How do you know an AI system is working in production, and how do you know it has stopped working? Most enterprises treat evaluation as a launch milestone and governance as a compliance afterthought. The mature applied layer treats both as continuous, instrumented practice.

Editorial thesis

Evaluation and governance are inseparable. Evaluation gates are governance components; governance maturity is the prerequisite for evaluation to be trusted by anyone outside the building team. Mature enterprises decompose evaluation into seven measurable dimensions (correctness, faithfulness, relevance, safety, latency, cost, and business outcome) and treat governance as the operational delivery of policy as code, controls, and accountable workflows, not as documentation that sits outside the lifecycle.

Key findings (from the anchor research)

Evaluation in many organisations is still confused with model benchmarking, leaving production failure modes (hallucinated citations, prompt injection, drift, vendor liability) unmonitored until they appear in court records.
Production failures (Moffatt v. Air Canada, Mata v. Avianca, EEOC v. iTutorGroup, Lacey v. State Farm, the DPD chatbot, Chevrolet of Watsonville) are governance failures more often than they are model failures.
Three governance archetypes (Compliance-Led, Risk-Led, Engineering-Led) emerge from the public record, and a four-level combined maturity ladder describes the path to operational trust.
Online evaluation matters more than offline evaluation once a system is live. The audit trail is part of the product, not a compliance afterthought.
Lifecycle management for prompts is the same problem as lifecycle management for code, and the operating model that succeeds at one tends to succeed at the other.

What is filed under this pillar

Anchor research: “Trust in Enterprise AI: Evaluation as Practice, Governance as Delivery”, the flagship survey of evaluation methods, governance archetypes, and the maturity ladder.
Briefings on evaluation harnesses, governance structures, lifecycle management (forthcoming).

[upgrade-prompt target=”member”] Become a Member, free in 60 seconds, to read the underlying research and briefings. [/upgrade-prompt]

Member view

The flagship Pillar 5 research, “Trust in Enterprise AI: Evaluation as Practice, Governance as Delivery”, is the canonical anchor for this pillar. Members can read the full report, including the seven evaluation dimensions, the five evaluation methods, and the governance archetypes.

Briefings filed beneath this pillar walk through specific harnesses (RAGAS, G-Eval, MT-Bench), governance frameworks (EU AI Act, NIST AI RMF, ISO/IEC 42001), and named production failures as they are published.

[upgrade-prompt target=”patron”] Patron unlocks methodology notes, the full bibliography with annotations, and primary research data. £15 per month. [/upgrade-prompt]

Patron view, methodology and primary data

The methodology note and full bibliography for the flagship trust research live in the Patron-tier section of the anchor piece. The annotated bibliography is exportable as BibTeX. The four-level combined maturity ladder is downloadable as a structured artefact for use in internal governance documents.

Patrons receive new pieces in this pillar 7 days before they go live for Members and 14 days before they go fully public.

Was this useful?

2 May 2026Briefing

Membership

Become a Member to receive new briefings as they are published.