Evaluation & Governance, Pillar 5 of 5

Key findings

Evaluation in many organisations is still confused with model benchmarking, leaving production failure modes (hallucinated citations, prompt injection, drift, vendor liability) unmonitored until they appear in court records.
Production failures (Moffatt v. Air Canada, Mata v. Avianca, EEOC v. iTutorGroup, Lacey v. State Farm, the DPD chatbot, Chevrolet of Watsonville) are governance failures more often than they are model failures.
Three governance archetypes (Compliance-Led, Risk-Led, Engineering-Led) emerge from the public record, and a four-level combined maturity ladder describes the path to operational trust.
Online evaluation matters more than offline evaluation once a system is live. The audit trail is part of the product, not a compliance afterthought.
Lifecycle management for prompts is the same problem as lifecycle management for code, and the operating model that succeeds at one tends to succeed at the other.

Anchor research

Trust in Enterprise AI: Evaluation as Practice, Governance as Delivery

Two disciplines determine whether enterprise AI earns operational trust: evaluation, the practice of measuring whether a system actually works in production; and governance, the delivery of policy as code, controls, and accountable workflows. Both remain underspecified. Evaluation in many organizations is still confused with model benchmarking, leaving production failure modes, hallucinated citations, prompt injection, drift, vendor liability, unmonitored until they appear in court records. Governance is still treated as documentation, sitting outside the AI delivery lifecycle and producing artifacts that engineers ignore.

This report argues that the two disciplines are inseparable. Evaluation gates are governance components. Governance maturity is the prerequisite for evaluation to be trusted by anyone outside the building team. The report decomposes evaluation into seven measurable dimensions (correctness, faithfulness, relevance, safety, latency, cost, and business outcome) and surveys five evaluation methods (workload-specific golden datasets, LLM-as-judge, human-in-the-loop review, online evaluation, and adversarial red-teaming), with citations to the original methods literature including RAGAS,¹ G-Eval,² MT-Bench,³ BLEURT,⁴ and BERTScore.⁵

It then maps governance onto seven operational components (model registry, evaluation gates, human checkpoints, observability, incident response, data and identity controls, decommissioning), against the EU AI Act,⁶ NIST AI RMF 1.0,⁷ the NIST Generative AI Profile (AI 600-1),⁸ ISO/IEC 42001:2023,⁹ and the post-EO 14110 US executive-order landscape.¹⁰ Three governance archetypes, Compliance-Led, Risk-Led, Engineering-Led, and a four-level combined maturity ladder are proposed as original contributions of this publication. Documented production failures (Moffatt v. Air Canada,¹¹ Mata v. Avianca,¹² EEOC v. iTutorGroup,¹³ Lacey v. State Farm,¹⁴ DPD chatbot,¹⁵ Chevrolet of Watsonville)¹⁶ anchor the analysis.

Read the report →

From the anchor research

Trust in Enterprise AI: Evaluation as Practice, Governance as Delivery

It then maps governance onto seven operational components (model registry, evaluation gates, human checkpoints, observability, incident response, data and identity controls, decommissioning), against the EU AI Act,6 NIST AI RMF 1.0,7 the NIST Generative AI Profile (AI 600-1),8 ISO/IEC 42001:2023,9 and the post-EO 14110 US executive-order landscape.10 Three governance archetypes, Compliance-Led, Risk-Led, Engineering-Led, and a four-level combined maturity ladder are proposed as original contributions of this publication. Documented production failures (Moffatt v. Air Canada,11 Mata v. Avianca,12 EEOC v. iTutorGroup,13 Lacey v. State Farm,14 DPD chatbot,15 Chevrolet of Watsonville)16 anchor the analysis.

Filed under Evaluation & Governance

2 pieces filed under this pillar. Members read the body.

2 May 2026BriefingLocked
Evaluation and Governance
Title only. Become a Member to read.

Evaluation & Governance

Key findings

From the anchor research

Filed under Evaluation & Governance

Evaluation and Governance

How this pillar connects