The First Year of the Applied Layer: A Synthesis
The most consequential layer of the AI buildout is not the foundation models themselves but what sits between them and the organizations that deploy them: architecture, integration, evaluation, and governance. The public record has clarified the picture rather than settled it. The applied layer is
Cite as: The Applied Layer. (2026). The First Year of the Applied Layer: A Synthesis. The Applied Layer. https://appliedlayer-ai.com/research/first-year-applied-layer-synthesis

Stratification, calibrated claims, and the questions ahead
Executive Summary
The most consequential layer of the AI buildout is not the foundation models themselves but what sits between them and the organizations that deploy them: architecture, integration, evaluation, and governance. The public record has clarified the picture rather than settled it. The applied layer is in a phase of stratification, not consolidation. Organizations doing this work well, disciplined about evaluation, deliberate about retrieval, organizationally honest about what AI replaces and what it augments, are pulling decisively away from those treating AI as a procurement category.
The field has produced several confirmed claims. Evaluation has emerged as the binding constraint on production systems. Governance, where it works, behaves as a delivery practice rather than a compliance checklist. Organizational design dominates technology choice in determining outcomes; the same model in two different operating models produces opposite results. Retrieval did not die; it matured into a conditional, hybrid, multimodal discipline. The Model Context Protocol moved from a single-vendor experiment to multi-vendor infrastructure in roughly eighteen months.
The year also falsified or substantially qualified several confident claims circulating at the start of 2025. “Agents will replace workflows” collided with the empirical reality of agentic failure rates and Klarna’s public reversal. “RAG is over” did not survive contact with cost, audit, and accuracy data. “Foundation models will commoditize” is true at the inference layer but misleading about platform power, which has concentrated rather than diffused. “Open source will dominate enterprise” overstated open weights’ position even as Meta retreated from full openness. “Governance is a regulatory checklist” was contradicted by the EU AI Act’s phased rollout, where the binding work has been operational, not paperwork.
The maturity ladder articulated in the applied-layer framework holds. The middle of the ladder needs refinement: too many organizations sit at “scaled pilots” without crossing into the production discipline that defines the upper rungs. The publication’s forward research agenda follows from those open questions.
Section 1: Where the Applied Layer Is
Headline state of field
A year of compounding evidence supports a single, blunt characterization: the applied layer is real, it is the binding constraint on enterprise value capture from AI, and the gap between organizations that treat it seriously and those that do not is widening at an accelerating pace.
The 2026 AI Index from Stanford’s Institute for Human-Centered Artificial Intelligence reports organizational adoption at 88%, a number that, taken alone, would suggest the technology has arrived. The Index also reports that responsible AI reporting lags badly, that documented AI incidents rose to 362 in the year, and that the Foundation Model Transparency Index average dropped to 40 from 58 a year earlier.1 Adoption, in other words, is not the same as integration. The MIT NANDA initiative’s State of AI in Business 2025, based on 52 executive interviews, 153 senior-leader survey responses, and analysis of 300 publicly disclosed AI initiatives, concluded that 95% of enterprise generative AI pilots produced no measurable impact on profit and loss.2
The 95% figure travels poorly. It has been cited as proof that AI has failed and as proof that organizations are failing to use AI. Both readings are wrong. The NANDA authors are explicit: the failure mode is structural, not technological. Pilots stalled most often where organizations bought tools rather than partnered with vendors, where central AI labs owned deployment rather than line managers, and where investment concentrated in sales and marketing despite higher returns in back-office automation.2 These are operating-model failures, mediated through the applied layer.
[Figure 1: The applied layer one year in, two-panel before/after diagram. Left panel shows the framework as articulated in the applied-layer framework: foundation models at the bottom, a thin “applied layer” middle (architecture, integration, evaluation, governance), and applications/users at the top. Right panel shows the same framework one year on with thickened sub-layers: retrieval (now hybrid + late-interaction + long-context routing), orchestration (now MCP-native), evaluation (now eval-driven development with golden datasets), governance (now operationalized through NIST AI RMF + EU AI Act + ISO/IEC 42001), integration (now context engineering), and a new horizontal “human-AI workflow” band crossing all sub-layers. Annotations mark where each sub-layer thickened, where the boundary between applied layer and foundation models blurred, and where leader-laggard divergence is largest.]
The stratification thesis
The clearest finding of the year is that aggregate statistics conceal an increasingly bimodal distribution. Anthropic’s Economic Index reports that Claude usage remains highly concentrated by geography and sector: Singapore and Canada use Claude at 4.6x and 2.9x the rate predicted by working-age population.3 On the enterprise side, Anthropic disclosed that more than 1,000 customers spend over $1 million annually on Claude as of April 2026, more than doubling from approximately 500 in February 2026.4 The same period saw OpenAI cross $20 billion in annual recurring revenue with more than nine million paying business users.5
These are not numbers that describe a market in equilibrium. They describe a small set of organizations consuming AI at intensities orders of magnitude higher than the median, often accompanied by deeper structural changes in how work is organized. Microsoft’s third-quarter fiscal 2026 earnings disclosed that the number of customers with more than 50,000 paid Microsoft 365 Copilot seats quadrupled year over year, and that the company’s AI business surpassed a $37 billion annual revenue run rate.6 At the same time, Recon Analytics tracked Copilot’s accuracy Net Promoter Score at -3.5 in July 2025, deteriorating to -24.1 in September 2025, and partially recovering to -19.8 in January 2026, an indicator that volume of seats does not equate to value capture.7
What distinguishes leaders from laggards is not access to better models. The frontier model gap, where it exists, is measured in months. What distinguishes them is how seriously they treat the applied layer.
What pulls leaders ahead
Three patterns recur in the public record from organizations producing measurable returns.
First, leaders treat evaluation as a first-order engineering discipline, not a quality-assurance afterthought. The eval-driven development pattern that emerged from Databricks’ 2025 Data + AI Summit and from independent practitioner reports describes a workflow in which golden datasets are versioned, custom LLM judges are calibrated against domain-specific rubrics, and statistical significance, not vibes, gates production deployment.8 The 2025 NeurIPS workshop on reliable machine learning from unreliable data documented systematic biases in LLM-as-judge setups: position bias, verbosity bias, self-enhancement bias, agreeableness bias.9 Leaders have absorbed these findings. Laggards still rely on single-judge evaluation with default prompts and call the result a benchmark.
Second, leaders treat retrieval as architecture, not a feature toggle. The “RAG is dead” discourse that surrounded Gemini 1.5 Pro’s million-token context window did not survive contact with cost data. Independent analysis of production deployments suggests RAG queries cost roughly 1,200x less than equivalent long-context queries at scale.10 More importantly, 2025 research showed that long-context models continue to lose information in the middle of their windows, with recall hovering near 60% for the median frontier model at full load.11 Leaders combine retrieval and long context conditionally. Laggards pick one and defend it as identity.
Third, leaders absorb organizational change. Shopify’s April 2025 memo from Tobi Lütke, that teams must demonstrate why AI cannot do the work before requesting headcount, was widely covered as an HR policy. It was infrastructure: Shopify backed it with internal LLM proxies, dozens of MCP servers, and reflexive AI usage embedded in performance review.12 Other companies adopted the language without the infrastructure and produced predictable results.
What stalls laggards
The MIT NANDA work identifies the modal failure pattern: AI projects that aspire to “transformation” but lack measurement, lack line-manager ownership, and lack integration with the systems where work actually happens.2 Deloitte’s 2026 Tech Trends survey reported that 48% of organizations cited data searchability and 47% data reusability as obstacles to AI automation, both architectural deficits that predate AI but are now binding constraints on it.13 Gartner’s October 2025 forecast that more than 40% of agentic AI projects will be cancelled by end of 2027 was widely reported as a prediction; the more useful framing is that it describes the consequence of skipping the applied layer.14
The stratification thesis follows. As leaders compound their applied-layer investments, better evaluation, cleaner retrieval, deeper governance, more honest organizational design, laggards face increasing structural disadvantages. The market has moved from “is your organization using AI?” to “is your organization using AI in a way that compounds?”
Section 2: What the Year Proved
This section examines five claims that have been substantially confirmed by the year’s evidence. Each is paired with at least two independent sources from the period.
Confirmed claim 1: Evaluation is the binding constraint on production AI
The applied-layer framework argued, against the prevailing discourse, that evaluation, not model capability, would emerge as the binding constraint on production systems. The year produced strong corroboration.
The 2025 NeurIPS workshop on reliable machine learning included multiple papers documenting that LLM-as-judge methods, the dominant approach to scaling evaluation, exhibit systematic biases that compound when used in iterative development. Position bias produces a measurable preference for the first option presented; verbosity bias rewards longer responses regardless of quality; agreeableness bias yields true-positive rates above 96% paired with true-negative rates below 25% on class-imbalanced datasets.915 Independent work at NeurIPS and ICML 2025 introduced calibration techniques, quantitative LLM judges trained via regression on human labels, agentic context engineering for evaluator stability, and causal judge evaluation frameworks for valid policy ranking, but no consensus solution emerged.1617
The applied implication is direct: an organization that has not invested in calibrated, domain-specific evaluation cannot reliably tell whether its AI system is improving. The MIT NANDA report’s finding that 95% of pilots produce no measurable P&L impact is, in significant part, a measurement story.2 You cannot manage what you cannot measure, and the default measurement infrastructure is not yet adequate.
Confirmed claim 2: Governance is a delivery practice, not a compliance checklist
The EU AI Act entered into force on August 1, 2024, with phased applicability. Prohibitions on unacceptable-risk systems took effect February 2, 2025. General-purpose AI obligations applied from August 2, 2025. The European Commission’s enforcement powers over GPAI providers begin August 2, 2026, and high-risk system rules apply from the same date.1819 DLA Piper’s August 2025 analysis emphasized that the August 2025 milestone activated penalty regimes of up to €35 million or 7% of global turnover for prohibited practices.20
The phased rollout had a predictable effect: organizations that treated the Act as a paperwork exercise produced paperwork. Organizations that treated it as a forcing function for documentation, data lineage, model cards, incident reporting, and human oversight produced governance that materially affected delivery practice. NIST’s December 2025 preliminary draft Cyber AI Profile (NIST IR 8596) and the August 2025 concept paper on SP 800-53 Control Overlays for Securing AI Systems follow the same pattern: outcome-oriented controls embedded in delivery, not compliance attestations.2122
The Cloud Security Alliance’s July 2025 AI Controls Matrix, 243 control objectives across 18 domains, vendor-agnostic and cloud-native, provides operational guidance that maps directly onto build pipelines.22 Leaders use it that way. Laggards fill out a checklist and continue.
Confirmed claim 3: Organizational design dominates technology choice
The clearest single piece of evidence for this claim is the contrast between Klarna and Shopify, both of which made high-profile AI-first organizational moves and both of which are now examined to within an inch of their lives. Klarna’s reversal, disclosed by CEO Sebastian Siemiatkowski to Bloomberg in May 2025, followed the pattern of optimizing operational metrics (“AI handled two-thirds of customer chats; equivalent of 700 agents”) while underweighting quality metrics.23 Shopify’s program, supported by internal LLM proxy infrastructure, more than 24 MCP servers, AI usage embedded in performance reviews, and an explicit “Red Queen” framing of continuous adaptation, has not produced a comparable reversal.12
This is not a story about model quality. Klarna and Shopify both used frontier models from major labs. The differentiator was operating model: how AI is integrated into the work, how quality is measured, how humans and AI escalate to one another, how feedback loops back into the system. The MIT NANDA report’s finding that purchased AI tools deployed via vendor partnership succeed roughly 67% of the time, versus one-third of that rate for internally-built systems, points in the same direction: the binding constraint is organizational capacity to integrate, not engineering capacity to build.2
Confirmed claim 4: Retrieval matured rather than died
The “RAG is dead” discourse peaked in early 2024 with Gemini 1.5 Pro’s million-token context window and surfaced again with Anthropic’s Model Context Protocol release in November 2024. The year produced strong evidence that retrieval did not die, but did substantially mature.
Two ICML 2025 papers, LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs and Long Context vs. RAG for LLMs: An Evaluation and Revisits, concluded that neither approach dominates universally. Long context tends to win on quality for self-contained corpora; RAG remains substantially cheaper and more auditable; chunk-based retrieval lags both; hybrid systems combining lexical (BM25), dense, and reranking layers consistently outperform single-strategy approaches.1124 Anthropic’s September 2024 Contextual Retrieval engineering post reported that hybrid contextual embedding plus contextual BM25 reduced top-20 retrieval failure rates from 5.7% to 2.9%, with reranking taking the failure rate to 1.9%.25
Late-interaction architectures (ColBERT, ColBERTv2, ColPali) gained substantial production traction in 2025, with ICML papers documenting MUVERA and PLAID indexing schemes that enable production-scale deployment with retrieval latency in the sub-millisecond range.26 Multi-modal retrieval, handling figures, charts, and tables alongside text, moved from research curiosity to deployment requirement.
Confirmed claim 5: Standardization at the integration layer accelerated faster than expected
The Model Context Protocol, introduced by Anthropic in November 2024, achieved cross-vendor adoption faster than any comparable open standard in living memory. By March 2025, OpenAI had announced support across the Agents SDK, Responses API, and ChatGPT desktop. By April 2025, Google DeepMind had committed to support in upcoming Gemini models. In December 2025, Anthropic donated MCP to the newly formed Agentic AI Foundation under the Linux Foundation, with OpenAI and Block as co-founders and AWS, Google, Microsoft, Cloudflare, and Bloomberg as supporting members.2728
The pace of adoption is verifiable: MCP turned one year old in November 2025 with a published specification update, and Anthropic, OpenAI, GitHub, Microsoft, Block, and Google publicly attested to production usage in the anniversary post.28 OpenAPI/Swagger took roughly five years to reach comparable cross-vendor adoption; OAuth 2.0 took four; HTTP and HTML took most of the 1990s.29
Two caveats are warranted. First, MCP’s security maturity has lagged its adoption: CVE-2025-6514 (a shell command injection vulnerability in mcp-remote affecting more than 437,000 developer environments) and CVE-2025-49596 (a browser-based attack against Anthropic’s MCP Inspector) document the gap.30 Second, MCP’s enterprise readiness, audit trails, SSO-integrated authentication, gateway behavior, configuration portability, remains a published priority of the protocol’s 2026 roadmap, not a solved problem.31 Adoption, again, is not integration.
Section 3: What the Year Falsified
This section examines five confident claims that circulated at the start of the year and that the evidence has falsified or substantially qualified. Each pairs the original claim with a citation and the disconfirming evidence with a citation.
Falsified claim 1: “Agents will replace workflows”
Original claim. Marc Benioff, CEO of Salesforce, told the World Economic Forum in Davos in January 2025 that “from this point forward… we will be managing not only human workers but also digital workers.” In an Axios interview that month, he characterized AI agents as “digital labor capable of handling roles previously reserved for human workers.” The framing, that 2025 was the year agents replaced workflows, was widely repeated.32
Disconfirming evidence. Three independent sources have substantially qualified the claim. METR’s randomized controlled trial of 16 experienced open-source developers across 246 tasks (February-June 2025) found that early-2025 AI tools made developers 19% slower on tasks within their existing repositories, despite developers estimating a 24% speedup beforehand and a 20% speedup afterward.33 METR’s follow-up August 2025 study suffered from selection bias as developers refused to participate in the no-AI control condition, but the published methodology shift suggests the headline finding has not been overturned.34
The MIT NANDA report’s 95% pilot-failure finding maps directly onto agentic workflows: the report found that fully autonomous agent deployments fail more often than human-in-the-loop hybrid systems.2 OSWorld, the standard benchmark for computer-use agents, shows top systems reaching 72.6% (Simular’s Agent S3, December 2025), above the 72.36% human baseline on this benchmark, but with substantial caveats about latency: agents take 30 seconds or longer for tasks humans complete in seconds, and OSWorld-Gold analysis at ICML 2025 showed that successive steps in an agent trajectory can take 3x longer than initial steps.3536 On Terminal-Bench 2.0, frontier models still score below 65% on hard tasks.37
Calibrated current view. Agents work for narrow, verifiable, repetitive tasks where escalation paths to humans exist. They do not yet replace workflows in any honest reading of the evidence. Klarna’s reversal, chronicled in detail below, is the clearest public example of a company that bet otherwise and unwound the bet.23
Falsified claim 2: “RAG is dead / over”
Original claim. Posts and presentations through 2024 and into 2025 declared RAG obsolete. A widely-circulated Medium piece titled “RAG is DEAD!” argued that million-token context windows had eliminated the need for retrieval; another framing, “MCP has killed RAG!”, followed Anthropic’s November 2024 release.38 Yao Fu’s tweet that “the 10M context kills RAG” was widely cited.39
Disconfirming evidence. ICML 2025’s LaRA benchmark, with 2,326 test cases across four QA tasks and three long-context types evaluated on 11 LLMs, concluded that there is “no silver bullet for LC or RAG routing” and that the optimal choice depends on model size, task type, context length, and retrieval quality.11 Independent practitioner analysis estimated RAG average query cost at $0.00008 versus long-context average at $0.10, roughly 1,250x cheaper at the query level, with comparable quality differences in the other direction depending on task.10 Production retention of audit trails, citation provenance, and the ability to update knowledge without retraining, all of which retrieval supports natively, has kept RAG architectures dominant in regulated industries.
Calibrated current view. RAG is not dead. Naive 2023-era RAG (chunk + embed + top-k) is decisively outperformed by hybrid systems. Long context wins where corpora are static, small, and audit-permissive. Conditional retrieval, agents that decide when, what, and how to retrieve, is the production-grade pattern.
Falsified claim 3: “Foundation models will commoditize”
Original claim. A widely shared thesis throughout 2024 and into early 2025 held that foundation models would rapidly commoditize as multiple labs converged on similar architectures and benchmark gaps narrowed. DeepSeek-R1’s January 2025 release at roughly 95% lower training cost than comparable Western models was treated as proof.40 Industry analysts repeatedly described foundation models as “the new commodity layer.”41
Disconfirming evidence. Inference pricing did fall sharply: the Stanford AI Index 2025 documented a 280-fold drop in inference costs for GPT-3.5-equivalent performance in 18 months.42 DeepSeek’s pricing, $0.28 per million input tokens for V3.2, represents a 20-50x advantage over GPT-class APIs.43
But platform power has concentrated, not diffused. Anthropic disclosed run-rate revenue of $30 billion as of April 2026, up from $9 billion at end of 2025, roughly 3x growth in four months, with more than 1,000 customers spending over $1 million annually.4 OpenAI reached $20 billion ARR by end of 2025 and reported nine million paying business users.5 Microsoft’s AI business reached a $37 billion annual revenue run rate in Q3 FY2026.6 These are not the financial profiles of commodity providers.
The Menlo Ventures State of Generative AI in the Enterprise 2025 report estimated enterprise generative AI spending at $37 billion in 2025, up 3.2x from $11.5 billion in 2024, with copilot platforms capturing 86% of that spend.44 Meta’s reported abandonment of full open-source frontier model development in favor of a proprietary Avocado/Muse Spark program (announced late 2025 / early 2026) further qualifies the commoditization thesis at the open-weight tier.45
Calibrated current view. Inference is commoditizing rapidly. Foundation model providers are not. The platforms that combine model access with distribution, security boundary, and integration tooling are concentrating power, not diffusing it.
Falsified claim 4: “Open source will dominate enterprise AI”
Original claim. Mark Zuckerberg’s October 2024 manifesto, “Open Source AI is the Path Forward,” argued that open weights would inevitably win enterprise share. Industry analysts, citing DeepSeek-R1’s performance and Llama’s billion download milestone, predicted open weights would constitute the majority of enterprise deployments by end of 2025.46
Disconfirming evidence. Menlo Ventures’ mid-year 2025 LLM market update, based on a survey of 150 technical decision-makers, found Anthropic at 32% enterprise market share, OpenAI at 25%, Google at 20%, Meta’s Llama at 9%, and DeepSeek at 1%.47 Closed-weight commercial models thus held 77% of enterprise production usage. Meta’s December 2025 reporting indicated abandonment of full Llama development in favor of proprietary models, with the company’s Q1 2026 announcement of Muse Spark formalizing the shift.4548 Bento ML’s analysis estimated open-weight models lag state-of-the-art proprietary models by approximately three months on average, meaningful but not dominant.49
Calibrated current view. Open weights play a substantial role in enterprise AI, particularly for regulated industries requiring on-premises deployment, for cost-sensitive high-volume workloads, and for fine-tuning. They have not dominated, and the trend through 2025 was toward partial rather than full openness.
Falsified claim 5: “Governance will be a regulatory checklist”
Original claim. A common 2024 framing held that AI governance would be reducible to compliance attestations once the EU AI Act and equivalent regulations clarified requirements, that the binding work would be paperwork, not engineering.
Disconfirming evidence. The EU AI Act’s phased rollout produced the opposite. The August 2025 GPAI obligations required providers to publish detailed model documentation, copyright compliance attestations, and systemic-risk notifications, none of which can be produced retroactively without substantial engineering investment in data lineage, model cards, and evaluation infrastructure.1820 NIST’s December 2025 Cyber AI Profile (NIST IR 8596) and the SP 800-53 AI control overlays under development map AI-specific controls onto outcome-oriented engineering practices, not compliance forms.21 The Cloud Security Alliance’s AI Controls Matrix, 243 controls across 18 domains, operationalizes governance as cloud-native infrastructure.22
The independent corroboration comes from the failure pattern: organizations that approached the August 2025 deadline with a checklist mindset produced documentation, not governance. Those that approached it as a forcing function for engineering hygiene produced both.
Calibrated current view. Governance is a delivery practice. The regulatory text matters; the engineering response matters more. Organizations that hold this view are pulling away.
[Figure 3: What proved, what falsified, tabular figure. Two-column table. Left column lists the five confirmed claims with a brief defense and primary citation. Right column lists the five falsified or qualified claims with the original confident claim, the disconfirming evidence, and the calibrated current view. Color coding distinguishes “fully confirmed” (dark green) from “partially confirmed” (light green) on the left, and “fully falsified” (dark red) from “substantially qualified” (light red) on the right.]
Section 4: The Applied Layer One Year On
This section walks each component of the applied layer and offers an honest, evidence-anchored assessment of where it stands at the end of the period.
Retrieval
Retrieval has become a more interesting and more disciplined field than its 2023 form. Three structural shifts define its current state.
First, the field has moved beyond the dense-vector-only orthodoxy. Anthropic’s Contextual Retrieval methodology, combining contextual embedding with contextual BM25 and a reranker, became a widely-adopted reference architecture, with reported failure-rate reductions from 5.7% to 1.9% on Anthropic’s internal benchmarks.25 Independent ICML 2025 analysis confirmed that hybrid lexical-plus-dense-plus-rerank pipelines outperform either approach alone across most benchmarks.1124
Second, late-interaction retrieval moved from research curiosity to production. ColBERTv2’s residual compression reduced storage footprint by 6-10x while preserving retrieval quality; ColPali extended late interaction to vision-language documents; PLAID and MUVERA indexing schemes demonstrated sub-millisecond retrieval latency at production scale.26 These advances matter because most enterprise corpora are not pure text, they include figures, tables, diagrams, and scanned documents.
Third, long context and retrieval are now treated as complements, not substitutes. The dominant production pattern routes simple, audit-light queries to long-context calls; complex, citation-required queries to retrieval; and hybrid queries to systems that retrieve, then reason over the retrieved set within a long context. RAGFlow’s year-end 2025 review described the shift as “from RAG to context engineering”, the integration of retrieval, prompt assembly, memory, and tool invocation into a unified discipline.50
Orchestration
Orchestration is the part of the applied layer where the most surprising progress occurred and where the ground is still most unstable.
The Model Context Protocol’s adoption arc, single vendor in November 2024 to Linux Foundation governance in December 2025 with cross-vendor SDK downloads in the tens of millions monthly, has no recent precedent.28 Production deployments at Block, Bloomberg, Salesforce, and across Microsoft’s Windows ecosystem are documented.27 The ecosystem now includes thousands of MCP servers across categories from databases to development tools.
Three caveats are warranted. First, security maturity lagged adoption. The MCP 2026 roadmap explicitly prioritizes audit trails, SSO-integrated authentication, gateway and proxy patterns, and configuration portability, all of which are problems enterprises hit at scale, not features available out of the box.31 Second, the protocol has not yet absorbed the agent-to-agent coordination problem; AGENTS.md (OpenAI) and A2A protocols address adjacent surfaces, and the boundary between MCP and these is unsettled. Third, the relationship between MCP and frameworks like LangGraph, CrewAI, AutoGen, and provider-native agent frameworks remains a layering question rather than a solved architecture.
The “context engineering” rebrand, popularized by Anthropic’s September 2025 engineering post and reinforced by Gartner’s July 2025 declaration that “context engineering is in, prompt engineering is out,” reflects the integration story.5152 The October 2025 Agentic Context Engineering paper at NeurIPS introduced ACE, a framework for evolving contexts that prevents collapse during iterative refinement, with reported +10.6% benchmark improvements.53
Evaluation
Evaluation is, by the year’s evidence, the binding constraint on production AI.
The state of the practice combines three threads. First, eval-driven development as a workflow: golden datasets versioned alongside code, custom domain-specific judges calibrated against human labels, statistical significance gating production deployment. Databricks’ 2025 Data + AI Summit talks codified this pattern; Arize, Maxim, Langfuse, Braintrust, and LangSmith provide the tooling layer.854
Second, the calibrated critique of LLM-as-judge methods. Multiple NeurIPS 2025 and ICML 2025 papers documented systematic biases, position, verbosity, self-enhancement, agreeableness, length, and proposed mitigations including order randomization, voting ensembles across model families, regression-based bias correction calibrated on small human-annotated sets, and multi-agent debate frameworks (MAJ-Eval).91516 No consensus solution emerged; the cost of bad evaluation is now widely understood, and good evaluation remains expensive.
Third, the recognition that benchmarks lie in two directions. The Scale AI SWE-Bench Pro work demonstrated that frontier model performance on the public Verified subset (>70% for GPT-5 and Claude Opus 4.1) drops to 23-24% on the harder Pro variant, and to 14.9-17.8% on the private subset.55 Anthropic’s reported scores of 87.6% on SWE-bench Verified for Opus 4.7 (December 2025) sit above 64.3% on SWE-bench Pro and 78.0% on OSWorld-Verified, with each benchmark measuring something materially different about a system’s capability.56 Treating any single benchmark as proof of production readiness is now a recognized failure mode.
The November 2025 Beyond Accuracy paper at NeurIPS proposed CLEAR (Cost, Latency, Efficacy, Assurance, Reliability) as a framework for enterprise agent evaluation, finding that optimizing for accuracy alone produces agents 4.4-10.8x more expensive than cost-aware alternatives with comparable performance, and that expert evaluators predict production success at correlation ρ = 0.83 with CLEAR versus ρ = 0.41 with accuracy-only metrics.57
Governance
Governance, where it works, is engineering. The frameworks that defined the period, NIST AI RMF 1.0, the EU AI Act, ISO/IEC 42001, NIST AI 600-1 (Generative AI Profile), the December 2025 NIST IR 8596 Cyber AI Profile, the Cloud Security Alliance AI Controls Matrix, converge on a structurally similar set of practices: data lineage, model cards, evaluation infrastructure, incident reporting, human oversight, and lifecycle monitoring.212258
The phased EU AI Act rollout (prohibitions February 2025; GPAI obligations August 2025; high-risk systems August 2026; embedded high-risk systems August 2027) functions as a forcing schedule.1819 Organizations that built infrastructure ahead of August 2025 had documentation when it was demanded; those that did not faced escalating costs in the months following.
The signal that governance has matured into a delivery practice rather than a checklist is operational: at leading organizations, governance reviews are integrated into the deployment pipeline, not bolted on at production gates. NIST’s emphasis on “GOVERN applies to all stages of organizations’ AI risk management processes and procedures” reinforces this framing.58
Integration
Integration is the part of the applied layer that has most clearly shifted from a static problem to an active engineering discipline. Three patterns dominate year-end 2025.
First, retrieval and tool invocation increasingly happen through MCP rather than custom connectors, reducing what BCG characterized as the quadratic explosion of integration complexity to a linear one.59 Second, identity and authorization patterns, Auth0’s MCP server work, Cloudflare’s approval workflows, Salesforce’s interoperability anchoring, moved from prototype to production through summer 2025.27 Third, observability tooling, New Relic’s MCP monitoring, Pomerium and SGNL gateway products, MCPTotal, emerged as a recognized category.27
The integration challenge that remains unsolved at year-end is data architecture. Deloitte’s 2026 Tech Trends survey reported that nearly half of organizations cite data searchability and reusability as obstacles to AI automation; Gartner attributes more than 40% of agentic AI project failures to legacy systems unable to support modern AI execution demands.1314 No standard architecture has emerged, and the field is in active research.
Human-AI workflow
The most underrated finding of the year sits in this category. Anthropic’s Economic Index reports document a steady shift from delegation to augmentation: as of February 2026, 52% of users worked with Claude as a thinking partner rather than delegating tasks for full automation; directive conversations rose from 27% to 39%; the share of program creation in coding usage grew while debugging share declined.360 The pattern indicates users converging on workflows where AI handles bounded subtasks and humans retain decision authority, a different end-state from the autonomous-agent narrative that dominated early 2025.
Klarna’s hybrid model, AI handling routing and FAQ resolution, humans handling escalation and judgment-required interactions, converged on the same pattern from a different starting point.23 Microsoft’s Copilot integration of Agent Mode as a default in Word, Excel, and PowerPoint, paired with retention data showing 44.2% of lapsed users cite distrust of answers as the primary reason for stopping use, points in the same direction: the durable integration is collaborative, not autonomous.7
[Figure 2: Stratification chart, leader-laggard gap visualization. Horizontal axis represents applied-layer maturity (left = procurement-only treatment of AI, right = full eval-driven development with operationalized governance). Vertical axis represents measured business outcome (cost savings, revenue growth, productivity per worker). Two scatter clouds: a tight high-performing cluster on the upper right (organizations like Shopify, Anthropic enterprise customers spending >$1M, Microsoft 365 Copilot deployments at >50,000 seats), and a broader, lower cluster on the lower left (95% of pilots from MIT NANDA findings). A widening gap between the two cluster trend lines is highlighted. Annotations call out “the divergence accelerated through 2025.”]
Section 5: The Vendor Map at Year End
This section offers an honest read of the major platform vendors as of May 2026. The competitive dynamics revealed by the period are stratified and asymmetric.
Microsoft
Microsoft is the year’s most visible enterprise AI commercial winner. Q3 FY2026 results disclosed Microsoft 365 Copilot at more than 20 million paid enterprise seats, with the number of customers holding more than 50,000 seats quadrupling year over year, and an AI business annual revenue run rate above $37 billion.6 Azure’s “and other cloud services” segment grew 40%, with AI services driving the acceleration.61
The honest assessment is more textured. Recon Analytics’ tracked Copilot accuracy Net Promoter Score moved from -3.5 in July 2025 to -24.1 in September 2025, recovering only partially to -19.8 in January 2026; 44.2% of lapsed Copilot users cite distrust of answers as the primary reason for stopping use.7 Single-platform-only adoption rates of 68% indicate that significant adoption is driven by employer provisioning rather than user preference. Microsoft’s multi-model strategy, routing queries across GPT-5, Claude Sonnet 4.5, Claude Opus 4.7, and Gemini in Smart Mode, represents a structural recognition that no single model dominates all workloads.62
AWS
AWS’s Bedrock has emerged as the most diverse multi-model marketplace, hosting Anthropic’s Claude family, Meta’s Llama, Mistral, Cohere, AI21, and Amazon’s Titan and Nova models behind a single API.63 The October 2025 launch of AgentCore as a full-scale agent builder, combined with Bedrock’s MCP server integrations with Cursor and Kiro, positions AWS as the platform of choice for organizations wanting maximum model optionality with deep cloud integration.
AWS’s strategic position is structurally distinct from Microsoft’s. Microsoft monetizes AI primarily through productivity software seat licenses; AWS monetizes through inference token consumption. Anthropic’s gross revenue accounting, recognizing AWS-channel sales at full end-customer value, produced an April 2026 dispute with OpenAI about reported figures, but the underlying dynamic is real: Bedrock is the primary distribution channel for non-OpenAI frontier models in enterprise.64
Google’s position improved through 2025 but has not achieved the commercial breakout of OpenAI or Anthropic. Vertex AI’s Model Garden hosts Gemini, Claude (notable absence: not in early Bedrock-equivalent positioning), Llama, and Mistral.63 Google’s own disclosure of more than $150 billion in 2025 capital expenditures signals long-term commitment.65 Gemini 3 Pro’s December 2025 release achieved competitive performance on coding (SWE-bench Verified at 80.6%) and the highest score on BrowseComp at 85.9%, indicating particular strength in web-research workflows.56
The honest assessment: Google retains the deepest research bench, the largest first-party data assets, and substantial enterprise distribution through Workspace, but has not yet converted these advantages into the commercial momentum of its closest competitors at the application layer.
Anthropic
Anthropic’s commercial trajectory is the year’s most striking commercial story. Run-rate revenue grew from approximately $1 billion in January 2025 to $9 billion at end of 2025 to $30 billion by April 2026, a 30x trajectory in 15 months.466 Eight of the Fortune 10 are reported as Claude customers; more than 1,000 enterprise customers spend over $1 million annually.4 Claude Code’s standalone product revenue reached $2.5 billion run-rate by February 2026, with business subscriptions quadrupling since the start of 2026.4
The structural position rests on three pillars: enterprise distribution through all three major clouds (AWS Bedrock, Google Vertex AI, Microsoft Azure Foundry); leadership on agentic coding benchmarks (Claude Opus 4.7 leading SWE-bench Verified at 87.6% and MCP-Atlas at 77.3% as of December 2025); and governance positioning that tracks regulatory direction.56 The dispute with the U.S. government documented in Anthropic’s Series G filing, and the company’s March 2026 legal challenge to its national security designation, is a structural feature of Anthropic’s position, not a transient incident.4
OpenAI
OpenAI ended 2025 at $20 billion ARR with more than nine million paying business users, weekly active users above 800 million, and more than one million organizations using its technology.5 Enterprise revenue reached more than 40% of total revenue, on track to reach parity with consumer revenue by end of 2026.67
The honest read is that OpenAI retains category leadership in distribution and breadth of usage, but has lost share at the enterprise foundation model layer (from 50% to 25-34% enterprise market share over 18 months) and is operating at a 33% gross margin, constrained by inference costs of $8.4 billion in 2025.475 OpenAI’s bet on being a fully integrated platform, consumer, developer, enterprise, agents, advertising, is the most ambitious strategic positioning of the cohort. Whether that ambition resolves into financial sustainability is the open commercial question of the period ahead.
Databricks and Snowflake
Databricks and Snowflake represent the data-platform vector of applied-layer competition. Both companies converged on remarkably similar 2025 announcements: AI assistants for business users (Snowflake Intelligence, Databricks AI/BI Genie), agentic frameworks for builders (Cortex Code, Agent Bricks), and managed AI lifecycle tooling (Mosaic AI on Databricks, Cortex AISQL on Snowflake).68 The strategic logic, that whoever controls enterprise data controls enterprise AI, is shared.
The differentiation is structural. Databricks has invested in custom model training and agent orchestration, with deep MLflow integration and the Mosaic AI Agent Framework. Snowflake has invested in governance-first integration, with Cortex Search, Document AI, and the April 2026 expansion of Snowflake Intelligence and Cortex Code as the “control plane for the agentic enterprise.”69 Foundation Capital’s analysis characterized the rivalry as a battle for the unstructured data layer, 80-90% of enterprise data, in MIT’s estimate.70
[Figure 4: Vendor map at year end, two-axis positioning of major platform vendors. Horizontal axis: model breadth (single-model platform left, multi-model marketplace right). Vertical axis: integration depth (shallow / API-only at the bottom, deep / workflow-integrated at the top). Bubbles sized by approximate enterprise AI revenue run rate as of April 2026. Microsoft positioned upper-right (multi-model via Smart Mode routing, deep workflow integration). AWS positioned mid-right (broadest model marketplace, mid integration depth). Google positioned mid-mid. Anthropic positioned upper-left (single platform, deep enterprise workflow integration via Claude Code, Cowork, healthcare/finance verticals). OpenAI positioned upper-left adjacent (single platform, deep distribution but contested integration depth). Databricks positioned upper-mid (multi-model + deep data integration). Snowflake positioned upper-mid adjacent. Annotations call out the strategic positions and note that no vendor occupies the upper-right corner unambiguously.]
Section 6: Where Enterprises Actually Are
Aggregate adoption statistics flatten meaningful variation. The honest reading of the year’s evidence is that enterprises occupy a wide spectrum of operational maturity, with sector-specific patterns that reward sector-specific analysis.
Maturity ladder
The applied-layer framework’s maturity ladder, from “exploring” through “scaled pilots” to “production discipline” to “operational integration”, held through the period with one important refinement: the middle of the ladder has thickened.
The MIT NANDA report’s 95% pilot-failure finding describes the bottleneck: most organizations are stalled at “scaled pilots.” They have multiple AI initiatives, executive sponsorship, and meaningful spend, but the systems do not produce measurable P&L impact.2 The crossing into the “production discipline” rung, eval-driven development, calibrated retrieval, operationalized governance, organizational adaptation, appears to be where the stratification gap opens.
Sector patterns
Where evidence supports specific sector findings, the picture is uneven.
Software engineering. Coding remains the most penetrated AI workflow. Anthropic’s Economic Index reports coding at 36% of Claude.ai usage as of February 2026.60 Microsoft Copilot’s developer-tools line (GitHub Copilot at more than 1.3 million paid subscribers) and Cursor / Windsurf / Replit deployments compound the pattern.61 The METR study’s finding that experienced developers were 19% slower with early-2025 AI tools is the strongest available randomized evidence and complicates the “AI accelerates engineering” narrative without falsifying it for less experienced developers or different task types.33
Customer service. Klarna’s reversal is the most documented case study. The data path: 700 outsourced positions reduced; AI handling two-thirds to three-quarters of chats; customer satisfaction reportedly dropping 22%; CEO acknowledgment of “lower quality”; reversal to a hybrid model with rural Sweden / customer-base “Uber-style” gig hires.2371 The pattern, operational metrics measured carefully, quality metrics measured loosely, reversal when the gap became visible, is recurring across the sector. Gartner’s prediction that half of companies that cut customer service staff because of AI will need to rehire by 2027 was widely cited.72
Financial services. Adoption is bifurcated. Front-office use cases (research summarization, due diligence) are widespread. Customer-facing automation has lagged due to regulatory and trust constraints. Anthropic’s Claude for Excel beta (with connectors to LSEG, Moody’s, Aiera/Third Bridge) and Snowflake’s banking control-plane positioning indicate where vendor investment is concentrated.469
Healthcare and life sciences. Anthropic’s Claude for Healthcare, including HIPAA-ready connectors to the CMS Coverage Database, ICD-10 codes, and PubMed, and the April 2026 acquisition of Coefficient Bio, signals platform investment.4 Stanford AI Index 2026 reported AI-enabled FDA-approved medical devices at 223 by 2023, with the pace accelerating.1
Public sector. Maryland’s November 2025 partnership with Anthropic for state-agency deployment across food aid, Medicaid, and cash assistance enrollment is a documented public sector signal. The European Commission’s emphasis on the AI Act’s deployer obligations adds another structural pressure.418
What survey data does not show
A note on calibration: surveys reporting “78% of organizations are using AI in at least one business function” (Stanford AI Index 2025) and “88% of organizations” (Stanford AI Index 2026) document deployment, not value capture.142 The gap between deployment and capture is the operational reality this section addresses. Treating survey adoption as a measure of impact has been a recurring analytical mistake of the year.
Section 7: The Questions Year Two Will Answer
The publication’s research agenda for the period ahead follows from the open questions the period surfaced. These are framed as questions, not predictions.
Question 1: Will the eval-driven development pattern stabilize into a profession, or fragment into vendor-specific tooling? The infrastructure exists (Arize, Maxim, Langfuse, Braintrust, LangSmith, MLflow, Vertex Evaluation, Bedrock Evaluations), and the literature on LLM-as-judge calibration is rapidly maturing. The open question is whether evaluation engineering becomes a distinct discipline with shared methodology, certifications, and practitioner identity, or whether it remains a feature set within larger platforms. Evidence that would resolve: the emergence (or absence) of cross-vendor evaluation standards; conference presence; whether eval-driven development appears in technical hiring as a named competency.
Question 2: Where does the agent-versus-workflow boundary stabilize? The year’s evidence falsifies the strongest version of the agentic claim, but agents are real and improving. OSWorld, Terminal-Bench, SWE-bench Pro, and the GDPval suite suggest a cluster of capabilities developing in parallel. The open question is which workflow categories (sales, finance, customer service, IT support) accommodate agentic decomposition with acceptable reliability and which require human-in-the-loop architectures permanently. Evidence that would resolve: comparable longitudinal studies of human-only, human-AI hybrid, and AI-autonomous workflows in matched settings.
Question 3: Does the Model Context Protocol’s adoption produce durable interoperability, or does proprietary extension fragment the standard? MCP’s current trajectory is more positive than the historical norm for open standards. The 2026 roadmap signals work on transports, governance, security, and enterprise readiness. But the protocol’s success creates incentives for proprietary extensions. Evidence that would resolve: rate of MCP-native versus MCP-compatible-but-extended deployments; the Linux Foundation’s governance trajectory through 2026.
Question 4: When and how does the gross-margin economics of foundation model providers change? OpenAI’s 33% gross margin and inference-cost trajectory ($8.4 billion in 2025, projected $14.1 billion in 2026) and Anthropic’s reliance on cloud-channel gross-revenue accounting raise structural questions about whether foundation model providers can achieve software-margin economics.54 The open question is whether algorithmic and hardware efficiency improvements outpace the demand for more capable, more expensive inference. Evidence that would resolve: published margin trajectories; pricing patterns; whether DeepSeek-class open-weight efficiency gains continue.
Question 5: Does the EU AI Act’s August 2026 enforcement milestone produce material change in deployer behavior, or does enforcement remain modest? GPAI obligations have been in effect since August 2025; enforcement powers begin August 2026. The published penalty regime (up to €35 million or 7% of global turnover) is large enough to change behavior if applied. Evidence that would resolve: the first published enforcement actions; the AI Office’s investigation patterns; deployer attestation completeness.
Question 6: Where does the human-AI workflow durably settle for knowledge work? Anthropic’s Economic Index finding that augmentation now slightly leads automation in usage patterns is a meaningful signal but not yet a steady state. The open question is which knowledge-work tasks converge on AI-as-thinking-partner workflows, which converge on AI-as-autonomous-executor workflows, and how the boundary moves with model capability. Evidence that would resolve: longitudinal usage data from Anthropic, OpenAI, and Microsoft; published productivity studies that distinguish task types.
Question 7: Does data architecture become the next binding constraint? Deloitte and Gartner both identify legacy systems and data architecture as the dominant impediments to agentic AI deployment. No standard architecture has emerged. The open question is whether existing data warehouse / lakehouse / vector database architectures stretch to accommodate agentic workloads or whether a new pattern emerges. Evidence that would resolve: vendor convergence patterns; reference architectures; whether the Databricks-Snowflake rivalry produces interoperable governance or competing closed ecosystems.
[Figure 5: The year-two research agenda, tabular figure. Five columns. Column 1: Question (the seven questions above, condensed). Column 2: Why it matters (the strategic stake). Column 3: Evidence that would resolve. Column 4: Earliest plausible resolution (Q3 2026, end of 2026, mid-2027, etc.). Column 5: This publication’s research commitment (which questions will receive dedicated coverage in which pillars). Color coding distinguishes commercial questions, technical questions, and governance questions.]
Section 8: What the Field Has Clarified
This section is reflexive. The publication’s editorial voice is rigorously third-person elsewhere; here, “this publication” is the appropriate framing.
This publication began the year with five editorial positions. Three have hardened with evidence. Two have shifted.
Hardened. The thesis that the applied layer matters more than foundation models for enterprise value capture has hardened. The year’s evidence, from MIT NANDA’s pilot-failure finding, to the stratification visible in Anthropic’s customer concentration data, to Klarna’s reversal, supports the position more strongly than the applied-layer framework did. Where the framework initially treated the claim as defensible, it now treats it as well-supported.
The thesis that evaluation is the binding constraint has also hardened. NeurIPS 2025 and ICML 2025 published more substantive work on LLM-as-judge calibration than the publication anticipated; the field’s progress on the problem outpaced its emergence as a popular topic.
The thesis that governance behaves as a delivery practice has hardened, with the EU AI Act’s August 2025 milestone serving as a forcing function. The publication’s initial framing was correct in direction but understated in operational specifics, which the year clarified.
Shifted. The publication’s initial position on agentic systems was more cautious than the prevailing 2024 discourse but still expected steady progress through 2025. The METR study, OSWorld latency findings, Klarna’s reversal, and Gartner’s project-cancellation forecast collectively produced a more sober view: agentic systems are real, narrow, and improving, but the boundary between agents and workflows is more conservative than this publication expected.
The publication’s initial position on open-source dominance in enterprise was hedged but slightly more bullish than the year’s evidence justified. Meta’s late-2025 retreat from full Llama development, combined with closed-weight commercial models retaining 77% of enterprise production usage in the Menlo Ventures survey, supports a more measured view: open weights are durable and important, but enterprise dominance is no longer the modal scenario.47
False consistency avoided. Two positions this publication held that proved harder than expected to defend: the assumption that Model Context Protocol adoption would face stiffer cross-vendor resistance (it did not), and the assumption that long-context models would not displace retrieval (they did, partially, but the market settled on hybrid rather than displacement). Both of these are nearer to “the market moved faster than expected” than to “the analytical framework was wrong,” but the editorial standpoint has updated accordingly.
The publication has also been wrong about pace in one direction. The expectation that EU AI Act compliance would produce significant chaos in August 2025 was overstated. Most organizations covered by the GPAI provisions met the deadline through documentation that was in some cases substantive and in some cases boilerplate, but enforcement has been measured. the next phase will test whether this calm holds.
Section 9: A Concluding Observation
A year of evidence supports a single observation that this report will hold lightly but durably: the most important thing that happened in the applied layer in the period is not visible in any single benchmark, vendor announcement, or quarterly earnings disclosure.
What happened is that a working theory of how to do this well has emerged from the public record. The theory is not original to this publication or to any one source. It is an aggregation of methodology, eval-driven development from Databricks engineering and from independent practitioners; context engineering from Anthropic’s engineering posts and from Atlan, Thoughtworks, and others; governance-as-delivery from NIST, the EU AI Office, and the Cloud Security Alliance; multi-model architecture from IDC’s research and from production deployments at Microsoft and others; the maturity ladder from this and other editorial publications working in adjacent territory.
The theory is not consensus. It is convergence. Different sources, working from different angles, are arriving at recognizably similar conclusions about what works and what does not.
Theories of practice converge before they harden. They harden before they ossify. The window in which the applied layer is being defined, by what works, by what is taught, by what is hired for, by what is funded, is the window enterprises currently occupy. The organizations doing this work seriously will define the terms by which future organizations evaluate their own work. The organizations not doing this work seriously will, in some number of cases, encounter the gap years from now and wonder when it opened.
The gap opened in the period. That is the lasting observation.
- Stanford Institute for Human-Centered Artificial Intelligence, The 2026 AI Index Report (Stanford, CA: Stanford HAI, April 2026), https://hai.stanford.edu/ai-index/2026-ai-index-report. ↩↩↩
- MIT NANDA Initiative, The GenAI Divide: State of AI in Business 2025 (Cambridge, MA: MIT Media Lab, July 2025). ↩↩↩↩↩↩↩
- Anthropic, “Anthropic Economic Index Report: Uneven Geographic and Enterprise AI Adoption,” September 2025, https://www.anthropic.com/research/anthropic-economic-index-september-2025-report. ↩↩
- Anthropic, “Anthropic Raises $30 Billion in Series G Funding at $380 Billion Post-Money Valuation,” April 2026, https://www.anthropic.com/news/anthropic-raises-30-billion-series-g-funding-380-billion-post-money-valuation. ↩↩↩↩↩↩↩↩↩↩
- Sarah Friar (OpenAI CFO), “A Business That Scales with the Value of Intelligence,” OpenAI, January 2026, https://openai.com/index/a-business-that-scales-with-the-value-of-intelligence/. ↩↩↩↩↩
- Microsoft Corporation, Q3 FY2026 earnings disclosure and conference call transcript, April 29, 2026. ↩↩↩
- Recon Analytics, Microsoft Copilot accuracy NPS tracking data, July 2025-January 2026, as cited in Stackmatix and AI Business Weekly secondary reporting. ↩↩↩
- Databricks, “Evaluation-Driven Development Workflows: Best Practices and Real-World Scenarios,” Data + AI Summit 2025 conference session, June 2025. ↩↩
- Anonymous authors, “The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge,” presented at NeurIPS 2025 Workshop on Reliable ML from Unreliable Data, https://arxiv.org/pdf/2509.26072. ↩↩↩
- Independent practitioner cost analyses aggregating production deployment data, RAGFlow year-end review and byteiota analysis, late 2025. ↩↩
- Xinze Li, Yixin Cao, Yubo Ma, and Aixin Sun, “Long Context vs. RAG for LLMs: An Evaluation and Revisits,” arXiv:2501.01880, January 2025. ↩↩↩↩
- Tobias Lütke (Shopify CEO), internal memo published on X, April 7, 2025, https://x.com/tobi/status/1909251946235437514; CNBC and TechCrunch coverage; field reports including “Shopify’s AI Memo Was a Filter, Not a Productivity Play.” ↩↩
- Deloitte Insights, Tech Trends 2026: Agentic AI Strategy (Deloitte, 2025). ↩↩
- Gartner, “Why Agentic AI Projects Fail, and How to Set Yours Up for Success,” as cited in Harvard Business Review, October 2025. ↩↩
- Ashley Jain et al., “Agreeableness Bias in LLM-as-a-Judge,” October 2025, as discussed in survey on LLM-as-Judge Evaluation Techniques, Emergent Mind compilation. ↩↩
- Anonymous, “Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems,” arXiv:2512.11150, December 2025. ↩↩
- Qizheng Zhang et al., “Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models,” arXiv:2510.04618, October 2025 (revised March 2026). ↩
- European Commission, “Timeline for the Implementation of the EU AI Act,” AI Act Service Desk, https://ai-act-service-desk.ec.europa.eu/en/ai-act/eu-ai-act-implementation-timeline. ↩↩↩↩
- Future of Life Institute, “Implementation Timeline | EU Artificial Intelligence Act,” https://artificialintelligenceact.eu/implementation-timeline/. ↩↩
- DLA Piper, “Latest Wave of Obligations Under the EU AI Act Take Effect: Key Considerations,” August 2025, https://www.dlapiper.com/en-us/insights/publications/2025/08/latest-wave-of-obligations-under-the-eu-ai-act-take-effect. ↩↩
- NIST, NIST IR 8596 preliminary draft “Cybersecurity Framework Profile for Artificial Intelligence (Cyber AI Profile),” December 2025; SP 800-53 Control Overlays for Securing AI Systems (COSAiS) concept paper, August 2025. ↩↩↩
- Cloud Security Alliance, “AI Controls Matrix (AICM) Bundle,” July 2025, https://cloudsecurityalliance.org. ↩↩↩↩
- Sebastian Siemiatkowski (Klarna CEO), interview with Bloomberg, May 2025; subsequent reporting by Entrepreneur, Fortune, Tech.co, FinTech Weekly, and the Klarna Group F-1/A-3 SEC filing dated September 2, 2025. ↩↩↩↩
- ICML 2025 Conference Proceedings, “LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs, No Silver Bullet for LC or RAG Routing,” May 2025, https://openreview.net/forum?id=CLF25dahgA. ↩↩
- Anthropic, “Introducing Contextual Retrieval,” September 2024. ↩↩
- O. Khattab et al., “ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction” (foundational, with 2025 production-scale deployments documented in MUVERA and PLAID indexing literature, ICML/SIGIR 2025). ↩↩
- Pento Engineering, “A Year of MCP: From Internal Experiment to Industry Standard,” November 2025, https://www.pento.ai/blog/a-year-of-mcp-2025-review. ↩↩↩↩
- MCP Core Maintainers, “One Year of MCP: November 2025 Spec Release,” Model Context Protocol Blog, November 25, 2025, https://blog.modelcontextprotocol.io/posts/2025-11-25-first-mcp-anniversary/. ↩↩↩
- The New Stack, “Why the Model Context Protocol Won,” 2025, https://thenewstack.io/why-the-model-context-protocol-won/. ↩
- CVE-2025-6514 (mcp-remote shell command injection); CVE-2025-49596 (Anthropic MCP Inspector RCE), as documented in MCP enterprise adoption analysis. ↩
- David Soria Parra and MCP Core Maintainers, “The 2026 MCP Roadmap,” March 9, 2026, https://blog.modelcontextprotocol.io/posts/2026-mcp-roadmap/. ↩↩
- Marc Benioff (Salesforce CEO), public statements at World Economic Forum Davos January 2025 and Axios interview; Salesforce Agentforce product disclosures from September 2024 forward. ↩
- Joel Becker, Nate Rush, Beth Barnes, and David Rein, “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity,” METR / arXiv:2507.09089, July 2025. ↩↩
- METR, “We Are Changing Our Developer Productivity Experiment Design,” February 24, 2026, https://metr.org/blog/2026-02-24-uplift-update/. ↩
- Simular AI, “Simular’s Computer-Use Agent Outperforms Humans,” December 16, 2025, https://www.simular.ai/articles/simulars-computer-use-agent-outperforms-humans. ↩
- ICML 2025, “OSWorld-Gold: Benchmarking the Efficiency of Computer-Use Agents,” https://icml.cc/virtual/2025/49774. ↩
- Mike A. Merrill et al., “Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces,” arXiv:2601.11868, 2026 (Stanford / Laude Institute). ↩
- Various, “RAG is Dead!” discourse, peaking February 2024 with Gemini 1.5 Pro release; representative summaries in Medium, LightOn, and AkitaOnRails analyses. ↩
- Yao Fu, public Twitter/X post on Gemini 1.5 Pro, February 2024; Zilliz analysis “Will RAG Be Killed by Long-Context LLMs?”. ↩
- DeepSeek-AI, “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning,” January 2025; subsequent V3.2 pricing disclosures. ↩
- Multiple investor and analyst publications, including Generative Value newsletter and Amadeus Capital “AI Commoditisation Curve” analyses, 2024-2025. ↩
- Stanford Institute for Human-Centered Artificial Intelligence, The 2025 AI Index Report (Stanford, CA: Stanford HAI, April 2025). ↩↩
- AInvest analysis of DeepSeek pricing impact on AI infrastructure economics, 2026. ↩
- Menlo Ventures, “2025: The State of Generative AI in the Enterprise,” https://menlovc.com/perspective/2025-the-state-of-generative-ai-in-the-enterprise/. ↩
- WinBuzzer, “Meta Pivots from Llama to Closed AI Models, Abandoning Open Source Roots,” December 9, 2025; Digitimes, December 11, 2025; SiliconANGLE, April 6, 2026. ↩↩
- Mark Zuckerberg, “Open Source AI Is the Path Forward,” Meta press release, October 2024. ↩
- Menlo Ventures, “2025 Mid-Year LLM Market Update: Foundation Model Landscape + Economics,” https://menlovc.com/perspective/2025-mid-year-llm-market-update/. ↩↩↩
- The New Stack, “Meta Abandons Open-Source Llama for Proprietary Muse Spark,” April 2026, https://thenewstack.io/meta-abandons-llama-spark/. ↩
- BentoML, “The Best Open-Source LLMs in 2026,” citing Epoch AI lag estimates, https://www.bentoml.com/blog/navigating-the-world-of-open-source-large-language-models. ↩
- RAGFlow, “From RAG to Context, A 2025 Year-End Review of RAG,” https://ragflow.io/blog/rag-review-2025-from-rag-to-context. ↩
- Anthropic, “Effective Context Engineering for AI Agents,” Engineering Blog, September 29, 2025, https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents. ↩
- Gartner, “Context Engineering Is In, and Prompt Engineering Is Out,” July 2025 declaration as cited in Atlan and FlowHunt practitioner guides. ↩
- Zhang et al., “Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models,” arXiv:2510.04618 (NeurIPS 2025 / 2026 venue). ↩
- Maxim AI and adjacent practitioner guides on golden datasets and eval-driven development workflows, 2025. ↩
- Scale Labs, “SWE-Bench Pro Leaderboard AI Coding Benchmark,” 2025, https://labs.scale.com/leaderboard/swe_bench_pro_public. ↩
- Vellum AI, “Claude Opus 4.7 Benchmarks Explained,” 2025, https://www.vellum.ai/blog/claude-opus-4-7-benchmarks-explained. ↩↩↩
- Sushant Mehta, “Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems,” arXiv:2511.14136, November 2025. ↩
- NIST, Artificial Intelligence Risk Management Framework (AI RMF 1.0), NIST AI 100-1, January 2023; AI 600-1 Generative AI Profile, July 2024. ↩↩
- Boston Consulting Group analysis of MCP integration complexity, 2025, as referenced in Gupta Deepak MCP enterprise adoption guide. ↩
- Anthropic, “Anthropic Economic Index Report: Learning Curves,” March 2026, https://www.anthropic.com/research/economic-index-march-2026-report. ↩↩
- PYMNTS, “Microsoft’s AI Bet Keeps Paying Off Across Cloud, Copilot and Code,” 2026, https://www.pymnts.com/earnings/2025/microsofts-ai-bet-keeps-paying-off-across-cloud-copilot-and-code/. ↩↩
- First AI Movers, “Microsoft Copilot 2025: Model Options, Smart Routing & Enterprise Integration Guide,” December 2025. ↩
- MyEngineeringPath, “AWS Bedrock vs Google Vertex AI, Cloud AI Platforms Compared (2026),” https://myengineeringpath.dev/tools/bedrock-vs-vertex-ai/. ↩↩
- Bloomberg, “Anthropic Tops $30 Billion Run Rate, Seals Broadcom Deal,” April 6, 2026, https://www.bloomberg.com/news/articles/2026-04-06/broadcom-confirms-deal-to-ship-google-tpu-chips-to-anthropic. ↩
- Stanford AI Index 2026 Economy chapter, citing Google capex disclosures. ↩
- Anthropic on X, “Run-rate revenue surpasses $30 billion,” April 2026, https://x.com/AnthropicAI/status/2041275563466502560. ↩
- OpenAI, “OpenAI Raises $122 Billion to Accelerate the Next Phase of AI,” https://openai.com/index/accelerating-the-next-phase-ai/. ↩
- B EYE, “Databricks vs Snowflake 2025: The Complete Buyer’s Guide,” 2025, https://b-eye.com/blog/databricks-vs-snowflake-guide/. ↩
- Snowflake, “Snowflake Expands Snowflake Intelligence and Cortex Code to Power the Control Plane for the Agentic Enterprise,” April 21, 2026, https://www.snowflake.com/en/news/press-releases/snowflake-expands-snowflake-intelligence-and-cortex-code-to-power-the-control-plane-for-the-agentic-enterprise/. ↩↩
- Foundation Capital, “Databricks vs. Snowflake: What Their Rivalry Reveals About AI’s Future,” https://foundationcapital.com/databricks-vs-snowflake-what-their-rivalry-reveals-about-ais-future/. ↩
- Klarna Group plc, Form F-1/A Amendment No. 3, U.S. Securities and Exchange Commission, September 2, 2025, https://www.sec.gov/Archives/edgar/data/2003292/000200329225000024/klarnagroupplcf-1a3.htm. ↩
- Gartner predictions on customer service AI rehiring, as cited in Vibe Graveyard postmortem and Fortune coverage. ↩
- Stanford HAI, “AI Index 2025: State of AI in 10 Charts,” April 2025, https://hai.stanford.edu/news/ai-index-2025-state-of-ai-in-10-charts. ↩
- Stanford HAI, “Inside the AI Index: 12 Takeaways from the 2026 Report,” April 2026, https://hai.stanford.edu/news/inside-the-ai-index-12-takeaways-from-the-2026-report. ↩
- IBM Think, “Key Findings from Stanford’s 2025 AI Index Report,” April 2025, https://www.ibm.com/think/news/stanford-hai-2025-ai-index-report. ↩
- METR Research page, https://metr.org/research/. ↩
- Sean Goedecke, “METR’s AI Productivity Study Is Really Good,” 2025, https://www.seangoedecke.com/impact-of-ai-study/. ↩
- Tianbao Xie et al., “OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments,” NeurIPS 2024 / arXiv:2404.07972. ↩
- XLANG Lab, “Introducing OSWorld-Verified,” 2025, https://xlang.ai/blog/osworld-verified. ↩
- Salesforce Q2 FY25 SEC filing, https://www.sec.gov/Archives/edgar/data/0001108524/000110852424000020/crm-q2fy25xexhibit991.htm. ↩
- Snowflake Inc., Form 10-Q FY2025, U.S. Securities and Exchange Commission, July 2025, https://www.sec.gov/Archives/edgar/data/0001640147/000164014725000187/snow-20250731.htm. ↩
- NIST, “AI Risk Management Framework,” official site, https://www.nist.gov/itl/ai-risk-management-framework. ↩
- European Commission, “Guidelines for Providers of General-Purpose AI Models,” July 2025, https://digital-strategy.ec.europa.eu/en/policies/guidelines-gpai-providers. ↩
- European Commission AI Act enforcement framework, “Enforcement of Chapter V Under the EU AI Act,” https://artificialintelligenceact.eu/enforcement-of-chapter-v-under-the-eu-ai-act/. ↩
- European Parliament Research Service, “AI Act Implementation Timeline,” 2025, https://www.europarl.europa.eu/RegData/etudes/ATAG/2025/772906/EPRS_ATA(2025)772906_EN.pdf. ↩
- Sacra, “OpenAI Revenue, Valuation & Funding,” accessed 2026, https://sacra.com/c/openai/. ↩
- Sacra, “Anthropic Revenue, Valuation & Funding,” accessed 2026, https://sacra.com/c/anthropic/. ↩
- TechCrunch, “Sources: Anthropic Could Raise a New $50B Round at a Valuation of $900B,” April 29, 2026, https://techcrunch.com/2026/04/29/sources-anthropic-could-raise-a-new-50b-round-at-a-valuation-of-900b/. ↩
- OpenAI, “The State of Enterprise AI 2025 Report,” https://cdn.openai.com/pdf/7ef17d82-96bf-4dd1-9df2-228f7f377a29/the-state-of-enterprise-ai_2025-report.pdf. ↩
- Wikipedia, “Model Context Protocol,” accessed April 2026, https://en.wikipedia.org/wiki/Model_Context_Protocol. ↩
- IDC FutureScape 2026, “AI & Automation Predictions: The Future of AI Is Model Routing,” https://www.idc.com/resource-center/blog/the-future-of-ai-is-model-routing/. ↩
- Thoughtworks Technology Radar Vol. 33, “The Model Context Protocol’s Impact on 2025,” November 2025, https://www.thoughtworks.com/en-us/insights/blog/generative-ai/model-context-protocol-mcp-impact-2025. ↩
- Stanford HAI, 2026 AI Index Report, Economy Chapter, https://hai.stanford.edu/ai-index/2026-ai-index-report/economy. ↩
- Lightcast / Stanford, “The Stanford AI Index Report 2026: Labor Market Data,” https://lightcast.io/resources/research/stanford-ai-index-2026. ↩
- IEEE Spectrum, “Stanford’s AI Index for 2026 Shows the State of AI,” 2026, https://spectrum.ieee.org/state-of-ai-index-2026. ↩
- Domenic Denicola, “My Participation in the METR AI Productivity Study,” https://domenic.me/metr-ai-productivity/. ↩
- GAICC, “NIST AI Risk Management Framework: A Complete Guide for US Organisations,” https://gaicc.org/blog/nist-ai-risk-management-framework/. ↩
- Mirantis, “Securing Model Context Protocol for Mass Enterprise Adoption,” https://www.mirantis.com/blog/securing-model-context-protocol-for-mass-enterprise-adoption/. ↩
- Snorkel AI, “Terminal-Bench 2.0: Raising the Bar for AI Agent Evaluation,” November 7, 2025, https://snorkel.ai/blog/terminal-bench-2-0-raising-the-bar-for-ai-agent-evaluation/. ↩
- AI Native Dev, “Terminal-Bench: Benchmarking AI Agents on CLI Tasks,” 2025, https://ainativedev.io/news/terminal-bench-benchmarking-ai-agents-on-cli-tasks. ↩
- PYMNTS, “Klarna Targets $14B Valuation as It Readies IPO,” September 2025, https://www.pymnts.com/news/ipo/2025/klarna-targets-14-billion-dollar-valuation-disruptive-brand-readies-ipo/. ↩
- Morningstar, “What’s Behind Klarna’s $14 Billion IPO Valuation?” September 2025, https://www.morningstar.com/stocks/whats-behind-klarnas-14-billion-ipo-valuation. ↩
- Internative, “Klarna’s AI Reversal: A Postmortem in 3 Lessons,” 2025, https://internative.net/insights/blog/klarna-ai-reversal-postmortem. ↩
- Atlan, “What Is Context Engineering? Complete 2026 Guide,” https://atlan.com/know/what-is-context-engineering/. ↩
- Architecture & Governance Magazine, “Stop Marrying Your Model: Why Enterprise AI Needs a Multi-Model Architecture,” March 31, 2026, https://www.architectureandgovernance.com/modeling/stop-marrying-your-model-why-enterprise-ai-needs-a-multi-model-architecture/. ↩
- Beam AI, “The 19-Model Problem: Enterprise Multi-Model Orchestration,” https://beam.ai/agentic-insights/the-19-model-problem-why-enterprise-ai-is-moving-to-multi-model-orchestration. ↩
- DeepLearning.AI, “Reasoning Models, Beginning with OpenAI’s o1 and DeepSeek’s R1, Transformed the Industry,” https://www.deeplearning.ai/the-batch/reasoning-models-beginning-with-openais-o1-and-deepseeks-r1-transformed-the-industry/. ↩
- HBR, “Why Agentic AI Projects Fail, and How to Set Yours Up for Success,” October 2025, https://hbr.org/2025/10/why-agentic-ai-projects-fail-and-how-to-set-yours-up-for-success. ↩
Was this useful?
Related
Membership
Become a Member to receive new research as they are published.
