The Applied Layer
Research

Production AI Architecture: Patterns That Distinguish Production from Demo

By 2026, enterprise AI systems are no longer differentiated primarily by which large language model they use. The frontier models, Anthropic's Claude Opus 4.7, OpenAI's GPT-5.2, Google's Gemini 3 Pro, are converging on capability for the median enterprise workload. What separates production-grade s

2 May 202641 min read9,900 wordsArchitecture & Patterns

Cite as: The Applied Layer. (2026). Production AI Architecture: Patterns That Distinguish Production from Demo. The Applied Layer. https://appliedlayer-ai.com/research/production-ai-architecture-patterns

Engraving plate, public domain (cover for Production AI Architecture: Patterns That Distinguish Production from Demo).
Engraving plate, public domain. See manifest for source attribution.

Executive Summary

By 2026, enterprise AI systems are no longer differentiated primarily by which large language model they use. The frontier models, Anthropic’s Claude Opus 4.7, OpenAI’s GPT-5.2, Google’s Gemini 3 Pro, are converging on capability for the median enterprise workload. What separates production-grade systems from impressive demos is the architecture wrapped around the model: how documents are chunked and indexed, how queries are understood and rewritten, how candidates are reranked, how tools are exposed, how agents are orchestrated, how outputs are evaluated, and how failures are detected and recovered.

This report surveys the architectural patterns that, in our assessment, define this gap. Part A documents retrieval architecture, from naive RAG to hybrid retrieval, query rewriting, reranking, and hierarchical and graph-based approaches. Part B addresses agentic patterns through a five-tier maturity taxonomy: deterministic-with-LLM-glue (Tier 1), tool-using single agents (Tier 2), multi-step planners (Tier 3), multi-agent systems (Tier 4), and fully autonomous agents (Tier 5). It documents which tiers are production-ready for which workloads and which remain experimental.

The report’s central thesis is that production AI quality is determined by architectural composition, not model selection; that naive RAG and naive agentic patterns fail in production for predictable reasons; that agentic patterns sit on a maturity ladder where Tiers 1-2 are broadly production-ready and Tiers 3-5 are workload-specific and frequently over-claimed; and that the choice between patterns can be made deterministically given a workload’s corpus characteristics, query complexity, latency budget, error tolerance, and verifiability surface. A unified decision framework synthesizes the evidence into explicit criteria architects can apply.

1. Why Retrieval Performance Varies

In early 2026, two enterprises in adjacent industries deployed customer-support assistants on the same underlying model, Claude Sonnet 4.5 (claude-sonnet-4-5-20250929). Both organizations indexed comparable corpora: a few hundred thousand support documents, internal runbooks, and product manuals. One organization, a fintech using LangGraph for orchestration and a hybrid retrieval pipeline with cross-encoder reranking, reported full end-to-end resolution rates well above 50% within months of launch, Klarna had reported 67% resolution and an estimated $40 million USD profit improvement for 2024 from its OpenAI-powered assistant, and Anthropic’s own published case via Intercom Fin reported a 50.8% resolution rate with the AI involved in 96% of conversations within roughly a month of deployment. The other organization, with a naive embed-and-retrieve pipeline, hovered around the figure that has become widely cited in production retrospectives: roughly 60% retrieval accuracy, meaning that approximately 40% of generated answers were grounded in irrelevant or partially-relevant documents.123

That spread, same model, same corpus class, materially different outcomes, is the thesis of this report in miniature. It is also consistent with the emerging consensus among practitioners writing engineering retrospectives: production RAG systems fail at the retrieval step far more often than at generation, and the variance between systems is dominated by architectural choices made before the language model is invoked at all.34

The variance has structural roots. A retrieval system is a composition of decisions: how documents are parsed, how they are segmented (chunked), which embedding model encodes them, whether sparse signals (BM25) are also indexed, how queries are interpreted and possibly rewritten, how candidates from multiple retrievers are fused, whether a reranker re-scores the candidate set, and how the final passages are packaged for the generator. Each decision compounds. A naive pipeline that uses fixed-size chunks with a generic embedding model and pure vector retrieval can answer a wide range of queries acceptably for a demo. The same pipeline, when faced with a corpus containing entity-heavy queries (product SKUs, error codes, customer IDs, version strings), specialized vocabulary, or multi-hop questions, will degrade unpredictably.45

The architectural perspective also explains why model upgrades produce smaller-than-expected gains in retrieval-bound systems. Replacing GPT-4 with Claude Sonnet 4.5 in a system whose retriever returns the wrong documents 40% of the time will not move the resolution rate by more than a few percentage points: the bottleneck is upstream of the model. Conversely, the same model upgrade in a system with a well-tuned hybrid retriever, query rewriter, and cross-encoder reranker can produce double-digit improvements, because the model is finally being given the right context to reason over.

For the rest of Part A, the report adopts a structural lens. We treat each retrieval pattern as a layer that can be present or absent, well-tuned or poorly tuned, and we document, with public benchmark and engineering-blog evidence, what each layer contributes, what it costs, and when it earns its place.

[Figure 1: The architectural pattern landscape, schematic showing retrieval and agentic layers as composable decisions, from document parsing through chunking, embedding, hybrid retrieval, query rewriting, reranking, generation, and (where applicable) agentic orchestration.]

Caption: Production AI systems are compositions of layered architectural choices. Each layer can be omitted, but omissions compound: pipelines that skip query understanding, hybrid retrieval, and reranking simultaneously typically perform far worse than pipelines that include all three.


2. The Naive Baseline and Why It Fails

“Naive RAG” is a term of art rather than a technical specification, but in practice it refers to a recognizable pipeline introduced in the original Retrieval-Augmented Generation paper of Lewis et al. (2020): documents are partitioned into fixed-size chunks, each chunk is encoded via a dense embedding model, the chunks are stored in a vector index, and at query time the user’s query is encoded with the same embedding model and the top-k nearest neighbors are retrieved and concatenated into the model’s prompt.6 In its simplest implementations, a vector database, an embedding model, and a generator, the pipeline can be assembled in a weekend.

The pattern works for demos for a specific reason: demos are typically over corpora that have been pre-curated to match the queries used in the demo. The questions are paraphrases of statements that exist verbatim in the corpus. Embedding similarity is sufficient because the semantic gap between query and answer is small. The pattern also works for a class of production workloads where this property is naturally satisfied, internal Q&A over a small, well-edited knowledge base where users phrase questions the way the corpus is written.

The pattern fails in production for predictable reasons, documented in retrospectives from teams at Neo4j, Databricks, Redis, Vespa, and Weaviate.4578 Five failure modes recur:

First, vector-only retrieval is semantically robust but lexically fragile. Pure dense retrieval can match “transaction error recovery” to a query about “payment failures,” which is the desired behavior. It can also fail to retrieve a document containing “iPhone 15 Pro Max 256GB Space Black” when the user types that string verbatim, because the embedding of the long entity-string is not necessarily the closest neighbor of itself in a Wikipedia-trained embedding space. Production corpora are full of such entity strings: error codes, SKUs, version numbers, customer identifiers, function names. The original BEIR benchmark of Thakur et al. (2021) demonstrated that BM25, the canonical sparse method described in Robertson and Zaragoza’s Probabilistic Relevance Framework, remains a robust zero-shot baseline that dense retrievers do not consistently dominate, and that the strongest systems on average are reranking-based or late-interaction architectures, “however, at high computational costs.”910

Second, fixed-size chunking cuts across structure. A chunk that begins mid-paragraph and ends mid-table loses the context required to interpret either fragment. Engineering retrospectives at Redis and Databricks both document this failure mode explicitly: chunks that ignore document boundaries (sections, headings, lists, tables) systematically degrade retrieval quality.711

Third, cosine similarity rewards semantic proximity, not usefulness. A passage that mentions every word in the query but does not answer it can rank above a passage that answers the question in different words. Without a reranking stage, the top-k passages passed to the model are noisy.12

Fourth, single-pass retrieval cannot handle multi-hop questions. Questions that require composing facts from multiple documents, “Which board members of S&P 500 IT companies also sit on healthcare boards?”, do not have a single document whose embedding is close to the query embedding. Without query decomposition or graph-based traversal, naive RAG silently returns plausible-looking but incomplete context.

Fifth, there is no evaluation loop. The pipeline returns answers; there is no measurement of whether the right documents were retrieved or whether the answer is grounded in them. The system degrades silently as the corpus grows and queries diversify.

In our assessment, naive RAG remains a useful starting point for prototyping and for narrow internal use cases over small, well-edited corpora. It is rarely the right architecture for a customer-facing or revenue-bearing production system over a corpus larger than tens of thousands of documents.


3. Chunking and Embedding Strategy

Chunking is the most underestimated decision in retrieval architecture. It happens once, at indexing time, and its consequences propagate to every subsequent query. Yet most production teams adopt the default settings of their framework, typically 512 or 1,000 tokens with 10-20% overlap, without auditing whether those defaults match their corpus.

Three chunking families dominate practice. Fixed-size chunking splits text into equal blocks by character or token count. It is simple, fast, parallelizable, and predictable. It is also blind to document structure: a chunk boundary can fall mid-sentence, mid-table, or mid-code block. Engineering blogs from Redis, Databricks, and Unstructured.io document the resulting quality loss for technical and structured corpora.71113 Semantic chunking computes embeddings for adjacent sentences and places boundaries where similarity drops below a threshold, on the theory that topic shifts mark natural breakpoints. Empirical studies on biomedical and clinical corpora suggest semantic chunking can outperform fixed-size methods on retrieval precision, but with computational overhead that is often substantial relative to the marginal quality gain.14 Structural chunking, sometimes called document-aware or hierarchical chunking, uses parsers to detect document elements (headings, paragraphs, lists, tables, code blocks, captions) and chunks at element boundaries, with metadata preserved. Tools like Docling and Unstructured.io implement this approach explicitly.13

In our assessment, structural chunking is the right default for most enterprise corpora. Fixed-size with sensible overlap is acceptable for unstructured text (transcripts, notes, free-form writing). Semantic chunking earns its added cost only when corpora are heterogeneous, poorly segmented, and high-value enough to justify the embedding overhead at indexing time.

The companion decision is embedding model choice. The Massive Text Embedding Benchmark (MTEB), introduced by Muennighoff et al. (2022), has become the standard reference for embedding quality, with the maintained leaderboard now covering more than 1,000 languages and including domain-specialized splits.15 As of MTEB’s 2026 leaderboard rotation, Google’s Gemini Embedding 2 reported the leading retrieval score (~67.71 MTEB retrieval), with strong showings from Cohere embed-v4 (~65.2), OpenAI text-embedding-3-large (~64.6), Voyage’s v4 family, and open-weight options including Qwen3-Embedding-8B and BGE-M3 (~63.0).1617 The MTEB authors themselves caution that no single embedding model dominates across all tasks; their original paper demonstrated that performance varies substantially by domain and task type, and the framework’s maintainers have published extensive analyses of reproducibility and benchmark drift.1518

Three operational implications follow. First, the practical decision is rarely “best on MTEB” but “best on my corpus”, small-scale evaluation with held-out queries and labels (or LLM-judged surrogates) on a representative sample is more diagnostic than leaderboard position. Second, dimensionality matters: 3,072-dimensional embeddings cost more to store and search than 768-dimensional ones, and Matryoshka-style truncation is a common cost-quality lever. Third, domain adaptation, fine-tuning on a few thousand in-domain query-document pairs, frequently outperforms switching to a larger generic model, particularly in specialized domains (legal, biomedical, code).

The cost-quality frontier across chunking and embedding choices is steep but tractable. In our assessment, the median enterprise team should default to structural chunking with token-budget enforcement, a current-generation general-purpose embedding model in the 768-1,024 dimension range, domain adaptation if the corpus exceeds a million documents, and explicit metadata preservation (source URI, section heading, document type, last-modified date) so downstream layers can filter and rerank.


4. Hybrid Retrieval

The single highest-ROI architectural change to a naive RAG pipeline, by repeated independent observation, is the addition of sparse retrieval running in parallel with dense retrieval, with results merged via Reciprocal Rank Fusion (RRF). This is the configuration that engineering blogs from Weaviate, Vespa, Pinecone, Elastic, and Qdrant all document and recommend, with broadly consistent quantitative claims.81920

The argument is structural. Sparse retrieval, BM25, the algorithm formalized in Robertson and Zaragoza’s Probabilistic Relevance Framework, scores documents based on exact term overlap weighted by term frequency and inverse document frequency.10 It is deterministic, interpretable, requires no training data, and excels at exact-match queries: error codes, function names, product identifiers, rare terminology. It fails on semantic paraphrase: “how does the system handle payment failures?” cannot match a document titled “transaction error recovery process” if there is no lexical overlap.

Dense retrieval is the mirror image. It excels at semantic matching, generalizes across paraphrases, and handles synonymy. It fails on entity-heavy queries: long product strings, specific identifiers, and rare tokens are poorly represented in embedding spaces optimized for distributional semantics. The BEIR benchmark explicitly documents this asymmetry: BM25 is a robust baseline that dense retrievers do not uniformly beat, particularly in zero-shot, out-of-domain settings.9

Hybrid retrieval runs both retrievers in parallel and merges their ranked lists. The dominant fusion algorithm in production today is Reciprocal Rank Fusion, which scores each document by the sum of 1/(k + rank_i) across each retriever’s ranked list. RRF has the operational advantage of ignoring the absolute scores from each retriever (which live on different scales) and using only ranks, which makes it robust to retriever drift and avoids the calibration problems of weighted-score fusion.819 Weaviate’s blog explicitly walks through this calculation; Vespa’s published BEIR experiments show their hybrid Vespa-ColBERT and BM25 combination outperforming on 12 of 13 BEIR datasets, with average nDCG@10 improving from 0.453 to 0.481 over the strong Vespa BM25 baseline.820

The implementation surface varies by vector database. Weaviate, Qdrant, and Elasticsearch support hybrid search natively with RRF; Pinecone offers sparse-dense hybrid with separate indexing; pgvector users typically implement RRF in application code; Vespa’s ranking framework supports BM25, dense, and ColBERT-style late interaction in a single declarative profile.192021

The latency cost of hybrid retrieval is usually modest. Sparse and dense retrievers run in parallel; the merge step is a constant-time sort over the top-k from each. The infrastructure cost is real, maintaining two indexes, but is dominated by the dense index in most deployments. In our assessment, hybrid retrieval should be the default configuration for any corpus larger than a few thousand documents that contains entity-heavy queries, code, technical specifications, identifiers, or specialized vocabulary, which is to say nearly every enterprise corpus.

The pattern’s limit is that it improves recall and exposure of relevant documents in the candidate set; it does not solve the problem of ranking those candidates correctly for the generator. That is the role of reranking, addressed in Section 6.


5. Query Understanding and Rewriting

The naive assumption that the user’s query is a good retrieval key holds less often than practitioners expect. Users phrase queries tersely, ambiguously, in jargon, or with intent that does not map cleanly to the lexical or semantic structure of the corpus. Query understanding patterns interpose a transformation between the user’s input and the retrieval system, with the goal of producing one or more retrieval-friendly variants.

Four families of query rewriting earn discussion: query expansion, query decomposition, hypothetical document embeddings (HyDE), and step-back prompting.

Query expansion generates synonyms, paraphrases, or related terms and issues the expanded query to the retriever. It is the oldest pattern in the family, with a long history in classical IR. In modern RAG, expansion is typically performed by the LLM itself (“rewrite this query to include synonyms and likely terminology”), then the expanded query is issued. The benefit is robustness to vocabulary mismatch; the cost is one extra LLM call (~200ms-1s of latency) and a small amount of generation cost.

Query decomposition splits a multi-hop or compound query into simpler sub-queries, retrieves for each, and fuses results. It is essential for questions like “Which board members of S&P 500 IT companies also sit on healthcare boards?”, a single embedding cannot represent both halves of the question simultaneously. Decomposition adds latency proportional to the number of sub-queries.

HyDE, Hypothetical Document Embeddings, introduced by Gao et al. (2022), instructs the LLM to generate a hypothetical answer to the query, embeds the hypothetical answer, and retrieves documents whose embeddings are nearest to the hypothesized answer.22 The premise is that the embedding space of answers is a better retrieval key for answers than the embedding space of questions. The original paper reported that HyDE significantly outperformed the unsupervised dense retriever Contriever and was competitive with fine-tuned retrievers on TREC and BEIR, for instance, achieving nDCG@10 of 61.3 on TREC DL-20 versus 44.5 for Contriever, in the original paper’s reported configurations.22 The pattern’s caveats are documented in subsequent work: HyDE adds 200-600ms of latency for the hypothetical generation step, can hallucinate when the LLM lacks domain knowledge, and may underperform when the query is already well-specified. Independent evaluations have noted both the latency overhead and increased hallucination on personal or temporally specific queries.23

Step-back prompting, introduced by Zheng et al. of Google DeepMind (2023), instructs the model to first abstract the query to a higher-level concept or principle, retrieve at that level, then reason with the abstraction. The original paper reported PaLM-2L gains on MMLU Physics and Chemistry of 7% and 11%, on TimeQA of 27% (revised to 34% in the v2 paper), and on MuSiQue of 7%, substantial improvements on knowledge-intensive and multi-hop reasoning benchmarks.24 Step-back is most valuable when the corpus is structured around general principles (textbooks, regulatory frameworks, scientific literature) and the query is a specific instance.

The decision criterion for query rewriting is latency budget. Each rewriting pattern adds an LLM call. For interactive applications with sub-second latency budgets, only lightweight patterns (expansion via small models, or no rewriting) are viable. For asynchronous workloads (research assistants, batch analysis, ticket triage), the latency cost is acceptable and the quality gains can be substantial. In our assessment, HyDE and step-back are workload-specific: deploy HyDE when retrieval recall is the bottleneck on knowledge-intensive QA; deploy step-back when the corpus is organized around abstractions and the queries are concrete instances; use decomposition for multi-hop. Default expansion is a reasonable always-on tactic for the median enterprise pipeline.


6. Reranking

Reranking is the second-stage scoring layer that re-orders candidate passages from the first-stage retriever. It is, in our assessment, the single highest-ROI improvement available to a hybrid-retrieval pipeline, with multiple independent practitioner reports of 15-35% precision improvements when added on top of a hybrid retriever.2526

Two architectures dominate.

Cross-encoder rerankers score each (query, candidate) pair by jointly encoding both through a transformer with full attention between the two, producing a single relevance score. Because each candidate is processed independently, cross-encoders cannot be precomputed and must be evaluated at query time, which makes them more expensive than embedding-based retrieval but qualitatively more accurate. The canonical open-weight reranker is the ms-marco MiniLM cross-encoder family, with models like BGE-reranker-v2-m3 (51.8 nDCG@10 on BEIR at 278M parameters) representing strong open-source baselines.27 Cohere’s commercial offering, Cohere Rerank, has progressed through multiple generations; Cohere reports that Rerank 4 Pro improves +170 ELO over v3.5 with +400 ELO on business and finance tasks (Cohere internal benchmarks; pair with independent evaluation).25

LLM-as-judge reranking uses a generative LLM, prompted with the query and candidates, to output relevance scores or rankings directly. It is more flexible (can incorporate task-specific rubrics and complex relevance definitions) and frequently more accurate on hard cases, but materially slower and more expensive than cross-encoder reranking. In a practical pipeline, LLM-as-judge reranking is typically used at top-5 to top-10, after a cross-encoder has reduced the candidate pool from 50-100 to a manageable size.

The cost-quality ladder is well-documented. A typical recommendation, repeated across engineering blogs from Cohere, ZeroEntropy, Vespa, and Weaviate, is: retrieve 50-100 candidates with hybrid search, rerank to 10 with a cross-encoder, optionally LLM-rerank to top 3-5 if the application is high-value and tolerates added latency.2526 Cross-encoder latency on CPU for the BGE family is roughly 8ms per pair on a single thread, scaling to ~130ms for a batch of 16 pairs; on GPU it is sub-millisecond per pair. Cohere Rerank typically adds 100-300ms via API call.2627

A commonly underestimated failure mode: when a system uses query rewriting (HyDE, expansion) before retrieval, the reranker should score against the original user query, not the rewritten variant. The reranker is measuring relevance to what the user asked, not to the system’s internal reformulation. This pitfall is documented in production engineering retrospectives.25

In our assessment, cross-encoder reranking is the right default when retrieval candidate pools exceed ~20 documents and the application can tolerate 50-300ms of added latency. LLM-as-judge reranking is justified for high-value, low-volume workloads (legal review, medical literature search, regulatory compliance) where added cost per query is amortizable.


7. Hierarchical and Graph Approaches

Two recent retrieval architectures deserve careful, evidence-grounded treatment because both have been heavily marketed and both have non-trivial limitations: RAPTOR and GraphRAG.

RAPTOR, Recursive Abstractive Processing for Tree-Organized Retrieval, published by Sarthi et al. of Stanford at ICLR 2024, addresses a structural limitation of flat chunked retrieval: the inability to integrate context across long documents. RAPTOR builds a hierarchical tree by recursively clustering chunks and summarizing each cluster, producing summary nodes at progressively higher levels of abstraction. At query time, retrieval can pull from leaf chunks, intermediate summaries, or root-level summaries, integrating information across the hierarchy. The original paper reported state-of-the-art results on QASPER, NarrativeQA, and QuALITY, coupling RAPTOR with GPT-4 improved best performance on the QuALITY benchmark by 20% absolute accuracy in their reported configuration.28

In our assessment, RAPTOR earns adoption when the corpus consists of long documents (books, lengthy reports, technical manuals) where queries can require either fine-grained detail or global synthesis. It pays a substantial indexing-time cost, every node above the leaves requires an LLM call to summarize, and adds complexity to the retrieval logic. For corpora of short or homogeneous documents, the cost rarely justifies the benefit.

GraphRAG, introduced by Edge et al. of Microsoft Research in April 2024 (arXiv:2404.16130, with v2 published February 2025), takes a different approach: it uses an LLM to extract a knowledge graph from the corpus, applies community-detection algorithms (Leiden) to partition the graph, pre-generates summaries for each community, and at query time produces “global” answers by composing community summaries.29 Microsoft’s paper claims GraphRAG “leads to substantial improvements over a conventional RAG baseline for both the comprehensiveness and diversity of generated answers” on global sensemaking queries over million-token corpora.29

These claims warrant careful framing. The Microsoft paper’s evaluation methodology relies heavily on LLM-as-judge comparisons of answer comprehensiveness and diversity, useful but not equivalent to ground-truth retrieval metrics. Independent evaluations have raised pointed concerns. A 2025 evaluation framework paper, “How Significant Are the Real Performance Gains? An Unbiased Evaluation Framework for GraphRAG,” argues that current GraphRAG evaluation suffers from “unrelated questions” (questions generated from vague summaries that do not require fine-grained graph retrieval) and “evaluation biases” (position, length, and trial bias in LLM-as-judge), and concludes that “performance gains are much more moderate than reported previously” when these biases are corrected.30 A separate systematic comparison, “RAG vs. GraphRAG: A Systematic Evaluation,” found that “RAG is consistently effective for single-hop, detail-oriented queries that require precise evidence, whereas GraphRAG is more advantageous for multi-hop, reasoning-intensive QA and tends to produce more corpus-level, diverse summaries,” and explicitly flagged practical challenges including “incomplete or noisy graph construction, additional computation and storage overhead, and evaluation artifacts such as position effects in LLM-as-Judge for summarization.”31

In our assessment, the honest reading is: GraphRAG has demonstrated value for global sensemaking and corpus-level synthesis tasks; it is not a general replacement for hybrid retrieval; vendor benchmarks should be paired with the independent evaluations cited above; and the indexing cost (LLM calls to extract entities and relations across the entire corpus, plus community summarization) is substantial, frequently 10-100x the cost of vector indexing the same corpus. For most enterprise workloads, GraphRAG is worth piloting only when the workload genuinely requires multi-document synthesis (“what are the main themes in this dataset?”) rather than precise document retrieval.

Parent-child retrieval (sometimes called sentence-window retrieval) is a simpler hierarchical pattern: index small chunks for retrieval precision but return larger parent chunks for context. It is cheap to implement, well-supported by frameworks like LlamaIndex, and produces consistent gains on technical corpora where local context matters. In our assessment, parent-child retrieval is a reasonable default to combine with hybrid retrieval and reranking.

[Figure 2: Performance vs cost frontier, schematic scatter plot positioning naive RAG, hybrid retrieval, hybrid+rerank, hybrid+rerank+rewriting, parent-child, RAPTOR, and GraphRAG along axes of relative quality (y) and relative cost/latency (x).]

Caption: Architectural patterns occupy a Pareto frontier. Hybrid retrieval with reranking provides the largest quality gains for the lowest marginal cost; RAPTOR and GraphRAG add capability at substantial indexing-time cost; LLM-as-judge reranking offers the highest quality at the highest per-query cost.


8. A Maturity Taxonomy of Agents

The agentic-AI discourse in 2026 conflates radically different system designs under the same vocabulary. Before assessing what works, the report introduces an explicit five-tier taxonomy that distinguishes patterns by their architectural commitments. The taxonomy is informed by Anthropic’s “Building Effective Agents” engineering note, which draws an architectural distinction between “workflows” (LLMs and tools orchestrated through predefined code paths) and “agents” (LLMs dynamically directing their own processes), and by Cognition Labs’ published essays on multi-agent architecture.323334

Tier 1: Deterministic with LLM Glue. The control flow is deterministic and authored by humans. The LLM is invoked at specific, scoped steps to perform tasks the model is reliably good at: classification, extraction, summarization, paraphrase, formatting. Failure modes are bounded; observability is straightforward; the system reduces to a conventional software pipeline with LLM nodes. This is the architecture Anthropic recommends starting with: “find the simplest solution possible, and only [increase] complexity when needed.”32

Tier 2: Tool-Using Single Agent. A single LLM is given a defined set of tools and a loop in which it can decide which tool to invoke, observe results, and continue or terminate. The canonical pattern is ReAct (Yao et al., 2022), in which the model interleaves reasoning traces with tool calls.35 Toolformer (Schick et al., 2023) demonstrated that models could be trained to use APIs autonomously.36 Tier 2 covers most of what production teams in 2026 actually deploy under the “agent” label: customer-support assistants that look up orders, code assistants that read files and run linters, internal copilots that query databases.

Tier 3: Multi-Step Planning. A single agent or a planner+executor pair decomposes a goal into a multi-step plan, executes the plan, and may revise it based on intermediate results. Patterns include Plan-and-Solve prompting (Wang et al., ACL 2023), Tree of Thoughts (Yao et al., NeurIPS 2023), and the orchestrator-worker patterns described in Anthropic’s engineering writing.373832 Plan-and-Solve reported substantial gains over Zero-shot-CoT on arithmetic and commonsense reasoning benchmarks; Tree of Thoughts reported a 74% success rate on Game of 24 versus 4% for Chain-of-Thought with GPT-4, on the specific tasks the paper evaluated.3738 Production deployments include Cursor’s planning mode, Cognition’s Devin, and Replit Agent’s plan mode.394041

Tier 4: Multi-Agent. Multiple LLM-driven agents, typically with distinct roles or specializations, coordinate through messages. Frameworks include AutoGen (Wu et al., 2023; Microsoft Research), CrewAI, and LangGraph multi-agent constructs.424344 The architectural debate at Tier 4 is unsettled: Cognition’s Walden Yan argued in June 2025 that “multi-agent architectures lead to fragile systems due to poor context sharing and conflicting decisions” and recommended single-agent designs with read-only sub-agents only.45 Anthropic’s internal evaluations of their multi-agent Research feature reported that Claude Opus 4 lead with Sonnet 4 sub-agents outperformed single-agent Opus 4 by 90.2% on research evaluations, with the caveat that multi-agent systems “use approximately 15× more tokens than chats.”46 Cognition’s later blog “Multi-Agents: What’s Actually Working” softened the original position to “a narrower class works, where agents contribute intelligence while writes stay single-threaded.”47

Tier 5: Fully Autonomous. An agent or system of agents operates over long horizons with minimal human intervention, recovering from errors, adapting plans, and producing results from open-ended inputs. Computer-use agents (Anthropic Claude Computer Use, OpenAI Operator/CUA, Google’s Project Mariner) and autonomous coding agents (Cognition Devin in fully-autonomous mode, Replit Agent 3 with up to 200-minute sessions) sit at this tier.48495041 In our assessment, Tier 5 is workload-specific in 2026: best-in-class on narrow benchmarks (Claude Opus 4.7 reaching 78.0% on OSWorld-Verified versus 72.4% human-expert baseline as of April 2026, per Anthropic’s published system-card-tier data), but with substantial reliability gaps on real-world workflows.5152

[Figure 3: Agentic maturity taxonomy, five tiers, what each does, infrastructure required, capabilities and limits.]

Caption: The maturity ladder is not a strict progression, most production systems combine tiers (a Tier 1 deterministic backbone invoking Tier 2 sub-agents at specific steps). The tier numbers describe the architectural commitment of each component, not the complexity of the overall system.


9. Tier 1 and 2, What Works in Production

Tier 1 and Tier 2 patterns account for the substantial majority of revenue-bearing production AI systems documented in named engineering retrospectives in 2025-2026. The pattern set is robust, the failure modes are well understood, and the operational practices for evaluation, observability, and recovery are mature.

Klarna’s AI Assistant, deployed February 2024 in partnership with OpenAI, is the most-cited Tier 1/2 reference deployment in fintech support. Klarna’s published numbers are striking: in the first month, the assistant handled 2.3 million conversations, equivalent to two-thirds of Klarna’s customer service volume; resolution time fell from 11 minutes to under 2 minutes; the company estimated a $40 million USD profit improvement for 2024.153 By 2025, Klarna re-architected on LangGraph and LangSmith, with reported 80% reduction in average resolution time over the prior nine months, according to a published LangChain customer case.54 The honest second chapter, which Klarna’s CEO Sebastian Siemiatkowski publicly acknowledged in 2025, is that the company partially walked back the “AI replaces 700 agents” framing and reintroduced human-agent capacity for complex cases, citing, in his own words, that “cost was a predominant evaluation factor when organizing this, what you end up having is lower quality.”55 In our assessment, the Klarna trajectory is representative: Tier 1/2 succeeds at ~50-67% resolution on routine queries; the remaining 33-50% require either Tier 3 patterns or human escalation, and underinvestment in escalation quality degrades CSAT.

Intercom Fin, the most widely deployed third-party customer-service agent, reports, across self-attested customer numbers documented on Intercom’s own pages, resolution rates that vary substantially by customer: Lightspeed’s VP of Global Support is quoted as saying Fin is “involved in 99% of conversations and successfully resolves up to 65% end-to-end”; Anthropic’s own deployment of Fin reportedly hit a 50.8% resolution rate within roughly a month, and Sharesies reported 70% resolution within 12 weeks across email and chat.256 These are vendor-attested numbers and should be paired with the consistent industry observation that resolution rate is highly sensitive to corpus quality, escalation logic, and the heterogeneity of the customer base. Industry averages cited by Intercom for Fin across its customer base have shifted from 41% to 51% over the past year as the product has improved.2

Shopify Sidekick, documented in Shopify Engineering’s 2025 ICML talk and accompanying blog post by Director of Applied ML Andrew McNamara, is a particularly transparent Tier 2 production retrospective. Shopify describes evolving Sidekick “from a simple tool-calling system into a sophisticated agentic platform,” with explicit discussion of architectural decisions including just-in-time tool instructions, LLM-based evaluation with ground-truth sets, and Group Relative Policy Optimization (GRPO) training.57 The blog explicitly addresses reward hacking failures encountered during training and the migration from “vibe testing” to statistically rigorous LLM-judge evaluation correlated with human review.

Other named Tier 1/2 deployments include Notion AI (writing and reasoning over a workspace), and GitHub Copilot Workspace (issue-to-PR automation grounded in repository context).

The Tier 1/2 failure modes documented across these deployments are consistent:

  • Tool selection errors. When the agent has more than ~10-15 tools, it begins to mis-route requests. Mitigations include Anthropic’s “code execution with MCP” pattern, which lets agents discover and load tool definitions on demand rather than receiving them all upfront, and Cursor’s MoE-routed approach.7139
  • Parameter hallucination. The agent calls a tool with structurally correct but semantically invented parameters (a non-existent customer ID, a malformed date range). Mitigations include strict input schemas (Pydantic models, JSON Schema validation), tool-result feedback loops, and Anthropic’s published guidance on “Writing tools for AI agents.”70
  • Recovery loops. The agent calls a tool, fails, calls again with a slightly different parameter, fails again, and eventually exhausts its budget without resolving. Mitigations include explicit retry budgets, escalation triggers on confidence thresholds, and Anthropic’s task-budget API parameter for capping agent loop spend.72

In our assessment, Tier 1/2 is production-ready for: customer support over well-curated knowledge bases; internal Q&A and copilots; structured data lookup and CRUD operations; document classification, extraction, and summarization at scale; and code assistance bounded to file-level edits with human review. It is not production-ready, without substantial additional engineering, for open-ended research, autonomous multi-application workflows, or any workload where the cost of an undetected error is high relative to the value of automation.


10. Tier 3-5, Honest Assessment

Tier 3, 4, and 5 patterns occupy the most over-claimed region of the agentic landscape. The published benchmark results are real but narrow; the field experience reported by practitioners, both directly and via independent reproductions, diverges from vendor headlines in instructive ways.

Tier 3 (multi-step planning) is production-ready for code, with caveats. SWE-bench Verified, the human-validated 500-issue subset of Jimenez et al.’s SWE-bench (ICLR 2024), has become the dominant benchmark for autonomous coding agents.58 As of April 2026, Anthropic reports that Claude Opus 4.7 reached 87.6% on SWE-bench Verified, up from 80.8% for Opus 4.6; Claude Sonnet 4.5 reached 77.2%; trajectory reaches back to 48.5% for GPT-4 Turbo in November 2023.5152 Independent leaderboards (Epoch AI’s SWE-bench Verified tracker, Vals AI, and the official SWE-bench leaderboard) confirm steady progress, though with consistent caveats: Anthropic reports its own custom harness can add roughly ten percentage points over a generic harness, meaning headline numbers are not directly comparable across vendors without harness-controlled methodology.5859 SWE-bench Pro, released by Scale, evaluates frontier models on harder, multi-language enterprise tasks; top scores fall to roughly 23% on the public set and lower on the private subset, demonstrating that “verified” subsets understate generalization to private codebases.60

The Cognition Devin trajectory is instructive. The company’s March 2024 launch reported 13.86% end-to-end resolution on SWE-bench, well ahead of contemporaneous baselines of 1.96% unassisted and 4.80% assisted.40 Subsequent independent attempts to reproduce Devin’s claims have been complicated by the lack of verifiable benchmark scores on later versions.61 Field tests from independent reviewers reported in 2025, for instance, one reviewer documented Devin failing 14 of 20 tasks in their own workload, indicate substantial gaps between benchmark and field performance, consistent with the general observation that benchmark tasks under-represent the messiness of real codebases.62

The most consequential architectural debate at Tier 3-4 was Cognition’s June 2025 essay “Don’t Build Multi-Agents,” authored by Walden Yan, which argued that naive multi-agent setups produce “fragile systems due to poor context sharing and conflicting decisions” and recommended single-agent designs.45 The argument was grounded in two principles: agents must share full context including complete agent traces, and parallel actions create implicit conflicting decisions. Anthropic’s contemporaneous “How we built our multi-agent research system” report provided the counter-evidence: their multi-agent Research feature outperformed single-agent Opus 4 by 90.2% on internal research evaluations, with token usage explaining 80% of performance variance, but with the system using approximately 15× more tokens than standard chats.46 Cognition’s follow-up post a year later, “Multi-Agents: What’s Actually Working,” published April 2026, refined the position: “10 months ago I argued against building multi-agent systems. Today, a narrower class works, where agents contribute intelligence while writes stay single-threaded.”47 In our assessment, the synthesis is that multi-agent systems provide genuine gains for breadth-first, parallelizable workloads (research, sensemaking, broad search) but introduce substantial reliability cost on tightly-interdependent workloads (coding, multi-step state-dependent tasks); the choice is workload-specific.

Tier 4 (multi-agent) frameworks, AutoGen, CrewAI, LangGraph multi-agent, are mature enough for prototyping but uneven in production maturity. AutoGen, originally released by Microsoft Research in fall 2023 (Wu et al., arXiv:2308.08155), shipped its v0.4 redesign in late 2024 with a more modular event-driven architecture.4263 CrewAI, with substantial enterprise traction including documented PwC, DocuSign, and Gelato deployments, has a clear opinionated abstraction (Crews + Flows) but limited checkpointing for long-running workflows in the open-source version.4364 LangGraph, developed by LangChain, provides lower-level graph orchestration with durable execution, human-in-the-loop checkpoints, and observability via LangSmith, the framework Klarna explicitly migrated to for its customer-service agent.4454 In our assessment, LangGraph has the most mature production tooling among the three; CrewAI has the cleanest abstractions for role-based teams; AutoGen 0.4 is best for research and asymmetric multi-agent setups. None of the three eliminates the underlying reliability challenges of multi-agent coordination.

Tier 5 (fully autonomous) is bench-strong, field-modest in 2026. OSWorld and OSWorld-Verified, the standard benchmarks for computer-use agents, document the rapid progress: Anthropic’s Claude 3.5 Sonnet scored 14.9% on the screenshot-only OSWorld in October 2024; Claude Sonnet 4.5 reached 61.4% on OSWorld-Verified in September 2025; Claude Sonnet 4.6 hit 72.5% (matching the 72.4% human-expert baseline) in March 2026; Claude Opus 4.7 reached 78.0% in April 2026.48495152 OpenAI’s Computer-Using Agent (CUA), launched January 2025 and surfaced as Operator, scored 38.1% on OSWorld and 87% on WebVoyager at launch.50 These are real capability gains.

The honest caveat is twofold. First, an independent study, “Emergence WebVoyager,” found that OpenAI Operator’s task success rate varied from 100% on Apple.com to 35% on Booking.com, with average performance of 68.6% across the full benchmark, substantially below OpenAI’s reported 87%, with the discrepancy attributed to evaluation methodology issues that the authors describe as threatening “the robustness and integrity” of WebVoyager.65 Second, on WebArena, a peer-reviewed re-evaluation paper “WebArena Verified” found that “widely used benchmarks can misestimate performance due to underspecified goals and fragile evaluators,” consistent with a broader pattern where benchmark numbers overstate field reliability.6667

For coding, Cursor 2.0/3 (launched October 2025 / 2026), with its Composer agentic model and harness-engineering blog series, is the most extensively documented production agentic-coding system. Cursor’s engineering team has published transparently on harness evolution, failure modes including tool-call errors and recovery loops, and the operational practice of running parallel agents on git worktrees.3968 Replit Agent 3, launched in 2025, can run autonomously for up to 200 minutes per session with subagent spawning, but independent reviewers report unpredictable credit consumption and autonomy loops.41

In our assessment, fully-autonomous Tier 5 patterns are appropriate today for: bounded sandboxed workflows where errors are cheap and reversible; research and information-gathering tasks where breadth-first exploration adds value; and well-instrumented coding assistance where humans review every PR. They are not appropriate for: financial transactions, irreversible state changes, sensitive customer interactions, or any workflow lacking robust verification.

[Figure 4: Production readiness matrix, rows: agentic tiers (1-5); columns: workload classes (knowledge worker, code, customer service, data analysis, autonomous research); cells: production-ready / emerging / experimental / not yet.]

Caption: Maturity is workload-specific. Tier 1/2 is production-ready across most knowledge-worker and customer-service workloads; Tier 3 is production-ready for code with careful evaluation; Tier 4 is emerging for research and customer service, experimental elsewhere; Tier 5 is bench-strong but field-modest in 2026.


11. Six Anti-Patterns

Across documented production retrospectives, six recurring anti-patterns degrade systems with surprising consistency. Each is named, defined, and illustrated below.

1. The Evaluation Cliff. The system has no rigorous evaluation suite. Quality assessment is performed by ad-hoc inspection (“vibe testing”). The system passes the demo, ships, and degrades silently as corpus and queries diverge from the demo distribution. Diagnostic signs: no held-out evaluation set, no version-tracked metrics, no regression tests, “it worked yesterday” complaints from operators. Root cause: treating LLM systems as deterministic software where unit tests and integration tests suffice. Mitigation: invest in LLM-judged evaluation correlated with human review, as Shopify documented for Sidekick; tools like RAGAS provide reference faithfulness, answer relevance, and context-precision metrics, though they should be calibrated against ground-truth on the specific corpus.5769

2. Debugging Opacity. When the system produces a wrong answer, the team cannot reconstruct the chain of decisions that led to it. There is no trace of which retriever returned which candidates, which reranker scored them how, which tool calls the agent made, or what the agent reasoned about. Diagnostic signs: bug reports met with shrugs; quality regressions traced to “the model changed”; no production traces. Root cause: skipping observability infrastructure (“LangSmith costs money”). Mitigation: structured logging of every retrieval, every rerank, every tool call, every model output, with persistent traces accessible by query ID, what Cursor’s harness team describes as treating the agent system as any ambitious software product would be treated.68

3. Latency Drift. Each architectural addition (HyDE, reranking, multi-step planning) adds 100-500ms. After a year of incremental “improvements,” the system’s p50 latency has tripled and the p99 latency has crossed the user’s tolerance threshold. Diagnostic signs: a slow creeping decline in usage, complaints from interactive users, batch-only adoption. Root cause: optimizing each step in isolation without a latency budget. Mitigation: explicit p50/p99 latency budgets per architectural layer, with a route or fast-path for queries that do not need full processing.

4. Integration Impedance. The agent or RAG system works in isolation but fails to integrate with the systems of record (CRMs, ticketing systems, ERPs, code repositories). The integration layer is fragile, undocumented, and unmaintained. Diagnostic signs: most production “incidents” are integration failures rather than model failures. Root cause: treating the LLM as the hard problem and integration as plumbing. Mitigation: invest as much engineering in the tool/integration surface as in the model layer, Anthropic’s “Writing effective tools for AI agents” guidance reflects this principle; tool definitions, error messages, and pagination defaults are first-class engineering.70

5. The POC-to-Production Gap. The proof-of-concept worked on 50 hand-picked queries. Production sees 50,000 queries with long-tail distributions the POC never considered. The system fails on the long tail and there is no graceful degradation. Diagnostic signs: success metrics look good in aggregate but distributions of per-query quality are bimodal; the failure cases cluster in particular query archetypes. Root cause: testing on the median, deploying to the distribution. Mitigation: stratified evaluation on query types, including adversarial cases; investment in escalation paths from automated to human handling, with high-quality handoff context (the lesson Klarna’s later reporting underscored).55

6. The Verification Vacuum. The agent acts. There is no second system that verifies its actions. The agent hallucinates, mis-clicks, deletes the wrong row, sends the wrong email. Diagnostic signs: rare but severe production incidents; mistakes that humans would have caught but the system did not. Root cause: trust in a single model’s output. Mitigation: independent verification, Cognition’s Devin Review pattern, in which a separate review agent without shared context with the coding agent catches an average of 2 bugs per PR (58% severe) on PRs already written by Devin; Anthropic’s evaluator-optimizer loops; and human-in-the-loop for irreversible actions.4732

[Figure 5: Decision tree for pattern selection, flowchart from workload characteristics (corpus size, query complexity, latency budget, error tolerance, verifiability surface) to recommended retrieval and agentic pattern combinations.]

Caption: Pattern selection is deterministic given workload characteristics. Most production failures stem from deploying patterns whose preconditions are not met, not from intrinsic limitations of the patterns themselves.

[Figure 6: Six anti-patterns, tabular: anti-pattern, diagnostic signs, root cause, mitigation.]

Caption: Production retrospectives show consistent recurrence of the same six anti-patterns. Each has documented mitigations; the difficulty is recognition before incident, not cure after.


12. A Unified Decision Framework

The decisions architects must make can be reduced to six workload characteristics, each of which maps to specific pattern recommendations. The framework that follows is derived entirely from the evidence presented in earlier sections; no new claims are introduced.

Corpus characteristics determine the retrieval architecture. For corpora under ~10,000 documents with homogeneous, well-edited content and queries phrased close to the corpus’s vocabulary, naive RAG is acceptable. For corpora over ~10,000 documents, with entity-heavy content (codes, identifiers, names), or with users who phrase queries idiosyncratically, hybrid retrieval (BM25 + dense + RRF) is the default. For corpora over a million tokens with global-sensemaking queries, GraphRAG-style approaches earn a pilot, paired with skepticism toward vendor benchmarks and the independent evaluations cited in Section 7. For corpora of long documents requiring both detail and synthesis, RAPTOR earns its added indexing cost.

Query complexity determines query understanding. Short, direct queries need minimal preprocessing. Multi-hop or compositional queries require decomposition. Knowledge-intensive abstract queries benefit from step-back. Queries where the corpus phrases answers very differently from how users ask questions benefit from HyDE. Each adds latency; the latency budget is the constraint.

Verifiability determines agentic tier. If the agent’s outputs can be deterministically verified (code that compiles and passes tests; structured outputs that match a schema; SQL that returns the expected rows), Tier 3+ becomes tenable because verification provides a reliable termination signal. If verification requires human judgment that does not scale, Tier 1 or Tier 2 with human review is more honest. The Devin Review pattern, using a separate verification agent without shared context, is a way to manufacture verifiability where it does not exist natively.47

Latency budget determines how many architectural layers the pipeline can support. Sub-second interactive workloads (search-as-you-type, voice agents, real-time copilots) can support hybrid retrieval and lightweight rerank, but not multi-step planning, HyDE, or LLM-as-judge rerank. Asynchronous workloads (research assistants, ticket triage, batch analysis) can absorb 5-60 seconds of orchestration cost and unlock Tier 3+.

Error tolerance determines verification investment. High-stakes workloads (financial transactions, medical recommendations, legal advice) require defense in depth: multiple retrievers, multiple verification passes, human-in-the-loop, audit trails. Low-stakes workloads (internal search, content drafting) can accept higher error rates in exchange for simpler architecture.

Integration surface, how many external systems the agent must touch, determines the case for or against multi-agent architectures and complex tool-using designs. Light integration (1-5 tools) favors single-agent designs. Heavy integration (dozens of tools, multiple systems of record) favors orchestrator-worker patterns or just-in-time tool loading via MCP, Anthropic’s published “Code execution with MCP” pattern reduces token overhead by 98.7% in their reported example.71

The framework’s prescriptive output: for the median enterprise workload in 2026, the right architecture is hybrid retrieval (BM25 + dense + RRF) with cross-encoder reranking, structural chunking with metadata, a current-generation embedding model, optional query expansion or HyDE depending on retrieval recall, a Tier 1 deterministic backbone with Tier 2 tool-using sub-agents at scoped steps, rigorous LLM-judge evaluation calibrated against human review, structured tracing of every retrieval and tool call, human-in-the-loop for irreversible actions, and a verification pass, independent of the primary agent, for high-stakes outputs. Higher-tier patterns earn their added complexity only when the workload’s characteristics make their preconditions hold.



  1. Klarna, “Klarna AI assistant handles two-thirds of customer service chats in its first month,” Klarna press release, February 27, 2024, https://www.klarna.com/international/press/klarna-ai-assistant-handles-two-thirds-of-customer-service-chats-in-its-first-month/, accessed May 2, 2026. 
  2. Intercom, “Fin: The #1 AI Agent for customer service,” fin.ai, accessed May 2, 2026, https://fin.ai/. 
  3. “RAG Production Guide 2026: Retrieval-Augmented Generation,” Lushbinary, 2026, https://lushbinary.com/blog/rag-retrieval-augmented-generation-production-guide/, accessed May 2, 2026. 
  4. “Advanced RAG Techniques for High-Performance LLM Applications,” Neo4j Engineering Blog, https://neo4j.com/blog/genai/advanced-rag-techniques/, accessed May 2, 2026. 
  5. “Why Naive RAG Fails in Production,” dasroot.net, February 2026, https://dasroot.net/posts/2026/02/why-naive-rag-fails-production/, accessed May 2, 2026. 
  6. Patrick Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” NeurIPS 2020, https://arxiv.org/abs/2005.11401, accessed May 2, 2026. 
  7. “Best Chunking Strategies for RAG Pipelines,” Redis Engineering Blog, https://redis.io/blog/chunking-strategy-rag-pipelines/, accessed May 2, 2026. 
  8. “Hybrid Search Explained,” Weaviate Blog, https://weaviate.io/blog/hybrid-search-explained, accessed May 2, 2026. 
  9. Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych, “BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models,” NeurIPS Datasets and Benchmarks 2021, https://arxiv.org/abs/2104.08663, accessed May 2, 2026. 
  10. Stephen Robertson and Hugo Zaragoza, “The Probabilistic Relevance Framework: BM25 and Beyond,” Foundations and Trends in Information Retrieval 3, no. 4 (2009): 333-389, https://dl.acm.org/doi/10.1561/1500000019, accessed May 2, 2026. 
  11. “The Ultimate Guide to Chunking Strategies for RAG Applications with Databricks,” Databricks Community Blog, https://community.databricks.com/t5/technical-blog/the-ultimate-guide-to-chunking-strategies-for-rag-applications/ba-p/113089, accessed May 2, 2026. 
  12. “Optimizing RAG with Hybrid Search & Reranking,” Superlinked VectorHub, https://superlinked.com/vectorhub/articles/optimizing-rag-with-hybrid-search-reranking, accessed May 2, 2026. 
  13. Rost Glukhov, “Chunking Strategies in RAG Comparison: Alternatives, Trade-offs, and Examples,” https://www.glukhov.org/rag/retrieval/chunking-strategies-in-rag/, accessed May 2, 2026. 
  14. “Comparative Evaluation of Advanced Chunking for Retrieval-Augmented Generation in Large Language Models for Clinical Decision Support,” PMC, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12649634/, accessed May 2, 2026. 
  15. Niklas Muennighoff et al., “MTEB: Massive Text Embedding Benchmark,” https://arxiv.org/abs/2210.07316, accessed May 2, 2026. 
  16. “Top embedding models on the MTEB leaderboard,” Modal Blog, https://modal.com/blog/mteb-leaderboard-article, accessed May 2, 2026. 
  17. “Best Embedding Models 2025: MTEB Scores & Leaderboard,” Ailog RAG, April 2026, https://app.ailog.fr/en/blog/guides/choosing-embedding-models, accessed May 2, 2026. 
  18. “Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks,” arXiv:2506.21182, https://arxiv.org/html/2506.21182v1, accessed May 2, 2026. 
  19. “Hybrid Search: Combining Keyword and Vector Search for Better Retrieval,” March 5, 2026, https://pr-peri.github.io/blogpost/2026/03/05/blogpost-hybrid-search.html, accessed May 2, 2026. 
  20. “Improving Zero-Shot Ranking with Vespa Hybrid Search, part two,” Vespa Blog, https://blog.vespa.ai/improving-zero-shot-ranking-with-vespa-part-two/, accessed May 2, 2026. 
  21. “Search Mode Benchmarking,” Weaviate Blog, https://weaviate.io/blog/search-mode-benchmarking, accessed May 2, 2026. 
  22. Luyu Gao, Xueguang Ma, Jamie Callan, “Precise Zero-Shot Dense Retrieval without Relevance Labels,” ACL 2023, https://arxiv.org/abs/2212.10496, accessed May 2, 2026. 
  23. “HyDE: Hypothetical Document Embeddings,” Emergent Mind, https://www.emergentmind.com/topics/hypothetical-document-embeddings-hyde, accessed May 2, 2026. 
  24. Huaixiu Steven Zheng et al., “Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models,” ICLR 2024, https://arxiv.org/abs/2310.06117, accessed May 2, 2026. 
  25. “Reranking in RAG: Cross-Encoders, Cohere Rerank & FlashRank,” Medium (Vaibhav Dixit), March 2026, https://medium.com/@vaibhav-p-dixit/reranking-in-rag-cross-encoders-cohere-rerank-flashrank-c7d40c685f6a, accessed May 2, 2026. 
  26. “Latency Benchmark: Cohere rerank 3.5 vs. ZeroEntropy zerank-1,” ZeroEntropy Blog, https://zeroentropy.dev/articles/lightning-fast-reranking-with-zerank-1/, accessed May 2, 2026. 
  27. “Build BGE Reranker: Cross-Encoder Reranking for Better RAG 2026,” Markaicode, https://markaicode.com/bge-reranker-cross-encoder-reranking-rag/, accessed May 2, 2026. 
  28. Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, Christopher D. Manning, “RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval,” ICLR 2024, https://arxiv.org/abs/2401.18059, accessed May 2, 2026. 
  29. Darren Edge et al., “From Local to Global: A Graph RAG Approach to Query-Focused Summarization,” Microsoft Research, arXiv:2404.16130, April 24, 2024 (v2 February 19, 2025), https://arxiv.org/abs/2404.16130, accessed May 2, 2026. 
  30. “How Significant Are the Real Performance Gains? An Unbiased Evaluation Framework for GraphRAG,” arXiv:2506.06331, https://arxiv.org/abs/2506.06331, accessed May 2, 2026. 
  31. “RAG vs. GraphRAG: A Systematic Evaluation and Key Insights,” arXiv:2502.11371, https://arxiv.org/abs/2502.11371, accessed May 2, 2026. 
  32. Anthropic, “Building Effective Agents,” Anthropic Research, December 2024, https://www.anthropic.com/research/building-effective-agents, accessed May 2, 2026. 
  33. Anthropic, “Effective context engineering for AI agents,” Anthropic Engineering Blog, https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents, accessed May 2, 2026. 
  34. Anthropic, “Effective harnesses for long-running agents,” Anthropic Engineering Blog, https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents, accessed May 2, 2026. 
  35. Shunyu Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models,” arXiv:2210.03629, October 2022, https://arxiv.org/abs/2210.03629, accessed May 2, 2026. 
  36. Timo Schick et al., “Toolformer: Language Models Can Teach Themselves to Use Tools,” NeurIPS 2023, arXiv:2302.04761, https://arxiv.org/abs/2302.04761, accessed May 2, 2026. 
  37. Lei Wang et al., “Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models,” ACL 2023, arXiv:2305.04091, https://aclanthology.org/2023.acl-long.147/, accessed May 2, 2026. 
  38. Shunyu Yao et al., “Tree of Thoughts: Deliberate Problem Solving with Large Language Models,” NeurIPS 2023, arXiv:2305.10601, https://arxiv.org/abs/2305.10601, accessed May 2, 2026. 
  39. “How Cursor Shipped its Coding Agent to Production,” ByteByteGo Blog, https://blog.bytebytego.com/p/how-cursor-shipped-its-coding-agent, accessed May 2, 2026. 
  40. Cognition Labs, “Introducing Devin, the first AI software engineer,” Cognition Blog, March 2024, https://cognition.ai/blog/introducing-devin, accessed May 2, 2026; “SWE-bench technical report,” Cognition Blog, https://cognition.ai/blog/swe-bench-technical-report, accessed May 2, 2026. 
  41. Replit, “Replit Agent, The best Agent for building Production-Ready apps,” https://replit.com/products/agent, accessed May 2, 2026; “2025: Replit in Review,” Replit Blog, https://blog.replit.com/2025-replit-in-review, accessed May 2, 2026. 
  42. Qingyun Wu et al., “AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework,” arXiv:2308.08155, August 2023, https://arxiv.org/abs/2308.08155, accessed May 2, 2026. 
  43. CrewAI, “CrewAI Documentation,” https://docs.crewai.com/, accessed May 2, 2026. 
  44. LangChain, “LangGraph: Agent Orchestration Framework for Reliable AI Agents,” https://www.langchain.com/langgraph, accessed May 2, 2026. 
  45. Walden Yan, “Don’t Build Multi-Agents,” Cognition Blog, June 12, 2025, https://cognition.ai/blog/dont-build-multi-agents, accessed May 2, 2026. 
  46. “Anthropic: Building a Multi-Agent Research System for Complex Information Tasks,” ZenML LLMOps Database, https://www.zenml.io/llmops-database/building-a-multi-agent-research-system-for-complex-information-tasks, accessed May 2, 2026; “How Anthropic Built a Multi-Agent Research System,” ByteByteGo, https://blog.bytebytego.com/p/how-anthropic-built-a-multi-agent, accessed May 2, 2026. 
  47. Cognition Labs, “Multi-Agents: What’s Actually Working,” Cognition Blog, April 2026, https://cognition.ai/blog/multi-agents-working, accessed May 2, 2026. 
  48. Anthropic, “Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku,” Anthropic News, October 22, 2024, https://www.anthropic.com/news/3-5-models-and-computer-use, accessed May 2, 2026. 
  49. Anthropic, “Introducing Claude Sonnet 4.6,” Anthropic News, https://www.anthropic.com/news/claude-sonnet-4-6, accessed May 2, 2026. 
  50. OpenAI, “Computer-Using Agent,” https://openai.com/index/computer-using-agent/, accessed May 2, 2026; OpenAI, “Introducing Operator,” https://openai.com/index/introducing-operator/, accessed May 2, 2026. 
  51. Anthropic, “System Card: Claude Opus 4.5,” November 2025, https://www.anthropic.com/claude-opus-4-5-system-card, accessed May 2, 2026; “System Card: Claude Opus 4.6,” February 2026, https://www-cdn.anthropic.com/0dd865075ad3132672ee0ab40b05a53f14cf5288.pdf, accessed May 2, 2026. 
  52. “Claude Opus 4.7 Benchmarks Explained,” Vellum, https://www.vellum.ai/blog/claude-opus-4-7-benchmarks-explained, accessed May 2, 2026. 
  53. “Klarna’s AI assistant does the work of 700 full-time agents,” OpenAI Customer Stories, https://openai.com/index/klarna/, accessed May 2, 2026. 
  54. “How Klarna’s AI assistant redefined customer support at scale for 85 million active users,” LangChain Blog, https://blog.langchain.com/customers-klarna/, accessed May 2, 2026. 
  55. “Klarna Customer Service: From AI-First to Human-Hybrid Balance,” PromptLayer Blog, https://blog.promptlayer.com/klarna-customer-service-from-ai-first-to-human-hybrid-balance/, accessed May 2, 2026; “Klarna Isn’t Backing Down from AI in Customer Service; It’s Getting Smarter About It,” CX Today, https://www.cxtoday.com/contact-center/klarnas-ai-merry-go-round-enough-to-put-anyones-head-in-a-spin/, accessed May 2, 2026. 
  56. “How Intercom’s Fin AI Agent Redefines CX,” Faye Digital, https://fayedigital.com/blog/fin-ai-agent/, accessed May 2, 2026. 
  57. Andrew McNamara, Ben Lafferty, and Michael Garner, “Building production-ready agentic systems: Lessons from Shopify Sidekick (2025),” Shopify Engineering, https://shopify.engineering/building-production-ready-agentic-systems, accessed May 2, 2026. 
  58. Carlos E. Jimenez et al., “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?”, ICLR 2024, https://arxiv.org/abs/2310.06770, accessed May 2, 2026; SWE-bench Leaderboards, https://www.swebench.com/, accessed May 2, 2026. 
  59. “SWE-bench Verified,” Epoch AI, https://epoch.ai/benchmarks/swe-bench-verified, accessed May 2, 2026. 
  60. “SWE-Bench Pro Leaderboard AI Coding Benchmark,” Scale, https://labs.scale.com/leaderboard/swe_bench_pro_public, accessed May 2, 2026. 
  61. “Devin AI vs Engine, Compare Software Engineer Tools,” Engine Labs Blog, https://blog.enginelabs.ai/devin-ai-vs-engine-compare-software-engineer-tools, accessed May 2, 2026. 
  62. “Devin AI Review: The Good, Bad & Costly Truth (2025 Tests),” Trickle Blog, https://trickle.so/blog/devin-ai-review, accessed May 2, 2026. 
  63. Microsoft, “AutoGen 0.4, Multi-agent Conversation Framework,” https://microsoft.github.io/autogen/0.2/docs/Use-Cases/agent_chat/, accessed May 2, 2026. 
  64. “CrewAI Framework 2025: Complete Review,” Latenode Blog, https://latenode.com/blog/ai-frameworks-technical-infrastructure/crewai-framework/crewai-framework-2025-complete-review-of-the-open-source-multi-agent-ai-platform, accessed May 2, 2026. 
  65. “Emergence WebVoyager: Toward Consistent and Transparent Evaluation of (Web) Agents in The Wild,” arXiv:2603.29020, https://arxiv.org/pdf/2603.29020, accessed May 2, 2026. 
  66. Shuyan Zhou et al., “WebArena: A Realistic Web Environment for Building Autonomous Agents,” arXiv:2307.13854, https://arxiv.org/abs/2307.13854, accessed May 2, 2026. 
  67. “WebArena Verified,” OpenReview, https://openreview.net/forum?id=CSIo4D7xBG, accessed May 2, 2026. 
  68. “Continually improving our agent harness,” Cursor Blog, https://cursor.com/blog/continually-improving-agent-harness, accessed May 2, 2026. 
  69. Shahul Es et al., “Ragas: Automated Evaluation of Retrieval Augmented Generation,” arXiv:2309.15217, https://arxiv.org/abs/2309.15217, accessed May 2, 2026; Ragas documentation, https://docs.ragas.io/en/stable/, accessed May 2, 2026. 
  70. Ken Aizawa, “Writing effective tools for AI agents, using AI agents,” Anthropic Engineering Blog, https://www.anthropic.com/engineering/writing-tools-for-agents, accessed May 2, 2026. 
  71. Anthropic, “Code execution with MCP: building more efficient AI agents,” Anthropic Engineering Blog, https://www.anthropic.com/engineering/code-execution-with-mcp, accessed May 2, 2026. 
  72. “Claude Opus 4.7: Full Review, Benchmarks & Features (2026),” Build Fast with AI, https://www.buildfastwithai.com/blogs/claude-opus-4-7-review-benchmarks-2026, accessed May 2, 2026. 

Related


Membership

Become a Member to receive new research as they are published.