Architecture
11 min

RAG Patterns 2026: How Long Context Windows Are Reshaping Retrieval

Million-token context windows had everyone declaring RAG dead in 2025. They were wrong – but the classic RAG recipe is dead too. A look at the retrieval patterns that actually hold up in production today.

Reports of RAG's Death Were Exaggerated

In 2024 and 2025 it became fashionable to declare Retrieval-Augmented Generation finished. The argument: if a model can process a million tokens at once, just dump the whole corpus into the context and skip the vector database gymnastics altogether.

In practice, none of that played out. Long contexts are expensive, slow and – the part that gets glossed over – not as qualitatively robust as the marketing slides suggest. Models do find the needle in an 800,000-token haystack surprisingly often, but they reason worse about what they find as more noise piles up next to it.

That said, RAG as we built it in 2023 is obsolete almost everywhere. Naive top-k vector search produces worse results than the new patterns – and isn't even cheaper. What's changed is not RAG's right to exist, but the recipe.

What Naive RAG Used to Look Like

A reminder, because plenty of systems still look exactly like this: documents get sliced into 500-token chunks, each chunk runs through an embedding model, the vectors land in a database. At query time, the question itself gets embedded, the five nearest neighbours are pulled, stuffed into the prompt, done.

This recipe works for simple FAQ bots. For anything else it falls apart at the edges:

Lost context across chunks: The decisive sentence sits in chunk 3, the definition of the term in chunk 1. Top-k retrieves chunk 3, the model hallucinates the definition.

Wrong neighbours: Vector similarity is not semantic relevance. A question about "termination notice periods" surfaces paragraphs about firing employees because they're lexically close, even though contract law is talking about something completely different.

Brittleness to phrasing: Asking "How do I cancel?" returns different hits than "right of return" – even though the answer should be the same.

No iteration: A single search step only sees what the original question gives it. Real research needs several passes.

The Patterns That Hold Up in 2026

Agentic Retrieval

Instead of searching once and burning the result, we let the model decide when and what to search for. Search becomes a tool the agent calls as often as it needs. A first search returns rough leads, the model formulates more precise follow-up queries, drills deeper.

This is significantly more expensive per request – and significantly better at non-trivial questions. For research tasks, legal review or technical support it has become the default.

Hierarchical Retrieval

We index at two levels: a summary per document and the full text. Search runs first across summaries – fast and good at finding the right documents. Only then do we look inside the chosen documents in detail.

The effect: the model never loses track of where a snippet came from, and the context of a paragraph stays intact. For structured corpora – contracts, manuals, scientific literature – this is a step change in quality.

Hybrid Search with a Reranker

Vector search alone loses to classic BM25 the moment proper nouns, case numbers or rare terms enter the picture. Searching for "judgment 4 AZR 123/24" doesn't want an embedding match – it wants an exact hit.

The fix is simple and old: run BM25 and vector search in parallel, merge the results, then run a reranker over them. Rerankers are smaller models that score the relevance of a question-document pair. They cost latency but reorder genuinely relevant hits to the top. Every project where we've added a reranker has seen a tangible jump in answer quality.

Whole Documents Instead of Chunks

When the context window allows, we no longer retrieve chunks but entire documents. A 30-page contract sits comfortably in 200,000 tokens. The model sees the contract the way a lawyer would – with preamble, definitions and the small print.

This eliminates classic RAG's biggest weakness: lost context. The prerequisite is that the retrieval layer finds the right document. Hierarchical and hybrid search are exactly what gets you there.

Context Caching for Stable Corpora

Vendors like Anthropic and Google bill cached input far more cheaply than fresh input. For corpora that barely change – product manuals, legal standard works, codebases – you can keep large parts permanently in cache and pay a fraction per request.

In two ongoing projects we've cut inference costs by 60 to 80 percent this way, without fundamentally changing the architecture. Context caching is one of the underrated levers of the year.

The Decision Matrix

When do you reach for what? The rule of thumb we use internally:

Long context, when: The relevant corpus is small and stable, every request potentially needs access to all of it, and latency is acceptable. Example: a 50-page whitepaper a bot answers questions about.

RAG, when: The corpus is too large, changes frequently, or only a small, dynamically chosen part is relevant per request. Example: an internal knowledge hub with tens of thousands of documents.

Fine-tuning, when: It's not about factual knowledge but about style, format or behaviour. Fine-tuning is surprisingly bad at teaching new facts and surprisingly good at making a model write consistently in a particular tone.

Most real systems combine two or three of these. Pure RAG systems or pure long-context systems are the exception.

Common Mistakes

Vector database worship: The choice of vector DB is almost never the bottleneck. Pinecone vs. Weaviate vs. Postgres with pgvector – it makes practically no difference to answer quality. What matters is the retrieval logic on top.

Chasing better embeddings: A new embedding model might bring two to five percent on benchmarks. Hierarchical retrieval, a reranker or a second search step often bring ten times that.

Eval as an afterthought: Without an evaluation set against real questions, every optimisation is gut feeling. A hundred hand-curated question-answer pairs from the actual use case are worth more than any synthetic benchmark.

Chunking over-optimisation: Hours spent tuning the perfect chunk size buy you less than half a day spent on a better index structure. Chunks are an implementation detail, not an architectural foundation.

At nh labs

We essentially never build RAG systems as a pure top-k vector setup any more. The default stack is hierarchical retrieval with hybrid search, a reranker, an agentic loop for more complex queries, and context caching for the parts of the corpus that don't change often. That sounds more complex than it is – most of these components are off-the-shelf libraries or managed services today.

More important than the stack is discipline around evaluation. Without a set of real questions to measure every change against, you build in the dark. With eval in place, iteration is fast and improvements are measurable.

Conclusion

RAG isn't dead, but the 2023 recipe is. Long context windows have expanded the toolbox, not replaced it. If you're building today, start with hierarchical retrieval, hybrid search and a reranker – and treat long context as a targeted complement, not a replacement. The systems that work well in production in 2026 aren't the ones with the largest context window or the most expensive embedding model. They're the ones with the most thought-through retrieval logic and an honest evaluation harness. You needed both two years ago too – it just matters more now.