Context Engineering: What's Replacing Prompt Engineering // nh labs

A Buzzword in Retreat

In 2023, everyone suddenly had a new job title on LinkedIn: prompt engineer. There were 2,000-dollar courses, "prompt libraries" sold as SaaS products and job postings advertising salaries north of 200,000 dollars. The story: whoever found the right words for a model held a decisive competitive advantage.

Three years later, little of that hype is left. Not because the underlying problem has gone away – but because it turned out the choice of words was only a tiny part of the job. The real leverage sits elsewhere.

What Prompt Engineering Got Right

It would be unfair to dismiss prompt engineering in retrospect. The term popularised an important insight: LLMs aren't search engines, they're language systems that respond to context. The way you frame a question dramatically affects the answer. Few-shot examples, clear instructions, structured task descriptions – these are now part of the standard toolkit for any serious AI development.

The problem wasn't the idea. The problem was reducing the whole discipline to a single block of text – as if the job were just polishing three sentences in an input field. Real applications don't work that way. They have system prompts, tools, retrieved data, conversation history, structured output formats, caching strategies and escalation paths. The "prompt" is maybe ten percent of that.

What Context Engineering Actually Means

Context engineering is the broader term for what actually happens when you put an LLM into production. It covers everything that lands in the context window before the model responds:

System prompt: The role, the guardrails, the style guide. Rarely the lever that decides success or failure, but the foundation everything else sits on.

Retrieval: Which documents, datasets or code snippets are pulled from internal sources before the model answers. The quality of this step matters more in most applications than any prompt tweaking.

Tools: Which functions the model can call – database queries, API calls, calculations, searches. The right tool at the right moment replaces pages of prompt instructions.

Conversation state: What gets carried forward from previous turns, what gets summarised, what gets dropped. In longer sessions, often the deciding factor for consistency.

Output schema: Structured JSON responses instead of free-form prose. Saves parsing, reduces errors, makes pipelines robust.

Memory: Persistent information between sessions – user preferences, prior decisions, project-specific context. Now its own subsystem, no longer part of the prompt.

When people say "prompt engineering" today, they usually mean one of these areas – or confuse them. Context engineering separates them cleanly.

Where the Real Leverage Sits

In our practice, the order of priority is clear: better data beats better wording. Concretely:

Better retrieval > better wording. A RAG system that surfaces the right three documents will produce better answers with a mediocre prompt than a perfectly worded prompt with the wrong sources. We've repeatedly seen days poured into prompt tuning when the actual problem was a poorly configured vector store.

Right tools > clever instructions. If an agent needs to do arithmetic, give it a calculator. If it needs current data, give it an API call. A thousand words of prompt designed to coax the model into doing maths correctly can be replaced by a ten-line tool definition – and it'll be more reliable.

Structured output > prose parsing. "Respond in the following JSON format" is a workaround. Forced output schemas via the API are the standard. If you're still running regular expressions over LLM output in production, you have an architecture problem, not a prompt problem.

Eviction strategy > stuffing the window. Even with a million tokens, you have to decide what goes in. Summarise older turns, drop irrelevant tool outputs, load memory selectively. If you dump everything in, you get worse results, higher costs and slower responses.

Why Long Context Windows Didn't Kill the Discipline

In 2024, many people thought the problem would solve itself. If a model can handle a million tokens, just throw the entire company wiki in there and you're done. That turned out to be a fallacy.

First, models suffer from attention problems at full context length – they don't reliably find relevant information buried in 800,000 tokens. Second, costs and latency explode. Third, the "lost in the middle" effect is empirically documented: anything between the start and end gets systematically less weight.

Long contexts didn't abolish context engineering, they shifted it. Instead of "how do we fit it all in?", the question is now "what belongs where, so the model can actually use it?". That's more work, not less.

Common Mistakes

From our projects, patterns emerge that go wrong again and again:

Over-stuffed context: Teams throw in everything that might be relevant, just in case. The model gets slower, more expensive and less accurate. Less is almost always more.

No eviction strategy: In long sessions, the context grows unchecked until it hits the limit. Then something gets dropped at random – usually the wrong thing. If you don't actively manage eviction, you've handed that decision to chance.

Mixed roles: System prompt, user input and tool outputs get smashed into a single text blob instead of cleanly separating roles. Models are sensitive to this – and security gaps like prompt injection become much harder to prevent.

Static retrieval: A one-shot vector search at the start of the conversation, and nothing after. In complex tasks, the information need shifts – retrieval has to be dynamic and multi-stage.

No context logging: When an output goes wrong and nobody knows what was actually in the context window, debugging is impossible. Logging the full context isn't optional, it's a precondition for stable systems.

What Teams Need Now

The skills that make a difference today aren't prompt wizardry. They're classical engineering disciplines applied to a new stack:

Information retrieval: Solid grasp of embedding models, hybrid search strategies, reranking, chunk sizes. Less glamorous than prompt tweaking, but more impactful.

API design: Tool definitions are APIs for models. If you've designed clean REST APIs before, you're ahead.

Data modelling: Output schemas, structured inputs, Pydantic or Zod definitions. Classical backend craft.

Observability: Logging, tracing, evaluation pipelines. Without them, no context system can be improved in operation.

At nh labs, we've stopped hiring "prompt engineers" since mid-2025 – not because the work has gone away, but because it's become an aspect of software engineering. The best results come from developers who can build clean systems, not from copywriters with ChatGPT experience.

Conclusion

Prompt engineering isn't disappearing – it's shrinking back to its actual size. It's one tool in the toolbox, not a standalone career. What remains, and grows in importance, is the ability to deliberately shape the entire context window: what enters, in what form, in what order, with which tools, in which schema. Master that and you build reliable AI systems. Keep polishing prompts and you optimise the last one percent while ignoring the first ninety. The term context engineering won't stick around forever – the discipline behind it will.