AI Agents in Production: What Actually Works // nh labs

The Agent Hype

2026 is the year of AI agents. Every conference, every newsletter, every LinkedIn post is about autonomous systems that complete tasks on their own, make decisions and take over entire workflows. The vision: give an agent a goal, and it does the rest.

Reality looks rather different. We've been deploying AI agents in real projects for months – in software development, data processing, and monitoring. Some things work surprisingly well. Others are still a long way from production-ready.

What Works Today

There are areas where AI agents deliver genuine value – not as demos, but in day-to-day operations:

Code agents: Tools like Claude Code write, refactor and debug code at a quality that was unthinkable a year ago. Not as a toy, but as a serious tool in a developer's daily workflow. The key: the human stays in the loop. The agent proposes, the developer decides. This works because code is verifiable – you can immediately see whether the result is correct.

Data processing and analysis: Agents that extract structured data from unstructured sources, classify and prepare it, run reliably in production. Parsing emails, categorising documents, generating reports – repetitive tasks with clear rules and verifiable outcomes.

Monitoring and alerting: AI agents that analyse logs, detect anomalies and produce initial diagnoses significantly reduce incident response times. Not because they're better than experienced ops engineers, but because they watch around the clock and filter out the obvious cases.

Test generation: Agents that analyse existing code and automatically generate test cases have doubled our test coverage across several projects. Not perfect tests, but a solid foundation that gets refined manually.

What Doesn't Work Yet

And here's the part that rarely gets discussed at conferences:

Fully autonomous workflows: The idea of telling an agent "build a complete web application" or "optimise our marketing strategy" and walking away simply doesn't work. Not because the models are poor, but because complex tasks require context that can't be packed into a prompt. Business logic, stakeholder expectations, implicit domain knowledge – the agent has none of it.

Decisions with consequences: As soon as an agent needs to make decisions that are hard to reverse – transferring money, sending emails to customers, deleting data – things get dicey. The 2–5% error rate that's acceptable for text generation becomes a dealbreaker for financial transactions.

Long-chain tasks: Agents that need to execute ten or more steps autonomously accumulate errors. Each step has a small probability of failure, and across the chain these multiply. By step eight, the agent is working on the basis of false assumptions from step three.

Multi-agent systems: The idea of having multiple agents communicate and collaborate is fascinating – and in practice, a debugging nightmare. When Agent A gives Agent B the wrong instructions and Agent B then feeds Agent C incorrect data, troubleshooting becomes exponentially harder than with a single system.

The Patterns That Work

From our experience, clear patterns emerge:

Human-in-the-loop: The most successful agent setups have a human at a defined point in the process. Not as a formality, but as a genuine decision point. The agent prepares, the human approves, the agent executes.

Narrow scope: Agents that handle a clearly defined task work better than generalists. An agent that exclusively reviews pull requests is more useful than one that's supposed to handle "everything to do with code."

Verifiable outputs: Tasks where the result can be automatically validated – tests pass, data format is correct, API responds properly – are excellent for agents. Tasks whose quality can only be judged subjectively, less so.

Graceful degradation: Good agent systems know when they're stuck and escalate to a human rather than guessing further. This sounds trivial, but it's the difference between a useful tool and a source of errors.

The Build-vs-Buy Mistake

Many companies make the same mistake: they buy a generic "AI agent service" and expect it to solve their specific problems. That rarely works.

The agents that deliver real value in production are almost always bespoke. Not because they use proprietary models, but because they're deeply integrated into existing infrastructure. They know the database schemas, the API endpoints, the business rules. This contextual knowledge makes the difference – not the choice of model.

This doesn't mean every company needs to build its own agents from scratch. But it does mean that integration and configuration are at least as important as the AI component itself.

What This Means for Businesses

AI agents are no longer a future topic – they're a now topic. But getting the entry point right is crucial:

Start small: Identify a specific, repeatable process. Not the most complex one, but the one where the benefit is clearly measurable and error tolerance is high.

Measure, don't assume: Before an agent goes into production, it must be clear what success looks like. Time saved? Error reduction? Throughput? Without metrics, any agent deployment is guesswork.

Gradual autonomy: Agents shouldn't run autonomously from day one. First supervised, then semi-autonomous, then autonomous – and only where the data shows it works.

At nh labs, we build AI agents not as technology demos, but as tools that solve concrete problems. That sounds less spectacular than "fully autonomous AI" – but it actually delivers results.

Conclusion

The AI agent hype overstates the short-term possibilities and underestimates the long-term ones. Today, agents work best where they handle clearly defined tasks with verifiable outcomes – supported by humans, not as replacements for them. In two years, that will look different. But those who wait for fully autonomous systems now will miss the value that agents already deliver today. The companies that start pragmatically now are building the experience and infrastructure that will give them a real competitive edge.