The Weekend Feature and the Bill at Month's End
Building an AI feature is a weekend's work today. A few API calls, a prompt, a bit of glue code – and suddenly the chatbot answers support questions, the assistant summarises documents, the tool writes product descriptions. The prototype feels like magic, and that tempts you into a dangerous conclusion: that the hard part is done.
It isn't. Building was never the expensive bit. The expensive bit is running that feature for every user, every day, forever. And that's the calculation almost nobody runs before they ship. They run it when the first real cloud bill lands – and by then the product is already live, already priced, already promised to the customer.
The uncomfortable truth behind every AI product: every token costs money. And unlike traditional software, it doesn't add up to nothing. It adds up with every single click.
The Break in the Cost Model
To see why AI behaves differently here, it helps to remember why classic SaaS was such a dream business in the first place.
With normal software, the marginal cost of the next user is close to zero. The code is written, the servers are running anyway – whether the tenth or the ten-thousandth user loads a page costs essentially the same, which is to say almost nothing. That's why growth in SaaS is pure upside: every new customer brings revenue but barely any extra cost. The bigger you get, the fatter the margin.
AI inverts that principle. Every interaction calls a model, and every model call costs money – per token, on every single request. The ten-thousandth user isn't free. They cost exactly what the first one did, multiplied by the number of requests they make. Here, scaling doesn't just multiply revenue, it multiplies cost. Anyone bringing the SaaS instinct of "more users = more profit" to an AI product is leaving half the equation off the page.
The Maths Nobody Runs
The arithmetic isn't complicated. Which is exactly what makes it so striking how rarely anyone writes it down. The cost of an AI feature comes, roughly, from:
(input tokens + output tokens) × price per token × calls per task × tasks per user × users.
Let's run it with round, illustrative numbers. Take a simple request: 2,000 tokens go in (the question plus a little context), 500 come out. On a model that costs, say, £3 per million input tokens and £15 per million output tokens, a single request lands at about 1.5 pence. Laughably small. This is where most teams stop doing the maths in their heads – and it's exactly where the problem begins.
Now layer reality on top:
- Long context and RAG. To get good answers, you stuff relevant documents into every call. Those 2,000 input tokens quickly become 20,000. The request now costs nearer 7 pence instead of 1.5 – a fivefold jump, from context alone.
- Agentic loops. Modern AI features rarely consist of a single call. The agent plans, calls a tool, evaluates the result, corrects itself, calls the next tool. Twenty model calls for one task is nothing unusual. Suddenly that one task costs not 7 pence but £1.40 – twenty times the naive estimate.
- Retries. A timeout, a truncated response, a failed tool call – every retry costs again. In poorly guarded systems, retry storms quietly double the bill.
Now multiply upward. An active user might trigger 10 such tasks a day, across 20 working days a month – that's 200 tasks at £1.40 each, or £280 per user per month. If the product sells for £29 a month, you lose roughly £250 on every active user. Not through abuse. Through normal use.
This isn't a contrived extreme. It's the perfectly ordinary escalation from "works in the prototype" to "runs in production" – and it happens quietly, layer by layer, until the bill arrives.
The All-You-Can-Eat Problem
Even if your average cost checks out, the next trap hides in the distribution. Because users aren't equal. In almost every product, usage follows a brutal skew: a small percentage of power users generates most of the load.
In classic SaaS that doesn't matter – the heavy user costs nothing extra. In AI it's fatal. Sell a flat rate and meet power users, and you've opened an all-you-can-eat restaurant where a handful of guests clear the entire buffet. The heaviest 5% of users can rack up more inference cost than the other 95% combined.
The average doesn't smooth it out. It gets dominated by the heavy user. And the most expensive customer isn't the one paying least – it's often the one who loves the product most, who uses it all day precisely because it's so good. On a fixed price, your most loyal fan costs more than they bring in. You end up punished for your own success.
Where the Costs Hide
Most cost explosions aren't laws of nature; they're avoidable decisions. The usual suspects:
- Context bloat. Every call re-sends the entire conversation history or half the knowledge base. You pay for the whole context on every single request – including the part that hasn't changed in ten messages.
- No caching. The same system prompt, the same documents, the same instructions go over the wire unchanged a thousand times and get billed in full each time, even though the result would be identical.
- A frontier model for trivia. The largest, priciest model classifies an email as "urgent / not urgent" or extracts a date from a sentence. Tasks a model ten times cheaper would handle just as well.
- No output limit. Without a cap, the model writes three paragraphs where one would have done – and output tokens are usually the most expensive kind.
- Agentic loops with no budget cap. An agent that gets stuck keeps calling the model until it's done – or until it never is. Without a hard ceiling on calls per task, that's an open tap.
- Retry storms. When a call fails and the system retries aggressively, with no backoff and no cap, costs multiply at exactly the moment something is already going wrong.
None of these is visible in the prototype. Every one of them becomes a line on the bill in production.
The Levers
The good news: every one of those cost drivers has a counter-lever. Teams that take unit economics seriously build them in from the start – not as emergency surgery once the bill already hurts.
- Right-size the model to the task. Not every task needs the flagship. A small, cheap model for classification, extraction, and simple answers; the frontier model only where it earns its price. That tiering alone often cuts cost by an order of magnitude.
- Use caching. Prompt caching for stable system prompts and documents, result caching for recurring requests. What doesn't change shouldn't be paid for every time.
- Cap context and output. Pass only the genuinely relevant parts into the call, not the whole history. Set a hard ceiling on response length.
- Budget limits per user and per task. A maximum number of calls per task, a maximum token budget per user per day. That turns the worst-case bill from infinite into calculable.
- Batch where you can. Anything that doesn't have to happen in real time can be processed in bulk, often far more cheaply.
- At high, steady volume, evaluate self-hosting. Past a certain sustained load, a self-run open-weight model can come in below the API meter – provided you count the operational cost honestly.
- Tie price to usage. Usage-based or tiered rather than flat. When revenue grows with cost, the power user can no longer become a source of losses.
The Honest Counter-Case: Don't Optimise Too Early
So much for the cost sermon – now the counterpoint, because otherwise you draw the wrong conclusions.
Optimising cost on a product that has no users yet is wasted time. In the early phase, two things matter: does it work at all, and does anyone want it? Correctness and product-market fit beat efficiency. An expensive prototype that proves people love the feature is infinitely more valuable than a cheap one nobody wants. Fretting over token caching in week one instead of sharpening the product is optimising the wrong variable.
There's a structural comfort on top of that: prices per token trend reliably downward over time. What's "too expensive for this use case" today can be unremarkable a year from now. Some idea that fails on the inference bill today simply becomes viable by waiting. That's real, and it shouldn't sap your nerve.
But – and this is the crucial part – it's no licence to ignore your unit economics. It's an argument for the right order: first prove it works and is wanted, then understand the cost before you scale. The disaster doesn't happen with the expensive prototype. It happens when you ramp a negative-margin product to ten thousand users without ever having worked out the unit cost. Scale makes a good business better and a broken one broken faster.
The Uncomfortable Questions
Before an AI feature goes wide, run a sober check. Five questions that deserve honest answers:
- Do you know your cost per request and per active user? Measured, not guessed. If you can't state that number, you're flying blind.
- What does your worst-case user cost? Take the heaviest 5% and run the numbers. That's your real risk figure, not the average.
- Does your pricing cover that user? If the most expensive customer costs more than they pay, the pricing model is broken – not the customer.
- Where is context bloat hiding? Are you re-sending the full history and half the knowledge base on every call? Are you caching what doesn't change?
- Are you paying for a frontier model where a small one would do? Walk through every model call and ask whether it really needs the most expensive tool in the box.
Answer all five cleanly and you have a product that gets more profitable as it grows, instead of driving into a cost wall.
Conclusion
Classic software got cheaper per head with every user. AI won't – every user, every request, every token carries a real, recurring price. That's not a reason to build no AI products. It's a reason to build them with your eyes open.
Unit economics aren't a job for later, once growth arrives. They're a decision on the first day of design: which model, how much context, what cap, what pricing model. Think it through from the front and you build a product that earns as it scales. Ignore it and you build one that bleeds as it scales – and only finds out when the bill arrives.
That's exactly where we come in at nh labs: designing AI products from day one so the unit economics hold. Not so the feature dazzles in the demo, but so it still makes money at ten thousand users. Building has become cheap. Running it with judgement is where the real craft lives.