Why small models matter more than you think

Every few months someone benchmarks a new frontier model, a number goes up, and a cohort of builders decides that their product is now obsolete. The argument goes: if the big model does 93% on the eval and the small one does 81%, why would you ever ship the small one? This is the wrong question, and it has been the wrong question for a decade. I wrote most of the Rasa docs around this confusion.

The right question is: what share of your traffic does the small model handle correctly, at what cost, with what latency, under what failure mode? That is the only question a production system actually answers. The eval number is a crude summary statistic, useful for the press release, nearly useless at runtime.

Your model’s job is not to be impressive. Your model’s job is to be predictable on a slice of reality you care about.

The shape of the actual problem

Take a customer-support bot for a mid-sized bank. The tail of weird, open-ended queries — the stuff a frontier model would genuinely handle better — is maybe 4% of traffic. The other 96% is password resets, card status, transaction disputes, and variations on “can I speak to a human.” You don’t need a 400B-parameter model to know that “reset password” means reset the password. You need a system that:

Handles the 96% at ~50ms and ~$0.0002 per turn;
Knows, loudly, when it’s in the 4% tail;
Escalates cleanly to either a bigger model or a human.

This is not a model problem. It’s a routing problem. And routing is where small models are catastrophically, unreasonably good — because “is this request in-distribution for me” is a much easier question than “what is the correct answer,” and it’s a question a small model can answer confidently.

”Is this in-distribution for me” is a much easier question than “what is the correct answer.”

Cost, specifically

The cost argument is the loudest one and usually the weakest. A frontier model costs 30× more than a 7B open-weights model at the API, but if your volume is small, 30× of nothing is still nothing. So people dismiss cost and keep shipping the expensive thing.

But cost isn’t really about the invoice — it’s about what you’re allowed to do because you’re cheap. When inference is effectively free, you can run the model on every keystroke. You can A/B test prompts by running both. You can evaluate every production turn against a held-out judge. You can retry on low-confidence outputs. You can N=8 sample and vote. All of the actually useful tricks in the production-ML playbook become available when you stop treating each call as an economic event.

The boring argument, for completeness

Small models run on your own hardware. That used to be a nerd argument; in 2026, with half the industry now re-learning what “data residency” means, it is a sales argument. Every enterprise procurement cycle I’ve watched this year has ended with some variation of “can we run this on-prem”. If your answer is no, your answer is: someone else will get the deal.

None of this is new. We said most of it in 2019 and got laughed at, then said it again in 2022 and got ignored, and now it’s finally fashionable. The substance hasn’t moved much. What moved is the distance between “state of the art” and “good enough to ship,” which has collapsed to something like a pleasant afternoon of finetuning. That collapse is the real story of the last eighteen months, and nobody is telling it because it doesn’t sound like an announcement.

So: small models. Use them. Not as a compromise. As the point.

Why small models matter more than you think.

The shape of the actual problem

Cost, specifically

The boring argument, for completeness