Fable, Opus, Sonnet, or Haiku? I ran them head to head
Claude just shipped a new top tier, Fable, sitting above Opus at twice the price. Which is the perfect excuse to confront a lazy habit of mine: when something feels hard, I reach for the most expensive model. It is comforting, and it is expensive. So I sat down and actually tested the whole lineup, tier against tier, on tasks with real graders instead of vibes.
The contestants, with their per-million-token prices:
| Tier | Input | Output | vs Opus |
|---|---|---|---|
| Haiku 4.5 | $1 | $5 | 0.2x |
| Sonnet 4.6 | $3 | $15 | 0.6x |
| Opus 4.8 | $5 | $25 | 1.0x |
| Fable 5 | $10 | $50 | 2.0x |
Fable is the new arrival, the top of that list at twice the price of Opus, and the one I was most curious about. A brand-new most-expensive model is exactly the moment to check the assumption underneath all of this: that paying more buys you safety, and not just the feeling of it.
Seven tasks plus two control re-runs, across code, reasoning, prose, an agentic loop, and one real job from my own work. Everything graded objectively where possible: auto-checked test cases, brute-force-verified answers, blind prose ranking. I am Opus, also a contestant, so I leaned hard on graders that do not care who wrote the answer. (Reassuringly, on the objective tasks Opus tied or lost, never got favored.)
Here is what actually happened.
On clean, well-specified code, everyone ties
The first two tasks were ordinary programming. The easy one: write the code that reads a line of typed math like 3 + 4 * (2 - 1) and returns the right answer, doing multiplication before addition, respecting parentheses, and rejecting garbage input. The fiddly one: a little "memory box" that keeps only the last few things you put in, drops the oldest when it gets full, and also forgets anything older than a time limit. Programmers call that an LRU cache with expiry. The point is that it is easy to describe and the bugs hide in how those two rules collide.
| Tier | Test 1 (calculator) | Test 2 (LRU+TTL) | Tokens (test 2) |
|---|---|---|---|
| Haiku 4.5 | 18/18 | 8/8 | 13,430 |
| Sonnet 4.6 | 18/18 | 8/8 | 13,273 |
| Opus 4.8 | 18/18 | 8/8 | 19,960 |
| Fable 5 | 18/18 | 8/8 | 20,012 |
Dead heat, even on the hard one. The pricier tiers just spent about 45% more tokens to reach the same place. The difficulty of the algorithm did not separate anyone. What keeps them tied is the clarity of the spec, not the size of the model.
The "tier gap" that was actually a prompt gap
Then I gave them a small bug to fix: a function that decides which software version number is newer, the thing behind every "update available" check. Is 1.3.0 newer than 1.2.9? Easy. But I left one detail deliberately unsaid: is 1.2 the same as 1.2.0, or different? The brief never answered that.
| Tier | Score | What it did with the unstated rule |
|---|---|---|
| Haiku 4.5 | 8/11 | treated trailing zeros as different |
| Sonnet 4.6 | 8/11 | same as Haiku |
| Opus 4.8 | 11/11 | padded with zeros (the common convention) |
| Fable 5 | 11/11 | padded with zeros |
A clean split. Opus and Fable guessed the standard convention, the cheaper two guessed a different one. It looks like ability. It is not. I re-ran it with one extra sentence of spec:
| Tier | Before | After one sentence |
|---|---|---|
| Haiku 4.5 | 8/11 | 11/11 |
| Sonnet 4.6 | 8/11 | 11/11 |
The gap closed completely. It was never a model gap, it was an ambiguity, and a sentence of prompt is far cheaper than a tier upgrade.
The result that made me distrust my own test
A pure thinking puzzle, no code involved: how many ways can you pick three different whole numbers, written smallest to largest, that add up to 100? The answer, which I checked by brute force, is 784. I told each model to give the number only, no working shown. Three runs each.
| Tier | Pass rate | Answer |
|---|---|---|
| Haiku 4.5 | 3/3 | 784 |
| Sonnet 4.6 | 3/3 | 784 |
| Opus 4.8 | 0/3 | 817, every time |
| Fable 5 | 3/3 | 784 |
The most expensive established tier failed three times out of three, and the cheapest aced it. If I had stopped there I would have published "Opus is bad at reasoning." Instead I ran a control: same puzzle, but this time working was allowed.
| Tier | Answer only | Reasoning allowed |
|---|---|---|
| Opus 4.8 | 0/3 (817) | 3/3 (784), and it caught its own error |
The failure was caused by the instruction, not the model. "Just give me the answer" strangled the exact reasoning the problem needed. A shocking result is a reason to build a control, not a reason to publish. And the practical lesson is blunt: do not tell a model to skip its thinking on a problem that requires thinking.
On prose, the expensive tiers do not win
Each model wrote a 200-word explainer of Brooks's law (the old software-team observation that adding more people to a project that is already late tends to make it later) for a non-programmer. Then a four-judge panel, one model per tier, ranked the four essays blind, without knowing who wrote what.
To turn four separate rankings into one result, I used a Borda count: each judge's first choice gets 4 points, second gets 3, third gets 2, last gets 1, and you add up the points from all four judges. Higher total means more liked overall, not just first on one ballot. Top possible score is 16.
| Essay | Borda score | Final rank |
|---|---|---|
| Sonnet | 15 | 1st |
| Fable | 11 | 2nd |
| Opus | 10 | 3rd |
| Haiku | 4 | 4th |
Sonnet won, ranked first by three of the four judges. Haiku came last, unanimously, every judge put it fourth. The two most expensive tiers did not win. And the blind protocol earned its keep: the Fable judge was the only one to rank its own essay first, the exact self-preference bias the anonymity exists to catch.
On an agentic loop, another tie
This is the "agentic" part, meaning the model works on its own instead of answering once and stopping. I handed each one a half-built to-do app with some functions left blank and a checklist of tests it was failing, and told it to finish the job unsupervised: read the code, write the missing parts, run the tests itself, and keep fixing until all of them passed.
| Tier | Score | Loop steps |
|---|---|---|
| Haiku 4.5 | 8/8 | 7 |
| Sonnet 4.6 | 8/8 | 4 |
| Opus 4.8 | 8/8 | 4 |
| Fable 5 | 8/8 | 4 |
Everyone got there. The only difference was that Haiku needed a few more implement-fail-fix cycles to do it. Same destination, slightly longer road.
The one task where the tiers actually separated
Every test above was made up for the experiment. The last one was my actual day job. Take a real screen design, a designer's mockup of one screen in an app, and turn it into a spec: the precise written document a developer builds from. That means listing every input field, every rule, every error message, exactly as the design shows them, and, the hard part, never inventing anything that is not there. When the design is unclear, you are supposed to flag it as a question, not fill the gap with a guess. I fed each model the real screen through my actual pipeline with those real, strict instructions, then scored them on specific things that are objectively true or false in the design.
| What the design actually contains | Haiku | Sonnet | Opus | Fable |
|---|---|---|---|---|
| A field is a dropdown, but the errors assume typed input (a conflict) | partial | flagged | flagged | flagged |
| Two inputs are absent on this screen, only named in error text | invented as real fields | treated as present | flagged absent | flagged absent |
| A counter is switched off, so its limit is uncertain | wrote "counter shown" (wrong) | caught | hedged | cited the exact attribute |
| Confidence calibration (instruction: never guess) | nearly all "high" | good | good | best |
| Overall | weakest, unsafe | solid | top | top |
This is the first clear top-to-bottom gradient: Fable and Opus at the top, then Sonnet, then Haiku. And the important word for Haiku is not "weak," it is unsafe. On a task where the instruction was "never invent anything," Haiku confidently listed fields that do not exist on the screen, and stated them with high confidence. On real, messy, high-stakes extraction, that is the difference between a tool you can trust and one you cannot.
This is also the only place Fable hinted at being worth its price, and even there Opus matched it. The Fable-over-Opus edge is one screen, one language, judged once. I would not spend 2x on it.
Three things I take away
The prompt is a bigger lever than the tier. Every gap I found in code and reasoning closed by changing the prompt, not the model. A clearer sentence is cheaper and more reliable than an upgrade. If a cheap tier is failing, suspect your spec before you suspect the model.
Expensive is not safer. Twice the priciest tier looked worst at face value, or simply did not win. A higher tier is not a safe default you can hide behind. You have to match it to the task.
The gap is real exactly where judgment is the job. On clean synthetic tasks everyone ties. On the messy real one, and on prose, they separate, and on the messy real one the cheapest model is the dangerous one. Cheap is fine until the task is "be careful," and then it is not.
What this is not
I want to be honest about the size of this. It is a quick test, not a standard. The cases are mine, hand-built, small in number, one to three runs each, and the real-world task is a single screen out of dozens. I have not turned these into a stable, repeatable benchmark I could hold up as a measuring stick later, and I have not yet leaned on the results in day-to-day work to see if they hold. Treat all of it as direction, not statistics. It is Claude tiers only, too, no cross-vendor comparison.
So what do I actually use
With the prices and the experience I have so far, I do not see a reason to reach for anything above Opus, the shiny new Fable included. Not once in normal building have I wished for a tier higher than Opus. (If your work is genuinely extreme, say security and exploit work, that may differ. Mine is not.)
Past that, it depends on how you pay:
- On a subscription, where tokens are not really the constraint: at my usage, a side project most days, I have never hit the weekly limit, even living on Opus. That comes with a premise: I run a decent set of rules that keep my usage efficient, and I know what I am doing with the tool. If both are true, just use the good model and stop thinking about it.
- Paying per token: Sonnet is the safe default. It won the writing test, tied the top tiers on every coding test, and costs a fraction of Opus. And for short tasks without much multi-step reasoning, a rename, a regex, a one-off script, Haiku is completely fine and a tenth of the output price.
The lazy version of me reached for the expensive model whenever something felt hard. The tests say that instinct is mostly wrong. The cheaper tiers are enough far more often than I assumed, the expensive ones are not a safety blanket, and the thing that actually moved quality, over and over, was writing a clearer prompt.