Prompting through an API is not the same as chatting

April 25, 2026

When people get good at the chat box they assume they are good at prompting. Then they call a model through the API and the output is worse, flakier, and more expensive, and they cannot figure out why. The reason is simple: the chat box was quietly doing nine tenths of the job. Through the API you own the whole message array, every parameter, and all the parts the UI was handling for you. More power, more rope.

Here is how I think about it after wiring a lot of these.

You are not writing a message, you are assembling a request

A chat prompt is one blob of text. An API prompt is a list of messages with roles, plus a pile of settings. Treat the roles as structure, not decoration.

Put the stable stuff in the system message: who the model is, the rules, the format, the examples. Put only the variable input in the user message. This is not style, it is money and reliability. The stable part is identical every call, so it can be cached and it cannot drift. The variable part is small and isolated, so the model is not re-reading your whole rulebook to find the one new thing.

Order it so the provider can cache it

Most providers cache a stable prefix of your prompt and charge a fraction for the cached part. You only get that discount if the front of your prompt is byte-identical from call to call.

So I put everything invariant first: the system rules, the persona, the examples, the format spec. Then the changing context. Then the new input, last. Reorder those and you break the cache every call and pay full price for tokens you already sent yesterday. On a hot path that is the difference between a sustainable feature and a scary bill. Get the layout right once and it pays you back on every request.

Lower the temperature for work, raise it for voice

The chat UI picks a temperature for you. The API does not, and the default is rarely what you want.

If the job is extraction, classification, JSON, anything with a right answer, push temperature down near zero. You want boring and repeatable. If the job is a reply with personality, let it up. I run extraction calls cold and conversational calls warm in the same app, because they are different jobs with different definitions of good. One number, big effect, almost free to tune.

Show the shape, not the script

Examples are the strongest tool you have for steering output through an API. They are also a trap.

Put a vivid, concrete example in the prompt and the model will lift it word for word at the worst possible moment. I have watched a model parrot an example line verbatim in production because it was the nearest thing to hand. The fix is to make examples abstract: describe the shape with a bracketed placeholder instead of a real sentence. Same goes for anti-examples, the "do not do this" cases, which are underrated. Show the failure shape, not a quotable failure.

For anything multi-step, write the prompt like a procedure

For a one-shot task, a clear paragraph is enough. For anything the model has to do in stages, stop writing a wish and write a procedure. The difference in reliability is not subtle.

Number the steps and make each one self-contained. For every step, spell out four things: what it does, what it takes as input, what it must produce as output, and the constraints it has to respect. Then, where the step is easy to get wrong, add an example, the shape not the script. A model told "summarize this" wanders. A model told "Step 1: extract the named people. Input: the message. Output: a JSON array of proper names only. Constraint: roles like 'boss' do not count" does the same thing every time.

Where the task is analysis, hand it a frame to think in instead of hoping it brings its own. A fixed lens like 5W1H (who, what, when, where, why, how), or whatever dimensions actually matter for your problem, turns "analyze this" from vibes into a checklist the model fills in. Structured questions get structured answers, and the output stays comparable run to run, which is what you want the moment you are processing more than one item.

Then end the prompt with a self-audit checklist: a short list of conditions the model has to verify its own output against before it returns. "Before you answer: is every name a real proper noun? Is the JSON valid? Did you leave out the character's own lines? If any check fails, fix it and check again." It costs a few tokens and catches a surprising amount, because a model is much better at grading a draft against explicit criteria than at nailing it on the first pass. Same reason you ask a person to proofread against a rubric instead of "looks good?"

None of this is exotic. It is the shape of a good runbook or a well-written ticket: numbered steps, clear inputs and outputs, constraints, examples, and a final pass to confirm the work. If you have ever written instructions for a teammate, you already know how. Just write them for a teammate who is fast, literal, and has no memory of the last time you explained it.

Force the structure you are going to parse, then verify it anyway

If your code is going to read the output, do not hope for clean output, demand it. Ask for strict JSON, give the exact keys, cap the length. Then, the part people skip, do not trust it.

Let the model interpret and let your code keep the books. The model is good at reading messy intent and bad at being reliable. So I let it produce the judgment, but anything that has to be exact (a flag, a total, a fallback when a field is missing) is computed in plain code, not asked for in the prompt. The prompt classifies. The code guarantees.

Do not call the model when you do not need to

The cheapest, fastest, most reliable call is the one you never make. A surprising amount of input does not need a model at all: an "ok," a "lol," an empty edit. I short-circuit those before any API call. If the input is filler, skip the model and move on. Every call you avoid is latency and money you keep, and one less chance for the model to do something dumb.

Turn reasoning off until a task earns it

Newer models have a reasoning mode, and it is tempting to leave it on for "better" answers. On a latency-sensitive path it is usually a bad trade. I measured it on a real chat path: turning reasoning on roughly doubled the latency, added cost, and produced no quality difference a user could feel. Reasoning is for genuinely hard, non-interactive tasks. For a real-time reply it is a tax. Default it off and switch it on only where you can prove it earns its keep.

The throughline

Chatting is talking to the model. Prompting through an API is engineering the request around it: roles for cache and clarity, temperature for the kind of job, examples that show shape not script, structure you enforce in code, and the discipline to not call it at all when you do not have to. The model is one ingredient. The reliability is everything you put around it.