Async LLM pipelines: getting structure out of a model you can't trust
For a while now I've been building the backend of an AI product — the kind that reads your emails, meetings and chats and turns them into structured things: tasks, decisions, draft replies. People assume the hard part is the prompt. It isn't. The prompt is maybe a day of work. The other six months is everything around the model.
The single idea that made the rest fall into place:
Treat the LLM as a slow, expensive, occasionally-lying network dependency — not as "the smart part" of your system.
Once you stop treating it as magic and start treating it like a flaky third-party API that bills you per call, the right backend basically designs itself. Here's what that looks like.
Ask for structure, never parse prose
If you let the model answer in free text and then regex your way into a data model,
you've already lost. Pin the output to a schema — response_format with a
JSON schema, or tool/function calling — so the contract is explicit. But a schema is a
request, not a guarantee: the model will still hand you invalid JSON, miss a required
field, or invent an enum value. So every response goes through the same gate:
validate against the schema, and on failure repair-or-retry, then fall back.
- Validate with the same model you use everywhere else (for me, Pydantic) — parse, don't trust.
- On a validation error, one cheap option is to feed the error back and ask for a correction. Cap the attempts.
- If it still fails, degrade gracefully — drop the feature for that item, don't 500 the whole request.
The mindset: the unhappy path isn't an edge case here, it's a regular Tuesday. Design for malformed output the way you'd design for a payment provider timing out.
Get the model off the request path
LLM calls are slow and variable — hundreds of milliseconds to several seconds, and sometimes they just hang. You cannot put that in front of a user pressing a button. So the user action does almost nothing: it writes a row and enqueues a job. Workers pull the job, call the model, and the result arrives later over SSE / a push / a webhook.
- Every job carries an idempotency key — the same email must not spawn five task-extractions because a worker retried or a webhook fired twice.
- Retries with backoff for the transient stuff (rate limits, 5xx, timeouts); a dead-letter path for the rest so one poisoned input doesn't wedge the queue.
- Partial failure is normal: a batch of ten emails might yield seven clean extractions, two retries, and one fallback. Persist per-item state, not per-batch.
None of this is AI-specific. It's the same queue-and-worker hygiene you'd use for any slow external dependency. The LLM just makes it non-optional.
Latency and cost are product features
Two cheap wins go a long way. First, cache by input: the same content asked the same question gives a cacheable structured result — and in a product that re-processes overlapping threads, the hit rate is real money saved. Second, route per task: a small fast model is plenty for "pull the action items out of this"; save the expensive one for generation that the user actually reads. Set a token budget per task type so a pathological input can't quietly cost you a dollar.
Personalize with examples, not fine-tuning
For drafts that should sound like the user, the instinct is "fine-tune a model per user." You almost never need to. People edit the drafts you generate — and those edits are the highest-quality signal you'll ever get for free. Store them, and feed a handful of the most relevant past (original → edited) pairs back as few-shot examples. The model adapts to someone's voice from a dozen examples far faster, cheaper and more reversibly than any training run. Fine-tuning is a later optimization, not a starting point.
You can't unit-test a coin flip
The same input can give different output twice in a row, so a normal assertion-based test is useless. What works is an eval set: a few dozen real inputs with the structured output you expect, and a metric you actually believe — field-level accuracy for extraction, a rubric for generation. Run it on every prompt and model change. This is the only thing that catches the silent regression where a "harmless" prompt tweak quietly tanks accuracy on a category you weren't looking at. Keep a small golden set under version control and do manual spot-checks on top; numbers don't catch "this reply is technically correct but sounds unhinged."
The boring truth
Add it up and the actual model call is maybe ten percent of the system. The other ninety is validation, queues, idempotency, caching, retries and evals — the same backend engineering it's always been. The LLM didn't replace that work. It raised the stakes on getting it right, because now your dependency is non-deterministic and your failure modes are weirder. If you're a backend engineer eyeing AI products and worried you're behind on the ML: you're probably closer than you think. The hard part is the part you already know.