LLMs are functions,
not brains.

The AI industry wants you to build agents. Autonomous systems that reason, plan, and act on your behalf. I think that's wrong for 90% of what developers actually need.

This is my case for a simpler, more reliable way to build with language models.


The Agent Problem

Right now, everything is agents. Open Twitter: agents. Read the docs for any new framework: agents. Ask how to add AI to your product and the first answer is let the model decide what to do. Let it plan its own steps. Let it reason about what to call next. Agents are cool. But somewhere along the way they became the default starting point, and nobody stops to ask whether they actually need the model to make decisions or just do what it's told.

I've shipped several LLM-powered tools to production. Not one of them is an agent. Some of them started that way and got rewritten once I realized the autonomy wasn't adding anything except latency and unpredictability.

The problem with agents isn't that they don't work. It's that they work unpredictably. You trade a known execution path for "autonomy" that mostly means "I don't know what it's going to do." When an agent-powered feature breaks in production, you're debugging a conversation transcript, not a stack trace.

Most "agent" use cases are actually workflows, a known sequence of steps where one or two of those steps happen to involve an LLM. You don't need autonomy for that. You need a function call.


LLMs as Functions

Here's the reframe: an LLM call is a function. It takes typed input, runs a transformation, and returns typed output. If you treat it that way, everything gets simpler. You get predictable behavior through structured outputs. Composability, because piping the output of one function into another is just code. Testability, because you can mock the function. Debuggability, because you can see exactly what went in and what came out.

You don't need a framework to compose LLM calls. You need well-defined functions and regular code.

There are patterns that come up constantly — verbs I reach for over and over: extract, classify, rewrite, expand, compress. These aren't the only things you can do with a language model. They're just the ones I find myself writing most often. In practice, my codebase has dozens of specialized variants — different flavors of compress tuned for different contexts, different expand functions with different output shapes. But the mental model is always the same: typed input in, typed output out, one job per call.


What an LLM Function Looks Like

An LLM function isn't just a prompt. It's a typed input schema, a system prompt, a typed output schema, validation, and retry logic, all wired together so the LLM does one transformation and your code does everything else. The prompt is an implementation detail, not the interface.

Try it. Pick some parameters and see what the function actually sends:

Interactive · classify

Typed parameters in, structured JSON out. The prompt is generated, sent, validated, and retried without leaving your pipeline. Here's what actually goes over the wire:

Actual output · Claude Sonnet 4.6
{
  "categories": [
    { "category": "billing",   "confidence": 0.95, "reasoning": "User reports being charged twice for subscription" },
    { "category": "technical", "confidence": 0.85, "reasoning": "User cannot log into account since an update" },
    { "category": "account",   "confidence": 0.75, "reasoning": "Login issue directly relates to account access" }
  ]
}
claude-sonnet-4-6 233 tokens 2.1s 0 retries

The LLM is a utility in your pipeline. The prompt is how you configure it.


Functions Compose

The whole point of treating LLM calls as functions is that functions compose. You don't need an orchestration framework. You don't need a DAG. You just call them.

Blog post pipeline
outline = expand(
  "trade school benefits",
  format="bullets",
  audience="parents"
)
for point in outline:
  expand(point, format="paragraphs")
compress(post, maxLength=800)
Ticket router
category = classify(
  ticket,
  categories=["billing",
    "technical", "general"]
)
priority = classify(
  ticket,
  categories=["urgent","normal","low"]
)
summary = compress(ticket, max=50)
route(category, priority, summary)

Different applications. Same patterns. No agent decided what to do. The developer decided. The LLM just executed the steps it's good at.


The Showdown

OK, but what about when agents work? What about the happy path, where the agent picks the right tools in the right order, doesn't loop, doesn't hallucinate? Surely then it's at least as good as the workflow?

Let's test it. Same task. Same tools. Same model. The only difference: who decides what to call and when.

The task · process an inbound sales lead
Subject: Interested in your platform

Hi there,

I'm the Director of Enrollment at Westfield College (about 2,500 students).
We're currently using a basic WordPress form for inquiries and it's a mess —
no tracking, no follow-up automation, leads just go to a shared inbox.

We'd need something that integrates with our Slate CRM. Budget is probably
$15-25k/year. We'd want to be up and running by fall enrollment season
(August). Could also use help with our program pages — they're just walls
of text right now.

Let me know if this is a fit.

Thanks,
Maria Chen
mchen@westfield.edu
(555) 312-8847
The tools · identical for both
extract()
classify()
compress()
expand()
The workflow · 4 lines of code
details  = extract(lead, what="contact, company, budget, timeline, needs")
priority = classify(lead, categories=["hot", "warm", "cold"])
crm_note = compress(lead, maxLength=60, preserve=["budget", "timeline"])
reply    = expand(details, format="email", audience="enrollment director")
Workflow Agent
LLM calls44
Reasoning steps03
Total tokens1,79310,585
Time16s44s
Est. cost$0.008$0.047
Output quality

The agent got it right. Every tool call was correct. The order was optimal, it even ran the first three calls in parallel. This is the best case for agents, and it still used 490% more tokens, took nearly 3× longer, and produced the same output.

Those three reasoning steps added zero value. The developer already knew the steps. The agent was explaining to itself what four lines of code already specified. And the final reasoning step? The model wrote a full markdown report with tables, headers, and emoji. Completely invisible to the end user. Pure overhead.

And this is the happy path. In production, agents loop. They call tools with wrong parameters. They second-guess themselves. The workflow does none of this because there's nothing to decide. Just execute.


What Goes Wrong

LLM functions aren't magic. There are three failure modes I hit regularly in production, and being honest about them is part of the pitch. At least they're predictable.

Malformed output. LLMs don't always return valid JSON. A well-built LLM function validates and retries with a reinforcement prompt, not a fresh request, a targeted correction:

First attempt · malformed
Here's the classification result:
```json
{ "category": "billing", "confidence": 0.92 }
```
The text clearly relates to billing issues.
Reinforcement prompt
Your response was not valid JSON.
Issue: Response contains text outside the JSON object.
Respond with ONLY the corrected JSON. No markdown fences, no explanation.
Second attempt · valid
{ "category": "billing", "confidence": 0.92, "reasoning": "Mentions being charged twice" }

Confidence scores are vibes, not probabilities. A 0.95 doesn't mean the model is right 95% of the time. It means the model feels strongly. On ambiguous inputs, you'll get high-confidence wrong answers. Treat confidence as a relative signal for ranking or routing, not an absolute threshold for trust. If the decision matters, add a human review step above some threshold rather than hard-coding if confidence > 0.8.

Compression drops things. compress() without a preserve list will make judgment calls about what matters. In the sales lead example, a naive compress might drop the Slate CRM requirement because it's one line out of many. That's the make-or-break detail. The fix is explicit: pass preserve=["CRM", "budget", "timeline"] and validate that those topics appear in the output. Don't assume the model's editorial instincts match yours.

Two of these three are schema and config problems, not model problems. That's the point. When an LLM function fails, you know exactly where to look.


The Takeaway

LLMs are powerful, but they're not decision-makers. They're transformers, in the colloquial sense. They transform text from one shape to another. The right abstraction isn't an autonomous agent. It's a function: typed input, typed output, composable with code you already know how to write.

Agents will have their place for genuinely open-ended tasks. But for the vast majority of LLM-powered features developers are building today, you don't need autonomy. You need well-defined functions.

The best tools disappear into the workflow. An LLM function should feel like calling JSON.parse(). You don't think about how it works. You just trust the output.

SuedePritch · GitHub