Agentic-First: Designing Systems for AI From Day One

Most teams integrating AI into their workflows hit a ceiling and assume the AI is the problem. The model isn't smart enough. The prompts need tuning. The output needs more human review.

I hit that ceiling too. Spent months on it. Built increasingly elaborate tooling to try to push through it. The breakthrough came when I realized the AI was never the bottleneck. The system around it was.

This isn't a build guide. It's about a design philosophy I stumbled into by failing first, and why I think it applies to way more than my specific use case.


The Insight: Traditional Development Instincts Don't Apply Here

This is the counterintuitive thing I learned, and it's the core of everything that follows.

I built a website migration tool — crawl a client's existing site, use AI to extract content and map it into page builder templates, deploy to WordPress. My instinct was to make it simple: point it at a site, let the AI do the work. But our traditional development practices pushed the design in a different direction. You don't ship without review stages. You don't automate without human checkpoints. You give users control at every step. More options, more configuration, more levers to pull.

So the system got human checkpoints at every stage. Select which templates the AI could use before processing. Review each page after processing. Manually deploy. Tweak in the page builder afterward.

Here's the thing about giving people control: it only works if they know what to do with it. Every configuration option is a decision someone has to make, and most of the time they don't have enough context to make it well. Which templates should the AI use for this page? I don't know — I haven't analyzed the content yet, that's what the AI is supposed to do. But the system demanded an answer before it would start. So people would guess, or pick defaults, or select everything and hope for the best. Then review the output, decide it doesn't look right, change the selections, reprocess, review again. Each "control point" felt like it was adding precision. It was actually just adding cycles of trial and error.

The more levers you expose, the more ways there are to get it wrong. And every lever needs documentation, or training, or at minimum someone remembering what it does and when to use it. That overhead compounds fast. By the time you've configured the system correctly for a given page, you've spent most of the time you were trying to save.

These are sound instincts for traditional software. When humans are the operators, review stages and manual control points can improve quality — assuming the humans are trained, have context, and are making informed decisions. The problem is that in AI-assisted workflows, most of these control points exist because we don't trust the AI to make the decision. But the humans making the decisions don't have better context than the AI would — they're just guessing with a nicer UI.

Every checkpoint added complexity, not quality. Template pre-selection meant making design decisions before seeing the content — premature choices that constrained the AI. Per-page review meant evaluating output against a mental model that was never communicated to the AI, which led to reprocessing not because the output was wrong, but because it didn't match unstated expectations. Manual deploy meant context-switching into WordPress admin for every page. Post-deploy tweaking meant the "final" output was actually just a starting point for more manual work.

We thought we were guiding the AI. We were adding friction and inconsistency. The AI couldn't learn from any of the corrections because they happened outside its context — in a page builder UI, in someone's head, in a Slack message saying "this doesn't look right." And the tool kept growing in complexity to manage the overhead. Bulk processing, search, filters, status management, bulk status overrides — more and more UI to manage a process that shouldn't have needed humans in the middle in the first place.

The AI wasn't the problem. It could analyze content and make reasonable structural decisions. The problem was the system: a rigid pipeline of isolated, hyper-specialized prompts. "Here's some content, pick a template." OK, now "put this content into this template's slots." The AI never saw the full page, never understood how sections related to each other, never had context about the site as a whole. Each prompt was disconnected from the last, and a human had to mediate every gap.

The new system gives the agent the full picture — the entire crawl snapshot, the component library, the site's design tokens — and lets it make holistic decisions about how to reconstruct a page. And humans do better QA when they're evaluating a complete page rather than approving isolated sections one at a time.

The practices that make traditional software development reliable — review stages, human approval gates, manual control points — actively degrade AI agent performance. Human-in-the-loop isn't inherently a quality measure. It depends on where the human is and what they're doing. The test is simple: is this person evaluating an output, or making a decision the agent should be making? Setting constraints up front and reviewing results at the end is quality control. Inserting yourself into the middle of the agent's process to approve, adjust, or redirect at each step is just manual labor with extra steps — and it fragments the AI's context in ways that make the output worse, not better.


The Flip: Design the System for the Agent

Once I stopped thinking about the AI as an assistant that needed supervision and started thinking about it as an operator that needed a well-designed system, everything changed.

The question shifted from "how do I add more human checkpoints to improve quality?" to "how do I design a system where the agent can operate autonomously and produce good output without being interrupted?"

That question led to three design principles that made the new system work.

1. Pick a platform the agent can navigate

Every AI agent is constrained by how well it can understand the system it's operating on. The old tool interacted with WordPress Oxygen page builder. WordPress is complex — PHP templates, database entries, plugin interactions, page builder JSON. That complexity is exactly why the old tool had to be so narrow and rigid. I couldn't give the AI any real autonomy because there were too many ways for it to break things. So instead I built a tightly scoped pipeline where the AI only ever did one small thing at a time, and a human handled everything else.

The new system targets Astro — a static site framework where a page at /programs/nursing is literally a file at src/pages/programs/nursing.astro. Content is a YAML entry. Layout is an explicit import. No database, no plugin interactions, no hidden assembly.

I didn't pick Astro because it's trendy. I picked it because files and folders are what AI coding agents reason about best — and because the review surface stays clean. Astro separates content from presentation and shares components across pages, so even substantial changes produce small, focused diffs. A reviewer can approve a PR at a glance.

2. Give the agent the right context for the job

Most people write one big instructions file for their AI tools. The agent reads everything and tries to figure out which parts matter right now.

This breaks down fast. Migration and maintenance have completely different instructions. Load both and you waste the agent's working memory on irrelevant info — or worse, the agent applies migration-mode thinking when it should be doing maintenance.

I treat the agent's context as a compiled configuration. Documentation lives as modular sections. Each operational mode has a manifest declaring which sections it needs. A build step assembles them into a single context file. A setup script swaps the active config when switching modes. Skills, hooks, agent definitions — all mode-specific.

Sounds like a minor optimization. It was the single biggest improvement to agent accuracy. Just removing irrelevant instructions.

3. Guardrails at edit time, not after the fact

Traditional workflows validate at commit or deploy time. Fine for humans. Expensive for an autonomous agent — by the time it gets feedback, it's moved on and backtracking is costly.

The system has two layers of automated enforcement. The first fires on every file edit — the agent writes a file, and instantly gets validation back. Schema checks, component linting, structural rules. If it writes invalid YAML or uses a raw color instead of a design token, it knows immediately and self-corrects in the same turn.

The second layer gates commits. Before anything gets committed, a preflight script runs the full Astro build, validates all internal links against the built output, lints every component, and checks page-level patterns. If any step fails, the commit is blocked. The agent can't ship broken work.

Both layers are mode-aware. Migration mode locks shared resources — the design system and component library can't be touched. Maintenance mode has different boundaries. And the agent can't modify its own configuration — context, hooks, and skills are protected. Changes to agent instructions go through the doc source, get rebuilt, get reinstalled by a human.

Clear contracts. Defined scope. Immediate feedback. No drift.


The Evidence: Automate the Process, Not the Steps

Most teams adopting AI zoom in. They find a painful step — this report takes forever, this data entry is tedious, this review process is slow — and point an AI at it. That feels like progress. But it locks in the assumption that the step should exist at all.

The bigger question is: does this workflow only look like this because it was built around human specialization? Most processes are broken into steps so different people can own different pieces. An agent doesn't need that division. It can hold the full context and do the work end to end. If you're breaking a task into steps for the agent, ask why. Is it because the work requires it, or because that's how humans had to organize it?

I learned this building the website migration tool. The old system automated each step of a human workflow — a hundred-page site meant five hundred human decisions. The new system focused on the actual goal: migrate a website. One command: point it at a URL and the agent migrates the site.

The numbers

A simple 20-page site is typically quoted at around 30 dev hours. A larger site can run to 120 hours. The success metric for the original WordPress tool was saving 20% of that time. We thought that was ambitious.

In test migrations, the new system handles virtually 100% of the initial build. The output still needs QA — design refinement, content tweaks, client-specific adjustments — but even those go through the agent. The whole lifecycle is agentic-first, not just the initial build.

New system
Migration cost:   ~$20 compute
Capacity:         unlimited
Bottleneck:       sales
Support tickets:  agent-generated PRs
Time to deliver:  days
Old system
Migration cost:   30–120 dev hours
Capacity:         ~2 sites / month
Bottleneck:       production
Support tickets:  manual labor
Time to deliver:  weeks

The business impact is structural, not incremental. The bottleneck shifted from production capacity to sales capacity. That's a different company.

Migration isn't even the bigger story. Three support engineers spend a significant portion of their time on website-related tickets — recurring labor on every client, indefinitely. In the new system, those tickets become agent-generated PRs, reviewed and approved rather than manually built. That's a permanent shift in how support time gets spent.


The Bigger Point

I built website migration tools. But the discovery isn't about website migration.

AI agents require a fundamentally different design philosophy than traditional software. The way we've always built things — add oversight, add configuration, add control — assumes a human operator who benefits from those options. An AI agent doesn't. It benefits from clarity, constraints, and context. The traditional playbook doesn't transfer. It's not just unhelpful in an agentic context. It's actively counterproductive.

Every team integrating AI right now is facing the same fork: do you add AI to your existing system and existing practices, or do you design a system for the AI to operate? Most teams pick the first option because it feels safer and doesn't require questioning established practices. I picked the first option too. It produced months of work and a tool that never delivered.

The second option requires rethinking some foundational assumptions:

Start from the goal, not the workflow
Most AI integration starts by automating steps in an existing process.
But those steps often only exist because a human was doing the work.
Before automating anything, ask: if an agent were handling this from
scratch, would this process even look like this?
Architecture is an AI performance decision
The biggest gains don't come from prompt engineering or model selection.
They come from giving the agent a system it can reason about clearly.
If your agent is struggling, look at the system before the model.
Human oversight belongs at the boundaries
Set up the agent's scope, constraints, and validation rules. Let it run.
Evaluate the output. Adding review stages at every step fragments the
AI's context and introduces inconsistency.
Context is working memory
Every irrelevant instruction in the agent's context degrades performance.
Compile what it needs for the current task, strip out everything else.
Highest-ROI optimization. Probably the easiest one to apply right now.

The models don't need to get smarter. You'll never know what they're actually capable of until you stop burying them in systems that were designed for humans. The same model that struggles in one architecture will thrive in another. The difference was never intelligence. It was design.

If your system is too complex for today's models, throwing a smarter model at it won't solve anything. Complexity doesn't become manageable just because the AI got an upgrade. Simplify the system.

It was never about the model, the prompts, or the RAG pipeline. It was always the architecture.

SuedePritch · GitHub