AI coding is extraordinary. Let me get that out of the way first. For small, well-scoped tasks, it’s magic. “Write me a function that parses this date format.” “Add a test for this edge case.” “Refactor this component to use hooks.” These land. The code works. It’s better than what I’d write myself, and it arrives in seconds.
But there’s a ceiling, and you hit it the moment the task gets complex. Multi-file changes. Features that touch several systems. Anything that requires sustained context, judgement calls, or a coordinated sequence of steps. That’s where the dream collides with reality.
People are asking me “Can’t we rewrite the whole app with AI?”, “Can we replace that vendor in a week?”, “Why do we still need a team this size?”. Every engineering leader I know is fielding these. The excitement is real, and good! The understanding of what it entails is not.
The gap between the promise and the reality isn’t an AI problem. The models are incredible — the one-shot results prove that. It’s an infrastructure problem. The tooling around the AI hasn’t caught up to what the AI can do.
Over the past few months, I’ve gone through a progression that I think most developers will recognise. There are four stages. Most of us are stuck at stage three.
Stage 1: Prompting
Like most people, I was sceptical at first.
I asked ChatGPT to build me an app. It produced code. I pasted it into my editor, ran it — errors everywhere. Back to the chat window, paste the error, get a fix, paste that into the editor, run it again. Another error. Rinse and repeat.
At that point, I couldn’t see AI taking over engineering jobs. I still don’t — but our jobs are changing in ways I didn’t expect back then.
The “aha” moment came later, with smaller asks. A function here, a regex pattern there, boilerplate I’d spend twenty minutes on. For simple, contained tasks, the output was good. Real, usable code — not toy code.
But the workflow was absurd. I was a human copy-paste pipeline between a chat window and a terminal. The AI was fast; I was the bottleneck. I spent more time shuttling text between windows than thinking about the problem.
So I went looking for something better.
Stage 2: Pair programming
I tried Cursor. It didn’t click.
The idea is great: AI lives in the editor, sees your code, suggests the next line or block. Copilot, Cursor inline completion — autocomplete on steroids. The productivity bump is real for certain things. Boilerplate, tests, repetitive patterns. You move faster.
But it’s your pace. You type, it suggests. You drive, it navigates. You can’t step away from the keyboard.
This is where much of the early AI scepticism came from. Studies claimed engineers were not getting faster. At this stage, they had a point – you’re still doing the work. The AI is just an autocomplete.
For me as an engineering manager with a packed meeting schedule, this wasn’t the unlock. I didn’t have hours of uninterrupted coding time. I needed something I could point at a problem and walk away from. I wanted to describe the whole task, not guide the AI through it line by line.
Stage 3: Delegating
Then came Claude Code. The breakthrough.
I describe a task, and it writes the code, runs it and fixes errors. “Build me an invoice exporter that pulls from that API and archives to S3”. All in my terminal. 2 hours later, 500k PDFs on my laptop. I shipped more useful tools in two weeks than in my previous four years at the company. Invoice exporters, GDPR handlers, Jira-to-Linear migration scripts, an OCR receipt scanner — real tools that helped the business. All while keeping up my meeting schedule.
This feels like the dream. At first.
Then I try building bigger things and reality sets in. I call it the handholding problem.
You have to watch it. Close your laptop, the agent stops. Go to a meeting, it sits there waiting. Your working hours are its working hours.
You have to re-explain things. “Remember to run the linter.”. “Code Review bot found more issues on the PR”, 40 times(!), the PR took a week to close. “Don’t modify anything outside the /src directory.” Every session. Sometimes mid-session, after the context window compresses and it forgets your instructions without warning. You gave it clear rules ten minutes ago. Now it acts like it never heard them.
You have to catch it. It does things you didn’t ask for. Modifies files outside scope. Skips review steps you specified. Not out of malice — it doesn’t remember, or it decided it knew better.
You have to be there. The whole thing is synchronous, terminal-bound, tied to one machine. There’s no “fire and forget.”
I found myself running 4-5 Claude Code sessions at once, juggling git worktrees. Coordinating which agent works on what, switching between terminal tabs non-stop. Less engineer, more floor manager at a factory walking between stations, checking output, correcting course, putting out small fires.
The data backs this up. Surveys show developers save ten or more hours per week with AI tools, yet report no decrease in workload. The time they save writing code, they spend supervising it.
Delegating promised to free up my time. Instead I spend all of it babysitting.
And the worst part? I can see what stage four looks like. Fire-and-forget. Asynchronous execution. Agents running overnight while I sleep. No tool that exists today gets me there.
Stage 4: Orchestrating
This is where I want to be. Where I think the whole industry is heading.
Here’s what it looks like: I write a spec on my phone during my morning coffee. I queue it up. Agents pick it up, execute in sandboxed containers, run through an enforced review pipeline. A structured process that runs every single time and doesn’t forget a single step. The agents collect proof that the work meets spec. I check the results when I’m ready. After lunch or the next morning.
The shift: synchronous to asynchronous. Babysitting to reviewing. “I watch the agent work” becomes “I check what the agent produced.”
Single agent to fleet. Multiple specs running in parallel, not one terminal at a time.
Implicit trust — “I hope the agent did it right” — to verified trust. Review pipelines the system enforces, not the LLM’s memory. Proof collection that happens whether the agent feels like doing it or not.
Laptop-bound to infrastructure. Runs on a server. Accessible from anywhere. Doesn’t stop when I close my lid, lose wifi, or go to bed.
Few people operate here today. Not because the AI isn’t good enough. The models are extraordinary. The infrastructure to run them this way doesn’t exist yet.
The infrastructure gap
Why are we stuck at stage three?
The models can write code, fix bugs, refactor systems, and reason through complex tasks. They’re not the bottleneck. Every tool is. Every tool assumes a human is in the loop. Watching. On the same machine. Ready to click “approve” or paste an error or re-explain a constraint.
Four pieces of infrastructure are missing:
Autonomy. Fire-and-forget execution. Not terminal babysitting. Queue a task, walk away, come back to results.
Sandboxing. Isolated execution environments. Not agents running with full access to your personal machine, your SSH keys, your .env files. Containers that limit what the agent can touch — so you don’t have to watch its every move.
Pipelines. Enforced review steps that run every time. Not “the AI remembers to lint and test.” A structured process: code generation, automated review, human approval gates, proof collection. The system enforces it regardless of what the LLM decides to do.
Hosting. Server-based, remote-accessible infrastructure. Not your laptop. Not a terminal that stops when your lid closes. Something that runs while you sleep, while you’re in meetings, while you’re on a plane.
These aren’t product features. They’re infrastructure primitives that should exist for any AI agent workflow. Their absence is why we’re all stuck babysitting.
Where are you?
The progression from prompting to orchestrating is inevitable. The industry is moving there. The question is how fast the infrastructure catches up.
And it won’t be one stage for everything. I still one-shot simple tasks in a chat window — a quick function, a regex, analyse a csv. Pair programming in an IDE works for flow-state feature work. Delegating complex tasks to Claude Code is where the leverage is today. Different stages for different problems. The point isn’t to leave earlier stages behind. It’s to have stage four available when you need it.
So — what stage are you at? Copying and pasting from chat windows? Pair programming with autocomplete? Running multiple terminal sessions, wishing you could walk away?
I’ve gone from sceptic to orchestrating multiple AI agents in a matter of months. Now I’m building toward stage four. Things move fast — follow me if you want to see where it goes.