When Code Gets Cheap, Comprehension Gets Expensive

A year ago, I built this blog entirely with Claude Code. Zero manual coding. I wrote a post about it celebrating the fact — “every single line of code powered by AI collaboration.” I was genuinely excited.

I’m still excited about AI-assisted development. I build my startup’s backend with it every day — Minty, a conversational AI recruitment platform that’s grown to over 800 source files, 400 tests, and 69 feature PRDs. I teach other developers how to use it. I built a framework around it.

But I was wrong about something important: I thought the hard part was getting the code written. It isn’t. The hard part is — and has always been — understanding the code. AI didn’t make that easier. In some ways, it made it harder.

The cost didn’t disappear. It migrated.

Mark Downie made an observation that stuck with me: when code production gets cheap, the cost doesn’t disappear. It moves. It shifts from creation to comprehension.

He drew a parallel to the outsourcing wave of the early 2000s. When companies offshored development, code got cheaper to produce. But the hidden cost was everything that came after — maintenance, debugging, knowledge transfer, the constant struggle to understand code written by people you’d never meet in a timezone eight hours away.

Here’s what’s different now. With outsourcing, there was always a human somewhere who understood the code. You could email them. You could schedule a call. The knowledge existed in someone’s head, even if it wasn’t yours.

With AI-generated code, that human might not exist at all.

An agent can produce 500 lines of working code that nobody — including the person who prompted it — fully understands. It compiles. The tests pass. The feature works. But if you asked someone three weeks later why it’s structured that way, what invariants it depends on, or what happens when you change this one function, you’d get a shrug.

I learned this the hard way

A few weeks ago, I implemented email infrastructure for Minty — transactional outbound via SendGrid, inbound reply capture, an agent loop for email replies. Two days. Roughly 20,000 lines of code across the adapter, the webhook handler, the signature verification, the suppression management, the template system. AI-assisted, fast, intense. It worked. The feature shipped.

Then I had to go back.

A week later I needed to modify a piece of that email flow. I opened the files, read the code, and realized something uncomfortable: I didn’t fully understand my own codebase. The agent had made reasonable decisions — architectural patterns, naming conventions, error handling approaches — but the reasoning behind those decisions lived in a chat session that was already buried in context history.

The code was there. The intent wasn’t.

That’s when the lesson from Downie’s article stopped being theoretical and became operational. I went back and added what I now call intent trails to every significant piece of agent-generated code: why this structure exists, what invariants matter, what tests prove, and where the next maintainer should look first. Not as a courtesy — as a survival mechanism.

Functional correctness is the floor, not the ceiling

There’s a tempting assumption that if the code works and the tests pass, we’re good. Two recent research papers challenge this directly.

BUILD-AND-FIND (a recent eval protocol) frames the problem precisely: a builder agent creates a codebase, then a separate finder agent is given only the repository and must recover the intended behavior and design decisions. The benchmark doesn’t just measure whether the finder gets the right answer — it measures how much inspection effort that recovery requires. How many files did it have to read? How many tool calls? How confident was it?

The insight is that a generated repository isn’t just a program. It’s a communication artifact for future agents and humans. A repo that passes all tests but scatters its intent across undocumented modules is a poor communication artifact — even if it’s functionally correct.

Constraint Decay makes the same point from a different angle. When agents generate backend code, they can produce something that’s behaviorally correct while silently violating architectural constraints — layering rules, ORM choices, module boundaries, API contracts. The tests pass because the behavior is right. But the architecture is eroding, and nobody notices until the next change becomes twice as hard.

Both papers point to the same conclusion: functional correctness is the floor. It’s the minimum acceptable standard, not the goal. The goal is a codebase that future agents and humans can understand, extend, and safely modify.

Sebastian Raschka’s analysis of coding agent architecture lands on the same truth from yet another angle: the harness — context management, tool design, validation, the agent loop — determines capability more than the model itself. The same model in a bare chat interface versus a well-designed harness produces dramatically different results. The harness is where comprehension either gets preserved or lost.

What I’m actually doing about it

This isn’t an abstract concern for me. I just shipped version 2.0 of Loom — a framework that orchestrates AI agents for software development. And I’m not testing it on toy projects. Minty — 800 files, 400 tests, 29 domain modules, 69 PRDs — is built and delivered through it. When the engine runs, it produces the actual commits you can see in the git history: merge: loom/builder-XXX/minty-chat-YYY. Every feature in that product went through the system I’m about to describe.

Here’s what’s in the framework now:

The Deterministic Floor. After an agent builds something, Python runs eleven automated checks before the code is accepted. Five are what you’d expect — does it compile, do types pass, did lint get worse, does the test suite pass, does coverage meet a threshold on changed lines. But six are integration-level checks that single-file tools miss: duplicate top-level symbols, unused imports, dead definitions that nothing references, dependency cycles introduced by the change, migration head conflicts, and committed build artifacts. These catch exactly the kind of slow architectural erosion that the constraint decay paper warns about — the stuff that passes tests because the behavior is right while the structure is quietly rotting.

And yes — the floor itself suffered from the exact failure mode it’s designed to catch. The checks register themselves as import side-effects. But nothing imported the package that triggered registration. So when the floor ran, it checked against an empty registry. Everything passed. It shipped a silent no-op that validated nothing — twice. The code was correct. The wiring was wrong. This is the failure class I now call unit-green, integration-dormant: every unit test passes, but the code is never connected to the live runtime. It took an end-of-run full-suite verifier — the one mechanism that ran against the real integration tree, not the unit test fixtures — to catch it.

Two human gates. When you tell Loom to orchestrate a feature end-to-end, it doesn’t just blast out code. It pauses twice. Gate #1 is a spec red-team: before any code is written, the agent decomposes the feature into stories and a human reviews the decomposition. Does the plan make sense? Are the stories actually independent? Are they the right size? If the decomposition is bad, you catch it before a single line is generated. Gate #2 is a confidence report: after all stories are built and floored, the agent produces an assessment of what it delivered and a human reviews before the PR opens.

Per-story floors, not just end-of-run checks. The floor doesn’t just run at the end. It runs after every single story. If story #3 introduces a dead definition, the run stops. You don’t get to story #8 with six accumulated violations and a mess to untangle. Halt-and-surface: the agent tells you what broke, and you decide whether to fix or adjust.

Decomposition as a gate, not a suggestion. One of the first bugs I hit in production was the engine re-planning completed stories — silently re-doing work because the decomposition logic didn’t filter finished items. The fix was making decomposition a first-class gate: the engine validates that stories are properly decomposed, independent, and scoped before building anything. If they’re not, it refuses to start.

The freshness guard. The engine refuses to orchestrate when your local main is behind origin. This sounds trivial, but it’s the kind of guardrail that prevents the exact scenario where an agent builds on stale assumptions, generates code that conflicts with upstream changes, and leaves you debugging a merge that should never have been necessary.

The scarce resource shifted

Here’s the thing that I think most people miss when they talk about AI and software development:

If generating average code is nearly free, then the ability to generate code is no longer the scarce resource. The scarce resource is the ability to read it, navigate it, understand which parts matter, and know why.

This is true for AI-generated code. But it was becoming true before AI, too. Modern codebases are enormous. Dependencies are deep. Nobody reads the whole thing. The skill that separates senior engineers from junior ones has never really been writing code — it’s been understanding systems.

AI didn’t create this problem. It just made it impossible to ignore.

What this means for how we build

If you’re building software with AI agents — whether you’re using Claude Code, Cursor, Copilot, or your own framework — here’s the practical shift:

Stop optimizing for code production speed. Start optimizing for code comprehension speed. The bottleneck isn’t getting code written anymore. It’s understanding what was written, verifying it’s safe, and knowing what happens when you change it.

This means:

Write the why alongside the what, every time
Treat commit messages and PR descriptions as load-bearing documentation, not paperwork
When an agent generates code, review it the same way you’d review a junior developer’s PR — for understanding, not just correctness
Build evaluation around whether future agents can recover intent from your repo, not just whether tests pass

The teams and tools that figure this out will build faster and maintain velocity over time. The ones that don’t will hit a wall — a pile of working code that nobody can safely change.

I learned this the hard way. The 20,000 lines of email code work. Minty runs — customers are on it. But the cost of that speed showed up later, when I had to go back and understand what I’d built.

Code got cheap. Comprehension got expensive. And the tools and practices that account for that shift are the ones that will actually work in production.

That’s what Loom is. Not a chat wrapper — a deterministic harness that treats comprehension as a first-class concern, enforced by code. Eleven automated checks. Two human gates. Per-story floors that halt the moment architectural integrity slips.

The harness matters more than the model. That’s what I’ll be writing about here.