Tests Pass. Does It Think?

TL;DR

When AI writes most of the code, passing tests confirms the code works. It doesn't confirm the architecture is sound.
The new code review discipline is proof of thoughtfulness: can the engineer explain and defend the choices the model made, not just the choices they made?
A tower of AI-made assumptions accumulates quietly until something changes and the whole thing needs to be understood at once.

There's a type of PR review that's becoming more common. An engineer submits code. Tests pass. The diff looks reasonable. You ask a question about one architectural decision — why this approach rather than that one, why this data structure rather than the simpler alternative — and the engineer says: "I'm not sure, that's what the model generated."

That answer used to be implausible. Engineers wrote the code. They had to understand it, at least partially, to write it. Now an engineer can genuinely not know why a system is structured the way it is, because they didn't structure it. They accepted the structure the model chose.

This isn't laziness. It's a new situation that existing code review processes aren't designed for.

What tests actually verify

Test suites answer a specific question: does this code produce the expected output for the cases we thought to test? That's a useful answer. It's not a complete answer.

Tests don't verify that the data model is the right shape for the problem. They don't verify that the abstraction boundaries are in the right places. They don't verify that the approach will still make sense in six months when the requirements shift. They don't verify that the implicit assumptions baked into the architecture are assumptions you'd agree with if you made them explicitly.

When an engineer writes code manually, they at least had to think through those questions to produce working code. The test coverage gap is narrower because the writing process forced some architectural reasoning.

When AI writes the code, the tests can pass and the architecture can still be wrong. The model made reasonable-seeming choices. Nothing broke. The reviewer sees green and moves on. The architectural problem is now load-bearing in production.

I ran into this building OpenChair. I'd have the model generate a system for handling appointment state transitions, tests would pass, and two weeks later when I needed to add a new state I'd discover the model had made an implicit assumption about state linearity that wasn't documented anywhere in the code. Nothing had broken. The assumption was just there, invisible, until the requirement that violated it arrived.

The tower of assumptions

Each AI-generated decision that isn't interrogated is a brick in a tower. Individually they're fine. The model makes reasonable choices. But each choice constrains the next one. A data model shape implies certain query patterns. Those query patterns imply certain index structures. Those indexes imply certain performance characteristics. None of this is wrong until you need the system to behave differently than the chain of assumptions predicted.

The problem isn't that the model made bad choices. Usually it didn't. The problem is that nobody made those choices consciously, so nobody knows they exist until something breaks or until the system needs to do something the assumptions didn't account for.

Traditional code review caught this because reviewers could ask "why did you do it this way?" and get an answer that revealed the reasoning, correct or not. With AI-generated code, that conversation still needs to happen — but now the engineer needs to reconstruct the reasoning retrospectively from the output, not recall it from the process of writing.

That's a fundamentally different skill.

Two PRs: one with tests passing and hollow architectural rationale; one with tests passing and full decision trail explained

What proof of thoughtfulness looks like

The standard I've started applying to non-trivial AI-generated systems: can you explain the three or four architectural decisions the model made and articulate why they're correct for this context?

Not "why did you make this decision" — that question leads to "the model chose it." The right question is "do you agree with this decision, and why?"

That reframe changes what the engineer has to do. They can't just accept what the model generated. They have to evaluate it. The test for the PR isn't just "did it work" — it's "do you understand it well enough to defend it."

At Cotality, the expectation for significant architectural decisions was always that the person proposing could explain why this approach rather than the alternatives, what it assumed, and what would break those assumptions. That standard didn't change when the code started being generated rather than written. If anything, it matters more now, because the model will often generate something that works without it being the right choice for your specific context.

The failure mode I saw at Cotality during our move toward AI-assisted development was engineers submitting PRs with "Claude built this, tests pass" as the complete description. The code worked. The architecture was often fine. But nobody had done the work of understanding it. When it needed to be extended three months later, the cost of that deferred understanding arrived all at once.

The prompt architecture problem

There's a specific variant of this problem that affects AI products directly. When something isn't working quite right in an AI system, the easy fix is to add more instructions to the system prompt. "Always format responses this way." "Never mention this topic." "When the user asks X, respond with Y."

Each addition makes the immediate problem go away. The accumulation creates a system that nobody fully understands. The model starts short-circuiting earlier instructions with later ones. Behaviour becomes unpredictable in combinations the instructions didn't anticipate.

I've watched this happen in production AI systems. The prompt grows to five hundred lines. The team knows roughly what each section is for but not how they interact. A new hire tries to add a simple instruction and introduces a conflict nobody expected. The "fix" is to add another instruction to resolve the conflict.

The correct intervention is usually to step back and ask: are these actually two separate capabilities that should be two separate tools or two agents with smaller, cleaner contexts? The bitter lesson for AI product architecture applies equally to prompt engineering: the quick patch that avoids structural thinking accumulates into debt you'll pay at the worst time.

That decision — "is this a prompt fix or a structural fix?" — requires understanding what you've actually built, not just whether it currently works.

What changes about code review

Code review for AI-generated systems should add one question to the standard checklist: for every non-trivial architectural choice, can the author explain it?

Not "did Claude make this choice" but "is this the right choice for this context, and here's why."

That question forces the engineer to actually understand the code rather than accept it. It often reveals assumptions that should be made explicit in comments or documentation. Occasionally it reveals that the model's choice was reasonable but not optimal, and a quick adjustment produces something better.

Eval infrastructure catches output failures. This catches input failures — architectural decisions made before the system ran. Both matter. The output evals tell you when something is wrong. The architectural review tells you why it might get wrong later.

Together they are part of the software factory for AI-generated code, the production system that turns fast generation into software people can trust.

Green CI means the code works. Proof of thoughtfulness means the code can be trusted.

Frequently Asked Questions

How do you build this discipline without slowing down velocity?

Focus it on non-trivial decisions. Not every AI-generated line warrants interrogation. New database schemas, API contracts, state management approaches, and abstraction boundaries warrant interrogation. Utility functions, configuration, and boilerplate don't. The question "do you understand and agree with this?" should be calibrated to decisions that will be hard to change later.

What if the engineer honestly doesn't know why the model made a choice?

That's the correct starting point, not the end point. The engineer should go back to the model and ask it to explain the choice. If the explanation reveals that the choice was arbitrary (the model had multiple reasonable options and picked one), that's useful — you can now evaluate whether this option is better than the alternatives. If the explanation reveals that the choice rests on an assumption, you can evaluate whether that assumption is correct for your context.

Does this create an unworkable overhead for AI-native development?

Only if applied uniformly to everything. The overhead comes from applying the same level of scrutiny to every line the model generates. The skill is knowing which decisions are load-bearing — the ones that will constrain everything built on top of them — and scrutinising those specifically. That skill is worth developing. The alternative is discovering the wrong choices at the worst time.