How to Kill the Code Review

Agreed. "The agent implements. The BDD framework verifies" is doing a lot of work here. We can't have the spec be separate from the program, because we can't trust the LLM to interpret "Then they receive an email with a reset link" correctly and write `sendEmailSpy.calledWith(...)`.

We should help the human write an enforceable "spec", basically programming lite, and then the LLM implements within those constraints.

Mar 2Edited

It’s funny to see how this dynamic plays out, over time:

1. Agents almost never get it right.

2. Agents getting it right sometimes.

3. Agents getting it right often.

4. Agents getting it right most of the time.

5. Agents getting it right enough to teach other agents.

… which actually feels like how you onboard and upskill a new engineer.

Funny how that works.

Reply (2)

Latent.Space

to me its not funny so much as "agents are becoming more and more able to do human tasks" which is pretty cool in the grand scheme of things and pretty scary that we got here in ~3-5 years

Patrick Senti

Mar 3Edited

Except that junior engineers actually learn and become senior engineers who use their experience to take responsibility for larger systems and lead other engineers. LLMs and AI agents can't, and they won't.

Wait, but why couldn't they? Onboarding into a new system or infrastructure requests a few requests through your favorite CLI tool and voila, I've been educated and informed. Is there a reason that an Agent couldn't take "responsibility" at some point for a few (many?) things like this?

Reply (2)

Kyle Smith

Mar 17

Generative AI does not 'learn' (unless you're training the models, perhaps), at best you encode 'learnings' into the context buffer and hope you get the right outcomes.

Mar 17

I don't plan on hope. You can actually program this stuff to get deterministic outcomes. https://blog.yen.chat/p/skills

Kyle Smith

Mar 17

You have an interesting definition of "deterministic" there.

paradigm

Do you really want A.I. to take responsibility? That's where we begin to give them true power. Is that what we want?

Tell me you’ve never seriously used AI tools for swe without telling me that you’ve never used them…

The knowledge worker

I am not reviewing code anymore. But I am user testing my software 10x more.

Reply (2)

Peter

Mar 9

Agreed! This is the direction I took for rearranging around AI code too.

QA experience is about to become the bolded line on a lot of SW dev resumes.

Laurie @ Role Call

This is the shift! So many engineers (#NotAllEngineers) used to spend all their time looking at code and none of their time looking at the actual product users are using. When Claude is doing the coding for you, it leaves a lot more time for dogfooding.

Mar 24

It's been 3 weeks since this article.. thanks everyone who engaged in the conversation. Attempting to address some of the comments broadly: https://www.aviator.co/blog/code-review-dead/

Volodymyr Stelmakh

Thanks for putting this together. That’s something I’ve been thinking for a while now. Code reviews have become a bottleneck now, and we have to change the mindset to fix it. It won’t be easy, but ot has to be done to move forward

Immanuel Giulea

Excellent that you quoted from StrongDM!

More awareness is needed for "Code must not be written by humans / Code must not be reviewed by humans".

richardstevenhack

"Human-written code died in 2025. Code reviews will die in 2026."

What won't go away. Humans maintaining bad AI code. Cognitive debt.

LLMs generate security-vulnerable code thirty percent of the time. That may improve if models are specifically trained on proper DevSecOPS. But it won't go away.

Bottom line: LLMs are not deterministic. They are not AGI. They are trained on bad human coding practices, which is endemic in the software industry, so they produce bad code.

If you look around, a lot of code written by humans also have vulnerabilities and tech debt. Even if models don't get significantly better, some of the quality problems can be solved by building better guardrails using both predictive and deterministic rules.

richardstevenhack

https://disesdi.substack.com/p/if-you-bought-anthropics-make-believe

Code written by humans with vulnerabilities is precisely my point. That's where AI learned to do the same. Cognitive debt is the new term and it's worse than tech debt.

I doubt the quality problems can be "solved". Improved, no doubt. Not "solved" - until we get much better AI than LLMs.

As for guardrails, they don't exist, inherently. See here:

If You Bought Anthropic's Make Believe, It's Time To Grow Up

Do I believe that AI can ever write code without bugs - No

Do I believe that AI can ever write code better than most humans - Soon!

Guardrails will follow similar patterns. So if millions of lives depend on it, you cannot rely on automated systems.

Suzanne Margolis

Mar 5

Maybe this is because humans are not building deterministic systems? Guardrails are part of the deterministic system. Guardrails for AI means both non-negotiable constraints and guardrails that allow decision making with boundaries. It also means understanding what causes drift, and how to mitigate it.

Martin H Berwanger

Interesting post, but I'd push back on the core premise.

We can agree that coding costs have dropped. But that's an argument for investing in better human-review tooling, not for removing humans from the loop.

Spec-driven development sounds clean until you're actually in the problem space. Your password reset example has dozens of unstated assumptions baked in: token TTL, session invalidation, MFA behavior, rate limiting, and what happens to suspended accounts. The iteration is the discovery. Removing that loop means misaligned outcomes, customer impact, and costs that don't show up until it's too late to fix them cheaply.

And the reviewer's job isn't catching bugs in a diff. It's pressure-testing the solution against intent. Does this reset flow need to be auditable for compliance reasons? What are the failure modes? Does this actually solve what the spec writer meant? Those aren't mechanical questions. An adversarial agent probes what it was told to look for. A human asks what wasn't thought of yet.

On permission scoping as architecture, the footguns aren't as clearly labeled as "touches auth logic." Blast radius isn't defined by file boundaries. A utility function can be part of the call path of a payment flow. File-level scoping tells you what the agent touched, not what it broke.

The volume problem is real, but it isn't infinite. There's a natural ceiling on how much meaningful change a system can absorb at once. Tiered review lanes, informed by tooling that identifies low-impact changes, can handle throughput without removing human judgment from the changes that actually matter.

The future of software engineering is judgment.

Jeff Hiner

> The agent can’t negotiate with a failing test.

It can, and it absolutely does. I can't count the number of times Claude has said "oh that was failing before, it has nothing to do with my changes" or "I'll just comment out this broken test".

If you are watching like a hawk you MIGHT catch it when the agent is making changes. If you're not, you have a chance to catch it in PR. If you're solely relying on agents to review your code, you won't catch it at all.

Patrick Senti

The solution to "too many PRs and they are too big to review" is not to get humans out of the loop. At least not if the PRs are generated by unreliable AI that cannot be verified otherwise.

If we want to use AI to generate code that can be verified, we have to use a more formal approach to specification so that the generated code can be verified with accuracy to match expectations.

That is we need better specification languages from which executable code can be generated - aka programming languages. If we have that, perhaps AI code generation will get to a level of reliability that is acceptable.

I would argue that we once we get there, however, we don't really need AI anymore. We can quite simply translate the specification into executable code - aka using compilers.

Arnav

Major question here that you didn’t address, AI agents can’t take accountability, if OpenAI or Anthropic takes accountability for the code their AI breaks, then now we can start talking about exponentials, if I’m the one taking accountability for anything the AI breaks, there is no way I’m gonna blindly let it generate slop, I’m not stupid enough to hurt my own career