50 Comments
User's avatar
Oliver Schoenborn's avatar

I don't agree. I think people who believe in spec driven development are naive about how hard it is to write a full spec, compared to just writing code. There's a reason we use symbols with elaborate underlying meanings in mathematics: writing the equivalent in words is tedious.

Look at the progression of level of *abstraction*: assembly -> C -> C++ -> python -> specs-for-LLMs. No that doesn't work. There's a step missing there. It's something between a high level language like python that is still precise and deterministic, and a spec which is a combination of words that have various meanings and cultural biases and ambiguities.

Spec driven dev is an interim phase between pre AI languages, and genAI ability to generate code in said languages.

In 5 years there will be a new language that will be the real bridge and spec driven dev will be seen as a nice but rather feeble attempt at taking advantage of this new reality.

Alex's avatar

Reviewing spec doesn't work because specs are already longer then python diff.

Dusan Omercevic's avatar

We're developing such language. Check https://plainlang.org/

Erick G. Hagstrom's avatar

I take it that you don't like Gherkin (commonly used in BDD). What do you see as its shortcomings?

James Irwin's avatar

Agreed. "The agent implements. The BDD framework verifies" is doing a lot of work here. We can't have the spec be separate from the program, because we can't trust the LLM to interpret "Then they receive an email with a reset link" correctly and write `sendEmailSpy.calledWith(...)`.

We should help the human write an enforceable "spec", basically programming lite, and then the LLM implements within those constraints.

8Lee's avatar
Mar 2Edited

It’s funny to see how this dynamic plays out, over time:

1. Agents almost never get it right.

2. Agents getting it right sometimes.

3. Agents getting it right often.

4. Agents getting it right most of the time.

5. Agents getting it right enough to teach other agents.

… which actually feels like how you onboard and upskill a new engineer.

Funny how that works.

Latent.Space's avatar

to me its not funny so much as "agents are becoming more and more able to do human tasks" which is pretty cool in the grand scheme of things and pretty scary that we got here in ~3-5 years

Patrick Senti's avatar

Except that junior engineers actually learn and become senior engineers who use their experience to take responsibility for larger systems and lead other engineers. LLMs and AI agents can't, and they won't.

8Lee's avatar

Wait, but why couldn't they? Onboarding into a new system or infrastructure requests a few requests through your favorite CLI tool and voila, I've been educated and informed. Is there a reason that an Agent couldn't take "responsibility" at some point for a few (many?) things like this?

Kyle Smith's avatar

Generative AI does not 'learn' (unless you're training the models, perhaps), at best you encode 'learnings' into the context buffer and hope you get the right outcomes.

8Lee's avatar

I don't plan on hope. You can actually program this stuff to get deterministic outcomes. https://blog.yen.chat/p/skills

Kyle Smith's avatar

You have an interesting definition of "deterministic" there.

paradigm's avatar

Do you really want A.I. to take responsibility? That's where we begin to give them true power. Is that what we want?

8Lee's avatar

Tell me you’ve never seriously used AI tools for swe without telling me that you’ve never used them…

The knowledge worker's avatar

I am not reviewing code anymore. But I am user testing my software 10x more.

Peter's avatar

Agreed! This is the direction I took for rearranging around AI code too.

QA experience is about to become the bolded line on a lot of SW dev resumes.

Laurie @ Role Call's avatar

This is the shift! So many engineers (#NotAllEngineers) used to spend all their time looking at code and none of their time looking at the actual product users are using. When Claude is doing the coding for you, it leaves a lot more time for dogfooding.

Ankit Jain's avatar

It's been 3 weeks since this article.. thanks everyone who engaged in the conversation. Attempting to address some of the comments broadly: https://www.aviator.co/blog/code-review-dead/

Volodymyr Stelmakh's avatar

Thanks for putting this together. That’s something I’ve been thinking for a while now. Code reviews have become a bottleneck now, and we have to change the mindset to fix it. It won’t be easy, but ot has to be done to move forward

Immanuel Giulea's avatar

Excellent that you quoted from StrongDM!

More awareness is needed for "Code must not be written by humans / Code must not be reviewed by humans".

richardstevenhack's avatar

"Human-written code died in 2025. Code reviews will die in 2026."

What won't go away. Humans maintaining bad AI code. Cognitive debt.

LLMs generate security-vulnerable code thirty percent of the time. That may improve if models are specifically trained on proper DevSecOPS. But it won't go away.

Bottom line: LLMs are not deterministic. They are not AGI. They are trained on bad human coding practices, which is endemic in the software industry, so they produce bad code.

Ankit Jain's avatar

If you look around, a lot of code written by humans also have vulnerabilities and tech debt. Even if models don't get significantly better, some of the quality problems can be solved by building better guardrails using both predictive and deterministic rules.

richardstevenhack's avatar

Code written by humans with vulnerabilities is precisely my point. That's where AI learned to do the same. Cognitive debt is the new term and it's worse than tech debt.

I doubt the quality problems can be "solved". Improved, no doubt. Not "solved" - until we get much better AI than LLMs.

As for guardrails, they don't exist, inherently. See here:

If You Bought Anthropic's Make Believe, It's Time To Grow Up

https://disesdi.substack.com/p/if-you-bought-anthropics-make-believe

Ankit Jain's avatar

Do I believe that AI can ever write code without bugs - No

Do I believe that AI can ever write code better than most humans - Soon!

Guardrails will follow similar patterns. So if millions of lives depend on it, you cannot rely on automated systems.

Suzanne Margolis's avatar

Maybe this is because humans are not building deterministic systems? Guardrails are part of the deterministic system. Guardrails for AI means both non-negotiable constraints and guardrails that allow decision making with boundaries. It also means understanding what causes drift, and how to mitigate it.

Martin H Berwanger's avatar

Interesting post, but I'd push back on the core premise.

We can agree that coding costs have dropped. But that's an argument for investing in better human-review tooling, not for removing humans from the loop.

Spec-driven development sounds clean until you're actually in the problem space. Your password reset example has dozens of unstated assumptions baked in: token TTL, session invalidation, MFA behavior, rate limiting, and what happens to suspended accounts. The iteration is the discovery. Removing that loop means misaligned outcomes, customer impact, and costs that don't show up until it's too late to fix them cheaply.

And the reviewer's job isn't catching bugs in a diff. It's pressure-testing the solution against intent. Does this reset flow need to be auditable for compliance reasons? What are the failure modes? Does this actually solve what the spec writer meant? Those aren't mechanical questions. An adversarial agent probes what it was told to look for. A human asks what wasn't thought of yet.

On permission scoping as architecture, the footguns aren't as clearly labeled as "touches auth logic." Blast radius isn't defined by file boundaries. A utility function can be part of the call path of a payment flow. File-level scoping tells you what the agent touched, not what it broke.

The volume problem is real, but it isn't infinite. There's a natural ceiling on how much meaningful change a system can absorb at once. Tiered review lanes, informed by tooling that identifies low-impact changes, can handle throughput without removing human judgment from the changes that actually matter.

The future of software engineering is judgment.

Jeff Hiner's avatar

> The agent can’t negotiate with a failing test.

It can, and it absolutely does. I can't count the number of times Claude has said "oh that was failing before, it has nothing to do with my changes" or "I'll just comment out this broken test".

If you are watching like a hawk you MIGHT catch it when the agent is making changes. If you're not, you have a chance to catch it in PR. If you're solely relying on agents to review your code, you won't catch it at all.

Patrick Senti's avatar

The solution to "too many PRs and they are too big to review" is not to get humans out of the loop. At least not if the PRs are generated by unreliable AI that cannot be verified otherwise.

If we want to use AI to generate code that can be verified, we have to use a more formal approach to specification so that the generated code can be verified with accuracy to match expectations.

That is we need better specification languages from which executable code can be generated - aka programming languages. If we have that, perhaps AI code generation will get to a level of reliability that is acceptable.

I would argue that we once we get there, however, we don't really need AI anymore. We can quite simply translate the specification into executable code - aka using compilers.

Arnav's avatar

Major question here that you didn’t address, AI agents can’t take accountability, if OpenAI or Anthropic takes accountability for the code their AI breaks, then now we can start talking about exponentials, if I’m the one taking accountability for anything the AI breaks, there is no way I’m gonna blindly let it generate slop, I’m not stupid enough to hurt my own career

Ankit Jain's avatar

Fair point, but can be viewed from a different reference point.

If humans make mistakes, we (1) teach them how to write better code and also (2) build feedback loops as guardrails that can catch errors. If we want accountability so that we can fire the human the moment they make a mistake, then you would not have a team very soon.

Sonal Goyal's avatar

This argument is taking us to the direction of - do we need developers at all? If all it needs is the spec, what is the use of developers? Coding guidelines and invariants can be universal. Product managers can write the spec, thinking of functionality they need.

"You never have to read the implementation unless something fails. " Thats why you have to read it before it fails. :D

Jeff Huckaby's avatar

With the multi-agent comparison approach, I would be worried about context drift feeding the agents. If these are different LLMs or using different context inputs, are you getting the best result or simply the best result with the context provided? You select the best output but how does this compare to say interatively improving the inputs to a single LLM through multiple runs?

Alex's avatar

Reviewing spec doesn't work because specs are already longer then python diff.

Jimmy Pang's avatar

so TL;DR: this would be a massive ultra fast test in production

NΞRD₿ΞN's avatar

I think you’re raising a valid concern, but I’m not sure the trade-off is quite “specs vs code.”

Historically, every step up the abstraction ladder didn’t eliminate the lower layer, instead it changed who has to reason about it. Assembly didn’t disappear when C arrived; it just became a compiler problem. Python didn’t remove C; it moved the complexity into interpreters and runtimes.

Spec-driven workflows with agents feel quite similar.. The spec isn’t trying to replace a programming language in the same sense that Python replaced C. It’s closer to defining intent, constraints, and verification criteria, so that machines can safely explore the implementation space.

You’re right that natural language specs are messy and ambiguous. But the interesting thing happening now is that the “spec” is increasingly structured: constraints, acceptance tests, invariants, contracts, guardrails, etc. In practice it starts looking less like prose and more like a layered artifact:

(1) intent (human readable)

(2) constraints and invariants

(3) deterministic verification (tests, contracts, linters)

At that point the spec becomes less about describing the algorithm and more about describing the boundaries of acceptable solutions.

I actually agree with your broader prediction though: we’re probably missing a layer in the stack. Something more formal than natural language specs but more declarative than traditional programming languages. Almost a kind of “intent language” that sits between product requirements and implementation.

Spec-driven development might just be the transitional phase where we’re discovering what that layer needs to look like - which is also why tools like Specularis are exploring how to operationalize specs as first-class artifacts rather than just documents: https://specularis.org