4 Comments
User's avatar
Pawel Jozefiak's avatar

The physics framing for taste is the part that stuck with me. I keep seeing builders treat model selection as a benchmark question when in practice it is closer to a feel question, which is hard to defend in a postmortem but usually right. Lupsasca's point about iteration speed mattering more than raw correctness early on lines up with what I see when I switch from Opus to a cheaper model for prototyping. The cheap model is wrong more often but the loop is fast enough that I catch it.

Alec Pritzos's avatar

The 30-minute paper reproduction is the unlock and the warning at the same time. The jagged-frontier story is real but actually jagged: people writing email get a modest gain while a Breakthrough Prize physicist gets step-change ones. Standard benchmark suites cannot capture this kind of long-tail capability without overstating it to the median user and understating it to the research user.

The CryptoJitt Brief's avatar

The Lupsasca episode is the cleanest case study yet of why theoretical physics is the early winner of agentic research — verification is symbolic, so the prover-verifier gap is tiny. FrontierMath and Putnam have been the fastest-improving GPT-5.x benchmarks for exactly this reason. The corollary nobody is pricing yet: the next leg of capex makes sense for domains where the loss function is checkable in closed form. Most of biology and ML interp don't qualify. Quantum gravity and pure math do.

Ex-Consultant in Tech's avatar

The real scarce skill may become less “can I personally grind through every derivation?” and more “do I have enough taste to ask the right question, enough domain depth to recognize a nontrivial move, and enough paranoia to know when the model is producing beautiful nonsense?”