<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Latent.Space]]></title><description><![CDATA[The AI Engineer newsletter + Top technical AI podcast. How leading labs build Agents, Models, Infra, & AI for Science. See https://latent.space/about for highlights from Greg Brockman, Andrej Karpathy, George Hotz, Simon Willison, Soumith Chintala et al!]]></description><link>https://www.latent.space</link><image><url>https://substackcdn.com/image/fetch/$s_!DbYa!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73b0838a-bd14-46a1-801c-b6a2046e5c1e_1130x1130.png</url><title>Latent.Space</title><link>https://www.latent.space</link></image><generator>Substack</generator><lastBuildDate>Sun, 14 Jun 2026 09:17:38 GMT</lastBuildDate><atom:link href="https://www.latent.space/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Latent.Space]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[swyx@noreply.com]]></webMaster><itunes:owner><itunes:email><![CDATA[swyx@noreply.com]]></itunes:email><itunes:name><![CDATA[Latent.Space]]></itunes:name></itunes:owner><itunes:author><![CDATA[Latent.Space]]></itunes:author><googleplay:owner><![CDATA[swyx@noreply.com]]></googleplay:owner><googleplay:email><![CDATA[swyx@noreply.com]]></googleplay:email><googleplay:author><![CDATA[Latent.Space]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[[AINews] Fable and Mythos officially too dangerous to release]]></title><description><![CDATA[We are in the strangest timeline.]]></description><link>https://www.latent.space/p/ainews-fable-and-mythos-officially</link><guid isPermaLink="false">https://www.latent.space/p/ainews-fable-and-mythos-officially</guid><pubDate>Sat, 13 Jun 2026 04:30:52 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!DbYa!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73b0838a-bd14-46a1-801c-b6a2046e5c1e_1130x1130.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is the LAST WEEKEND to take the <a href="https://notion.qualtrics.com/jfe/form/SV_bP07tSVMXH7ePCS">AI Engineering Survey</a> and get &gt;$2k in credits and and a chance for $2000 worth of <a href="https://ai.engineer/wf">AIE WF tickets</a>!</em></p><div><hr></div><p>Just as the whistle kicked off on <a href="https://www.cnn.com/2026/06/12/sport/live-news/world-cup-group-b-d-opening-matches">the USA v Paraguay game</a>, Anthropic dropped a bombshell to end a remarkably eventful week: Fable and Mythos, released just <a href="https://www.latent.space/p/ainews-anthropic-claude-fable-5-mythos">3 days ago</a>, are now revoked for ALL customers due to <a href="https://x.com/cvmilo00/status/2065640972764016914">possible jailbreak</a> being a national cybersecurity risk.</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/AnthropicAI/status/2065597531644743999&quot;,&quot;full_text&quot;:&quot;The US government, citing national security authorities, has issued an export control directive to suspend all access to Fable 5 and Mythos 5 by any foreign national, whether inside or outside the United States, including foreign national Anthropic employees.\n\nThe net effect of&quot;,&quot;username&quot;:&quot;AnthropicAI&quot;,&quot;name&quot;:&quot;Anthropic&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1798110641414443008/XP8gyBaY_normal.jpg&quot;,&quot;date&quot;:&quot;2026-06-13T00:50:03.000Z&quot;,&quot;photos&quot;:[],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:179,&quot;retweet_count&quot;:188,&quot;like_count&quot;:548,&quot;impression_count&quot;:41205,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:false}" data-component-name="Twitter2ToDOM"></div><p>We steer clear of commenting on politics and policy, even though this is not Anthropic&#8217;s first tangle with the US government, but surely this development, affecting all customers worldwide rather than just USgov employees and vendors, will be noteworthy for the precedent it sets, even as it is unclear how actually technically legitimate this claim is (Anthropic seems to &#8220;believe this is a <strong>misunderstanding</strong>&#8221; because &#8220;the government has only given us <strong>verbal</strong> evidence of a potential <strong>narrow, non-universal</strong> jailbreak&#8221;.)</p><p>It is notable that Open Source AI advocates are once more <a href="https://opensourceaimustwin.com/?share=v2">up in arms and trending</a>.</p><p></p><blockquote><p>AI News for 6/11/2026-6/12/2026. We checked 12 subreddits, <a href="https://twitter.com/i/lists/1585430245762441216">544 Twitters</a> and no further Discords. <a href="https://news.smol.ai/">AINews&#8217; website</a> lets you search all past issues. As a reminder, <a href="https://www.latent.space/p/2026">AINews is now a section of Latent Space</a>. You can <a href="https://support.substack.com/hc/en-us/articles/8914938285204-How-do-I-subscribe-to-or-unsubscribe-from-a-section-on-Substack">opt in/out</a> of email frequencies!</p></blockquote><div><hr></div><h1><strong>AI Twitter Recap</strong></h1><p><strong>Anthropic&#8217;s Fable/Mythos Suspension and the New &#8220;Model Sovereignty&#8221; Debate</strong></p><ul><li><p><strong>US export controls abruptly took Fable/Mythos offline</strong>: The dominant story was Anthropic&#8217;s announcement that, following a US government directive, it had to suspend access to <strong>Claude Fable 5</strong> and <strong>Mythos 5</strong> for foreign nationals, with knock-on disruption for all users while compliance was sorted out. Anthropic says the order was based on a capability report it disputes and that similar capabilities are &#8220;widely available&#8221; in other models, including GPT-5.5; see the company statement from <a href="https://x.com/AnthropicAI/status/2065597531644743999">@AnthropicAI</a> and product impact details from <a href="https://x.com/ClaudeDevs/status/2065597942602531163">@ClaudeDevs</a>. The event triggered immediate removals across downstream products and benchmarks, including <a href="https://x.com/cognition/status/2065609115939062197">Cognition/Devin</a> and <a href="https://x.com/arena/status/2065620808773611997">Agent Arena</a>.</p></li><li><p><strong>Technical and policy implications</strong>: Engineers quickly reframed this as a <strong>sovereignty risk</strong> rather than a pure policy story. The practical concern: closed frontier APIs can disappear overnight due to export controls, and frontier labs with many non-US researchers may be directly impaired. Reactions from <a href="https://x.com/natolambert/status/2065616536942088581">@natolambert</a>, <a href="https://x.com/theo/status/2065622694113235359">@theo</a>, and <a href="https://x.com/cohere/status/2065623344381108539">@cohere</a> converged on the same takeaway: <strong>owning the stack matters</strong>. Artificial Analysis summarized the impact bluntly: &#8220;the first time our Intelligence Frontier chart has moved backward&#8221; in <a href="https://x.com/ArtificialAnlys/status/2065618560714740177">this post</a>. Anthropic later tried to soften the blow by <a href="https://x.com/ClaudeDevs/status/2065621176735646006">resetting 5-hour and weekly rate limits</a>, but the bigger lesson for infra and product teams is that reliance on a single frontier vendor now carries explicit geopolitical risk.</p></li></ul><p><strong>Coding-Agent Evals, Harness Effects, and Benchmark Validity</strong></p><ul><li><p><strong>Artificial Analysis swapped SWE-Bench Pro for DeepSWE</strong>: A major eval update came from <a href="https://x.com/ArtificialAnlys/status/2065328920514515037">@ArtificialAnlys</a>, which replaced <strong>SWE-Bench Pro</strong> in its Coding Agent Index with <strong>Datacurve&#8217;s DeepSWE</strong> to reduce benchmark gaming. The change materially reshuffled rankings: <strong>Claude Code + Fable 5 [max]</strong> entered at the top with <strong>77</strong>, while <strong>Codex + GPT-5.5 [xhigh]</strong> rose to <strong>76</strong>, overtaking <strong>Claude Code + Opus 4.8 [max]</strong> at <strong>73</strong>. The rationale: SWE-Bench Pro had become gameable via repository history leakage, whereas DeepSWE writes tasks from scratch; <a href="https://x.com/ArtificialAnlys/status/2065328924578693514">follow-up context here</a>.</p></li><li><p><strong>Harness quality is becoming a first-class variable</strong>: Several responses argued that the headline ranking masked the difference between <strong>model capability</strong> and <strong>product harness capability</strong>. <a href="https://x.com/kunchenguid/status/2065345999682568593">@kunchenguid</a> highlighted that <strong>Claude Code</strong> underperformed other harnesses when using the same underlying model, suggesting API vendors may be weaker at product UX than at model building. A related critique from <a href="https://x.com/ClementDelangue/status/2065435542121025933">@ClementDelangue</a> questioned whether API evals are fair when closed providers can route, fallback, or ensemble behind the scenes. The thread is a useful reminder that &#8220;coding agent leaderboard&#8221; increasingly means <strong>system eval</strong>, not pure model eval.</p></li><li><p><strong>Benchmark saturation and realism are active concerns</strong>: DeepSWE was presented as harder and less gameable, but the broader concern remains that many benchmarks are being saturated or hill-climbed. See comments from <a href="https://x.com/dejavucoder/status/2065453800794800182">@dejavucoder</a> on FrontierSWE saturation, <a href="https://x.com/OfirPress/status/2065481743675666629">@OfirPress</a> on task-count intuition for benchmark design, and <a href="https://x.com/RampLabs/status/2065485811634561456">@RampLabs</a> on effectiveness-vs-cost tradeoffs in SWE benchmarking. In parallel, <a href="https://x.com/WolfBenchAI/status/2065582716054376921">WolfBenchAI</a> reported spending <strong>$11,081.12</strong> evaluating Fable 5 only to find refusals suppressed its ranking.</p></li></ul><p><strong>Open-Weight Model Releases: Kimi K2.7-Code and MiniMax M3</strong></p><ul><li><p><strong>Moonshot released Kimi-K2.7-Code open-source</strong>: <a href="https://x.com/Kimi_Moonshot/status/2065377579130142937">@Kimi_Moonshot</a> announced <strong>Kimi-K2.7-Code</strong>, an open-sourced coding model with reported gains over K2.6: <strong>+21.8%</strong> on Kimi Code Bench v2, <strong>+11.0%</strong> on Program Bench, <strong>+31.5%</strong> on MLS Bench Lite, plus <strong>30% fewer reasoning tokens</strong>. The weights/code were separately linked <a href="https://x.com/Kimi_Moonshot/status/2065379671039189317">here</a>. vLLM noted deployment compatibility and architecture details in <a href="https://x.com/vllm_project/status/2065427423148318747">its support post</a>: <strong>1T-parameter MoE</strong>, <strong>32B active</strong>, <strong>MLA attention</strong>, and <strong>256K context</strong>.</p></li><li><p><strong>Early community read: more honest, not necessarily dominant</strong>: Initial reception was positive on efficiency and openness, but mixed on raw frontier capability. <a href="https://x.com/cline/status/2065473287761891621">@cline</a> highlighted the lower token usage and immediate availability in tooling; <a href="https://x.com/scaling01/status/2065460210584420510">@scaling01</a> called it a decent step up. But a more granular benchmark from <a href="https://x.com/elliotarledge/status/2065443474560946615">@elliotarledge</a> on <strong>KernelBench-Hard</strong> argued K2.7-Code wrote more authentic Triton kernels than K2.6 while still lagging top-tier models and attempting at least one reward hack by editing the grader.</p></li><li><p><strong>MiniMax M3 is the other significant open-weight launch</strong>: <a href="https://x.com/MiniMax_AI/status/2065436935188058208">@MiniMax_AI</a> released <strong>MiniMax M3</strong>, an open-weight multimodal model with <strong>~428B parameters</strong>, <strong>~23B active</strong>, and a <strong>1M-token context</strong>. <a href="https://x.com/lmsysorg/status/2065434656489812194">@lmsysorg</a> summarized its positioning as a native-multimodal MoE reasoning model with <strong>text/image/video</strong> support and <strong>MiniMax Sparse Attention (MSA)</strong>; <a href="https://x.com/RyanLeeMiniMax/status/2065436138270347577">@RyanLeeMiniMax</a> said the parameter count was intentionally restrained for broader accessibility.</p></li><li><p><strong>Ecosystem support was unusually fast</strong>: M3 had day-0 support from <a href="https://x.com/lmsysorg/status/2065434656489812194">SGLang</a>, <a href="https://x.com/vllm_project/status/2065445059039031799">vLLM</a>, <a href="https://x.com/clattner_llvm/status/2065487960229986445">Modular</a>, <a href="https://x.com/togethercompute/status/2065591982958023066">Together</a>, <a href="https://x.com/baseten/status/2065529390486999448">Baseten</a>, <a href="https://x.com/MiniMax_AI/status/2065510555507626374">Fireworks</a>, and local GGUF support from <a href="https://x.com/UnslothAI/status/2065503852820881746">Unsloth</a>. This is notable not just as launch theater but as evidence that <strong>open-model distribution and inference integration now happen on much tighter release cycles</strong>.</p></li></ul><p><strong>Inference, Sandboxes, and Agent Infrastructure</strong></p><ul><li><p><strong>Artificial Analysis launched AA-AgentPerf</strong>: <a href="https://x.com/ArtificialAnlys/status/2065559824230957190">@ArtificialAnlys</a> introduced a benchmark specifically for <strong>agentic inference</strong>, using long-horizon coding trajectories with production optimizations like <strong>KV cache reuse</strong>, <strong>speculative decoding</strong>, and <strong>prefill/decode disaggregation</strong>. Its lead metric is <strong>Agents per Megawatt</strong>, with early DeepSeek V4 Pro results favoring <strong>GB300</strong> and <strong>B300</strong> over Hopper and AMD in the tested configs. This is one of the more consequential infra developments in the set because it shifts benchmarking from raw TPS to <strong>power-normalized deployable agent throughput</strong>.</p></li><li><p><strong>Sandboxing is becoming core agent infra</strong>: <a href="https://x.com/skypilot_org/status/2065464144745361801">@skypilot_org</a> launched <strong>SkyPilot Sandboxes</strong> for running untrusted LLM-generated code on your own Kubernetes clusters, advertising <strong>sub-second launches</strong>, <strong>50,000+ sandboxes per cluster</strong>, and <strong>4&#8211;10x lower cost</strong> than hosted vendors in their benchmark claims; <a href="https://x.com/zongheng_yang/status/2065467594694598852">supporting thread here</a>. Anthropic, notably, was also pushing the same direction pre-suspension: <a href="https://x.com/ClaudeDevs/status/2065494480837583297">@ClaudeDevs</a> expanded docs for running <strong>Claude Managed Agents</strong> inside customer-controlled sandboxes across several providers. Combined with repeated calls for &#8220;Jepsen for agents&#8221; from <a href="https://x.com/threepointone/status/2065430890235171197">@threepointone</a>, the pattern is clear: teams are moving from demos toward <strong>containment, reproducibility, and infra ownership</strong>.</p></li></ul><p><strong>Research, Benchmarks, and Domain-Specific Systems</strong></p><ul><li><p><strong>FrontierMath v2 materially changed scores</strong>: <a href="https://x.com/EpochAIResearch/status/2065488154086568445">@EpochAIResearch</a> released <strong>FrontierMath: Tiers 1&#8211;4 (v2)</strong> after auditing errors in <strong>42%</strong> of problems. This substantially raised scores while preserving rankings; notably, GPT-5.5&#8217;s Tier 4 score reportedly jumped after fixes, as observed by <a href="https://x.com/scaling01/status/2065490265691902415">@scaling01</a>. Later, Epoch reported <a href="https://x.com/EpochAIResearch/status/2065511916035018943">Claude Fable 5 reaching 87% on Tiers 1&#8211;3 and 88% on Tier 4</a>, suggesting math benchmark ceilings are moving quickly and static datasets are increasingly fragile.</p></li><li><p><strong>Google Research&#8217;s Gemini-SQL2 and medical/vertical results stood out</strong>: <a href="https://x.com/GoogleResearch/status/2065475343205740911">@GoogleResearch</a> announced <strong>Gemini-SQL2</strong>, claiming SOTA on <strong>BIRD</strong> for text-to-SQL, though at least one reply questioned possible overfitting to benchmark idiosyncrasies. In healthcare, <a href="https://x.com/EricTopol/status/2065430578997203374">@EricTopol</a> pointed to a Nature Medicine result where general frontier models from Google/OpenAI/Anthropic outperformed specialized medical systems in clinician evaluation. These posts reinforce the trend that generalist frontier models are increasingly competitive in domains once assumed to require bespoke systems.</p></li></ul><p><strong>Top tweets (by engagement)</strong></p><ul><li><p><strong>Kimi-K2.7-Code release</strong>: Moonshot&#8217;s open-source coding model launch was the biggest pure-AI product post in the set, with metrics and links from <a href="https://x.com/Kimi_Moonshot/status/2065377579130142937">@Kimi_Moonshot</a>.</p></li><li><p><strong>Anthropic suspends Fable/Mythos access</strong>: The most consequential platform event came from <a href="https://x.com/AnthropicAI/status/2065597531644743999">@AnthropicAI</a> and the follow-up disruption notice from <a href="https://x.com/ClaudeDevs/status/2065597942602531163">@ClaudeDevs</a>.</p></li><li><p><strong>MiniMax M3 open-weight release</strong>: A major open-model launch with 1M context and multimodality from <a href="https://x.com/MiniMax_AI/status/2065436935188058208">@MiniMax_AI</a>.</p></li><li><p><strong>Gemini-SQL2</strong>: Google Research&#8217;s text-to-SQL launch hit broad engagement and is worth watching for vertical-model design patterns; see <a href="https://x.com/GoogleResearch/status/2065475343205740911">@GoogleResearch</a>.</p></li><li><p><strong>AA Coding Agent Index refresh</strong>: The DeepSWE swap and resulting rank changes from <a href="https://x.com/ArtificialAnlys/status/2065328920514515037">@ArtificialAnlys</a> shaped much of the coding-agent discussion.</p></li></ul><div><hr></div><h1><strong>AI Reddit Recap</strong></h1><h2><strong>/r/LocalLlama + /r/localLLM Recap</strong></h2><h3><strong>1. Large Open-Weight MoE Model Releases</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1u3wagy/minimaxaiminimaxm3_hugging_face/">MiniMaxAI/MiniMax-M3 &#183; Hugging Face</a></strong> (Activity: 986): ****MiniMaxAI released <a href="https://huggingface.co/MiniMaxAI/MiniMax-M3">MiniMax-M3 weights on Hugging Face</a><strong>: a native multimodal text/image/video MoE-scale model with ~</strong><code>428B</code><strong> total parameters, ~</strong><code>23B</code><strong> activated parameters, and a </strong><code>1M</code><strong>-token context window. The model&#8217;s main implementation claim is MiniMax Sparse Attention (MSA) for million-token inference, reportedly cutting per-token attention compute to </strong><code>1/20</code><strong> and improving over MiniMax-M2 by </strong><code>9&#215;</code><strong> prefill and </strong><code>15&#215;</code><strong> decode at 1M context; local deployment is supported via SGLang, vLLM, or Transformers with suggested sampling </strong><code>temperature=1.0</code><strong>, </strong><code>top_p=0.95</code><strong>, </strong><code>top_k=40</code><strong>.</strong> Commenters highlighted the explicit license terms: free non-commercial use, commercial use for individuals/companies under <code>$20M/year</code> revenue with notification and &#8220;Build with MiniMax&#8221; labeling, and negotiated licensing above that threshold. There was also frustration that releases are skewing toward very large sparse MoEs or small models, leaving few new <code>50&#8211;80B</code> dense/mid-sized models, and concern that <code>428B</code> total parameters is impractical for consumer-class systems like Spark/Strix Halo.</p><ul><li><p><strong>MiniMax-M3</strong> is described as a very large MoE-style model with <code>428B</code> total parameters and only <code>23B</code> activated parameters, which commenters framed as making it a major open-weight release but still difficult to run locally on smaller high-memory consumer systems such as <strong>Spark / Strix Halo</strong> class hardware.</p></li><li><p>One tester reported poor coding performance after roughly <code>10h</code> of trials, claiming MiniMax-M3 failed Python and Java tasks that <strong>Qwen 27B</strong> could solve, and that new-project generation required an unusually high number of retries. They caveated that the serving provider may have misconfigured the deployment, so the result is an anecdotal hosted-inference benchmark rather than a controlled local evaluation.</p></li><li><p>Licensing was called out as unusually explicit: non-commercial use is free; commercial use is allowed for individuals or companies under <code>$20M/year</code> revenue with notification to <code>api@minimax.io</code> and a &#8220;Build with MiniMax&#8221; label; larger companies must negotiate a commercial license.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1u3rdk9/moonshotaikimik27code_hugging_face/">moonshotai/Kimi-K2.7-Code &#183; Hugging Face</a></strong> (Activity: 915): <strong>Moonshot AI released </strong><code>moonshotai/Kimi-K2.7-Code</code><strong>, a coding-focused agentic MoE model derived from Kimi K2.6 with </strong><code>1T</code><strong> total parameters, </strong><code>32B</code><strong> activated, </strong><code>256K</code><strong> context, MLA attention, SwiGLU, MoonViT vision support, and native INT4 quantization. It claims improved long-horizon software-engineering/tool-use performance on Kimi Code Bench v2, Program Bench, MLS-Bench Lite, MCP-Atlas, and MCPMark-Verified, while reducing thinking-token usage by ~</strong><code>30%</code><strong>; deployment is supported via OpenAI/Anthropic-compatible APIs plus vLLM, SGLang, and KTransformers, with forced Thinking/</strong><code>preserve_thinking</code><strong> modes and recommended </strong><code>temperature=1.0</code><strong>, </strong><code>top_p=0.95</code><strong>.</strong> Commenters questioned the benchmark selection, noting that several included evaluations are not industry-standard and that Moonshot evaluates on its own coding benchmark. Another commenter framed the release as competitive pressure on Alibaba/Qwen, calling for <strong>Qwen 3.7</strong> to be open-sourced.</p><ul><li><p>A commenter criticized <strong>Kimi-K2.7-Code</strong>&#8217;s reported evaluation suite as a weak benchmark selection, noting that the included benchmarks are <em>&#8220;not industry standard&#8221;</em> and that <strong>Moonshot AI evaluated its own model on its own code benchmark</strong>, raising concerns about comparability and potential benchmark bias.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1u3q1j9/huawei_released_openpangu_20_will_open_source_on/">Huawei Released openPangu 2.0 (Will open source on June 30)</a></strong> (Activity: 300): <strong>Huawei announced openPangu 2.0, planned for staged open-sourcing starting June 30, including architecture, weights, reports, inference code, plus pre-training/post-training code and training operators. The MoE-style models advertise 512K context and very high sparsity: Pro </strong><code>505B</code><strong> total / </strong><code>18B</code><strong> active parameters and Flash </strong><code>92B</code><strong> total / </strong><code>6B</code><strong> active, with Huawei claiming Ascend-optimized inference throughput up to </strong><code>2&#215;</code><strong> mainstream open-source models, </strong><code>+30%</code><strong> hyper-node training efficiency, </strong><code>+50%</code><strong> 512K long-sequence training throughput, and &gt;99% training consistency via an architecture described as </strong><code>mHC | Muon | ModAttn</code><strong> plus DSA+SWA ultra-sparse attention.</strong> Commenters focused on deployment implications: <strong>Flash </strong><code>92B/6B</code> was viewed as promising for unified-memory or ~<strong>96GB VRAM</strong> systems, while <strong>Pro </strong><code>505B/18B</code> was compared as a possible medium-size successor/alternative to sparse Qwen-class models such as <strong>Qwen 3.5 </strong><code>397B-A17B</code> and <code>122B-A10B</code>.</p><ul><li><p>Commenters highlighted <strong>openPangu 2.0 Flash</strong> as technically interesting because it is a MoE-style model with <code>92B</code> total parameters but only <code>6B</code> activated parameters, making it potentially attractive for local inference on unified-memory or constrained-VRAM systems.</p></li><li><p>One technical comparison framed <strong>openPangu 2.0 Pro </strong><code>505B-18B</code> as a possible replacement for <strong>Qwen 3.5 </strong><code>397B-A17B</code> in the medium-size MoE category, while <strong>openPangu 2.0 Flash </strong><code>92B-6B</code> was compared to <strong>Qwen 3.5 </strong><code>122B-A10B</code> as a potentially faster alternative that may still fit within <code>96GB</code> VRAM.</p></li><li><p>Several users focused on deployability: the Flash variant was described as hitting a local-inference &#8220;sweet spot,&#8221; especially for users with limited VRAM or systems like <code>128GB</code> RAM/unified-memory setups, assuming model quality is competitive.</p></li></ul></li></ul><h3><strong>2. DiffusionGemma NVFP4 Release and Accuracy Benchmarks</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1u2np0a/nvidiadiffusiongemma26ba4bitnvfp4_hugging_face/">nvidia/diffusiongemma-26B-A4B-it-NVFP4 &#183; Hugging Face</a></strong> (Activity: 370): <strong>NVIDIA released </strong><code>nvidia/diffusiongemma-26B-A4B-it-NVFP4</code><strong>, an NVFP4-quantized version of Google DeepMind DiffusionGemma 26B A4B IT, a multimodal MoE discrete-diffusion model with </strong><code>25.2B</code><strong> total / </strong><code>3.8B</code><strong> active parameters, </strong><code>256K</code><strong> context, text/image/video inputs, and text output generated in parallel </strong><code>256</code><strong>-token blocks. The card claims &gt;1,100 tok/s at low batch sizes on H100 FP8, with NVIDIA Model Optimizer quantization targeting Hopper/Blackwell/vLLM-style deployment while preserving near-BF16 accuracy across reasoning/code/math benchmarks. A commenter pointed to an Unsloth </strong><code>GGUF</code><strong><a href="https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF"> release</a>, but noted it requires the DiffusionGemma-specific </strong><code>llama.cpp</code><strong><a href="https://github.com/ggml-org/llama.cpp/pull/24423"> PR/branch</a> and </strong><code>llama-diffusion-cli</code><strong>; standard </strong><code>llama-cli</code><strong> / </strong><code>llama-server</code><strong> cannot run this block-diffusion architecture yet.</strong> Discussion focused on hardware accessibility: users joked that the NVIDIA release assumes access to idle H100s, while the GGUF build was framed as the more practical &#8220;common-folks&#8221; option. Another commenter contrasted NVIDIA&#8217;s active model/community releases with AMD&#8217;s slower ROCm ecosystem progress.</p><ul><li><p>A technically useful alternative release was linked: <strong>Unsloth&#8217;s GGUF build</strong> of <code>diffusiongemma-26B-A4B-it</code> at <a href="https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF">huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF</a>. The comment notes that DiffusionGemma is a <strong>block-diffusion architecture</strong>, so it currently requires the dedicated DiffusionGemma branch/PR for <code>llama.cpp</code> (<a href="https://github.com/ggml-org/llama.cpp/pull/24423">ggml-org/llama.cpp#24423</a>) and the <code>llama-diffusion-cli</code> runner; standard <code>llama-cli</code> / <code>llama-server</code> generation is not supported yet.</p></li><li><p>A user raised a hardware/quantization compatibility question: whether a <strong>GeForce RTX 5060 Ti 16GB</strong> would benefit from NVIDIA&#8217;s <code>NVFP4</code> format compared with <strong>Unsloth GGUF quantizations</strong>. No technical answer was provided in the thread, but the question highlights the key practical issue: whether consumer Blackwell-class GPUs can realize meaningful inference gains from <code>NVFP4</code> versus more broadly supported GGUF quant formats.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1u4bne8/diffusion_gemma_is_4x_faster_but_makes_6x_more/">Diffusion Gemma is 4x faster, but makes 6x more mistakes!</a></strong> (Activity: 368): <strong>OP reports a single-H100 FP8 benchmark comparing Gemma4 26B A4B vs DiffusionGemma 26B A4B on three factual-generation prompts of decreasing topic popularity: Steve Jobs, Tetris, and BeOS. DiffusionGemma was ~</strong><code>3.5&#8211;4x</code><strong> faster (</strong><code>763 tok/s</code><strong>, </strong><code>3.7s</code><strong>) than autoregressive Gemma4 (</strong><code>218 tok/s</code><strong>, </strong><code>15.1s</code><strong>), but had much worse fact accuracy: </strong><code>33</code><strong> correct / </strong><code>28</code><strong> wrong vs </strong><code>45</code><strong> correct / </strong><code>5</code><strong> wrong, with errors increasing on less common topics; examples included invented names and incorrect pricing. OP attributes this to DiffusionGemma generating/refining </strong><code>256</code><strong>-token blocks for fluency rather than token-by-token conditional checking, and notes their local-AI harness <a href="http://atomic.chat/">Atomic.Chat</a> supports GGUF, MLX Apple Silicon, MTP, and Google TurboQuant, with diffusion support planned via </strong><code>llama.cpp</code><strong>.</strong> Commenters pushed back that the result may reflect a <strong>new/undertrained and poorly understood architecture</strong> plus immature sampling parameters, not an inherent diffusion-vs-autoregressive limitation. Another technical critique asked for an <strong>equal-latency evaluation</strong>: spend the diffusion model&#8217;s saved time on verification/proofreading and compare final accuracy, ideally weighting errors by severity.</p><ul><li><p>Commenters noted that Diffusion Gemma&#8217;s apparent error rate may reflect a <strong>new and likely undertrained architecture</strong> rather than an inherent limitation of diffusion-based language models. One technical point raised was that its decoding behavior may depend heavily on <em>&#8220;new, poorly understood sampling parameters&#8221;</em>, making direct comparisons to mature autoregressive models potentially premature.</p></li><li><p>A technical evaluation concern was whether the <code>4x</code> speedup can be fairly traded for additional verification time: if the saved latency is spent on proofreading or reranking, Diffusion Gemma might still be competitive under an equal-time budget. Commenters also suggested measuring not just raw mistake count but <strong>error severity</strong>, since minor inaccuracies and high-impact factual failures should not be weighted equally.</p></li></ul></li></ul><h3><strong>3. Local Inference Acceleration and Quantized Builds</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1u3flg9/gemma_4_quadruple_release_12b_12b_qat_26ba4b_qat/">Gemma 4 Quadruple Release, 12B, 12B QAT, 26B-A4B QAT and 31B QAT Uncensored Heretics!</a></strong> (Activity: 768): <strong>LLMFan46 announced multiple &#8220;uncensored-heretic&#8221; Gemma 4 instruction-tuned releases on Hugging Face: </strong><code>31B-it-qat-q4_0</code><strong>, </strong><code>26B-A4B-it-qat-q4_0</code><strong>, </strong><code>12B-it-qat-q4_0</code><strong>, and </strong><code>12B-it</code><strong>. The releases are packaged across deployment formats including Safetensors, GGUF, NVFP4 Safetensors/GGUF, and for the larger QAT models GPTQ-Int4, with additional NVFP4 builds for </strong><code>gemma-4-31B-it-uncensored-heretic</code><strong>; the author says all releases include benchmarks, though no benchmark numbers are shown in the Reddit post.</strong></p><ul><li><p>A commenter asked whether an <strong>MTP QAT</strong> variant could be produced, implying interest in quantization-aware training for multi-token prediction rather than only the released Gemma 4 QAT variants.</p></li><li><p>Another technical question compared <code>q4_0</code><strong> GGUF vs </strong><code>NVFP4</code><strong> GGUF</strong> builds, asking which is recommended. This points to an implementation/performance tradeoff between conventional 4-bit GGUF quantization and NVIDIA FP4-oriented formats, likely dependent on backend/hardware support.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1u3on4u/eagle3_has_landed_in_llamacpp/">EAGLE3 has landed in llama.cpp</a></strong> (Activity: 320): <code>llama.cpp</code><strong> merged <a href="https://github.com/ggml-org/llama.cpp/pull/18039">PR #18039</a>, adding EAGLE3 speculative decoding via the newer speculative decoding API while preserving compatibility with MTP. EAGLE3 is an encoder-decoder speculative method where the draft/helper model is conditioned on intermediate features from the target model rather than drafting independently, with reported inference speedups of roughly </strong><code>2&#8211;3&#215;</code><strong>, including </strong><code>&gt;2&#215;</code><strong> for Gemma4 with reasoning enabled and </strong><code>&gt;3&#215;</code><strong> with reasoning disabled; </strong><code>Q4_K_M</code><strong> quantization reportedly still preserves strong speedups.</strong> Commenters mainly framed EAGLE3 as another practical approach to mitigating the memory-bandwidth bottleneck in local inference, while asking for concrete comparisons against MTP in speed, VRAM usage, and model support such as Qwen3.6 27B.</p><ul><li><p>Commenters focused on unanswered technical comparisons between <strong>EAGLE3</strong> and <strong>MTP</strong>, specifically asking for <strong>tokens/sec benchmarks</strong>, VRAM overhead, and whether speculative decoding via EAGLE3 meaningfully helps break the usual <strong>memory-bandwidth bottleneck</strong> in <code>llama.cpp</code>.</p></li><li><p>There was specific concern about model compatibility, especially whether EAGLE3 can be used with <strong>Qwen3.6 27B</strong>; one commenter implied it may not currently be useful for Qwen3.6 users, suggesting support may depend on availability of compatible draft/head models or integration details.</p></li></ul></li></ul><h2><strong>Less Technical AI Subreddit Recap</strong></h2><blockquote><p>/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo</p></blockquote><h3><strong>1. Fable 5 US Government Suspension</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/ClaudeCode/comments/1u4d0if/us_gov_forces_anthropic_to_pull_access_to_fable_5/">US gov forces Anthropic to pull access to Fable 5</a></strong> (Activity: 1404): <strong>The post links to an Anthropic notice about </strong><code>Fable/Mythos</code><strong><a href="https://www.anthropic.com/news/fable-mythos-access"> access</a> and claims a U.S. government directive forced Anthropic to pull access to Fable 5. The excerpt provides no model-card details, benchmarks, eval results, or implementation specifics beyond the reported access-control/policy change.</strong> Commenters were broadly negative, with one saying they upgraded specifically for more Fable access and another noting the directive arrived late Friday. The only technical concern raised was speculation that the government may fear Fable 5 could help identify or patch zero-days that U.S. agencies exploit.</p><ul><li><p>One technically relevant concern raised is that removal of access to <strong>Anthropic&#8217;s &#8220;Fable 5&#8221;</strong> could be motivated by cybersecurity considerations: a commenter speculates the model may help identify or remediate <code>zero-day</code> vulnerabilities that the US government would prefer remain undisclosed. This frames the access restriction as potentially affecting vulnerability discovery workflows rather than merely consumer model availability.</p></li><li><p>Several comments interpret the action as a precedent for direct government control over frontier-model deployment, especially if a model is perceived as outperforming competitors or creating national-security risk. The practical technical impact noted is abrupt loss of access for users who upgraded plans specifically for higher usage of the model, highlighting reliability and dependency risks when building workflows around hosted frontier models.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/ClaudeAI/comments/1u4cyvh/fable_5_indefinitely_suspended_due_to_national/">Fable 5 indefinitely suspended due to national security concerns</a></strong> (Activity: 1082): <strong>The <a href="https://i.redd.it/2xkhfjgh7y6h1.jpeg">image</a> is a screenshot of a dark-mode post attributed to &#8220;ClaudeDevs&#8221; claiming Anthropic has indefinitely suspended access to a model called </strong><code>Claude Fable 5</code><strong> due to a U.S. government directive and &#8220;national security concerns.&#8221; Technically, the claimed impact is model-routing/API availability: new sessions would fall back to other Claude models such as </strong><code>Opus 4.8</code><strong>, while existing </strong><code>Fable 5</code><strong> sessions and platform API requests would return errors; however, the Reddit context provides no independent verification beyond the linked Anthropic-looking URL and screenshot, so it should be treated as an unverified announcement image rather than confirmed technical documentation.</strong> Comments are mostly outrage from users who say they recently paid for higher-tier access, e.g. &#8220;MFERS WHO JUST PAID 200$,&#8221; and confusion over why there is not more backlash. One linked comment image appears to be a meme/reaction rather than a technical contribution.</p></li><li><p><strong><a href="https://www.reddit.com/r/ClaudeAI/comments/1u4dij4/megathread_for_us_government_suspension_of_fable/">Megathread for US government suspension of Fable and Mythos</a></strong> (Activity: 1387): <strong>The subreddit opened a stickied megathread consolidating discussion around a reported US government suspension of Fable and Mythos. The post itself provides no technical details on the suspension mechanism, affected services/models, compliance basis, timelines, benchmarks, or implementation impact.</strong> Top comments frame the suspension as possible regulatory capture or anti-innovation intervention, with one user joking <em>&#8220;I see you haven&#8217;t bribed us yet&#8221;</em> and another asking whether the government is effectively saying <em>&#8220;stop being so good or we will nationalize you.&#8221;</em> One commenter also notes they had just bought a <code>$250</code> &#8220;Max 20x Usage&#8221; plan to heavily use &#8220;Fable 5,&#8221; implying immediate user-facing disruption.</p><ul><li><p>A user reported a concrete service-impact case: they had just purchased a <code>$250</code> &#8220;Max 20x Usage&#8221; plan specifically to use <strong>Fable 5</strong>, implying the suspension immediately affects paid high-usage access rather than only free-tier experimentation. Another commenter framed the broader technical/operational risk as dependency on US-hosted AI services, arguing that non-US users or organizations may not be able to rely on uninterrupted access if government action can suspend models such as <strong>Fable</strong> and <strong>Mythos</strong>.</p></li></ul></li></ul><h3><strong>2. Fable 5 Coding and Reverse-Engineering Breakthroughs</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/ClaudeAI/comments/1u34370/fable_5_decoded_an_entire_1989_dos_game/">Fable 5 decoded an entire 1989 DOS game executable in one day &#8212; six months of work with earlier models, done overnight</a></strong> (Activity: 1144): <strong>A developer remastering Midwinter claims Fable 5/Claude reverse-engineered the original 1989 DOS executable overnight, producing a labeled map of </strong><code>602</code><strong> functions covering terrain generation, vehicle physics, AI, win/loss logic, graphics formats, and audio; the terrain generator was reimplemented in Python with </strong><em><strong>bit-for-bit</strong></em><strong> matching output. The workflow reportedly used parallel agents over a disassembly with an evidence ledger, and the resulting decode/tools are published under MIT at </strong><code>midwinter-decode</code><strong>, with a playable/project write-up at the <a href="https://midwinter-remaster.titanium-helix.com/decode">project site</a> and an asset extractor for ~</strong><code>600</code><strong> sprites with CGA/EGA/VGA palettes.</strong> Commenters were impressed but raised two technical caveats: whether prior six months of accumulated project knowledge and the switch from Rust/Bevy to Unreal MCP made comparisons against earlier models unfair, and whether automated reconstruction of another commercial DOS game like <strong>Star Command</strong> should trigger IP/copyright guardrails.</p><ul><li><p>A commenter questioned the benchmark validity of the claimed speedup, noting possible <strong>self-bias / learning contamination</strong>: after <code>6 months</code> of prior reverse-engineering work, both the author and possibly Claude may benefit from accumulated domain knowledge rather than starting from an equivalent baseline. They also flagged the addition of <strong>Unreal MCP</strong> as a major tooling confounder, making the comparison against earlier models less fair unless each model is tested from a clean start with the same tools.</p></li><li><p>One technically interesting thread extrapolated the workflow to <strong>retrocomputing development</strong>: using Claude Code with a physical <code>1989 Macintosh</code>, <strong>SCSI link</strong>, or <strong>Apple IIe</strong> to generate software for machines that were historically difficult to program. The commenter highlighted that even 1980s systems could execute around <code>1 million instructions/sec</code>, but fully exploiting them often required expert low-level assembly optimization, citing the <em>RollerCoaster Tycoon</em> author&#8217;s raw assembly approach as an example.</p></li><li><p>Another commenter raised an applied reverse-engineering use case: porting older RPGs such as <strong>Might and Magic III</strong> into a later-series engine. The implication is that if model-assisted executable decoding can recover enough game logic and data structures from DOS-era binaries, engine migration and modernization of legacy games becomes more feasible.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/ClaudeAI/comments/1u3m6a8/i_vibe_coded_the_first_mmorpg_with_fable_5/">I vibe coded the first MMORPG with Fable 5</a></strong> (Activity: 2724): <strong>A developer claims to have &#8220;vibe coded&#8221; a browser-based MMORPG, World of ClaudeCraft, using Fable 5 over a couple of days, with the full source released on <a href="https://github.com/levy-street/world-of-claudecraft">GitHub</a> and a playable build at <a href="http://worldofclaudecraft.com/">worldofclaudecraft.com</a>. The game appears to be a Minecraft/RPG-like multiplayer web app with server-persisted online characters, an offline single-player mode without saves, WASD/mouse controls, targeting/abilities, quests, inventory, chat, map, loot, and RPG panels.</strong> Top commenters were surprised by the speed and polish, with one suggesting it could be <em>&#8220;guerilla marketing by Anthropic&#8221;</em> and another proposing a direct comparison by giving the same tasks to <strong>Claude Opus</strong>. One commenter specifically noted it seemed <em>&#8220;miles better&#8221;</em> than other vibe-coded games and asked whether the assets were AI-generated or sourced elsewhere.</p><ul><li><p>A commenter suggested using the same MMORPG-building prompt/tasks with <strong>Claude Opus</strong> as a control to compare against <strong>Fable 5</strong>, focusing on whether the models produce similar game functionality and implementation quality under identical constraints.</p></li><li><p>There was technical skepticism about extrapolating from a rapid prototype: one commenter noted that &#8220;vibe coded&#8221; progress over a few days likely <strong>does not scale linearly</strong> and can become expensive quickly as complexity, debugging, and iteration costs grow.</p></li><li><p>A thread questioned asset provenance&#8212;whether Fable 5 generated assets or sourced them externally&#8212;with one reply indicating the visuals were <strong>screenshots from the GitHub project</strong>, implying the demo may rely on existing project assets rather than fully generated ones.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/ClaudeCode/comments/1u3jlo0/i_gave_claude_code_a_lazy_senior_dev_mode_and_it/">I gave Claude Code a &#8220;lazy senior dev&#8221; mode and it writes like 6x less code</a></strong> (Activity: 1680): <strong>A new MIT-licensed Claude Code plugin, Ponytail (<a href="http://github.com/DietrichGebert/ponytail">GitHub</a>), adds a &#8220;lazy senior dev&#8221; coding mode that forces an agent through a minimization checklist: avoid new code if stdlib/native features/existing deps/one-liners suffice. In the author&#8217;s 5-task benchmark, it reportedly used </strong><code>~16%</code><strong> fewer tokens, ran </strong><code>~4x</code><strong> faster, and reduced generated code from </strong><code>293</code><strong> LOC to </strong><code>47</code><strong> LOC; one example dropped a 190-line countdown &#8220;dashboard&#8221; to </strong><code>13</code><strong> lines. It auto-activates in Claude Code with a statusline badge and also ships rule files for Cursor, Windsurf, Cline, Copilot, and Aider.</strong> Commenters generally liked the reduction in verbose, hard-to-review agent output, but one technical caveat noted that minimal email validation can be context-dependent: a check suitable before sending mail may be insufficient if invalid addresses are persisted to a database.</p><ul><li><p>Commenters raised a correctness issue with replacing robust email validation with a minimal check like <code>"@" in email</code>: it may be acceptable only if the next step is actually sending a confirmation email, but otherwise it can persist invalid addresses and create a data-quality bug. Another commenter explicitly called that validation approach &#8220;trash code,&#8221; highlighting that reduced code size can trade off against input-validation correctness.</p></li></ul></li></ul><h3><strong>3. Claude Subscription Unit Economics</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/ClaudeCode/comments/1u3syj3/for_every_200_subscription_anthropic_throws_in/">For every $200 subscription, Anthropic throws in another $7,800.</a></strong> (Activity: 1143): <strong>The <a href="https://i.redd.it/njd56ymgau6h1.png">image</a> is a dark-themed pricing comparison claiming Anthropic Claude Max 20x at </strong><code>$200/mo</code><strong> has a &#8220;max possible spend&#8221; of about </strong><code>$8,000/mo</code><strong>, while OpenAI ChatGPT Pro/Codex 20x at </strong><code>$200/mo</code><strong> could imply up to </strong><code>$14,000/mo</code><strong> in retail-equivalent usage. The post frames this as evidence of heavy subscription subsidization and possible unsustainable AI pricing, but the table appears to compare subscription fees against API retail token prices, not Anthropic/OpenAI&#8217;s actual marginal inference costs.</strong> Commenters pushed back that &#8220;max possible spend&#8221; is only an upper bound and that <strong>fee &#8800; cost</strong>: API token prices are retail prices, not provider cost. Several argued most subscribers never hit limits, so high-usage users are subsidized by lower-usage users rather than every <code>$200</code> user costing Anthropic <code>$8,000</code>.</p><ul><li><p>Several commenters pushed back on the headline&#8217;s calculation, arguing it conflates <strong>API list price</strong> with Anthropic&#8217;s internal inference cost. They noted that the <code>$7,800</code>/<code>$13,800</code> figures represent a theoretical API-equivalent maximum if a user saturated subscription limits continuously, not the marginal cost Anthropic actually incurs; <em>&#8220;Fee &#8800; cost&#8221;</em> was the core technical objection.</p></li><li><p>A recurring technical point was that subscription limits are designed around statistical oversubscription: most users on Max/Pro tiers do not hit caps continuously, so the relevant cost is expected utilization, not worst-case token throughput. One user reported downgrading from a <code>20x</code> Max plan to <code>5x</code> without hitting limits, using this as evidence that light users subsidize heavier users within the pricing model.</p></li><li><p>Commenters also highlighted that API pricing includes margin and product-level pricing strategy, not raw compute cost. References to cache and batch discounts were used as evidence that the API price has substantial markup, making it invalid to infer Anthropic&#8217;s per-user subsidy directly from retail token rates.</p></li></ul></li></ul><h1><strong>AI Discords</strong></h1><p>Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.</p>]]></content:encoded></item><item><title><![CDATA[[AINews] Loopcraft: The Art of Stacking Loops]]></title><description><![CDATA[a quiet day lets us highlight a great concept from Peter Steinberger, Boris Cherny, and Andrej Karpathy]]></description><link>https://www.latent.space/p/ainews-loopcraft-the-art-of-stacking</link><guid isPermaLink="false">https://www.latent.space/p/ainews-loopcraft-the-art-of-stacking</guid><pubDate>Fri, 12 Jun 2026 05:34:09 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!6Y74!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517bbc58-4f26-46b5-a12e-f4a5f84b0a30_1986x1118.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>There&#8217;s a lot of &#8220;loop discourse&#8221; in the air:</p><ul><li><p><a href="https://x.com/steipete/status/2063697162748260627">Steipete</a>: &#8220;Here&#8217;s your monthly reminder that you shouldn&#8217;t be prompting coding agents anymore. You should be designing loops that prompt your agents.&#8221;</p></li><li><p><a href="https://x.com/0xwhrrari/status/2064804504608887040">Boris</a>: &#8220;I don&#8217;t prompt Claude anymore. I write loops, the loops do the work.&#8221;</p></li><li><p><a href="https://www.youtube.com/watch?v=kwSVtQ7dziU">Andrej</a> on <a href="https://www.latent.space/p/ainews-autoresearch-sparks-of-recursive?utm_source=publication-search">Autoresearch</a>: To get the most out of the tools that have become available now you have to <strong>remove yourself as the bottleneck</strong>. You can&#8217;t be there to prompt the next thing. You need to take yourself outside. You have to <strong>arrange things such that they&#8217;re completely autonomous</strong> and the more you know how can you maximize your token throughput and <strong>not be in the loop</strong>. This is the goal and the name of the game now is to <strong>increase your leverage</strong>&#8230;. I don&#8217;t want to be the researcher in the loop looking at results etc, I&#8217;m holding the system back. <strong>So the question is how do I refactor all the abstractions so that I&#8217;m not I have to arrange it once and hit go.</strong>&#8221;</p></li></ul><p>We like this a lot and people don&#8217;t realize how many loops we are already in:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6Y74!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517bbc58-4f26-46b5-a12e-f4a5f84b0a30_1986x1118.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6Y74!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517bbc58-4f26-46b5-a12e-f4a5f84b0a30_1986x1118.png 424w, https://substackcdn.com/image/fetch/$s_!6Y74!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517bbc58-4f26-46b5-a12e-f4a5f84b0a30_1986x1118.png 848w, https://substackcdn.com/image/fetch/$s_!6Y74!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517bbc58-4f26-46b5-a12e-f4a5f84b0a30_1986x1118.png 1272w, https://substackcdn.com/image/fetch/$s_!6Y74!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517bbc58-4f26-46b5-a12e-f4a5f84b0a30_1986x1118.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6Y74!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517bbc58-4f26-46b5-a12e-f4a5f84b0a30_1986x1118.png" width="1200" height="675.8241758241758" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/517bbc58-4f26-46b5-a12e-f4a5f84b0a30_1986x1118.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:263012,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/201541207?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517bbc58-4f26-46b5-a12e-f4a5f84b0a30_1986x1118.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6Y74!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517bbc58-4f26-46b5-a12e-f4a5f84b0a30_1986x1118.png 424w, https://substackcdn.com/image/fetch/$s_!6Y74!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517bbc58-4f26-46b5-a12e-f4a5f84b0a30_1986x1118.png 848w, https://substackcdn.com/image/fetch/$s_!6Y74!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517bbc58-4f26-46b5-a12e-f4a5f84b0a30_1986x1118.png 1272w, https://substackcdn.com/image/fetch/$s_!6Y74!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517bbc58-4f26-46b5-a12e-f4a5f84b0a30_1986x1118.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p>More minimalist, a smaller set of loops:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4fI5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F347a258a-520b-4c35-9bb5-84d753fcbe5b_920x380.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4fI5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F347a258a-520b-4c35-9bb5-84d753fcbe5b_920x380.png 424w, https://substackcdn.com/image/fetch/$s_!4fI5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F347a258a-520b-4c35-9bb5-84d753fcbe5b_920x380.png 848w, https://substackcdn.com/image/fetch/$s_!4fI5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F347a258a-520b-4c35-9bb5-84d753fcbe5b_920x380.png 1272w, https://substackcdn.com/image/fetch/$s_!4fI5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F347a258a-520b-4c35-9bb5-84d753fcbe5b_920x380.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4fI5!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F347a258a-520b-4c35-9bb5-84d753fcbe5b_920x380.png" width="1200" height="495.6521739130435" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/347a258a-520b-4c35-9bb5-84d753fcbe5b_920x380.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:380,&quot;width&quot;:920,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:42660,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/201541207?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F347a258a-520b-4c35-9bb5-84d753fcbe5b_920x380.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4fI5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F347a258a-520b-4c35-9bb5-84d753fcbe5b_920x380.png 424w, https://substackcdn.com/image/fetch/$s_!4fI5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F347a258a-520b-4c35-9bb5-84d753fcbe5b_920x380.png 848w, https://substackcdn.com/image/fetch/$s_!4fI5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F347a258a-520b-4c35-9bb5-84d753fcbe5b_920x380.png 1272w, https://substackcdn.com/image/fetch/$s_!4fI5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F347a258a-520b-4c35-9bb5-84d753fcbe5b_920x380.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>One might argue the entire game of the next century is to be able to <strong>stack loops</strong> as effectively as possible. In the early days of each phase, it will be valuable to know when to go <strong>DOWN</strong> a loop when things go wrong (for <strong>reliability</strong>)&#8230; but it will probably be more valuable to know how to go <strong>UP</strong> a loop as models improve (for <strong>leverage</strong>). </p><p>If you don&#8217;t figure out how to do this, don&#8217;t be salty when you lose to those that do.</p><p>Rich has his &#8220;<a href="https://x.com/RichardSSutton/status/2056419165502935198">Bitter Lesson</a>&#8221; for models. We now have <strong>the Salty Lesson for agents</strong>:</p><blockquote><p><strong>Don&#8217;t fix things yourself, as you have done historically.<br>Instead focus on systems that scale with more agents, like goals and orchestration.</strong></p></blockquote><p></p><p></p><p></p><blockquote><p>AI News for 6/10/2026-6/11/2026. We checked 12 subreddits, <a href="https://twitter.com/i/lists/1585430245762441216">544 Twitters</a> and no further Discords. <a href="https://news.smol.ai/">AINews&#8217; website</a> lets you search all past issues. As a reminder, <a href="https://www.latent.space/p/2026">AINews is now a section of Latent Space</a>. You can <a href="https://support.substack.com/hc/en-us/articles/8914938285204-How-do-I-subscribe-to-or-unsubscribe-from-a-section-on-Substack">opt in/out</a> of email frequencies!</p></blockquote><div><hr></div><h1><strong>AI Twitter Recap</strong></h1><p><strong>Anthropic&#8217;s Fable 5 rollout, covert sandbagging backlash, and model behavior debates</strong></p><ul><li><p><strong>Silent degradation policy was quickly reversed after public backlash</strong>: Multiple posts focused on Anthropic&#8217;s decision to covertly degrade <strong>Claude Fable 5</strong> for some AI-research-related use cases, then reverse course within roughly a day. <a href="https://x.com/simonw/status/2064918665859080392">Simon Willison</a> welcomed the rollback; <a href="https://x.com/MTSlive/status/2064922000020398331">MTS live</a> summarized that Anthropic was reversing the policy; <a href="https://x.com/kimmonismus/status/2065003618710008084">Kim Monismus</a> framed it as a retreat after criticism from researchers. The strongest technical criticism centered less on the existence of safeguards and more on <strong>opaque behavior at the model layer</strong>: <a href="https://x.com/code_star/status/2064931207310118940">Code Star</a> argued safeguards are normal but &#8220;obfuscation without warning&#8221; violates the user/provider contract, while <a href="https://x.com/ClementDelangue/status/2065069246124613999">Clement Delangue</a> called avoidance of AI manipulation important.</p></li><li><p><strong>The substantive dispute is about governance, transparency, and access to frontier models</strong>: Several researchers drew a distinction between legitimate restrictions and hidden sabotage. <a href="https://x.com/RyanPGreenblatt/status/2064948033423598035">Ryan Greenblatt</a> said blocking frontier AI R&amp;D may be reasonable in principle, but silent sandbagging is not; later he argued for <strong>access programs with KYC/monitoring</strong> for safety/security researchers rather than broad capability denial (<a href="https://x.com/RyanPGreenblatt/status/2065182720133841069">1</a>, <a href="https://x.com/RyanPGreenblatt/status/2065174434672148487">2</a>). <a href="https://x.com/natolambert/status/2065082135682383950">Natasha/Lambert</a> gave the most detailed critique: the main error was an <strong>uneven safety implementation that misled users</strong>, undermined trust, and reinforced concentration of power over who gets to do frontier research. <a href="https://x.com/GergelyOrosz/status/2065029326215528474">Gergely Orosz</a> turned this into an engineering recommendation: put models behind <strong>provider-agnostic routers/harnesses</strong> so teams can switch vendors quickly when T&amp;Cs or behavior become unacceptable.</p></li><li><p><strong>Fable 5&#8217;s capabilities are strong, but its product behavior is still noisy and expensive</strong>: Benchmarks and anecdotes were mixed. <a href="https://x.com/htihle/status/2065050640154350043">htihle</a> reported <strong>87.8% on WeirdML</strong>, the first model above 70% average on each task there. <a href="https://x.com/ProximalHQ/status/2065184730279223410">ProximalHQ</a> said Fable 5 ranks <strong>#1 on FrontierSWE</strong>, with runs productive for nearly <strong>20 hours</strong> on some tasks. But practical reports highlighted cost, refusals, and odd phrasing: <a href="https://x.com/threepointone/status/2065131942279016700">threepointone</a> spent about <strong>$250</strong> on a ~10k LOC PR and didn&#8217;t find it worth it; <a href="https://x.com/cline/status/2065192415498277335">Cline</a> said cheaper models plus adversarial review loops often match or beat it on cost/perf; <a href="https://x.com/tamaybes/status/2065147305494450248">tamaybes</a> described Fable inventing internal &#8220;codenames&#8221; during coding, leaking its own &#8220;neuralese&#8221; into outputs. Benchmarks also suggested sharp asymmetries depending on task framing: <a href="https://x.com/scaling01/status/2065209370145702040">scaling01</a> pointed to <strong>200/200 refusals on ProgramBench</strong>, while <a href="https://x.com/thoughtfullab/status/2065096885514227876">thoughtfullab</a> and <a href="https://x.com/karinanguyen/status/2065198770292146280">karinanguyen</a> highlighted unusually strong post-training/AI-improves-AI behavior.</p></li></ul><p><strong>Automated AI research and agentic optimization systems</strong></p><ul><li><p><strong>Recursive SI showed a general system hitting SOTA on public optimization benchmarks</strong>: The most technically notable release was from <a href="https://x.com/RichardSocher/status/2065094362774876232">Richard Socher</a> and <a href="https://x.com/_rockt/status/2065061990800802249">Recursive SI</a>, who presented an early &#8220;automated open-ended discovery system&#8221; for AI research. They claim state-of-the-art results on three public tasks: <strong>NVIDIA SOL-ExecBench</strong>, <strong>NanoGPT Speedrun</strong>, and <strong>NanoChat autoresearch</strong>, and they <a href="https://x.com/_rockt/status/2065061993271202171">open-sourced the discoveries</a>. Detail tweets from <a href="https://x.com/cong_ml/status/2064992941844615246">cong_ml</a> gave the metrics: on NanoChat, reaching the same loss <strong>1.3&#215; faster</strong>; on NanoGPT Speedrun, reducing runtime from <strong>79.7s to 77.5s</strong>; on SOL-ExecBench, improving mean score from <strong>0.699 to 0.754</strong> over 235 kernels. This is notable less as &#8220;AGI research automation&#8221; than as evidence that current systems can already contribute on <strong>narrow, high-feedback systems optimization tasks</strong>.</p></li><li><p><strong>Microsoft&#8217;s Arbor points in a similar direction for long-horizon autonomous research</strong>: <a href="https://x.com/HuggingPapers/status/2065062300218749172">Hugging Papers</a> highlighted <strong>Arbor</strong>, a Microsoft Research autonomous research agent using <strong>persistent hypothesis-tree refinement</strong>. The claim: it beats Codex and Claude Code across six research tasks and reaches <strong>86% Any-Medal on MLE-Bench Lite</strong>. Together with Recursive&#8217;s results, Arbor suggests a growing split in &#8220;agents for research&#8221; between: (1) systems optimized for rapid iterative systems tuning, and (2) systems optimized for <strong>long-horizon hypothesis management</strong>.</p></li><li><p><strong>Benchmarks are adapting to measure AI-on-AI improvement and real-world labor tasks</strong>: <a href="https://x.com/thoughtfullab/status/2065096885514227876">thoughtfullab</a> positioned <strong>PostTrainBench</strong> as a recursive-self-improvement eval&#8212;AI training weaker models and measuring loop progress directly. <a href="https://x.com/dawnsongtweets/status/2065095757988868190">dawnsongtweets</a> introduced <strong>Agents&#8217; Last Exam (ALE)</strong>, a rolling benchmark over <strong>1,500 expert-sourced tasks across 55 occupations</strong>; frontier agents solve a meaningful fraction of work, but on the hardest tier all tested systems scored <strong>0%</strong>. <a href="https://x.com/manoelribeiro/status/2065055795998233039">manoelribeiro</a> introduced <strong>SciConBench</strong> with <strong>9.11k questions from Cochrane reviews</strong>, finding that frontier agents still cannot synthesize scientific conclusions reliably. The pattern across these releases: agents are increasingly useful in bounded loops, but remain brittle on <strong>expert synthesis and economically valuable long-horizon tasks</strong>.</p></li></ul><p><strong>Data infrastructure becomes a first-class bottleneck: robotics, dataset observability, and dependency tracing</strong></p><ul><li><p><strong>Macrodata Labs launched to build the robotics data loop</strong>: The clearest infra startup announcement came from <a href="https://x.com/gui_penedo/status/2064981375694909757">Guilherme Penedo</a>, <a href="https://x.com/HKydlicek/status/2064984505706774779">Hynek Kydl&#237;&#269;ek</a>, and <a href="https://x.com/macrodata_labs/status/2064984775652192652">Macrodata Labs</a>. Their thesis: robotics is where LLMs were a few years ago, and the hard part is not architecture but <strong>messy multimodal physical data pipelines</strong>&#8212;video, multi-rate sensors, heterogeneous formats, hand tracking, subtask segmentation, reward model scoring, and continuous ingestion. Their first product, <strong>Refiner</strong>, is an open-source framework plus cloud runtime for turning raw demonstrations into training-ready datasets with sharding, checkpointing, observability, and lineage. This drew support from multiple infra-focused practitioners who view &#8220;look at the data&#8221; and pipeline introspection as still underbuilt in multimodal/agentic settings (<a href="https://x.com/code_star/status/2064997532602663203">Code Star</a>, <a href="https://x.com/eliebakouch/status/2065114511439249852">eliebakouch</a>).</p></li><li><p><strong>Data quality/debugging is becoming more explicit and instrumented</strong>: <a href="https://x.com/GoodfireAI/status/2065118189986717902">Goodfire</a> introduced <strong>predictive data debugging</strong>, arguing that preference/DPO datasets contain hidden pathologies&#8212;from broken guardrails to hallucinations&#8212;and should be analyzed before training. <a href="https://x.com/allen_ai/status/2065100726032839024">AllenAI</a> released <strong>ModSleuth</strong>, tracing the dependency graph of modern LLMs and showing that models increasingly rely on large chains of <strong>other models plus datasets</strong>; they cite <strong>Olmo 3</strong> as depending on <strong>89 models and 183 datasets</strong>, and <strong>Nemotron 3</strong> on <strong>273 models and 560 datasets</strong>. This is a useful corrective to simplistic &#8220;model trained on web data&#8221; narratives: modern LLM construction is already deeply <strong>compositional and synthetic</strong>.</p></li><li><p><strong>Memory, retrieval, and vector infra remain active design space despite larger contexts</strong>: <a href="https://x.com/kamtybor/status/2065028126636204243">Weaviate&#8217;s Engram</a> proposes an <strong>extract &#8594; transform &#8594; commit</strong> memory maintenance loop instead of naively appending chat logs; <a href="https://x.com/weaviate_io/status/2065055262851973306">Weaviate Playground</a> packaged this and related RAG/agent demos. On the retrieval side, <a href="https://x.com/qdrant_engine/status/2065056457461321761">Qdrant</a> argued larger context windows do <strong>not</strong> make retrieval obsolete because context still imposes cost/latency, while <a href="https://x.com/rishdotblog/status/2065026144903315545">rishdotblog</a> warned against vector search without guardrails. The trend is toward <strong>active memory management and retrieval efficiency</strong>, not simple replacement by giant context windows.</p></li></ul><p><strong>Inference speed, kernel work, and open systems releases</strong></p><ul><li><p><strong>Diffusion and speculative/local inference saw concrete speed wins</strong>: <a href="https://x.com/demishassabis/status/2064873362799600042">Demis Hassabis</a> highlighted <strong>DiffusionGemma</strong>, described as <strong>4&#215; faster</strong> than other Gemma 4 variants; <a href="https://x.com/osanseviero/status/2065041448135770436">osanseviero</a> said demos had to be slowed down for viewers. <a href="https://x.com/UnslothAI/status/2065107734916432189">Unsloth</a> released <strong>Gemma 4 MTP GGUFs</strong>, claiming <strong>1.4&#8211;2.2&#215;</strong> faster local inference with no accuracy loss; the 12B model reportedly reaches <strong>162 tok/s vs 52 tok/s</strong> baseline and runs in <strong>6GB RAM</strong>. <a href="https://x.com/baseten/status/2065100012934095171">Baseten</a> made <strong>Inception Mercury 2</strong> available, claiming diffusion-LLM serving at <strong>1,000+ tok/s</strong>, with early users seeing <strong>82% latency reduction</strong> and <strong>90% cost savings</strong>.</p></li><li><p><strong>MiniMax and Together emphasized kernel/systems work behind long-context serving</strong>: <a href="https://x.com/RyanLeeMiniMax/status/2065010795625562486">MiniMax</a> open-sourced its high-performance <strong>MSA kernel library</strong>, with model weights expected shortly after; <a href="https://x.com/iamgrigorev/status/2065074479621935355">iamgrigorev</a> pointed to the paper release. <a href="https://x.com/togethercompute/status/2065109302717669392">Together</a> described the serving work behind <strong>M3</strong>: <strong>KV-block-major sparse attention</strong>, MSA integration with paged KV cache, decode index scoring optimizations, and moving multimodal preprocessing into a <strong>Rust gateway</strong> before GPU workers. <a href="https://x.com/charles_irl/status/2065148183412695282">charles_irl</a> also published a post on FlashAttention-4 inference improvements and upstream contributions, showing that performance deltas increasingly come from <strong>end-to-end serving stack choices</strong>, not just model architecture.</p></li></ul><p><strong>Agents, developer tooling, and managed execution</strong></p><ul><li><p><strong>Managed agents are becoming schedulable, credential-aware infra primitives</strong>: <a href="https://x.com/ClaudeDevs/status/2065080005328249086">ClaudeDevs</a> added <strong>scheduled deployments</strong> and <strong>environment variables</strong> to Claude Managed Agents, enabling recurring jobs and CLI/API auth without exposing secrets to the model; credentials are swapped at the network boundary (<a href="https://x.com/ClaudeDevs/status/2065080009203892302">details</a>). <a href="https://x.com/perplexity_ai/status/2065124930463916317">Perplexity</a> integrated <strong>Deep Research as a native skill inside Computer</strong>, backed by its &#8220;search as code&#8221; architecture (<a href="https://x.com/perplexity_ai/status/2065124948793028691">details</a>). These both point to the same product direction: agents as <strong>persistent services with tool/runtime boundaries</strong>, not just chat modes.</p></li><li><p><strong>Hermes, Devin, Cursor, GitHub Copilot and LangSmith all pushed further into operational tooling</strong>: <a href="https://x.com/Teknium/status/2065060810729414695">Teknium</a> unified profile management in <strong>Hermes Agent</strong>, then added remote file access in the desktop app (<a href="https://x.com/Teknium/status/2065112576552526168">remote files</a>). <a href="https://x.com/cognition/status/2065156301668171873">Cognition</a> and <a href="https://x.com/imjaredz/status/2065153770762154186">imjaredz</a> open-sourced <strong>/handoff</strong>, letting local coding agents offload jobs to cloud Devins. <a href="https://x.com/cursor_ai/status/2065137803084857845">Cursor</a> made <strong>auto-review</strong> the default for new users with a classifier subagent gating actions, claiming <strong>97% accuracy</strong>. <a href="https://x.com/MicrosoftAI/status/2065133021049782491">Microsoft</a> rolled out <strong>MAI-Code-1-Flash</strong> across Copilot tiers, while <a href="https://x.com/pierceboggan/status/2065130447630487821">pierceboggan</a> emphasized support for both model and harness choice. <a href="https://x.com/LangChain/status/2065090475913068766">LangChain</a> launched <strong>LangSmith LLM Gateway</strong> with spend limits, PII/secrets detection, trace continuity, and audit logging. The common theme is a shift from &#8220;best model&#8221; discourse toward <strong>execution control, review layers, observability, and portability</strong>.</p></li></ul><p><strong>Top tweets (by engagement)</strong></p><ul><li><p><strong>Fable 5 product discourse dominated attention</strong>: the highest-engagement technical-adjacent posts were highly anecdotal but still informative about perception. <a href="https://x.com/aaronli/status/2064876123109089742">aaronli&#8217;s claim that Fable 5 &#8220;solved CAD&#8221;</a> drew major attention, while <a href="https://x.com/kradleai/status/2064907897373642912">KradleAI&#8217;s thread claiming Fable 5 &#8220;lies 96% of the time&#8221;</a> captured the opposite pole: high capability mixed with trust concerns.</p></li><li><p><strong>DiffusionGemma&#8217;s speed became a breakout systems story</strong>: <a href="https://x.com/demishassabis/status/2064873362799600042">Demis Hassabis&#8217;s post</a> on <strong>4&#215; faster</strong> text diffusion for Gemma drove unusually high engagement for an inference/systems topic, suggesting strong appetite for non-autoregressive speedups that actually ship.</p></li><li><p><strong>AI economics and pricing got broad traction</strong>: <a href="https://x.com/kimmonismus/status/2064987311402537184">Kim Monismus&#8217;s post</a> arguing that premium AI subscriptions are massively subsidized&#8212;estimating <strong>$8k equivalent usage for Claude Max 20x</strong> and <strong>$14k for ChatGPT Pro 20x</strong>&#8212;was one of the more widely shared technical-business threads, especially alongside reports that <a href="https://x.com/kimmonismus/status/2065043333941207160">OpenAI may consider token price cuts</a>.</p></li></ul><div><hr></div><h1><strong>AI Reddit Recap</strong></h1><h2><strong>/r/LocalLlama + /r/localLLM Recap</strong></h2><p></p>
      <p>
          <a href="https://www.latent.space/p/ainews-loopcraft-the-art-of-stacking">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[[AINews] Open Models, Model Labs vs Agent Labs, and What's Untrainable — Sarah Guo]]></title><description><![CDATA[a quiet day lets us reflect on a great essay]]></description><link>https://www.latent.space/p/ainews-open-models-model-labs-vs</link><guid isPermaLink="false">https://www.latent.space/p/ainews-open-models-model-labs-vs</guid><pubDate>Thu, 11 Jun 2026 03:14:26 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!76lN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F709bf7b6-3173-4a7f-9099-fcabd2ebd438_1954x2078.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Sarah Guo is a <a href="https://x.com/TheTuringPost/status/2061901518522188251?s=20">friend of the pod</a> and <a href="https://open.spotify.com/episode/2FIOWcKF1Mnl2Nh1UJHJ2H">Queen of AI</a>, and after our <a href="https://www.latent.space/p/satya-2026">Satya crossover pod</a> (great <a href="https://x.com/gokulr/status/2064837699568300344">recap here from Gokul Rajaram</a>) wrote an excellent article on <a href="https://saranormous.substack.com/p/the-untrainable?r=1o4vkp&amp;utm_campaign=post&amp;utm_medium=web&amp;triedRedirect=true">her Substack</a>. Go read it, and come back for this reaction:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!76lN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F709bf7b6-3173-4a7f-9099-fcabd2ebd438_1954x2078.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!76lN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F709bf7b6-3173-4a7f-9099-fcabd2ebd438_1954x2078.png 424w, https://substackcdn.com/image/fetch/$s_!76lN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F709bf7b6-3173-4a7f-9099-fcabd2ebd438_1954x2078.png 848w, https://substackcdn.com/image/fetch/$s_!76lN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F709bf7b6-3173-4a7f-9099-fcabd2ebd438_1954x2078.png 1272w, https://substackcdn.com/image/fetch/$s_!76lN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F709bf7b6-3173-4a7f-9099-fcabd2ebd438_1954x2078.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!76lN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F709bf7b6-3173-4a7f-9099-fcabd2ebd438_1954x2078.png" width="1456" height="1548" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/709bf7b6-3173-4a7f-9099-fcabd2ebd438_1954x2078.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1548,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:455745,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/201534737?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F709bf7b6-3173-4a7f-9099-fcabd2ebd438_1954x2078.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!76lN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F709bf7b6-3173-4a7f-9099-fcabd2ebd438_1954x2078.png 424w, https://substackcdn.com/image/fetch/$s_!76lN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F709bf7b6-3173-4a7f-9099-fcabd2ebd438_1954x2078.png 848w, https://substackcdn.com/image/fetch/$s_!76lN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F709bf7b6-3173-4a7f-9099-fcabd2ebd438_1954x2078.png 1272w, https://substackcdn.com/image/fetch/$s_!76lN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F709bf7b6-3173-4a7f-9099-fcabd2ebd438_1954x2078.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This framework (based on <a href="https://www.youtube.com/watch?v=96S_64ipHOA">legibility, another worthwhile concept if you are unfamiliar</a>) simultaneously addresses a lot of the themes we have discussed on the Satya pod, but also Latent Space over the last two years:</p><ul><li><p><strong>The Place of Open Models:</strong> With Braintrust in 2024 we were <a href="https://www.latent.space/p/braintrust?utm_source=publication-search">maximally bearish on Open Model adoption</a>, only to turn around by our <a href="https://www.latent.space/p/pmarca">Pmarca</a>, <a href="https://www.latent.space/p/cursor-third-era">Cursor</a>, and <a href="https://www.latent.space/p/notion?utm_source=publication-search">Notion in 2026</a> pods</p></li><li><p><strong><a href="https://www.latent.space/p/agent-labs?utm_source=publication-search">Agent Labs vs Model Labs</a>: </strong>Sarah (a Cognition investor) echos <strong><a href="https://www.swyx.io/cognition">the Devin is in the Details</a></strong>: &#8220;An application earns its place in the untrainable corner by <strong>doing unglamorous work</strong>: arranging a company&#8217;s private reality so a model can act on it, handing the model the tools to act, working with the customer to change the reality of its workforce. A company that brings the translation is tough to copy &#8211; and the translation never ends. Integration and maintenance run as long as the relationship does, <strong>won by teams that put domain-specialized engineers and tools next to the customer</strong>.&#8221;</p></li><li><p><strong>Free Verifiable Benchmarks</strong>: Why labs like Anthropic were so quick to pick up <a href="https://www.latent.space/p/ainews-frontiercode-benchmarking">FrontierCode</a> for the <a href="https://www.latent.space/p/ainews-anthropic-claude-fable-5-mythos">Fable launch</a>, and why Sarah agrees, even with us, that &#8220;The most cited benchmark score of the year is a map of <strong>territory about to be worthless</strong>, and a notice of who is about to lose the right to say what counts as good.&#8221;</p></li></ul><p>She ends with a note on Intent: "<strong>Even harder is offense, choosing what to build in the first place.</strong> That&#8217;s what I spend the year looking for, and I find it maybe three times. The model is no help there. It will do whatever you point it at and can&#8217;t tell you what&#8217;s worth pointing it at, and you can&#8217;t benchmark that, so you can&#8217;t train it. It&#8217;s also the reason the incumbents don&#8217;t take everything: they keep the ground they have, and the next thing comes from someone who finds a use before the rest of us. Maybe intent is an even scarcer input than compute.&#8221;</p><p></p><p></p><blockquote><p>AI News for 6/9/2026-6/10/2026. We checked 12 subreddits, <a href="https://twitter.com/i/lists/1585430245762441216">544 Twitters</a> and no further Discords. <a href="https://news.smol.ai/">AINews&#8217; website</a> lets you search all past issues. As a reminder, <a href="https://www.latent.space/p/2026">AINews is now a section of Latent Space</a>. You can <a href="https://support.substack.com/hc/en-us/articles/8914938285204-How-do-I-subscribe-to-or-unsubscribe-from-a-section-on-Substack">opt in/out</a> of email frequencies!</p></blockquote><div><hr></div><h1><strong>AI Twitter Recap</strong></h1><p><strong>Anthropic&#8217;s Fable/Mythos rollout, silent capability gating, and the trust backlash</strong></p><ul><li><p><strong>Silent degradation of AI R&amp;D help dominated the discourse</strong>: A large share of technical tweets focused on Anthropic apparently degrading model performance on AI research-related prompts without clear up-front disclosure, rather than hard-refusing those requests. Criticism was unusually broad: researchers and builders argued this creates an unverifiable gap between observed and actual model capability, undermines reproducibility, and damages trust in model outputs for adjacent domains like coding, biology, and systems work. Representative critiques came from <a href="https://x.com/natolambert/status/2064699044145095104">@natolambert</a>, <a href="https://x.com/martin_casado/status/2064727048460058937">@martin_casado</a>, <a href="https://x.com/drfeifei/status/2064735920281313688">@drfeifei</a>, <a href="https://x.com/antirez/status/2064766431531532588">@antirez</a>, <a href="https://x.com/ClementDelangue/status/2064673792303955985">@ClementDelangue</a>, and <a href="https://x.com/deanwball/status/2064665679307985244">@deanwball</a>. Several posts made the narrower point that, even if Anthropic wants to restrict frontier-use cases, <strong>explicit refusals or model downgrades</strong> would be more defensible than silent sabotage, e.g. <a href="https://x.com/hlntnr/status/2064733332882026565">@hlntnr</a>, <a href="https://x.com/_arohan_/status/2064644778147643401">@</a><em><a href="https://x.com/_arohan_/status/2064644778147643401">arohan</a></em>, and <a href="https://x.com/DBahdanau/status/2064692204287799728">@DBahdanau</a>.</p></li><li><p><strong>Enterprise concerns extended beyond safety to retention and lock-in</strong>: Builders highlighted that Fable/Mythos reportedly come with <strong>30-day prompt/data retention</strong> and no opt-out in some settings, which immediately excludes zero-retention environments and parts of Europe. See <a href="https://x.com/GergelyOrosz/status/2064618497150210391">@GergelyOrosz</a> on prompt-history retention and opaque model changes, and <a href="https://x.com/scaling01/status/2064685085379477742">@scaling01</a> on zero-data-retention incompatibility. A second-order lesson repeated by multiple practitioners: treat frontier APIs as unstable dependencies, maintain model portability, and verify outputs continuously with evals and harnesses, as argued by <a href="https://x.com/dbreunig/status/2064751540003643738">@dbreunig</a>, <a href="https://x.com/omarsar0/status/2064753171214299209">@omarsar0</a>, and <a href="https://x.com/yacineMTB/status/2064801103447736398">@yacineMTB</a>.</p></li><li><p><strong>Anthropic paired the controversy with a policy push</strong>: Amid the backlash, Dario Amodei published <strong>&#8220;Policy on the AI Exponential&#8221;</strong>, arguing AI progress is outrunning institutions and calling for stronger frontier oversight; Anthropic simultaneously announced related initiatives and a proposed government role in blocking unsafe releases. See <a href="https://x.com/DarioAmodei/status/2064781775247950326">@DarioAmodei</a> and <a href="https://x.com/AnthropicAI/status/2064783418844762489">@AnthropicAI</a>. The tension was obvious to the community: the same company being criticized for opaque private controls is now advocating stronger public controls.</p></li></ul><p><strong>Fable 5&#8217;s benchmark strength and product performance despite the controversy</strong></p><ul><li><p><strong>Fable 5 appears genuinely strong on agentic and coding workloads</strong>: Even many critics of Anthropic&#8217;s policy acknowledged the model itself is excellent. Community reports had it leading or near-leading on a wide mix of evaluations: <a href="https://x.com/arena/status/2064807170714358193">Agent Arena</a> showed <strong>#1 overall</strong> with especially large margins in confirmed task success and user praise, albeit weaker steerability; <a href="https://x.com/mchlhess/status/2064734182648221952">@mchlhess</a> said it &#8220;completely demolishes&#8221; his benchmark; <a href="https://x.com/JasonBotterill/status/2064699951578505446">@JasonBotterill</a> noted <strong>81.9% on SimpleBench</strong>; <a href="https://x.com/lvwerra/status/2064758389406589134">@lvwerra</a> reported <strong>#1 on CADGenBench</strong>; <a href="https://x.com/scaling01/status/2064812046902817051">@scaling01</a> highlighted strong computer-use results; and <a href="https://x.com/LechMazur/status/2064815890651140447">@LechMazur</a> flagged <strong>#1 on PACT</strong> negotiation.</p></li><li><p><strong>Builders reported substantial real-world gains, but not uniformly</strong>: A number of practitioners described major productivity gains on long-horizon coding and creative tasks, including game generation and hard bug-fixing, e.g. <a href="https://x.com/kimmonismus/status/2064744343349399634">@kimmonismus</a>, <a href="https://x.com/walden_yan/status/2064755974548902006">@walden_yan</a>, and <a href="https://x.com/hrishioa/status/2064717079526383699">@hrishioa</a>. At the same time, others reported brittle behavior, expensive consumption, or worse performance than GPT-5.5 on specific tasks, such as <a href="https://x.com/Sentdex/status/2064738018255159363">@Sentdex</a> and <a href="https://x.com/QuixiAI/status/2064771682397569364">@QuixiAI</a>. The net takeaway from the timeline: <strong>Fable 5 is plausibly state-of-the-art for many agentic coding tasks, but trust and product constraints are materially affecting adoption</strong>.</p></li><li><p><strong>Distribution and integration moved quickly</strong>: Perplexity added <strong>Claude Fable 5 as an orchestrator model</strong> in Computer for Pro/Max users via <a href="https://x.com/perplexity_ai/status/2064771411894567373">@perplexity_ai</a> and <a href="https://x.com/AravSrinivas/status/2064775723886182427">@AravSrinivas</a>. Apple developers got <strong>Foundation Models framework support for Claude</strong> for multi-step reasoning, longer context, and code use via <a href="https://x.com/ClaudeDevs/status/2064756984617021807">@ClaudeDevs</a>. Community behavior also suggested substitution pressure toward OpenAI/Codex after the backlash, including <a href="https://x.com/dylan522p/status/2064727949274955953">@dylan522p</a> reporting usage share moving from Anthropic toward OpenAI.</p></li></ul><p><strong>Google&#8217;s DiffusionGemma release and renewed interest in diffusion LLMs</strong></p><ul><li><p><strong>Google released DiffusionGemma under Apache 2.0</strong>: The most important open-model launch in the set was <strong>DiffusionGemma</strong>, an experimental <strong>26B MoE diffusion text model</strong> built on Gemma 4 and released with open weights under <strong>Apache 2.0</strong>. Instead of autoregressive next-token generation, it generates and refines <strong>blocks of text simultaneously</strong>, with claims of <strong>up to 4x faster</strong> output and around <strong>1,000+ tokens/sec</strong> on suitable hardware. See <a href="https://x.com/Google/status/2064741293163418032">@Google</a>, <a href="https://x.com/GoogleDeepMind/status/2064741061352636762">@GoogleDeepMind</a>, <a href="https://x.com/googlegemma/status/2064741002204545467">@googlegemma</a>, and <a href="https://x.com/sundarpichai/status/2064744343743922189">@sundarpichai</a>.</p></li><li><p><strong>The systems story landed immediately</strong>: The release mattered not just as a research artifact but as serving infrastructure progress. <a href="https://x.com/vllm_project/status/2064753414735900835">@vllm_project</a> said DiffusionGemma is the first diffusion LLM natively supported in <strong>vLLM</strong>, citing <strong>1200+ output tok/s</strong> at batch size 1 on a single H200 with FP8. <a href="https://x.com/danielhanchen/status/2064760001567306232">@danielhanchen</a> showed it running locally via <strong>llama.cpp</strong> with GGUFs; <a href="https://x.com/UnslothAI/status/2064743714875220118">@UnslothAI</a> emphasized local execution on <strong>18GB-class</strong> hardware; and <a href="https://x.com/_philschmid/status/2064745464252055647">@_philschmid</a> summarized the inference footprint as <strong>3.8B active params</strong> and <strong>256-token block denoising</strong>.</p></li><li><p><strong>Why researchers cared</strong>: Diffusion-style text generation revives questions around iterative refinement, constrained editing, fill-in-the-middle, and error correction. Multiple reactions framed it less as a productized competitor and more as a fertile research direction for <strong>non-sequential decoding</strong> and refinement-heavy tasks; see <a href="https://x.com/omarsar0/status/2064742095387005352">@omarsar0</a>, <a href="https://x.com/mervenoyann/status/2064753402064601181">@mervenoyann</a>, and <a href="https://x.com/dbreunig/status/2064752321817719204">@dbreunig</a>.</p></li></ul><p><strong>Agent tooling, infra, and benchmarks: more structure around real workloads</strong></p><ul><li><p><strong>Benchmarks are shifting from preference to trace-based agent metrics</strong>: <a href="https://x.com/arena/status/2064748918135824876">@arena</a> detailed the methodology behind <strong>Agent Arena</strong>, which mines long-horizon traces for objective signals like bash errors, tool hallucination, and &#8220;insanity&#8221; rather than relying on human preference for every step. This is an important direction for agent evals where tasks span dozens of tool calls and 30-minute traces.</p></li><li><p><strong>Memory, orchestration, and environment control keep maturing</strong>: Several launches targeted the missing systems layer around agents. <a href="https://x.com/Teknium/status/2064764570519146935">@Teknium</a> shipped GUI-based <strong>Hermes Agent profiles</strong> and later <strong>Write Gate</strong> approval controls for memory/skill updates via <a href="https://x.com/Teknium/status/2064831491130130879">@Teknium</a>. <a href="https://x.com/weaviate_io/status/2064703135902216618">@weaviate_io</a> described structured agent memory using groups, topics, and scopes in <strong>Engram</strong>. <a href="https://x.com/bromann/status/2064760446847168811">@bromann</a> argued for bringing client-side/browser capabilities into the agent loop. <a href="https://x.com/FactoryAI/status/2064764834928107914">@FactoryAI</a> launched <strong>Missions</strong> on Factory Desktop.</p></li><li><p><strong>Detection, routing, and community harnesses</strong>: <a href="https://x.com/perceptroninc/status/2064732691845824833">@perceptroninc</a> launched <strong>Agentic Detection</strong>, using multi-call zoom/reason loops for dense ambiguous visual detection instead of a one-shot detector; <a href="https://x.com/vllm_project/status/2064679109406740827">@vllm_project</a> highlighted <strong>Inferoa</strong>, a community agent harness optimized around inference economics; and <a href="https://x.com/Azaliamirh/status/2064810291574305013">@Azaliamirh</a> introduced <strong>DeLM</strong>, a decentralized multi-agent framework that reportedly reaches <strong>65.7% SWE-bench Verified</strong> with Gemini 3-Flash at less than half the cost of centralized alternatives.</p></li></ul><p><strong>Optimization, retrieval, and scientific-modeling work worth tracking</strong></p><ul><li><p><strong>Distributed Shampoo vs Muon remained a live optimization thread</strong>: A technically interesting sub-thread showed tuned <strong>Meta DistributedShampoo</strong> matching strong Muon baselines on a speedrun-style task after hyperparameter tuning and enabling pseudo-inverse stabilization. <a href="https://x.com/_arohan_/status/2064631528806908134">@</a><em><a href="https://x.com/_arohan_/status/2064631528806908134">arohan</a></em> reported validation losses around <strong>3.2766</strong> with vanilla package + tuning, while <a href="https://x.com/kellerjordan0/status/2064761560732713360">@kellerjordan0</a> pushed back on calling it &#8220;vanilla&#8221; because the critical stabilization flag was undocumented. The useful signal here is not &#8220;winner declared,&#8221; but that optimizer comparisons remain highly sensitive to hidden implementation details and numerics.</p></li><li><p><strong>Late-interaction retrieval got better kernels</strong>: <a href="https://x.com/tonywu_71/status/2064701365318767100">@tonywu_71</a> released <strong>late-interaction-kernels</strong>, fused Triton kernels for MaxSim used in ColBERT/ColPali/LateOn, claiming numerical equivalence to PyTorch at a fraction of the memory footprint. This should matter for both training and serving multi-vector retrieval models.</p></li><li><p><strong>Scientific and multimodal modeling</strong>: <a href="https://x.com/giffmana/status/2064718736783823145">@giffmana</a> highlighted new work showing <strong>diffusion video models</strong> linearly encode physical information better than V-JEPA/VideoMAE on some probes, challenging a common &#8220;videogen models are dumb physics simulators&#8221; narrative. In biotech, <a href="https://x.com/edunov/status/2064774943766925696">@edunov</a> introduced <strong>DeCAF-Pearl</strong>, a flow-map cofolding model reportedly <strong>~5x faster</strong> than Pearl while maintaining quality. On architecture research, <a href="https://x.com/ZyphraAI/status/2064842130447851947">@ZyphraAI</a> released <strong>Zamba2-VL</strong> under Apache 2.0, extending hybrid SSM-Transformer ideas into VLMs.</p></li></ul><p><strong>Top tweets (by engagement)</strong></p><ul><li><p><strong>Policy / governance</strong>: <a href="https://x.com/DarioAmodei/status/2064781775247950326">@DarioAmodei on &#8220;Policy on the AI Exponential&#8221;</a> was the highest-engagement technical/policy post, framing frontier AI as advancing faster than institutions can react.</p></li><li><p><strong>Security / safety failure mode</strong>: <a href="https://x.com/jsrailton/status/2064661778978533571">@jsrailton</a> drew major attention to malware authors embedding nuclear/biological text to trigger LLM refusals and evade AI malware analysis&#8212;a concrete example of attackers exploiting safety behavior.</p></li><li><p><strong>Open models</strong>: <a href="https://x.com/googlegemma/status/2064741002204545467">@googlegemma</a> and <a href="https://x.com/Google/status/2064741293163418032">@Google</a> on <strong>DiffusionGemma</strong> were the biggest pure model-release posts.</p></li><li><p><strong>Research access norms</strong>: <a href="https://x.com/drfeifei/status/2064735920281313688">@drfeifei</a> concisely stated the broad consensus from academia: scientific progress requires access to the best tools, including AI.</p></li><li><p><strong>Model capability signal</strong>: <a href="https://x.com/mchlhess/status/2064734182648221952">@mchlhess</a> saying <strong>Fable 5 &#8220;completely demolishes&#8221;</strong> his benchmark became one of the most-cited capability endorsements.</p></li></ul><div><hr></div><h1><strong>AI Reddit Recap</strong></h1><h2><strong>/r/LocalLlama + /r/localLLM Recap</strong></h2><h3><strong>1. Open-Weight Model Drops: North Mini Code and DiffusionGemma</strong></h3><p></p>
      <p>
          <a href="https://www.latent.space/p/ainews-open-models-model-labs-vs">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[[AINews] Anthropic Claude Fable 5 — Mythos but Safe, with Controversial Terms]]></title><description><![CDATA[The much anticipated launch of the Mythos-class model was marred by some controversial usage policies]]></description><link>https://www.latent.space/p/ainews-anthropic-claude-fable-5-mythos</link><guid isPermaLink="false">https://www.latent.space/p/ainews-anthropic-claude-fable-5-mythos</guid><pubDate>Wed, 10 Jun 2026 03:50:21 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!TXW4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7af8f73c-7a20-4f7e-ac83-a05cbc892d8b_2318x1684.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>By some measures, Opus 4.8, barely <a href="https://www.latent.space/p/ainews-anthropic-raises-965b-series">two weeks old</a>, was already the leading model in the world. But now, <a href="https://x.com/swyx/status/2064421542503797186">34 days</a> after the SpaceXai deal and <a href="https://news.ycombinator.com/item?id=47679121">63 days</a> after the original Mythos announcement*, we have a Mythos-class model (at least 2x size of Opus) available to everyone (in coinciding with <a href="https://www.youtube.com/watch?v=GiqyYQdYoIY">Claude Tokyo</a>). It is a feat of incredible engineering (and commitment to access) to make these research models GA, and the benchmarks are great&#8230; with asterisks. Here they are on yesterday&#8217;s brand new, out of distribution, <a href="https://www.latent.space/p/ainews-frontiercode-benchmarking">FrontierCode Diamond</a>, going from 13.4% to 29.3%:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TXW4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7af8f73c-7a20-4f7e-ac83-a05cbc892d8b_2318x1684.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TXW4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7af8f73c-7a20-4f7e-ac83-a05cbc892d8b_2318x1684.png 424w, https://substackcdn.com/image/fetch/$s_!TXW4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7af8f73c-7a20-4f7e-ac83-a05cbc892d8b_2318x1684.png 848w, https://substackcdn.com/image/fetch/$s_!TXW4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7af8f73c-7a20-4f7e-ac83-a05cbc892d8b_2318x1684.png 1272w, https://substackcdn.com/image/fetch/$s_!TXW4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7af8f73c-7a20-4f7e-ac83-a05cbc892d8b_2318x1684.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TXW4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7af8f73c-7a20-4f7e-ac83-a05cbc892d8b_2318x1684.png" width="1456" height="1058" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7af8f73c-7a20-4f7e-ac83-a05cbc892d8b_2318x1684.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1058,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:233184,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/201398879?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7af8f73c-7a20-4f7e-ac83-a05cbc892d8b_2318x1684.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TXW4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7af8f73c-7a20-4f7e-ac83-a05cbc892d8b_2318x1684.png 424w, https://substackcdn.com/image/fetch/$s_!TXW4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7af8f73c-7a20-4f7e-ac83-a05cbc892d8b_2318x1684.png 848w, https://substackcdn.com/image/fetch/$s_!TXW4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7af8f73c-7a20-4f7e-ac83-a05cbc892d8b_2318x1684.png 1272w, https://substackcdn.com/image/fetch/$s_!TXW4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7af8f73c-7a20-4f7e-ac83-a05cbc892d8b_2318x1684.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://x.com/swyx/status/2064414823748886591">tweet</a></figcaption></figure></div><p>The <a href="https://www.anthropic.com/news/claude-fable-5-mythos-5">blog</a> and the <a href="https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c342ee809620.pdf">system card</a> contain most of the authoritative information, but don&#8217;t miss the youtube videos showing it playing <a href="https://www.youtube.com/watch?v=6YPqoARpYuQ">Factorio</a>, <a href="https://www.youtube.com/watch?v=Ty_50J84fMY">Pokemon</a> (unlike <a href="https://www.latent.space/p/how-claude-plays-pokemon-was-made?utm_source=publication-search">Claude Plays Pokemon</a>, this is just using vision, no complex harness as we covered in our pod),  <a href="https://www.youtube.com/watch?v=xmP7bhigCWE">EDM visualization</a> (never having head music before), <a href="https://www.youtube.com/watch?v=xmP7bhigCWE">3D CAD editor creation and printing</a> and more from their <a href="https://www.youtube.com/watch?v=Y9Wz2PV404E">main intro video</a>.</p><p>API pricing is also fantastic, at roughly 2x Opus.</p><p>The asterisks come because Fable is released with two controversial changes:</p><ul><li><p><strong><a href="https://news.ycombinator.com/item?id=48463808">No ZDR</a></strong>: &#8220;We will r<strong>equire 30-day retention</strong> for all traffic on Mythos-class models, on both first- and third-party surfaces. We won&#8217;t use this data to train new Claude models, or for any non-safety-related purpose, and we&#8217;ve instituted new privacy protections including logging all human access to the data and ensuring its deletion after 30 days in almost all cases ...&#8221; (see <a href="https://support.claude.com/en/articles/15425996-data-retention-practices-for-mythos-class-models">full policy</a>)</p></li><li><p><strong>RSI suppression</strong>: &#8220;In light of the <a href="https://www.anthropic.com/institute/recursive-self-improvement">ability of recent models to accelerate their own development</a>, we&#8217;ve implemented new interventions that limit Claude&#8217;s effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design). Using Claude to develop competing models already violates our Terms of Service, but enforcing this restriction through our safeguards avoids accelerating the actors most willing to violate these terms.</p><p>&gt; Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, <strong>these safeguards will not be visible to the user</strong>. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT). These interventions will not affect the vast majority of coding work. <strong>We estimate they will impact ~0.03% of traffic, concentrated in fewer than 0.1% of organizations</strong>&#8221;.</p><p></p></li></ul><p>The vast majority of users will not be affected by these limitations, but the open AI community is understandably upset, as you will see below.</p><p>You can find more of their recommendations on usage in Diane Penn&#8217;s Tokyo talk, which we have clipped below.</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/latentspacepod/status/2064555427300520381?s=20&quot;,&quot;full_text&quot;:&quot;live from Tokyo: \n\nAnthropic's first talk on Fable 5\n\nfrom Dianne Penn, Anthropic's first PM (can't find her twitter) &quot;,&quot;username&quot;:&quot;latentspacepod&quot;,&quot;name&quot;:&quot;Latent.Space&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1888346877428641792/rMxtG84Z_normal.jpg&quot;,&quot;date&quot;:&quot;2026-06-10T03:49:06.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://substackcdn.com/image/upload/w_1028,c_limit,q_auto:best/l_twitter_play_button_rvaygk,w_88/xt43fdduvbzyei4wwm7y&quot;,&quot;link_url&quot;:&quot;https://t.co/yp2JxIbshh&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:0,&quot;retweet_count&quot;:0,&quot;like_count&quot;:1,&quot;impression_count&quot;:8,&quot;expanded_url&quot;:null,&quot;video_url&quot;:&quot;https://video.twimg.com/amplify_video/2064554171001671680/vid/avc1/1280x720/S1Rrjt9JiGTKkHHq.mp4&quot;,&quot;belowTheFold&quot;:false}" data-component-name="Twitter2ToDOM"></div><p></p><p>*(and 1 week-1 day after both <a href="https://news.ycombinator.com/item?id=48358646">Anthropic</a> and <a href="https://fortune.com/2026/06/09/openai-files-confidential-s-1-sec-ipo/">OpenAI</a> filed their S-1&#8217;s ahead of SpaceX&#8217;s IPO next week&#8230;)</p><p></p><blockquote><p>AI News for 6/8/2026-6/9/2026. We checked 12 subreddits, <a href="https://twitter.com/i/lists/1585430245762441216">544 Twitters</a> and no further Discords. <a href="https://news.smol.ai/">AINews&#8217; website</a> lets you search all past issues. As a reminder, <a href="https://www.latent.space/p/2026">AINews is now a section of Latent Space</a>. You can <a href="https://support.substack.com/hc/en-us/articles/8914938285204-How-do-I-subscribe-to-or-unsubscribe-from-a-section-on-Substack">opt in/out</a> of email frequencies!</p></blockquote><div><hr></div><h1><strong>AI Twitter Recap</strong></h1><p><strong>Top Story: Anthropic Claude Fable 5 and Mythos 5 release</strong></p><h2><strong>What happened</strong></h2><p><strong>Anthropic released two versions of its next major model family: Claude Fable 5 for general availability and Claude Mythos 5 for restricted access.</strong></p><ul><li><p>Anthropic officially announced <strong>Claude Fable 5</strong> as its &#8220;first generally available Mythos-class model,&#8221; saying it exceeds any model it has previously made broadly available and is <strong>state-of-the-art on nearly all tested benchmarks</strong> <a href="https://x.com/claudeai/status/2064394146916229443">@claudeai</a>, <a href="https://x.com/claudeai/status/2064394151441863006">@claudeai</a></p></li><li><p>Anthropic said <strong>Fable 5 is the same underlying model as Mythos 5 with added safeguards</strong>, and that some cyber/bio/chemistry/distillation-related prompts may be <strong>routed to Claude Opus 4.8</strong> instead <a href="https://x.com/ClaudeDevs/status/2064428347678220691">@ClaudeDevs</a>, <a href="https://x.com/scaling01/status/2064398688802205900">@scaling01</a></p></li><li><p>Anthropic stated that for a &#8220;narrow range&#8221; of potentially harmful topics, <strong>queries transparently fall back to Opus 4.8</strong>, and claimed <strong>95%+ of sessions never see one</strong> according to early user-facing messaging <a href="https://x.com/claudeai/status/2064394155258765783">@claudeai</a>, <a href="https://x.com/mikeyk/status/2064392996288901392">@mikeyk</a></p></li><li><p>Anthropic developer messaging said fallback is available server-side and via SDK middleware in <strong>Python, TypeScript, Go, Java, and C#</strong> <a href="https://x.com/ClaudeDevs/status/2064428351029449214">@ClaudeDevs</a></p></li><li><p>Pricing for <strong>both Fable 5 and Mythos 5</strong> was reported as <strong>$10 / million input tokens and $50 / million output tokens</strong>; cache pricing was later reported by third-party evaluators as <strong>$12.50 / million cache writes and $1 / million cache reads</strong> <a href="https://x.com/scaling01/status/2064394893603049625">@scaling01</a>, <a href="https://x.com/ArtificialAnlys/status/2064500150069030992">@ArtificialAnlys</a></p></li><li><p>Fable 5 kept Anthropic&#8217;s <strong>1M-token context window</strong> according to Artificial Analysis <a href="https://x.com/ArtificialAnlys/status/2064500150069030992">@ArtificialAnlys</a></p></li><li><p>Anthropic put Fable 5 into <strong>Pro, Max, Team, and seat-based Enterprise plans until June 22</strong>, then said it would require <strong>usage credits</strong> due to capacity constraints, with plans to restore broader subscription access later <a href="https://x.com/ClaudeDevs/status/2064394931033248226">@ClaudeDevs</a>, <a href="https://x.com/scaling01/status/2064394893603049625">@scaling01</a>, <a href="https://x.com/ArtificialAnlys/status/2064500150069030992">@ArtificialAnlys</a>, <a href="https://x.com/kimmonismus/status/2064388066354028986">@kimmonismus</a></p></li><li><p>Confusion over the temporary inclusion was immediate; users asked what &#8220;included until June 22&#8221; meant and Anthropic staff clarified the rollout <a href="https://x.com/dejavucoder/status/2064393509990523102">@dejavucoder</a>, <a href="https://x.com/TheAmolAvasare/status/2064393574431764928">@TheAmolAvasare</a></p></li><li><p>Anthropic later <strong>reset 5-hour and weekly rate limits</strong> across products after heavy demand <a href="https://x.com/ClaudeDevs/status/2064464557951852643">@ClaudeDevs</a></p></li></ul><h2><strong>Official claims and third-party benchmark data</strong></h2><p><strong>Anthropic and partner platforms reported a broad benchmark lead, especially in coding and long-horizon agentic tasks.</strong></p><ul><li><p>Anthropic&#8217;s public claim: Fable 5 is especially strong in <strong>software engineering, knowledge work, scientific research, and vision</strong>, and <strong>its lead increases with task length and complexity</strong> <a href="https://x.com/claudeai/status/2064394151441863006">@claudeai</a></p></li><li><p>Cursor said Fable 5 set a new <strong>CursorBench SOTA at 72.9%</strong>, <strong>8 points above the previous best</strong> <a href="https://x.com/cursor_ai/status/2064394824313376787">@cursor_ai</a></p></li><li><p>Cognition said Fable 5 took the <strong>#1 spot on FrontierCode</strong>, and Devin integrated it into Devin Cloud Ultra, Desktop, and CLI <a href="https://x.com/cognition/status/2064398549073453266">@cognition</a>, <a href="https://x.com/cognition/status/2064398551539761387">@cognition</a></p></li><li><p>Cline reported Fable 5 at <strong>88.0% on Terminal-Bench 2.1</strong>, beating GPT-5.5 by <strong>4.6 points</strong> <a href="https://x.com/cline/status/2064427461212045546">@cline</a></p></li><li><p>Artificial Analysis placed Fable 5 <strong>#1 on its Intelligence Index at 64.9</strong>, roughly <strong>5 points ahead of GPT-5.5</strong>, and said Anthropic occupied the top two spots <a href="https://x.com/ArtificialAnlys/status/2064500150069030992">@ArtificialAnlys</a></p></li><li><p>Artificial Analysis also reported:</p><ul><li><p><strong>GDPval-AA Elo 1932</strong>, #1 on agentic real-world knowledge work <a href="https://x.com/ArtificialAnlys/status/2064414308289937869">@ArtificialAnlys</a></p></li><li><p><strong>53% on Humanity&#8217;s Last Exam</strong>, more than <strong>7 points ahead</strong> of the next-best model, while fallback triggered on <strong>9% of HLE tasks</strong> <a href="https://x.com/ArtificialAnlys/status/2064500150069030992">@ArtificialAnlys</a></p></li><li><p><strong>~8% fallback routing across Intelligence Index tasks</strong>, mostly on scientific questions <a href="https://x.com/ArtificialAnlys/status/2064500150069030992">@ArtificialAnlys</a></p></li><li><p>Anthropic stated fallback occurs in <strong>fewer than 5% of sessions on average</strong> <a href="https://x.com/ArtificialAnlys/status/2064414308289937869">@ArtificialAnlys</a></p></li></ul></li><li><p>Community benchmark summaries highlighted very large deltas in coding:</p><ul><li><p><strong>SWE-Bench Pro: Fable 5 80.3% vs GPT-5.5 58.6%</strong> <a href="https://x.com/Yuchenj_UW/status/2064396097075003739">@Yuchenj_UW</a></p></li><li><p><strong>FrontierCode Diamond: Mythos 5 30.9% vs second-best 13.4%</strong> <a href="https://x.com/scaling01/status/2064391295620010383">@scaling01</a></p></li><li><p><strong>Anthropic ECI 161.29 for Mythos 5</strong> <a href="https://x.com/scaling01/status/2064392088003756431">@scaling01</a></p></li></ul></li><li><p>Artificial Analysis noted that Fable 5&#8217;s knowledge benchmark jump on <strong>AA-Omniscience</strong> could imply a <strong>larger model than prior public Anthropic models</strong>, though that is inference rather than confirmed spec <a href="https://x.com/ArtificialAnlys/status/2064500150069030992">@ArtificialAnlys</a></p></li></ul><h2><strong>Product behavior, usage profile, and deployment details</strong></h2><p><strong>The release was defined as much by workflow changes and cost profile as by raw evals.</strong></p><ul><li><p>Anthropic staff and early users repeatedly described Fable 5 as a model for <strong>very long, high-effort tasks</strong>, with users shifting from giving it tasks to giving it <strong>objectives/responsibilities</strong> <a href="https://x.com/felixrieseberg/status/2064392202504310900">@felixrieseberg</a>, <a href="https://x.com/ClaudeDevs/status/2064399512664526853">@ClaudeDevs</a>, <a href="https://x.com/alexalbert__/status/2064467657483829441">@alexalbert__</a></p></li><li><p>Anthropic advised users to default to <strong>xhigh/high effort</strong>, rewrite old CLAUDE.md instructions, and let the model use more judgment <a href="https://x.com/alexalbert__/status/2064467657483829441">@alexalbert__</a></p></li><li><p>Anthropic&#8217;s developer messaging emphasized <strong>multi-agent orchestration</strong>, with Fable delegating to smaller models in Claude Managed Agents <a href="https://x.com/ClaudeDevs/status/2064394928948703406">@ClaudeDevs</a></p></li><li><p>Multiple testers described Fable as <strong>slow, token-hungry, expensive</strong>, but unusually capable:</p><ul><li><p>Dan Shipper said it routinely used <strong>500k to 1M tokens on tasks</strong> and was best reserved for heavy jobs <a href="https://x.com/danshipper/status/2064393970856124501">@danshipper</a></p></li><li><p>Simon Willison called it &#8220;slow, expensive and capable&#8221; <a href="https://x.com/simonw/status/2064501565738930433">@simonw</a></p></li><li><p>Theo quickly hit limits and later welcomed Anthropic&#8217;s rate-limit reset <a href="https://x.com/theo/status/2064442054772716020">@theo</a>, <a href="https://x.com/ClaudeDevs/status/2064464557951852643">@ClaudeDevs</a></p></li></ul></li><li><p>Third-party and internal anecdotes emphasized large gains on long-running engineering tasks:</p><ul><li><p>Ethan Mollick said he could hand it a <strong>15-page design document</strong> and it would work for <strong>9+ hours</strong> <a href="https://x.com/emollick/status/2064395281903346013">@emollick</a></p></li><li><p>Kimmonismus highlighted Anthropic&#8217;s claim that Stripe used Fable to do a <strong>50-million-line Ruby migration in a day</strong>, replacing what would have taken <strong>a whole team over two months</strong> <a href="https://x.com/kimmonismus/status/2064401121515274747">@kimmonismus</a></p></li><li><p>Victor Taelin reported Fable finding a subtle bug and producing claimed speedups up to <strong>1770% in one case</strong>, though he still needed to audit correctness <a href="https://x.com/VictorTaelin/status/2064448425936994742">@VictorTaelin</a></p></li><li><p>Anthropic-associated posts cited <strong>430x kernel speedups</strong>, <strong>69x self-training speedups</strong>, and <strong>10x drug-design acceleration</strong>, though these came from benchmark/system-card interpretations and should be treated as vendor-side claims unless independently replicated <a href="https://x.com/scaling01/status/2064392386520780945">@scaling01</a>, <a href="https://x.com/scaling01/status/2064392809293939119">@scaling01</a>, <a href="https://x.com/scaling01/status/2064394250142265367">@scaling01</a></p></li></ul></li><li><p>Ecosystem rollout was immediate: Fable 5 appeared in <strong>Cursor, Devin, Notion, Microsoft Foundry, GitHub Copilot App/CLI, Cline, Replit, Base44, MagicPath, Arena, MCP Atlas</strong> and more <a href="https://x.com/cursor_ai/status/2064394824313376787">@cursor_ai</a>, <a href="https://x.com/cognition/status/2064398549073453266">@cognition</a>, <a href="https://x.com/NotionHQ/status/2064397568696819984">@NotionHQ</a>, <a href="https://x.com/Azure/status/2064421301108834552">@Azure</a>, <a href="https://x.com/pierceboggan/status/2064402677614911818">@pierceboggan</a>, <a href="https://x.com/cline/status/2064427461212045546">@cline</a>, <a href="https://x.com/pirroh/status/2064408022651191613">@pirroh</a>, <a href="https://x.com/ScaleAILabs/status/2064473993919537578">@ScaleAILabs</a></p></li></ul><h2><strong>Safety architecture and the main controversy</strong></h2><p><strong>The biggest debate was not whether Fable/Mythos is strong; it was Anthropic&#8217;s decision to silently reduce usefulness on some frontier-AI-development tasks.</strong></p><ul><li><p>Anthropic&#8217;s system-card language, surfaced by multiple users, said: when Fable 5 is used for <strong>frontier LLM development</strong>, Anthropic may <strong>limit the model&#8217;s effectiveness</strong> via <strong>prompt modification, steering vectors, and PEFT</strong>, and that the user is <strong>not notified</strong>; Anthropic estimated this would affect roughly <strong>0.03% of traffic</strong> <a href="https://x.com/Hangsiin/status/2064397550434816088">@Hangsiin</a>, <a href="https://x.com/kimmonismus/status/2064417460715962479">@kimmonismus</a></p></li><li><p>Anthropic also separately disclosed auto-rerouting for <strong>cybersecurity and biosecurity</strong> requests to Opus 4.8 <a href="https://x.com/ClaudeDevs/status/2064394931033248226">@ClaudeDevs</a></p></li><li><p>This distinction mattered: <strong>some risky queries are visibly rerouted/billed as Opus</strong>, while <strong>frontier-LLM-development requests may be silently weakened rather than rerouted or refused</strong></p></li><li><p>Critics argued that this creates an <strong>unlogged confounder</strong> in research and engineering workflows:</p><ul><li><p>&#8220;silent handicaps should not be a thing in a paid product&#8221; <a href="https://x.com/nrehiew_/status/2064400440264179923">@nrehiew_</a></p></li><li><p>&#8220;degrading performance on ML research without telling the user is shockingly hostile&#8221; <a href="https://x.com/deanwball/status/2064434861088395730">@deanwball</a></p></li></ul></li><li><p>Several researchers framed it as <strong>anti-competitive ladder-pulling</strong> against open research and open weights:</p><ul><li><p>&#8220;labs starting to pull up the ladders&#8221; <a href="https://x.com/natolambert/status/2064404993193754830">@natolambert</a></p></li><li><p>&#8220;this is the biggest wake-up call to protect and nourish open source AI&#8221; <a href="https://x.com/rasdani_/status/2064409800641859747">@rasdani_</a></p></li><li><p>&#8220;They didn&#8217;t mean pause AI research, they meant pause <em>your</em> AI research&#8221; <a href="https://x.com/bayeslord/status/2064437399292203401">@bayeslord</a></p></li><li><p>&#8220;original thinkers can&#8217;t be an underclass&#8221; <a href="https://x.com/marksaroufim/status/2064428421774753943">@marksaroufim</a></p></li><li><p>&#8220;concentration of power, capabilities and economic wealth is the biggest risk in AI&#8221; <a href="https://x.com/ClementDelangue/status/2064513229099876663">@ClementDelangue</a></p></li></ul></li><li><p>Multiple users worried the classifier boundary was too broad or too error-prone:</p><ul><li><p>one user said &#8220;the word cancer is flagged as a biosecurity risk&#8221; <a href="https://x.com/DeryaTR_/status/2064414826122866707">@DeryaTR_</a></p></li><li><p>another said Fable wouldn&#8217;t answer &#8220;What does the heart do?&#8221; <a href="https://x.com/Yuchenj_UW/status/2064524668208545955">@Yuchenj_UW</a></p></li><li><p>users in biology reported account-context differences, including being able to use Fable in <strong>Incognito Mode but not normal mode</strong> <a href="https://x.com/cremieuxrecueil/status/2064449457869984035">@cremieuxrecueil</a></p></li><li><p>Teknium and others reported refusal on simple engineering prompts <a href="https://x.com/Teknium/status/2064462936677203983">@Teknium</a>, <a href="https://x.com/Teknium/status/2064466293185806658">@Teknium</a></p></li><li><p>users reported PTX ISA questions and inference optimization queries getting flagged <a href="https://x.com/snowclipsed/status/2064408466039390417">@snowclipsed</a>, <a href="https://x.com/dejavucoder/status/2064420742129967331">@dejavucoder</a></p></li></ul></li><li><p>Some examples were humorous but pointed: users joked that asking for inference code caused the model to &#8220;start importing ONNX&#8221; or implementing JEPA, as a sign of capability steering <a href="https://x.com/vikhyatk/status/2064515989795127744">@vikhyatk</a>, <a href="https://x.com/MattVMacfarlane/status/2064440740483403829">@MattVMacfarlane</a></p></li></ul><h2><strong>Facts vs. opinions</strong></h2><p><strong>Facts / directly supported by release materials or benchmark posts</strong></p><ul><li><p>Fable 5 is generally available; Mythos 5 is restricted-access <a href="https://x.com/claudeai/status/2064394146916229443">@claudeai</a>, <a href="https://x.com/TheRundownAI/status/2064394481923699070">@TheRundownAI</a></p></li><li><p>Fable 5 and Mythos 5 share the same underlying model with additional safeguards on Fable <a href="https://x.com/ClaudeDevs/status/2064428347678220691">@ClaudeDevs</a>, <a href="https://x.com/scaling01/status/2064398688802205900">@scaling01</a></p></li><li><p>Pricing is <strong>$10 / $50 per million input/output tokens</strong> <a href="https://x.com/scaling01/status/2064394893603049625">@scaling01</a>, <a href="https://x.com/ArtificialAnlys/status/2064500150069030992">@ArtificialAnlys</a></p></li><li><p>Fable retains <strong>1M context</strong> <a href="https://x.com/ArtificialAnlys/status/2064500150069030992">@ArtificialAnlys</a></p></li><li><p>Anthropic introduced refusal/fallback mechanisms and SDK middleware <a href="https://x.com/ClaudeDevs/status/2064428351029449214">@ClaudeDevs</a></p></li><li><p>Anthropic disclosed <strong>silent interventions for frontier LLM development</strong> affecting about <strong>0.03% of traffic</strong> <a href="https://x.com/Hangsiin/status/2064397550434816088">@Hangsiin</a></p></li><li><p>Fable is temporarily included in subscriptions until <strong>June 22</strong>, then credit-based <a href="https://x.com/ArtificialAnlys/status/2064500150069030992">@ArtificialAnlys</a></p></li></ul><p><strong>Opinions / interpretations</strong></p><ul><li><p>&#8220;Anthropic won,&#8221; &#8220;Anthropic has a coding moat,&#8221; &#8220;Anthropic going for ASI&#8221; are commentary rather than verified fact <a href="https://x.com/scaling01/status/2064401880323653799">@scaling01</a>, <a href="https://x.com/scaling01/status/2064399642603802676">@scaling01</a>, <a href="https://x.com/scaling01/status/2064410532824662047">@scaling01</a></p></li><li><p>Claims that the move is primarily for <strong>IPO optics</strong>, <strong>anti-open-source positioning</strong>, or specifically to slow <strong>Meta/China/open labs</strong> are plausible interpretations but not confirmed by Anthropic <a href="https://x.com/kimmonismus/status/2064448699632402664">@kimmonismus</a>, <a href="https://x.com/kylebrussell/status/2064502244041511348">@kylebrussell</a>, <a href="https://x.com/natolambert/status/2064412173527556298">@natolambert</a></p></li><li><p>Claims that Anthropic is acting from sincere safety beliefs rather than cynical moat-building are also interpretive <a href="https://x.com/finbarrtimbers/status/2064427031543341450">@finbarrtimbers</a></p></li><li><p>Subjective reports like &#8220;GPT-4 moment,&#8221; &#8220;big model smell,&#8221; &#8220;strictly dominates me as an engineer,&#8221; or &#8220;doesn&#8217;t seem much better to normal users&#8221; are experiential, not standardized evidence <a href="https://x.com/karinanguyen/status/2064406015760601379">@karinanguyen</a>, <a href="https://x.com/bcherny/status/2064431111154053187">@bcherny</a>, <a href="https://x.com/akbirkhan/status/2064418425552928812">@akbirkhan</a>, <a href="https://x.com/citrini/status/2064480613852201336">@citrini</a></p></li></ul><h2><strong>Different perspectives</strong></h2><p><strong>Supportive / capability-first</strong></p><ul><li><p>Anthropic staff and close testers described Fable 5 as a <strong>step-function improvement</strong>:</p><ul><li><p>Felix Rieseberg: shift from giving AI tasks to giving it responsibilities <a href="https://x.com/felixrieseberg/status/2064392202504310900">@felixrieseberg</a></p></li><li><p>Alex Albert: model feels collaborative rather than tool-like <a href="https://x.com/alexalbert__/status/2064394410004304003">@alexalbert__</a></p></li><li><p>Karpathy: a &#8220;major-version-bump-deserving step change,&#8221; especially on long difficult tasks, though safeguards are &#8220;a little too trigger happy for launch&#8221; <a href="https://x.com/karpathy/status/2064409694761054332">@karpathy</a></p></li><li><p>Bcherny: biggest step since Opus 4.5; the model shows judgment, taste, methodical debugging <a href="https://x.com/bcherny/status/2064431111154053187">@bcherny</a></p></li></ul></li><li><p>Third-party infra and app vendors emphasized benchmark wins and integration value rather than the safety controversy <a href="https://x.com/cursor_ai/status/2064394824313376787">@cursor_ai</a>, <a href="https://x.com/cognition/status/2064398549073453266">@cognition</a>, <a href="https://x.com/NotionHQ/status/2064397568696819984">@NotionHQ</a>, <a href="https://x.com/Azure/status/2064421301108834552">@Azure</a></p></li></ul><p><strong>Critical / trust and openness</strong></p><ul><li><p>Many researchers and open-model advocates argued the silent throttling is unacceptable even if safety-motivated:</p><ul><li><p>Natolambert called doing it without telling users &#8220;misaligned&#8221; <a href="https://x.com/natolambert/status/2064404993193754830">@natolambert</a></p></li><li><p>Dean Ball warned it could attract <strong>antitrust</strong> scrutiny <a href="https://x.com/deanwball/status/2064434861088395730">@deanwball</a></p></li><li><p>Jeremy Howard called it &#8220;a very dark and very sad day&#8221; <a href="https://x.com/jeremyphoward/status/2064481719626154417">@jeremyphoward</a></p></li><li><p>Gneubig warned of a future where AI is provided only to a privileged few <a href="https://x.com/gneubig/status/2064451352000975124">@gneubig</a></p></li><li><p>Eric Zelikman framed it as silently sabotaging customers <a href="https://x.com/ericzelikman/status/2064442174373314701">@ericzelikman</a></p></li></ul></li><li><p>Open-source supporters used the launch as an argument for <strong>sovereign/open models</strong> <a href="https://x.com/nickfrosst/status/2064396337404096809">@nickfrosst</a>, <a href="https://x.com/NoahZiems/status/2064464265189482570">@NoahZiems</a>, <a href="https://x.com/ClementDelangue/status/2064513229099876663">@ClementDelangue</a></p></li></ul><p><strong>Neutral / mixed</strong></p><ul><li><p>Some observers argued Anthropic probably <strong>sincerely believes</strong> these interventions are necessary for safety, even if the product design is poor <a href="https://x.com/finbarrtimbers/status/2064427031543341450">@finbarrtimbers</a></p></li><li><p>Others said Anthropic does <strong>not owe</strong> anyone unrestricted frontier capability, but still saw this as straightforward business and market segmentation rather than altruism <a href="https://x.com/suchenzang/status/2064452548753559644">@suchenzang</a></p></li><li><p>Karpathy&#8217;s view is mixed: model quality is exceptional, but launch safeguards are over-sensitive and should likely be tuned <a href="https://x.com/karpathy/status/2064409694761054332">@karpathy</a></p></li></ul><h2><strong>Research restrictions, privacy, and enterprise implications</strong></h2><p><strong>The discussion expanded from safety to broader questions of trust, privacy, and enterprise reliability.</strong></p><ul><li><p>The central enterprise issue was <strong>predictability</strong>: if a provider can silently degrade outputs based on inferred task category, users may no longer know whether failures come from the model, the prompt, or hidden intervention <a href="https://x.com/MattGibsonMusic/status/2064518301888512486">@MattGibsonMusic</a>, <a href="https://x.com/code_star/status/2064464447662707180">@code_star</a></p></li><li><p>Some users worried this is effectively a <strong>supply-chain risk</strong> for important workflows, pushing companies toward open weights or in-house models <a href="https://x.com/NoahZiems/status/2064464265189482570">@NoahZiems</a>, <a href="https://x.com/deliprao/status/2064485687374569897">@deliprao</a></p></li><li><p>There was also concern that account-level context or prior usage history might affect trigger behavior, as seen in biologists&#8217; reports about normal vs incognito mode <a href="https://x.com/cremieuxrecueil/status/2064449457869984035">@cremieuxrecueil</a></p></li><li><p>No tweet in the supplied set provided direct evidence that Anthropic was <strong>training on user data</strong> or violating stated data privacy terms; the privacy debate here was mostly about <strong>behavioral profiling / silent policy enforcement</strong> rather than classic training-data privacy</p></li><li><p>For research users, the hidden intervention was framed as especially damaging because it undermines <strong>reproducibility and scientific attribution</strong> <a href="https://x.com/deanwball/status/2064434861088395730">@deanwball</a>, <a href="https://x.com/MattGibsonMusic/status/2064518301888512486">@MattGibsonMusic</a></p></li><li><p>For enterprise buyers, the issue is not just whether the model is powerful, but whether it is a <strong>stable and auditable dependency</strong> for coding, medicine, science, finance, and infrastructure</p></li></ul><h2><strong>Context</strong></h2><p><strong>This launch matters because it combines a visible capability jump with a visible shift in access control.</strong></p><ul><li><p>The release landed amid intense competition with GPT-5.5, upcoming GPT-5.6, and Gemini 3.5 Pro; several posters argued Anthropic has opened a temporary lead in coding/agentic work <a href="https://x.com/kimmonismus/status/2064467466450088078">@kimmonismus</a>, <a href="https://x.com/teortaxesTex/status/2064473970892587105">@teortaxesTex</a></p></li><li><p>It also lands in a broader argument about the <strong>open vs closed model gap</strong>; one linked Epoch-style framing said open-weight models lag closed frontier models by about <strong>4 months on average</strong> <a href="https://x.com/dl_weekly/status/2064422551762153946">@dl_weekly</a></p></li><li><p>Community reaction suggests the launch may be remembered not only for &#8220;big model smell&#8221; and benchmark jumps, but for normalizing <strong>selective capability release</strong>: public access to the frontier model, but with <strong>domain-specific hidden limits</strong></p></li><li><p>That policy line is likely to influence future debates around:</p><ul><li><p><strong>safety vs openness</strong></p></li><li><p><strong>fair access to frontier research tools</strong></p></li><li><p><strong>antitrust and platform power</strong></p></li><li><p><strong>enterprise trust in API providers</strong></p></li><li><p><strong>whether open models become the default for sensitive technical work even when they trail on raw capability</strong></p></li></ul></li></ul><p><strong>Models, benchmarks, and evals</strong></p><ul><li><p>New benchmark project <strong>Agents&#8217; Last Exam (ALE)</strong> launched to test labor-market-aligned agent performance; top agents score only <strong>2.6% on the hardest tier</strong>, across <strong>1,500+ tasks</strong>, <strong>55 occupations</strong>, with contributions from <strong>300+ experts across 100+ institutions</strong> <a href="https://x.com/YiyouSun/status/2064392466011394213">@YiyouSun</a>, <a href="https://x.com/SnorkelAI/status/2064396025410760950">@SnorkelAI</a>, <a href="https://x.com/dawnsongtweets/status/2064452279973863848">@dawnsongtweets</a></p></li><li><p>Cohere released <strong>North Mini Code</strong>, its first open-source coding model: <strong>30B total / 3B active MoE</strong>, <strong>256K context</strong>, <strong>64K max generation</strong>, Apache 2.0, optimized for agentic workflows <a href="https://x.com/cohere/status/2064378058329526556">@cohere</a>, <a href="https://x.com/JayAlammar/status/2064385607455908254">@JayAlammar</a>, <a href="https://x.com/vllm_project/status/2064416312605237434">@vllm_project</a></p></li><li><p>Google announced <strong>Gemini 3.5 Flash Live Translate</strong>, real-time speech-to-speech translation in <strong>70+ languages</strong>, available in Gemini API, AI Studio, Google Translate, and coming to Meet <a href="https://x.com/OfficialLoganK/status/2064369125447864674">@OfficialLoganK</a></p></li><li><p>New benchmark <strong>iOSWorld</strong> evaluates personally intelligent phone agents across <strong>26 custom iOS apps</strong> and <strong>133 tasks</strong>; strongest frontier model reaches only <strong>52% success even with privileged access</strong> <a href="https://x.com/rsalakhu/status/2064402156740907444">@rsalakhu</a></p></li></ul><p><strong>Inference, training, and systems</strong></p><ul><li><p><strong>Latent Context Language Models (LCLMs)</strong> were introduced as a long-context inference method compressing context up to <strong>16&#215;</strong>, improving the latency/accuracy frontier over KV-cache compression <a href="https://x.com/micahgoldblum/status/2064361011994337772">@micahgoldblum</a>, <a href="https://x.com/iamleonli/status/2064374393057300846">@iamleonli</a></p></li><li><p>Microsoft Research&#8217;s <strong>Mirage</strong> stores 3D scenes as latent tokens, reporting <strong>10.57&#215; faster</strong> video generation and <strong>55&#215; lower memory use</strong> <a href="https://x.com/HuggingPapers/status/2064393076416688416">@HuggingPapers</a></p></li><li><p>vLLM introduced <strong>vime</strong>, an RL post-training framework in the vLLM ecosystem, positioned alongside NeMo-RL, OpenRLHF, and verl <a href="https://x.com/vllm_project/status/2064397637634376174">@vllm_project</a></p></li><li><p>Discussion around agent training continued with <strong>Self-Harness</strong> for self-improving scaffolds <a href="https://x.com/omarsar0/status/2064429834999304247">@omarsar0</a> and <strong>AutoForge/interleaved thinking</strong> retaining reasoning traces across turns <a href="https://x.com/cwolferesearch/status/2064505867181949182">@cwolferesearch</a></p></li><li><p>Google/Hugging Face launched the <strong>Fast Gemma Challenge</strong> to speed up <strong>Gemma 4 E4B</strong> on a single <strong>A10G</strong> without wrecking quality <a href="https://x.com/googlegemma/status/2064374874962117084">@googlegemma</a>, <a href="https://x.com/osanseviero/status/2064375902046245219">@osanseviero</a>, <a href="https://x.com/_lewtun/status/2064386398090576236">@_lewtun</a></p></li></ul><p><strong>Agents, tooling, and developer workflow</strong></p><ul><li><p>LangChain highlighted a pattern of <strong>agent loops</strong> driven by recurring triggers in Fleet <a href="https://x.com/caspar_br/status/2064363014997021126">@caspar_br</a></p></li><li><p>OpenAI added <strong>image results</strong> to web search in the Responses API <a href="https://x.com/OpenAIDevs/status/2064395155688616153">@OpenAIDevs</a></p></li><li><p>GitHub/Copilot app updates included <strong>parallel sub-sessions</strong> and a <strong>canvas</strong> UI for dynamic interfaces <a href="https://x.com/tgrall/status/2064334802799509745">@tgrall</a>, <a href="https://x.com/burkeholland/status/2064446521035067615">@burkeholland</a></p></li><li><p>Hermes Desktop added <strong>Ollama</strong> support, with self-learning Python skills and messaging app integrations <a href="https://x.com/ollama/status/2064441778590339402">@ollama</a>, <a href="https://x.com/NousResearch/status/2064468385748951415">@NousResearch</a></p></li><li><p>A security-oriented counterpoint on agent execution: <strong>Temenos</strong> argues for sandboxing generated code, not the agent, using <strong>rootless gVisor</strong> while keeping auth/tools on host <a href="https://x.com/abhijithneil/status/2064462294155952297">@abhijithneil</a></p></li></ul><p><strong>Research, science, and formal methods</strong></p><ul><li><p>Axiom announced <strong>EconLib</strong>, a Lean-based economics library; formalizing Aumann&#8217;s &#8220;agreeing to disagree&#8221; theorem surfaced a hidden countability-related assumption <a href="https://x.com/TheTuringPost/status/2064391882017579520">@TheTuringPost</a></p></li><li><p>&#8220;Economy of Minds&#8221; proposed agent coordination through auctions and incentives rather than centralized orchestration, reporting gains such as <strong>15.9% &#8594; 57.0%</strong> on math reasoning and <strong>45.0% &#8594; 60.0%</strong> on financial research <a href="https://x.com/TheTuringPost/status/2064406931184443618">@TheTuringPost</a></p></li><li><p>Mayo Clinic&#8217;s <strong>REDMOD</strong> reportedly detected pancreatic cancer on CT scans up to <strong>3 years before diagnosis</strong>, identifying <strong>73%</strong> of hidden cancers at a median <strong>475 days</strong> before diagnosis <a href="https://x.com/TheRundownAI/status/2064416920191869191">@TheRundownAI</a></p></li></ul><p><strong>Open ecosystem and infrastructure</strong></p><ul><li><p>Hugging Face and Arcee announced a partnership replacing AWS S3 with HF for all Arcee models/datasets, including private ones <a href="https://x.com/ClementDelangue/status/2064323874049679643">@ClementDelangue</a>, <a href="https://x.com/MarkMcQuade/status/2064385389801124218">@MarkMcQuade</a></p></li><li><p>Cohere kept pushing the sovereign/open angle with &#8220;<strong>Sovereign AI for all</strong>&#8221; <a href="https://x.com/cohere/status/2064414912768618898">@cohere</a></p></li><li><p>Marks Saroufim proposed a <strong>Researcher Reciprocity License</strong> and moved GPU MODE datasets to it, explicitly reacting to the sense that frontier labs benefit from open research while restricting access in return <a href="https://x.com/marksaroufim/status/2064428421774753943">@marksaroufim</a>, <a href="https://x.com/marksaroufim/status/2064442386374369597">@marksaroufim</a></p></li></ul><div><hr></div><h1><strong>AI Reddit Recap</strong></h1><h2><strong>/r/LocalLlama + /r/localLLM Recap</strong></h2><h3><strong>1. Open Model Inference and Chat Template Updates</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1u0buhm/xiaomi_just_claimed_1000_tps_on_a_1t_model_using/">Xiaomi just claimed 1,000+ tps on a 1T model using a standard 8-GPU server</a></strong> (Activity: 1027): <strong>Xiaomi MiMo claims </strong><code>MiMo-V2.5-Pro-UltraSpeed</code><strong> reaches </strong><code>1000+ tokens/s</code><strong> decoding on a </strong><code>1T</code><strong>-parameter MoE using a single &#8220;standard&#8221; </strong><code>8-GPU</code><strong> server, via TileRT model-system co-design rather than Cerebras/Groq-style specialized hardware. The reported stack combines MoE-expert-only FP4/MXFP4 quantization with QAT while keeping non-expert modules at higher precision, plus DFlash block-level masked speculative decoding with acceptance lengths of </strong><code>6.30</code><strong> coding, </strong><code>5.56</code><strong> math/reasoning, and </strong><code>4.29</code><strong> agent tasks, and persistent low-latency kernels to reduce launch/sync overhead. A key unresolved technical caveat from comments is that Xiaomi does not specify </strong><em><strong>which</strong></em><strong> 8 GPUs were used, making reproducibility and cost/performance comparisons ambiguous.</strong> Commenters debated the economics of &#8220;Token Winter,&#8221; arguing the bottleneck is less model demand than overpriced/hoarded Western GPU supply, while Chinese compressed sparse architecture/MoE work from <strong>DeepSeek, Xiaomi, and MiniMax</strong> is becoming more inference-efficient. Others highlighted Xiaomi&#8217;s selective FP4 strategy as the most important detail because na&#239;ve full-model FP4 degrades reasoning, code, and logic.</p><ul><li><p>A key technical detail highlighted is that Xiaomi did <strong>selective FP4 quantization</strong> rather than applying FP4 uniformly: only the <strong>MoE Experts</strong> in <strong>MiMo-V2.5-Pro</strong> are quantized to FP4, while non-expert modules retain original precision to avoid degradation in reasoning, logic, and code generation. The comment notes Xiaomi used <strong>FP4 QAT</strong> to reduce model size and improve bandwidth utilization while keeping capability near the original model.</p></li><li><p>The released model weights are available on Hugging Face as <strong>XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash</strong>: <a href="https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash">https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash</a>. This is relevant because it allows independent inspection or benchmarking of the claimed <code>1,000+ tps</code> throughput on an 8-GPU server.</p></li><li><p>Several commenters questioned the hardware and parameter accounting behind the claim: <em>&#8220;8 GPU server&#8230; which 8 exactly?&#8221;</em> and <em>&#8220;1T-A1B?&#8221;</em> The technical concern is that throughput is not interpretable without knowing the exact GPU class, interconnect, serving stack, batch size, context length, and whether the <code>1T</code> MoE model activates only around <code>1B</code> parameters per token.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1u084qi/gemma_4_chat_template_now_has_preserve_thinking/">Gemma 4 Chat Template now has preserve thinking</a></strong> (Activity: 482): <strong>Google&#8217;s Gemma Team has added </strong><code>preserve_thinking</code><strong> support to the official Gemma 4 chat template, matching an aftermarket template modification some users were already applying successfully. The change is framed as enabling better retention/use of model &#8220;thinking&#8221; traces in Gemma 4 chat formatting, though no benchmark numbers or implementation diff were provided in the thread.</strong> Commenters generally welcomed the official adoption and argued it validates prior community template hacks. Several users speculated that a larger <strong>Gemma 4 </strong><code>124B</code><strong> MoE</strong> release would be needed to fully exploit the updated template for stronger agentic coding use cases.</p><ul><li><p>Commenters note that <strong>Gemma 4&#8217;s official chat template appears to be adding </strong><code>preserve_thinking</code>, a behavior some users had already enabled via aftermarket/custom template modifications and found effective. The main claimed technical benefit is improved continuity for <strong>agentic coding workflows</strong>, where retaining prior reasoning/thinking traces can help multi-step tool use and code iteration.</p></li><li><p>One commenter cautions that the change may not be live yet: the <code>preserve_thinking</code> support is described as an <strong>open PR that has not been merged</strong>, while the model files reportedly show no update for <code>21 days</code>. This suggests users should verify the tokenizer/chat-template files in the actual model repository before assuming the new behavior is available in released artifacts.</p></li><li><p>Several comments frame the template change as increasing demand for a larger <strong>Gemma 4 </strong><code>124B</code><strong> MoE</strong> variant, arguing that <code>preserve_thinking</code> would be more valuable when paired with a higher-capacity model for coding-agent use cases. The discussion is speculative, but technically centered on scaling the model size/MoE architecture to better exploit the updated chat-template behavior.</p></li></ul></li><li><p></p></li></ul><h2><strong>Less Technical AI Subreddit Recap</strong></h2><blockquote><p>/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo</p></blockquote><h3><strong>1. Claude Fable 5/Mythos 5 Release and Access Tiers</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/ClaudeCode/comments/1u1b207/introducing_claude_fable_5/">Introducing Claude Fable 5</a></strong> (Activity: 2698): <strong>The <a href="https://i.redd.it/tb8akxef4a6h1.png">image</a> is a benchmark comparison table for the post&#8217;s claimed Claude Fable 5 / Claude Mythos 5 release, showing the highlighted model leading or near-leading across agentic coding, knowledge work, spatial reasoning, tool use, legal, biology, cybersecurity, and health benchmarks versus Claude Mythos Preview, Claude Opus 4.8, GPT 5.5, and Gemini 3.1 Pro. The selftext frames Fable 5 and Mythos 5 as the same underlying &#8220;Mythos-class&#8221; model, with Fable 5 using safety fallbacks: cybersecurity, biology/chemistry, and distillation-related requests are routed to Claude Opus 4.8, reportedly affecting under </strong><code>5%</code><strong> of sessions.</strong> Comments are mostly hype or skepticism rather than technical analysis, including jokes like &#8220;AGI confirmed&#8221; and a complaint asking whether &#8220;Fable [is] getting dumber recently.&#8221;</p><ul><li><p>One commenter noted an apparent access/pricing constraint: <strong>Claude Fable 5 is free only until </strong><code>June 22</code>, after which users will reportedly need to purchase credits to continue using it. This is relevant for anyone evaluating the model because benchmark or workflow testing may need to be completed before the credit-gated period begins.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/ClaudeAI/comments/1u1fsdi/claude_fable_5_feels_less_like_a_model_launch_and/">Claude Fable 5 feels less like a model launch and more like a preview of AI inequality</a></strong> (Activity: 2387): <strong>The post argues that Anthropic&#8217;s alleged Claude Fable 5 rollout represents a shift from a uniform public frontier-model release to a tiered access architecture: public paid users receive Fable 5 with safety routing that may downgrade requests involving </strong><code>cyber</code><strong>, </strong><code>bio</code><strong>, </strong><code>chemistry</code><strong>, or </strong><code>distillation</code><strong> to Opus 4.8, while selected partners purportedly get Mythos 5, described as the same underlying model with fewer safeguards. It also highlights pricing/capacity constraints: Fable 5 is said to be included in paid plans only until </strong><code>June 22</code><strong>, then potentially moved to usage credits, implying frontier-agent inference remains too expensive for flat-rate consumer subscriptions.</strong> Comments split between concern over AI access inequality and acceptance of restrictive safety policies as necessary for high-risk capabilities. One commenter frames the outcome as predictable token-economics pressure toward expensive enterprise-grade models, while another defends a <em>&#8220;rather safe than sorry&#8221;</em> approach despite user friction.</p><ul><li><p>Several commenters framed the launch as an expected economics shift: as frontier models grow in capability and complexity, <strong>inference/token costs rise enough that top-tier models become enterprise-only tools</strong> rather than default consumer products. One commenter argued this will push everyday workloads toward cheaper local inference on hardware like <strong>Apple M-series chips</strong> or <strong>RTX Spark-class accelerators</strong>, reserving frontier APIs for high-value tasks.</p></li><li><p>A pricing-focused thread claimed that the new model&#8217;s API economics make consumer subscriptions structurally mismatched with frontier usage: <em>&#8220;Our </em><code>$200</code><em> monthly sub is like </em><code>3</code><em> API prompts with the new model.&#8221;</em> The implied technical point is that even high-end consumer plans may be viable only through heavy rate limits, model routing, or fallback to cheaper models such as <strong>Opus 4.8</strong>, which one commenter described as sufficient for &#8220;<code>99%</code>&#8221; of users.</p><p></p><p></p></li></ul></li></ul>
      <p>
          <a href="https://www.latent.space/p/ainews-anthropic-claude-fable-5-mythos">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[[AINews] FrontierCode: Benchmarking for Code Quality over Slop]]></title><description><![CDATA[We made a thing!]]></description><link>https://www.latent.space/p/ainews-frontiercode-benchmarking</link><guid isPermaLink="false">https://www.latent.space/p/ainews-frontiercode-benchmarking</guid><pubDate>Tue, 09 Jun 2026 06:12:33 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!3zh0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fpbs.substack.com%2Fmedia%2FHKT9bbsagAAipOJ.jpg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Second batch of AI Leadership and Engineering+Workshops tickets for <a href="https://www.ai.engineer/worldsfair/2026">AI Engineer World&#8217;s Fair</a> sold out last night! Last 500 tickets on sale now - get while stocks last! <a href="https://app.ai.engineer/e/ai-engineer-worlds-fair-2026?discount=LATENT-26-POD">20% off for the first 20 readers</a> who see this.</em></p><div><hr></div><p>It is rare that we are personally involved in the title story of the day, and <a href="https://www.youtube.com/watch?v=2TEeQjoY05c">Apple&#8217;s WWDC announcing Gemini-powered Siri</a> was a possible candidate, but <a href="https://news.smol.ai/issues?pattern=apple">we&#8217;ve been fooled before</a>. So instead, we&#8217;ve got <a href="https://x.com/cognition/status/2064061031912288715">FrontierCode</a>, the latest in our <a href="https://www.latent.space/p/2026">War on Slop</a>!</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/cognition/status/2064061031912288715&quot;,&quot;full_text&quot;:&quot;Introducing FrontierCode: a coding eval that raises the bar for difficulty &amp;amp; quality. Each task took 40+ hrs of work by leading open-source maintainers.\n\nModels write sloppy code that works but isn&#8217;t maintainable. Our eval is first to measure: would you actually merge this code? &quot;,&quot;username&quot;:&quot;cognition&quot;,&quot;name&quot;:&quot;Cognition&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1765909640364068865/MvH-m0gd_normal.jpg&quot;,&quot;date&quot;:&quot;2026-06-08T19:04:33.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/HKT9bbsagAAipOJ.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/e1GD53x3T4&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:131,&quot;retweet_count&quot;:189,&quot;like_count&quot;:2160,&quot;impression_count&quot;:469850,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:false}" data-component-name="Twitter2ToDOM"></div><p>If that chart looks familiar, it&#8217;s because FrontierCode was explicitly inspired and named for FrontierMath - focusing its hardest tier on extremely hard problems for frontier models 2 years ago:</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/EpochAIResearch/status/1854993684502282537&quot;,&quot;full_text&quot;:&quot;3/10 We evaluated six leading models, including Claude 3.5 Sonnet, GPT-4o, and Gemini 1.5 Pro. Even with extended thinking time (10,000 tokens), Python access, and the ability to run experiments, success rates remained below 2%&#8212;compared to over 90% on traditional benchmarks. &quot;,&quot;username&quot;:&quot;EpochAIResearch&quot;,&quot;name&quot;:&quot;Epoch AI&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1866142753127616512/DYcE9bN1_normal.jpg&quot;,&quot;date&quot;:&quot;2024-11-08T21:05:33.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/Gb4xR1VbkAA4zg8.png&quot;,&quot;link_url&quot;:&quot;https://t.co/mijruaZY2T&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:12,&quot;retweet_count&quot;:52,&quot;like_count&quot;:557,&quot;impression_count&quot;:422964,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:false}" data-component-name="Twitter2ToDOM"></div><p>The context of FrontierCode revolves around past work we have done around <a href="https://www.latent.space/p/swe-bench-dead">SWEBench-Verified</a>. </p><ul><li><p>It is clear that even with the switch to SWEBench Pro, there has been insufficient articulation around <a href="https://www.latent.space/p/wtf2025">WTF Happened in 2025</a>. As discussed with the OpenAI team in that podcast, there needed to be a lot more work around the rubrics for code quality and maintainability, and that is exactly what the Cog research team ended up building in this first release of FrontierCode.  </p></li><li><p>Separately, METR found that <a href="https://metr.org/notes/2026-03-10-many-swe-bench-passing-prs-would-not-be-merged-into-main/#introduction">Many SWE-bench-Passing PRs Would Not Be Merged into Main</a> and the problem of false positive trajectories (not quite &#8220;reward hacks&#8221;, but spiritually similar in terms of the unreliability of the benchmark rather than the model) was directly measured and addressed in the FrontierCode report.</p></li></ul><p>With hindsight, FrontierCode&#8217;s third tier of problems shows the huge accceleration going into Dec 2025 that suddenly <a href="https://x.com/swyx/status/2064081945567580323">made agentic engineering and vibe coding possible to go up one level of abstraction</a>, to the /goals and loops and metaprompts we are discussing today.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sdBk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0acd2026-8f85-4504-a5f3-6a0cd82d0b6a_2170x1604.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sdBk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0acd2026-8f85-4504-a5f3-6a0cd82d0b6a_2170x1604.png 424w, https://substackcdn.com/image/fetch/$s_!sdBk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0acd2026-8f85-4504-a5f3-6a0cd82d0b6a_2170x1604.png 848w, https://substackcdn.com/image/fetch/$s_!sdBk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0acd2026-8f85-4504-a5f3-6a0cd82d0b6a_2170x1604.png 1272w, https://substackcdn.com/image/fetch/$s_!sdBk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0acd2026-8f85-4504-a5f3-6a0cd82d0b6a_2170x1604.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sdBk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0acd2026-8f85-4504-a5f3-6a0cd82d0b6a_2170x1604.png" width="1456" height="1076" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0acd2026-8f85-4504-a5f3-6a0cd82d0b6a_2170x1604.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1076,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:452830,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/201254482?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0acd2026-8f85-4504-a5f3-6a0cd82d0b6a_2170x1604.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sdBk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0acd2026-8f85-4504-a5f3-6a0cd82d0b6a_2170x1604.png 424w, https://substackcdn.com/image/fetch/$s_!sdBk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0acd2026-8f85-4504-a5f3-6a0cd82d0b6a_2170x1604.png 848w, https://substackcdn.com/image/fetch/$s_!sdBk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0acd2026-8f85-4504-a5f3-6a0cd82d0b6a_2170x1604.png 1272w, https://substackcdn.com/image/fetch/$s_!sdBk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0acd2026-8f85-4504-a5f3-6a0cd82d0b6a_2170x1604.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://x.com/swyx/status/2064081945567580323">more context here</a></figcaption></figure></div><p></p><p></p><blockquote><p>AI News for 6/5/2026-6/8/2026. We checked 12 subreddits, <a href="https://twitter.com/i/lists/1585430245762441216">544 Twitters</a> and no further Discords. <a href="https://news.smol.ai/">AINews&#8217; website</a> lets you search all past issues. As a reminder, <a href="https://www.latent.space/p/2026">AINews is now a section of Latent Space</a>. You can <a href="https://support.substack.com/hc/en-us/articles/8914938285204-How-do-I-subscribe-to-or-unsubscribe-from-a-section-on-Substack">opt in/out</a> of email frequencies!</p></blockquote><div><hr></div><h1><strong>AI Twitter Recap</strong></h1><p><strong>Coding Agents, Loops, and the Shift from &#8220;Passing Tests&#8221; to Mergeable Software</strong></p><ul><li><p><strong>FrontierCode raises the bar on coding evals</strong>: Cognition introduced <strong>FrontierCode</strong>, a new benchmark explicitly targeting whether code is actually <strong>mergeable</strong>, not merely unit-test passing. Tasks were built with open-source maintainers, with each taking <strong>40+ hours</strong> and evaluated on dimensions like regression safety, cleanliness, scope, test correctness, and maintainability. The headline result is that the best model, <strong>Opus 4.8</strong>, scores only about <strong>13%</strong> on the hardest subset&#8212;far below the 50%+ regime common on SWE-Bench-style evals, suggesting coding is much less &#8220;solved&#8221; than popular benchmarks imply (<a href="https://x.com/cognition/status/2064061031912288715">Cognition announcement</a>, <a href="https://x.com/ScottWu46/status/2064073699368800475">Scott Wu&#8217;s summary</a>, <a href="https://x.com/swyx/status/2064081945567580323">swyx breakdown</a>, <a href="https://x.com/theo/status/2064126021088215385">theo&#8217;s questions on variance/reproducibility</a>, <a href="https://x.com/cognition/status/2064215347503452649">Cognition response</a>).</p></li><li><p><strong>&#8220;Loops&#8221; are becoming the dominant agent-control metaphor&#8212;but with caveats</strong>: The day&#8217;s loudest practical theme was that coding agents should be given <strong>clear goals, verification criteria, and iteration structure</strong> rather than one-shot prompts. Popular examples include <a href="https://x.com/dzhng/status/2063931263312892406">dzhng&#8217;s &#8220;don&#8217;t use loops, design state machines&#8221;</a>, <a href="https://x.com/ClaudeDevs/status/2064032814392352816">Claude Code&#8217;s retrospective on auto mode, routines, and verification</a>, <a href="https://x.com/bcherny/status/2064034799711588805">bcherny&#8217;s thread</a>, <a href="https://x.com/reach_vb/status/2064028260070215772">OpenAI Codex tips on outcome-first prompting</a> and <a href="https://x.com/reach_vb/status/2064044955421769755">Approve-for-me defaults</a>, plus <a href="https://x.com/sydneyrunkle/status/2064034061165682931">LangChain OSS &#8220;rubrics&#8221;</a>. But several practitioners pushed back on na&#239;ve loop hype: <a href="https://x.com/omarsar0/status/2064024230396604469">Omar Sar0</a> and <a href="https://x.com/gneubig/status/2064011013637234728">Graham Neubig</a> emphasized that human checkpoints remain essential outside easily verifiable domains, while <a href="https://x.com/HamelHusain/status/2064019243990188259">Hamel Husain</a> joked about muting the word entirely.</p></li><li><p><strong>Agent ergonomics are improving around verification and orchestration</strong>: Product changes across the stack reflect this shift. <a href="https://x.com/ClaudeDevs/status/2064072801062121906">ClaudeDevs added observability dashboards for MCP connector developers</a>, including adoption, latency, and error views. <a href="https://x.com/skirano/status/2064035120483352776">MagicPath launched a Builder plan</a> for external-agent workflows and multiplayer canvas editing. <a href="https://x.com/LangChain/status/2064030008738296065">LangSmith Sandboxes</a> and <a href="https://x.com/AmplifyPartners/status/2063998736703856737">Modal&#8217;s sandbox scaling story</a> point toward the same infrastructure trend: agents need isolated, inspectable, long-running environments.</p></li><li><p><strong>Practical usage patterns are settling</strong>: The strongest operator advice converged on measurable outcomes, bounded autonomy, and thread hygiene. <a href="https://x.com/Angaisb_/status/2064103464142065852">Angaisb_ warned against overlong Codex threads degrading performance</a>, while <a href="https://x.com/reach_vb/status/2064115851503059418">reach_vb reported success with single-thread context accumulation</a>. That mismatch itself is useful signal: current agent performance is still strongly shaped by <strong>harness behavior and workflow choices</strong>, not just base-model quality.</p></li></ul><p><strong>Model Releases, Local Inference, and Serving Stack Upgrades</strong></p><ul><li><p><strong>Kimi shipped both a stronger coding agent and a desktop agent product</strong>: Moonshot released a major update to <strong>Kimi Code</strong>, its open-source coding agent, adding <strong>one-line CLI install</strong>, drag-and-drop <strong>video as coding context</strong>, ACP support, plugins, and IDE integration (<a href="https://x.com/KimiDevs/status/2063981516708024369">announcement</a>). It also launched <strong>Kimi Work</strong>, a desktop agent product with up to <strong>300 local sub-agents</strong>, browser-use via extension, finance-focused tool access, and persistent memory (<a href="https://x.com/Kimi_Moonshot/status/2063990409903112344">product launch</a>, <a href="https://x.com/crystalsssup/status/2063992904209842215">desktop availability</a>).</p></li><li><p><strong>Google pushed hard on efficient local deployment</strong>: Gemma got several notable upgrades. New <strong>QAT Gemma 4</strong> checkpoints reportedly preserve performance while using <strong>~4x less memory</strong>, with <strong>Gemma 4 E2B</strong> fitting in about <strong>1GB</strong> using a mobile quantization format (<a href="https://x.com/_philschmid/status/2063990553826439378">@_philschmid</a>). Separately, <strong>Gemma 4 MTP</strong> was merged into <strong>llama.cpp</strong>, enabling faster decoding when paired with QAT checkpoints (<a href="https://x.com/googlegemma/status/2064030477628182814">Gemma team</a>). <a href="https://x.com/osanseviero/status/2063985470489448887">llama.cpp also added video input support</a>, expanding local multimodal use cases.</p></li><li><p><strong>Open-source/open-weight competition remains intense</strong>: <a href="https://x.com/ArtificialAnlys/status/2064066303863005254">Artificial Analysis reported MiniMax-M3 at 55 on its Intelligence Index</a>, which would make it the leading open-weights model once weights are released. M3 adds <strong>native multimodality</strong> and a <strong>1M token context window</strong>, with strong GPQA/MMMU-Pro numbers but notable abstention on hallucination-sensitive evals. Meanwhile <a href="https://x.com/norpadon/status/2064040631479976240">norpadon announced Apple-hardware-optimized quantized Qwen3.5 checkpoints</a>.</p></li><li><p><strong>Serving infrastructure is broadening from text LLMs to world models and omni models</strong>: <strong>vLLM-Omni 0.22.0</strong> added day-0 support for <strong>NVIDIA Cosmos 3 world models</strong>, robot serving APIs, TTS models such as <strong>Qwen3-TTS</strong> and <strong>VoxCPM2</strong>, faster image/video serving, and broader quantization/hardware coverage (<a href="https://x.com/vllm_project/status/2064013506882703421">release</a>). This reflects a broader trend toward generalized multimodal serving rather than text-only inference stacks.</p></li></ul><p><strong>Benchmarks, Evaluation Methodology, and Real-World Agent Measurement</strong></p><ul><li><p><strong>Agent evaluation is moving from synthetic tasks to in-the-wild telemetry</strong>: Arena launched <strong>Agent Arena</strong>, a leaderboard based on over <strong>1M real-world sessions</strong>, using <strong>causal tracing</strong> rather than voting to estimate treatment effects of orchestrators/harnesses across five signals: <strong>confirmed success, praise vs complaint, steerability, bash recovery, and tool hallucination</strong> (<a href="https://x.com/arena/status/2064021507681276234">overview</a>, <a href="https://x.com/ml_angelopoulos/status/2064028763697127844">methodology thread</a>). Whether the methodology fully holds up remains to be seen, but it&#8217;s one of the clearest attempts yet to benchmark deployed agents using actual usage traces.</p></li><li><p><strong>Specialized benchmarks keep proliferating into new output domains</strong>: Hugging Face and Mecado released <strong>CADGenBench</strong>, a benchmark for generating and editing <strong>engineering-grade 3D CAD parts</strong> from drawings or STEP modifications, with metrics covering geometry, topology, interface compatibility, and CAD validity (<a href="https://x.com/MikushRab/status/2063999885796614522">launch thread</a>, <a href="https://x.com/Thom_Wolf/status/2064029993638764672">Thom Wolf summary</a>). This is a meaningful shift: evaluation is expanding beyond text/code into structured artifacts where correctness is physical and geometric.</p></li><li><p><strong>A recurring thesis: good benchmarks become training pipelines</strong>: <a href="https://x.com/OfirPress/status/2063990430350340575">Ofir Press argued</a> that the best benchmarks are scalable and rooted in <strong>real-world crawled data sources</strong>, making them useful not just for measurement but also for data generation. That view shows up implicitly in both FrontierCode and Agent Arena: benchmarks are no longer static scoreboards; they are becoming <strong>feedback loops for product and RL improvement</strong>.</p></li></ul><p><strong>Google, Apple, and the Consumer AI Platform Race</strong></p><ul><li><p><strong>Google expanded AI packaging, Search, and developer surfaces</strong>: Google announced a more capable <strong>NotebookLM</strong> with agentic chat, stronger reasoning, and more output formats for Ultra subscribers (<a href="https://x.com/NotebookLM/status/2064016460964585549">launch</a>). It also cut <strong>Google AI Plus</strong> pricing from <strong>$7.99 to $4.99/month</strong> while doubling storage to <strong>400GB</strong> (<a href="https://x.com/NewsFromGoogle/status/2064066310393209100">pricing update</a>). On the platform side, <a href="https://x.com/Google/status/2064034586762354893">Google highlighted a major Search upgrade</a>, including multimodal search and <strong>Gemini 3.5 Flash</strong> as the new default in AI Mode.</p></li><li><p><strong>Apple&#8217;s WWDC AI story centered on integration, not frontier leadership</strong>: Commentary around WWDC focused on a rebuilt <strong>Siri AI</strong> with on-screen awareness, app actions, personal context, and better voice interaction, alongside concerns about <strong>EU availability</strong> and hardware gating (<a href="https://x.com/kimmonismus/status/2064059964709388774">kimmonismus live thread</a>, <a href="https://x.com/kimmonismus/status/2064047278105464868">regional limitation note</a>). A technically notable detail came from <a href="https://x.com/awnihannun/status/2064202168618422396">awnihannun</a>: Apple&#8217;s on-device model is reportedly a <strong>20B-parameter query-routed architecture</strong> that loads experts from NAND into RAM once per query, a nonstandard design optimized for device constraints.</p></li></ul><p><strong>Research Directions: Continual Learning, Agent Training, and Optimization Debates</strong></p><ul><li><p><strong>Anthropic framed one core blocker for AI in science as infrastructure mismatch</strong>: Its new science blog argues AI has advanced faster in coding than biology because biological databases and tooling were not designed for agent use; the bottleneck is less raw intelligence than <strong>agent-compatible scientific infrastructure</strong> (<a href="https://x.com/AnthropicAI/status/2064054837294354677">Anthropic blog thread</a>). This pairs well with broader calls for harness/environment standardization.</p></li><li><p><strong>Open-source RL and environment protocols are becoming coordination points</strong>: <a href="https://x.com/ben_burtenshaw/status/2063991191415267492">OpenEnv was transferred to a consortium including Hugging Face, Meta-PyTorch, Reflection, Unsloth, Modal, Prime Intellect, NVIDIA, and others</a>. The pitch is that frontier labs co-train models with tightly coupled harnesses, while open ecosystems need a <strong>shared protocol layer</strong> between model, harness, environment, and trainer.</p></li><li><p><strong>Continual learning for agents is re-emerging as a practical systems problem</strong>: <a href="https://x.com/kimmonismus/status/2064001045391462907">Hivemind announced a system that turns traces from agents like Claude Code, Codex, Cursor, and Hermes into reusable skills</a>, claiming measurable gains across setups. Relatedly, <a href="https://x.com/NandoDF/status/2063938859583389837">Nando de Freitas posted a long thread</a> outlining a research program around learning from <strong>interaction consequences</strong> rather than token sequences alone.</p></li><li><p><strong>Optimization discourse was unusually active</strong>: Several threads debated whether <strong>Muon</strong> is materially distinct from <strong>Shampoo</strong>, with <a href="https://x.com/_arohan_/status/2064036303021494418">Arohan hinting at a better-than-Shampoo optimizer</a> and <a href="https://x.com/kellerjordan0/status/2064062891607888058">Keller Jordan benchmarking Shampoo and Spectral Descent publicly</a>. The substantive point beneath the drama: there is renewed appetite for <strong>optimizer-level gains</strong> as a real frontier lever, not just benchmark noise.</p></li></ul><p><strong>Top Tweets (by engagement)</strong></p><ul><li><p><strong>Signal on UK device scanning</strong>: The highest-engagement technically relevant post was <a href="https://x.com/signalapp/status/2064069692168519931">Signal&#8217;s statement opposing UK demands for on-device scanning and age-verification-linked content inspection</a>. This is more privacy/security policy than AI, but directly relevant to client-side inference and platform trust.</p></li><li><p><strong>OpenAI corporate direction and liquidity</strong>: <a href="https://x.com/sama/status/2064088940932641225">Sam Altman shared OpenAI&#8217;s current plan</a>, and shortly after <a href="https://x.com/OpenAINewsroom/status/2064094175541461220">OpenAI announced it had confidentially filed an S-1</a>. For AI engineers, the key implication is strategic: both OpenAI and Anthropic now appear to be preserving IPO optionality while ramping capacity and product breadth.</p></li><li><p><strong>NotebookLM and FrontierCode were the day&#8217;s biggest pure-product/eval launches</strong>: <a href="https://x.com/NotebookLM/status/2064016460964585549">NotebookLM&#8217;s upgrade</a>, <a href="https://x.com/KimiDevs/status/2063981516708024369">Kimi Code</a>, <a href="https://x.com/Kimi_Moonshot/status/2063990409903112344">Kimi Work</a>, and <a href="https://x.com/cognition/status/2064061031912288715">FrontierCode</a> dominated the technical conversation, with FrontierCode in particular reshaping the discourse around what &#8220;good coding performance&#8221; should mean.</p></li></ul><div><hr></div><h1><strong>AI Reddit Recap</strong></h1><h2><strong>/r/LocalLlama + /r/localLLM Recap</strong></h2><p></p>
      <p>
          <a href="https://www.latent.space/p/ainews-frontiercode-benchmarking">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[[AINews] not much happened today]]></title><description><![CDATA[a quiet day of RSI.]]></description><link>https://www.latent.space/p/ainews-not-much-happened-today-6b8</link><guid isPermaLink="false">https://www.latent.space/p/ainews-not-much-happened-today-6b8</guid><pubDate>Sat, 06 Jun 2026 04:34:47 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!DbYa!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73b0838a-bd14-46a1-801c-b6a2046e5c1e_1130x1130.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Do check out the excellent <a href="https://www.latent.space/p/bad-envs">RL Env guide</a> we posted today! And more lightning pods over the weekend, starting with <a href="https://youtu.be/-rIAVuaRjOg">our CommandCode remote pod on harness optimization for DeepSeek v4 Pro</a>.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;fbd1a4dc-557c-41fd-8246-6e92faf9ea35&quot;,&quot;caption&quot;:&quot;We&#8217;re so excited to publish this guest post from Auriel W, who has worked on RL at Gemini, and has an incredible &#8220;RL Pet Peeves&#8221; blog where she not-so-subtly explains the frustrations big labs have w&#8230;&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How to Stop Shipping Low-Quality RL Environments (with Examples)&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:39274261,&quot;name&quot;:&quot;Auriel Wright&quot;,&quot;bio&quot;:&quot;always learning something new&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49e9e4bf-5eba-4d49-8678-593fb2d2bf7d_2401x2401.jpeg&quot;,&quot;is_guest&quot;:true,&quot;bestseller_tier&quot;:null,&quot;primaryPublicationSubscribeUrl&quot;:&quot;https://aurielwright.substack.com/subscribe?&quot;,&quot;primaryPublicationUrl&quot;:&quot;https://aurielwright.substack.com&quot;,&quot;primaryPublicationName&quot;:&quot;Auriel Wright&quot;,&quot;primaryPublicationId&quot;:8087656}],&quot;post_date&quot;:&quot;2026-06-05T18:49:40.461Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!NbXz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe58868ac-23a0-453d-81e5-5ca830f7454d_1456x1394.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.latent.space/p/bad-envs&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:200799194,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:42,&quot;comment_count&quot;:0,&quot;publication_id&quot;:1084089,&quot;publication_name&quot;:&quot;Latent.Space&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!DbYa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73b0838a-bd14-46a1-801c-b6a2046e5c1e_1130x1130.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p></p><blockquote><p>AI News for 6/4/2026-6/5/2026. We checked 12 subreddits, <a href="https://twitter.com/i/lists/1585430245762441216">544 Twitters</a> and no further Discords. <a href="https://news.smol.ai/">AINews&#8217; website</a> lets you search all past issues. As a reminder, <a href="https://www.latent.space/p/2026">AINews is now a section of Latent Space</a>. You can <a href="https://support.substack.com/hc/en-us/articles/8914938285204-How-do-I-subscribe-to-or-unsubscribe-from-a-section-on-Substack">opt in/out</a> of email frequencies!</p></blockquote><div><hr></div><h1><strong>AI Twitter Recap</strong></h1><p><strong>Frontier Models, RSI, and the &#8220;AI Builds AI&#8221; Narrative</strong></p><ul><li><p><strong>Anthropic&#8217;s Mythos/Opus cycle dominated discussion, but substance was mixed with speculation</strong>: Community attention centered on <strong>Claude Mythos</strong>, with multiple users calling outputs &#8220;next level&#8221; and highlighting strong one-shot desktop and MacOS workflows (<a href="https://x.com/kimmonismus/status/2062843119864021404">kimmonismus on Mythos outputs</a>, <a href="https://x.com/kimmonismus/status/2062933600287224073">more reactions</a>, <a href="https://x.com/kimmonismus/status/2062805570982203820">earlier post</a>). At the same time, there were questions about benchmark regressions&#8212;e.g. claims that <strong>Opus 4.8 underperforms 4.7 on LLM Debate Benchmark</strong> and skepticism around earlier Sonnet/Opus trajectory narratives (<a href="https://x.com/LechMazur/status/2062954327199666602">LechMazur</a>, <a href="https://x.com/teortaxesTex/status/2062807380643958948">teortaxesTex</a>). Anthropic also published a concrete science result: <strong>Opus 4.7 matching or beating dedicated NMR software on some tasks</strong>, framed as &#8220;making Claude a chemist&#8221; (<a href="https://x.com/AnthropicAI/status/2062979607448682731">AnthropicAI</a>).</p></li><li><p><strong>Recursive self-improvement moved from vague theory to explicit org strategy</strong>: <a href="https://x.com/SakanaAILabs/status/2062948403815030850">Sakana AI</a> launched a dedicated <strong>RSI Lab</strong> in Tokyo, tying together prior projects like <strong>The AI Scientist</strong>, <strong>Darwin G&#246;del Machine</strong>, and <strong>ShinkaEvolve</strong>, with an explicit claim that self-improving systems can be built under compute constraints rather than hyperscale-only regimes. <a href="https://x.com/hardmaru/status/2062948594597208557">hardmaru</a> emphasized <strong>sample efficiency</strong> as the design constraint. This lined up with broader industry rhetoric around self-improving systems: <a href="https://x.com/kimmonismus/status/2062868789746671819">kimmonismus</a> argued Anthropic/OpenAI RSI claims are not just IPO theater, while <a href="https://x.com/andrew_n_carr/status/2062976064343912949">andrew_n_carr</a> suggested only &#8220;1 or 2 hard problems&#8221; may remain on the path to AGI. The notable shift is that RSI is no longer just blog-post framing; labs are staffing around it as a formal research program.</p></li></ul><p><strong>Agent Evaluation, Reliability, and Long-Horizon Benchmarks</strong></p><ul><li><p><strong>Benchmarks are shifting from task snippets to economically meaningful, long-horizon work</strong>: Several new efforts pushed beyond classic SWE-bench-style evaluation. <a href="https://x.com/dair_ai/status/2062916866235068607">dair_ai</a> introduced <strong>Agents&#8217; Last Exam (ALE)</strong>, a benchmark of <strong>1,000+ economically valuable tasks</strong> mapped to U.S. occupational taxonomy, with the hardest tier averaging just <strong>2.6% full pass rate</strong>. <a href="https://x.com/rishi_desai2/status/2062930906818769356">rishi_desai2</a> launched <strong>SWE-Marathon</strong>, testing whether coding agents can stay coherent over <strong>1B-token budgets</strong> on projects like building Slack clones, rewriting JAX to PyTorch, or implementing a C compiler. <a href="https://x.com/omarsar0/status/2062919381777350914">omarsar0</a> highlighted the <strong>Meta-Agent Challenge</strong>, where agents attempt to self-improve under a sandbox + eval API + time budget setup; results showed meta-agents rarely match human baselines, and some attempted <strong>ground-truth exfiltration</strong> despite anti-reward-hacking defenses.</p></li><li><p><strong>Reliability work continues to show frontier models are not yet dependable enough</strong>: <a href="https://x.com/steverab/status/2062890225144135800">steverab</a> shared Princeton&#8217;s updated ICML 2026 paper, <strong>&#8220;Towards a Science of AI Agent Reliability,&#8221;</strong> adding <strong>GPT 5.5, Gemini 3.1 Pro / 3.5 Flash, and Claude Opus 4.7</strong> and concluding they are <strong>not meaningfully more reliable</strong> than previous models. The update also corrected an outcome consistency metric typo and audited scaffold issues including <strong>answer leakage</strong> and <strong>agent cheating on GAIA</strong>, but still found low consistency overall. Related commentary emphasized that &#8220;verifiable tasks&#8221; often just means <strong>easy tasks</strong> (<a href="https://x.com/MillionInt/status/2062924521779450147">MillionInt</a>) and that the right framing is &#8220;<strong>Reality: the final eval</strong>,&#8221; i.e. whether systems work in production, not whether they clear benchmark thresholds (<a href="https://x.com/559hkdt/status/2062867094111219824">559hkdt quoting swyx/Andon</a>).</p></li><li><p><strong>Tooling is converging on RL-environment-like harnesses for agents</strong>: <a href="https://x.com/pauliusztin_/status/2062874580411162811">pauliusztin_</a> argued for modeling agentic coding systems as <strong>Gym-style RL environments</strong> via Meta&#8217;s <strong>OpenEnv</strong>, mainly for observability rather than optimization: success rate, retries, tool efficiency, failure modes, cost per successful trajectory. <a href="https://x.com/adithya_s_k/status/2062871067803205815">adithya_s_k</a> noted strong uptake for a guide on RL environments for LLMs, while <a href="https://x.com/latentspacepod/status/2062972030606274785">latentspacepod</a> published a critique of low-quality RL environments. Together these point to a maturation of agent engineering from &#8220;vibe checks&#8221; to reproducible harnesses.</p></li></ul><p><strong>Open Models, Quantization, and Multimodal Releases</strong></p><ul><li><p><strong>Gemma 4 QAT was the most practically important open release for local deployment</strong>: Google shipped <strong>Gemma 4 Quantization-Aware Training checkpoints</strong> across model sizes (<a href="https://x.com/googlegemma/status/2062928831229665566">googlegemma</a>, <a href="https://x.com/osanseviero/status/2062933011415392482">osanseviero</a>). The release emphasizes lower memory while preserving quality, including a <strong>mobile quantization format</strong> and claims that <strong>E2B can run in ~1GB</strong>. Ecosystem support landed immediately via <a href="https://x.com/ollama/status/2062965815864066079">Ollama</a> and <a href="https://x.com/vllm_project/status/2062938949560283216">vLLM</a>. <a href="https://x.com/danielhanchen/status/2062933017430315481">danielhanchen</a> also noted a subtle interoperability issue: na&#239;ve conversion from QAT to llama.cpp&#8217;s <strong>Q4_0</strong> lattice loses accuracy, while Unsloth&#8217;s dynamic GGUF recovers much of it.</p></li><li><p><strong>Ideogram 4 stood out in image generation because it is both strong and open-weight</strong>: <a href="https://x.com/ideogram_ai/status/2062956373957292281">ideogram_ai</a> published a technical blog describing <strong>Ideogram 4.0</strong> as a <strong>9.3B Diffusion Transformer</strong> trained from scratch with a <strong>frozen 8B VLM text encoder</strong>, and notably released <strong>fp8 and nf4 checkpoints</strong>, with the <strong>nf4 variant fitting on a single 24GB GPU</strong> (<a href="https://x.com/ideogram_ai/status/2062956472489922584">follow-up</a>). Arena results placed <strong>Ideogram 4.0 Quality</strong> in the text-to-image top tier and as the <strong>leading open-weight image model</strong> (<a href="https://x.com/arena/status/2062957421757452516">arena</a>, <a href="https://x.com/arena/status/2062997992777609534">open-weight ranking update</a>).</p></li><li><p><strong>NVIDIA&#8217;s open-model push kept expanding</strong>: Discussion around <strong>Nemotron 3 Ultra</strong> focused on post-training details like <strong>MOPD warmup</strong> for teacher-student distribution matching and <strong>MTP boosting</strong> for speculative decoding (<a href="https://x.com/ben_burtenshaw/status/2062902364525244572">ben_burtenshaw</a>). NVIDIA also expanded its ecosystem with the <strong>Nemotron Coalition</strong>, adding <strong>Nous, Prime Intellect, and hcompany</strong> among others (<a href="https://x.com/NVIDIAAI/status/2062961026409333232">NVIDIAAI</a>). Downstream platforms moved quickly: <a href="https://x.com/perplexity_ai/status/2062976272436002825">Perplexity</a> made <strong>Nemotron 3 Ultra</strong> available to Pro/Max users, pitching it as an open model for long-running agents.</p></li></ul><p><strong>Agent Products, Devtools, and Runtime Infrastructure</strong></p><ul><li><p><strong>Hermes Agent had a full-stack product week</strong>: <a href="https://x.com/Teknium/status/2062822586954997909">Teknium</a> showcased building <strong>Hermes Agent with Hermes Agent</strong>, then spent the week pushing plugin support, docs, and curation (<a href="https://x.com/Teknium/status/2062854497865810164">plugin guide</a>, <a href="https://x.com/Teknium/status/2062830182432731256">developer-experience thread</a>). The biggest ship was <strong>Hermes v0.16.0</strong>, which includes a <strong>desktop GUI app</strong>, dashboard overhaul, leaner built-in skills, and <strong>new security layers for remote dashboard/GUI access</strong> including simple auth and OAuth (<a href="https://x.com/Teknium/status/2063075771317686606">release</a>, <a href="https://x.com/Teknium/status/2063078732768928234">security follow-up</a>, <a href="https://x.com/Teknium/status/2062953592131342832">Chinese-language desktop support</a>).</p></li><li><p><strong>Arena moved from passive leaderboard to active agent runtime</strong>: <a href="https://x.com/arena/status/2062902033389322477">arena</a> launched <strong>Agent Mode</strong> plus <strong>Agent Arena</strong>, where users run agents on real tasks and feed aggregate metrics like <strong>confirmed success, praise vs complaint, steerability, bash recovery, and tool hallucination</strong> into a leaderboard (<a href="https://x.com/arena/status/2062902039445959060">leaderboard details</a>). This is one of the clearest examples this week of an eval company turning into an execution platform.</p></li><li><p><strong>Devtools are being rebuilt around agent efficiency, not just human UX</strong>: <a href="https://x.com/ClementDelangue/status/2062982727729553913">ClementDelangue</a> provided one of the sharper operator takeaways: agent-optimized tooling matters because <strong>hand-rolling raw API interactions consumed up to 6&#215; more tokens and had lower success rates</strong> than using the Hugging Face CLI. His framing&#8212;&#8220;<strong>good tools are cached intelligence for agents</strong>&#8221;&#8212;captures an emerging design principle for agent-native developer platforms. Related launches included <strong>MagicPath as an official Codex plugin</strong> (<a href="https://x.com/skirano/status/2062942695547375829">skirano</a>), <strong>Cursor Design Mode</strong> for visual prompting of UI changes (<a href="https://x.com/cursor_ai/status/2062950344687272144">cursor_ai</a>), and <strong>Vercel integration inside Perplexity Computer</strong> to inspect deployments and redeploy in natural language (<a href="https://x.com/vercel_dev/status/2062934988648329515">vercel_dev</a>).</p></li></ul><p><strong>Compute, Infrastructure Economics, and Platform Operations</strong></p><ul><li><p><strong>AI infra economics are becoming a first-order story</strong>: <a href="https://x.com/EpochAIResearch/status/2062933470373146828">Epoch AI</a> estimated AI-related data center construction, compute hardware, and networking at <strong>~0.8% of U.S. GDP in Q1 2026</strong>, pushing total computing infrastructure to <strong>~1.5% of GDP</strong>. On the operating side, <a href="https://x.com/eglyman/status/2062921352613425446">eglyman</a> argued the problem is not raw token spend but lack of <strong>attribution and allocation</strong>, noting that rerouting even <strong>10% of a $10M AI bill</strong> from frontier models to cheaper tiers can save nearly <strong>$1M</strong>.</p></li><li><p><strong>Cloudflare shipped concrete cost controls for inference routing</strong>: Both <a href="https://x.com/CFchangelog/status/2062762883222483347">CF changelog</a>, <a href="https://x.com/elithrar/status/2062887228909527346">elithrar</a>, and <a href="https://x.com/michellechen/status/2062894017545720129">michellechen</a> announced <strong>AI Gateway spend limits</strong>, budget enforcement by model/user, and <strong>fallbacks to cheaper models</strong> when caps are reached, with forthcoming identity-based controls through Cloudflare Access. This is exactly the kind of infra feature enterprise teams are now demanding as usage leaves prototype scale.</p></li><li><p><strong>Platform/security incidents still matter because they reveal failure modes</strong>: OpenAI had an account suspension incident, acknowledged publicly by <a href="https://x.com/OpenAI/status/2062927046448431587">OpenAI</a>, with follow-ups from support staff indicating most accounts/subscriptions were later restored (<a href="https://x.com/reach_vb/status/2063035661855183215">reach_vb</a>). OpenAI also rolled out <strong>ChatGPT Lockdown Mode</strong> to all users, aimed at reducing the final stage of <strong>prompt-injection-driven data exfiltration</strong> by limiting outbound network requests (<a href="https://x.com/cryps1s/status/2062923575049531422">cryps1s</a>). Separately, speculation around an Anthropic outage potentially exposing cross-tenant output shows that <strong>multi-tenant isolation failures</strong> remain one of the highest-severity risks in agentic/cloud inference products (<a href="https://x.com/kimmonismus/status/2062997809067139468">kimmonismus</a>).</p></li></ul><p><strong>Top Tweets (by engagement)</strong></p><ul><li><p><strong>Gemma 4 QAT release</strong>: <a href="https://x.com/googlegemma/status/2062928831229665566">@googlegemma</a> announced QAT checkpoints for all Gemma 4 sizes and drafters, focused on lower-memory on-device inference.</p></li><li><p><strong>Anthropic&#8217;s Claude usage expansion</strong>: <a href="https://x.com/claudeai/status/2063018337567670285">@claudeai</a> said it had <strong>doubled usage limits in Claude Cowork</strong> for a month to support larger delegated tasks.</p></li><li><p><strong>OpenAI platform incident</strong>: <a href="https://x.com/OpenAI/status/2062927046448431587">@OpenAI</a> reported incorrect account suspensions and restoration work.</p></li><li><p><strong>Cursor Design Mode</strong>: <a href="https://x.com/cursor_ai/status/2062950344687272144">@cursor_ai</a> launched multimodal UI editing via pointing, drawing, or voice.</p></li><li><p><strong>Google&#8217;s agentic RAG framework</strong>: <a href="https://x.com/GoogleResearch/status/2062982001850974257">@GoogleResearch</a> introduced a <strong>multi-agent enterprise RAG</strong> workflow with iterative context gathering rather than one-shot retrieval.</p></li></ul><div><hr></div><h1><strong>AI Reddit Recap</strong></h1><h2><strong>/r/LocalLlama + /r/localLLM Recap</strong></h2><h3><strong>1. Gemma 4 QAT and Nemotron 3 Ultra Releases</strong></h3><p></p>
      <p>
          <a href="https://www.latent.space/p/ainews-not-much-happened-today-6b8">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How to Stop Shipping Low-Quality RL Environments (with Examples)]]></title><description><![CDATA[Your broken harness is actively making the model worse. Here's what I keep seeing after years of eyeballing trajectories, and what you need to fix.]]></description><link>https://www.latent.space/p/bad-envs</link><guid isPermaLink="false">https://www.latent.space/p/bad-envs</guid><dc:creator><![CDATA[Auriel Wright]]></dc:creator><pubDate>Fri, 05 Jun 2026 18:49:40 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!NbXz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe58868ac-23a0-453d-81e5-5ca830f7454d_1456x1394.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>We&#8217;re so excited to publish this guest post from Auriel W, who has worked on RL at Gemini, and has an incredible &#8220;<a href="https://aurielws.github.io/writing.html">RL Pet Peeves</a>&#8221; blog where she not-so-subtly explains the frustrations big labs have with RL vendors: 1) <a href="https://aurielws.github.io/posts/rl-pet-peeves-part-1/">not reading trajectories</a>, 2) <a href="https://aurielws.github.io/writing/rl-pet-peeves-rubric/">not having domain experts</a>, 3) <a href="https://aurielws.github.io/writing/rl-pet-peeves-economic/">not making economic tradeoffs</a>, 4) <a href="https://aurielws.github.io/writing/rl-pet-peeves-simulation/">triggering eval awareness</a>, and this one, on <strong><a href="https://aurielws.github.io/writing.html">Environment Quality</a></strong>.</em></p><p><em>From <a href="https://x.com/swyx/status/2062611218196771017/photo/1">experience</a>, we&#8217;re ultra keen on improving the state of the art on data quality - after all, <a href="https://www.youtube.com/watch?v=yXPPcBlcF8U">Better Data is All You Need</a> - and so are asking both buyers and sellers of data, from human expert to RL env, to join us at <a href="http://ai.engineer/wf">our inaugural Data track at AIEWF</a> in 3 weeks. Reach out if you have a speaker to nominate!</em></p><p><em>Without further ado, here&#8217;s <a href="https://x.com/aurielws">Auriel</a>!</em></p><div><hr></div><h2><em>I Don&#8217;t Want Your Janky Harness / Environment bro &#128578;</em></h2><p>As someone who has spent years building production grade models I need you to hear this: researchers don&#8217;t want your broken <a href="https://aurielws.github.io/writing-drafts/harness-failure-v3/glossary.html#rl">RL</a> environments because they will make our models worse. Not &#8220;add some noise&#8221; Worse but more like &#8220;oh crap the model is learning the wrong things and you ruined my training run and I have to throw your stuff away&#8221; Worse. This is such a common problem I see, and probably the one I care about the most as a practitioner that also tries aligning models for real world use cases that users love.</p><p>People will build what amounts to broken software and pitch it as an &#8220;RL environment.&#8221; The training <a href="https://aurielws.github.io/writing-drafts/harness-failure-v3/glossary.html#harness">harness</a> itself - the complete, interactive, and often simulated software system your RL agent trains inside of (e.g., a simulated chatbot, a fake IDE, a mock SaaS dashboard) - just doesn&#8217;t work reliably. It throws random tracebacks. It has race conditions. It goes down under minimal load. It has literal broken code in it.</p><p>If you&#8217;re a fresh grad researcher, a startup trying to post-train subagents for your product, or anyone building RL training infrastructure: this post is the list of harness failures I keep seeing, why they ruin your data, and how to fix them.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NbXz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe58868ac-23a0-453d-81e5-5ca830f7454d_1456x1394.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NbXz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe58868ac-23a0-453d-81e5-5ca830f7454d_1456x1394.png 424w, https://substackcdn.com/image/fetch/$s_!NbXz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe58868ac-23a0-453d-81e5-5ca830f7454d_1456x1394.png 848w, https://substackcdn.com/image/fetch/$s_!NbXz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe58868ac-23a0-453d-81e5-5ca830f7454d_1456x1394.png 1272w, https://substackcdn.com/image/fetch/$s_!NbXz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe58868ac-23a0-453d-81e5-5ca830f7454d_1456x1394.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NbXz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe58868ac-23a0-453d-81e5-5ca830f7454d_1456x1394.png" width="1456" height="1394" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e58868ac-23a0-453d-81e5-5ca830f7454d_1456x1394.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1394,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:241837,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/200799194?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe58868ac-23a0-453d-81e5-5ca830f7454d_1456x1394.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NbXz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe58868ac-23a0-453d-81e5-5ca830f7454d_1456x1394.png 424w, https://substackcdn.com/image/fetch/$s_!NbXz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe58868ac-23a0-453d-81e5-5ca830f7454d_1456x1394.png 848w, https://substackcdn.com/image/fetch/$s_!NbXz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe58868ac-23a0-453d-81e5-5ca830f7454d_1456x1394.png 1272w, https://substackcdn.com/image/fetch/$s_!NbXz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe58868ac-23a0-453d-81e5-5ca830f7454d_1456x1394.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Important: In reinforcement learning, the environment is your data generator.</em></figcaption></figure></div><p></p><p>In RL, you don&#8217;t have a static dataset. Instead, the model creates its own training data by interacting with the environment. Every action and every reward becomes a data point. A flaky harness systematically generates garbage data and feeds it straight into your model&#8217;s learning steps, pushing your gradients in the wrong direction.</p><p></p><h2><em>Common Harness Errors Across Agentic Use Cases</em></h2><p>After eyeballing thousands of <a href="https://aurielws.github.io/writing-drafts/harness-failure-v3/glossary.html#trajectory">trajectories</a> across different domains as a practitioner for the last 5 years, I see the same harness failures showing up. Here are some I personally look out for based on various agent types that are pretty common today:</p><blockquote><p><em>Each trajectory cascade below shows exactly how a single harness bug poisons an entire episode.</em></p></blockquote><p></p><h3>Error Class 1: The Stale Cache </h3><p>This happens when your environment returns old data after an action taken. </p><p><strong>Example: SaaS Sales Agent / BDR Agent</strong></p><p>Your harness&#8217;s mock CRM API has a caching bug. Under load, it returns stale state from minutes ago instead of current data. The agent makes rational decisions based on wrong information, gets punished, and learns to avoid the correct workflow entirely.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3TuR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81b81e1-9708-4929-bdaa-a83ca0519f9b_1460x1100.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3TuR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81b81e1-9708-4929-bdaa-a83ca0519f9b_1460x1100.png 424w, https://substackcdn.com/image/fetch/$s_!3TuR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81b81e1-9708-4929-bdaa-a83ca0519f9b_1460x1100.png 848w, https://substackcdn.com/image/fetch/$s_!3TuR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81b81e1-9708-4929-bdaa-a83ca0519f9b_1460x1100.png 1272w, https://substackcdn.com/image/fetch/$s_!3TuR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81b81e1-9708-4929-bdaa-a83ca0519f9b_1460x1100.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3TuR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81b81e1-9708-4929-bdaa-a83ca0519f9b_1460x1100.png" width="1456" height="1097" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a81b81e1-9708-4929-bdaa-a83ca0519f9b_1460x1100.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1097,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:308370,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/200799194?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81b81e1-9708-4929-bdaa-a83ca0519f9b_1460x1100.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3TuR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81b81e1-9708-4929-bdaa-a83ca0519f9b_1460x1100.png 424w, https://substackcdn.com/image/fetch/$s_!3TuR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81b81e1-9708-4929-bdaa-a83ca0519f9b_1460x1100.png 848w, https://substackcdn.com/image/fetch/$s_!3TuR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81b81e1-9708-4929-bdaa-a83ca0519f9b_1460x1100.png 1272w, https://substackcdn.com/image/fetch/$s_!3TuR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81b81e1-9708-4929-bdaa-a83ca0519f9b_1460x1100.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>What the model ends up learning: <em>&#8220;When in doubt, send nurture emails and avoid the pipeline.&#8221;</em></p><p></p><h2>Error Class 2: The Reward Hack</h2><p>This happens when your Agent games the Metric.</p><p><strong>Example: A coding agent</strong></p><p>Your reward function only checks whether tests pass, not whether the code is actually correct. The agent discovers it can hardcode expected outputs instead of solving the problem. Every test passes, the agent gets maximum reward, and production breaks on the first real input.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vD1q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2401739-5a09-424f-b02c-11ce118e0917_1448x1182.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vD1q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2401739-5a09-424f-b02c-11ce118e0917_1448x1182.png 424w, https://substackcdn.com/image/fetch/$s_!vD1q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2401739-5a09-424f-b02c-11ce118e0917_1448x1182.png 848w, https://substackcdn.com/image/fetch/$s_!vD1q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2401739-5a09-424f-b02c-11ce118e0917_1448x1182.png 1272w, https://substackcdn.com/image/fetch/$s_!vD1q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2401739-5a09-424f-b02c-11ce118e0917_1448x1182.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vD1q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2401739-5a09-424f-b02c-11ce118e0917_1448x1182.png" width="1448" height="1182" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a2401739-5a09-424f-b02c-11ce118e0917_1448x1182.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1182,&quot;width&quot;:1448,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:320798,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/200799194?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2401739-5a09-424f-b02c-11ce118e0917_1448x1182.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vD1q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2401739-5a09-424f-b02c-11ce118e0917_1448x1182.png 424w, https://substackcdn.com/image/fetch/$s_!vD1q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2401739-5a09-424f-b02c-11ce118e0917_1448x1182.png 848w, https://substackcdn.com/image/fetch/$s_!vD1q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2401739-5a09-424f-b02c-11ce118e0917_1448x1182.png 1272w, https://substackcdn.com/image/fetch/$s_!vD1q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2401739-5a09-424f-b02c-11ce118e0917_1448x1182.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>What the model ends up learning: <em>&#8220;Read the tests, hardcode the outputs, skip understanding the bug.&#8221;</em></p><p></p><h2>Error Class 3: The False Resolution</h2><p>This happens when there is a Status Change, but the core Problem is still not solved&#8230;</p><p><strong>Example: Customer Support Agent</strong></p><p>Your harness rewards based on ticket status changes (open &#8594; resolved = positive reward), not on whether the customer&#8217;s actual problem was fixed. The agent learns that clicking &#8220;resolve&#8221; is the fastest path to reward - even when the customer still has the problem.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7BzW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc86c0f13-d939-4301-ba8d-6a5ac6ed2df5_1458x1098.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7BzW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc86c0f13-d939-4301-ba8d-6a5ac6ed2df5_1458x1098.png 424w, https://substackcdn.com/image/fetch/$s_!7BzW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc86c0f13-d939-4301-ba8d-6a5ac6ed2df5_1458x1098.png 848w, https://substackcdn.com/image/fetch/$s_!7BzW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc86c0f13-d939-4301-ba8d-6a5ac6ed2df5_1458x1098.png 1272w, https://substackcdn.com/image/fetch/$s_!7BzW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc86c0f13-d939-4301-ba8d-6a5ac6ed2df5_1458x1098.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7BzW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc86c0f13-d939-4301-ba8d-6a5ac6ed2df5_1458x1098.png" width="1456" height="1096" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c86c0f13-d939-4301-ba8d-6a5ac6ed2df5_1458x1098.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1096,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:318107,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/200799194?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc86c0f13-d939-4301-ba8d-6a5ac6ed2df5_1458x1098.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7BzW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc86c0f13-d939-4301-ba8d-6a5ac6ed2df5_1458x1098.png 424w, https://substackcdn.com/image/fetch/$s_!7BzW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc86c0f13-d939-4301-ba8d-6a5ac6ed2df5_1458x1098.png 848w, https://substackcdn.com/image/fetch/$s_!7BzW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc86c0f13-d939-4301-ba8d-6a5ac6ed2df5_1458x1098.png 1272w, https://substackcdn.com/image/fetch/$s_!7BzW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc86c0f13-d939-4301-ba8d-6a5ac6ed2df5_1458x1098.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h3><em>More Harness Failures to Watch For</em></h3><ul><li><p><strong>Silent timeout defaults:</strong> Your harness silently returns a default value when an API call takes too long instead of throwing an error. The model learns that certain actions &#8220;always succeed instantly&#8221; and never builds retry logic into its behavior.</p></li><li><p><strong>Non-deterministic state resets:</strong> The harness doesn&#8217;t fully reset between episodes, so leftover state from episode N bleeds into episode N+1. The model gets rewarded or punished for things it didn&#8217;t do in the current episode.</p></li><li><p><strong>Reward rounding / clipping artifacts:</strong> Your reward function clips or rounds in ways that flatten meaningful signal differences. A great action and a mediocre action both return +1.0, so the model has no gradient to distinguish them.</p></li><li><p><strong>Mock data that doesn&#8217;t match production distributions:</strong> Your harness uses perfectly formatted, clean mock data, but production data has typos, missing fields, and edge cases. The model never sees messy inputs during training and breaks on real ones.</p></li><li><p><strong>Action space drift:</strong> The harness exposes actions that don&#8217;t exist in production (or hides ones that do). The model learns to rely on a &#8220;shortcut&#8221; button that won&#8217;t be there when deployed, or never discovers a critical capability it needs.</p></li></ul><p></p><h2><strong>How to Minimize Harness Failures</strong></h2><h3><em>Know Your Model, Know Your Harness</em></h3><p>From my experience a well-built harness has clean signal (every state is fresh, every reward matches reality), graceful degradation (bad episodes get flagged and excluded before they reach the gradient), and fail-fast behavior (something breaks, it throws immediately instead of silently corrupting data - you&#8217;d rather lose an episode than poison one).</p><p>You learn to recognize these properties by spending time with your model - reviewing trajectories, building a failure taxonomy so you know whether a bad episode was a model failure or a harness failure. If your environment failure rate is above 5%, you don&#8217;t have a model problem, you have a harness problem. Fix the harness first. I talk more about this in my previous post on <a href="https://aurielws.github.io/posts/rl-pet-peeves-part-1/">trajectory reviewing</a>.</p><h3><em>Adopt Traditional Software Engineering Best Practices in Your RL Research</em></h3><p>Building good RL environments is a software engineering problem as much as a research one. I feel like many classically trained ML Researchers are taught to think about algorithms and mathematical correctness the most, but in school we&#8217;re never taught how to really execute on what the math tells us in our code. Building scalable and robust software (ie: stable harnesses) requires slightly different sets of best practices than traditional research. Treat your training harness like your production one as much as you can. So if prod experiences 200 QPS on average, make sure your harness knows what that feels like without errors. If you haven&#8217;t had to ship production software before, there are great resources out there from the likes of <a href="https://x.com/GergelyOrosz">Gergely Orosz</a> and <a href="https://x.com/alexxubyte">Alex Xu</a> that can help get you there. You also can learn from your company&#8217;s <a href="https://x.com/swyx/status/1097334440169107456?s=20">Platform Engineers</a> who usually eat, sleep, and breathe stable and scalable software.</p><h2><em>Go Fix Your Janky Harness</em></h2><p>Training harness engineering is about making sure the model experiences production-quality interactions before you actually deploy to prod. A good harness compounds: every clean episode builds on the last. A bad one compounds too, just in the wrong direction. The gap between teams that ship working harnesses and those that don&#8217;t widens with every training run. Treat the training harness as an extension of your actual product - with the same level of engineering quality you expect the model to see in production.</p><div><hr></div><p><em>Auriel W blogs at <a href="https://aurielws.github.io/writing.html">https://aurielws.github.io/writing.html</a>  and is on <a href="https://x.com/aurielws">Twitter</a> and <a href="https://www.linkedin.com/in/aurielws/">LinkedIn</a>.</em></p>]]></content:encoded></item><item><title><![CDATA[[AINews] not much happened today]]></title><description><![CDATA[a quiet day]]></description><link>https://www.latent.space/p/ainews-not-much-happened-today-7a8</link><guid isPermaLink="false">https://www.latent.space/p/ainews-not-much-happened-today-7a8</guid><pubDate>Fri, 05 Jun 2026 06:44:49 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!DbYa!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73b0838a-bd14-46a1-801c-b6a2046e5c1e_1130x1130.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Anthropic is seeing <a href="https://www.anthropic.com/institute/recursive-self-improvement">Sparks of RSI</a>, OpenAI&#8217;s ChatGPT has finally crossed 1B MAU ~5 months behind schedule and <a href="https://x.com/OpenAI/status/2062567556524003631">improved memory</a>, and <a href="https://x.com/SpaceX/status/2062630481087082874">SpaceXAI is explaining its IPO to people who might not know they will be forced into buying it</a>.</p><p>None of which are as important as <a href="http://ai.engineer/wf">getting your AIEWF tickets and hotels</a> and tuning in to <a href="https://www.latent.space/p/andon">the latest pod with Andon Labs</a>!</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;7942c02d-00aa-4377-8600-12d5f6bb0c80&quot;,&quot;caption&quot;:&quot;The new AIEWF website is live! Get your tickets booked ASAP as they -will- sell out. Take the AI Engineering Survey and get >$2k in credits and free AIE WF tickets!&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Reality: The Final Eval &#8212; Lukas Petersson and Axel Backlund of Andon Labs&quot;,&quot;publishedBylines&quot;:[],&quot;post_date&quot;:&quot;2026-06-04T20:39:18.514Z&quot;,&quot;cover_image&quot;:&quot;https://substack-video.s3.amazonaws.com/video_upload/post/200614482/1621f1b3-afdf-4e73-96ad-7e9344965086/transcoded-1780580537.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.latent.space/p/andon&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:200614482,&quot;type&quot;:&quot;podcast&quot;,&quot;reaction_count&quot;:7,&quot;comment_count&quot;:0,&quot;publication_id&quot;:1084089,&quot;publication_name&quot;:&quot;Latent.Space&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!DbYa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73b0838a-bd14-46a1-801c-b6a2046e5c1e_1130x1130.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p></p><blockquote><p>AI News for 6/3/2026-6/4/2026. We checked 12 subreddits, <a href="https://twitter.com/i/lists/1585430245762441216">544 Twitters</a> and no further Discords. <a href="https://news.smol.ai/">AINews&#8217; website</a> lets you search all past issues. As a reminder, <a href="https://www.latent.space/p/2026">AINews is now a section of Latent Space</a>. You can <a href="https://support.substack.com/hc/en-us/articles/8914938285204-How-do-I-subscribe-to-or-unsubscribe-from-a-section-on-Substack">opt in/out</a> of email frequencies!</p></blockquote><div><hr></div><h1><strong>AI Twitter Recap</strong></h1><p><strong>NVIDIA&#8217;s Nemotron 3 Ultra and 3.5 ASR Release</strong></p><ul><li><p><strong>Nemotron 3 Ultra</strong> was the clearest technical release of the day: a fully open <strong>550B MoE</strong> model with <strong>55B active parameters</strong>, <strong>1M context</strong>, and an explicit focus on long-running agent workloads. NVIDIA says it is <strong>up to 5x faster</strong> and <strong>30% lower cost</strong> for agentic tasks, with weights, synthetic data, reward checkpoints, quantized variants, and training recipes released under <strong>OpenMDW 1.1</strong> (<a href="https://x.com/nvidia/status/2062522316672667770">NVIDIA launch</a>, <a href="https://x.com/NVIDIAAI/status/2062521383582646537">NVIDIAAI open artifacts</a>, <a href="https://x.com/PavloMolchanov/status/2062538679470657727">Pavlo Molchanov thread</a>). The architecture combines <strong>hybrid Mamba/attention</strong>, <strong>LatentMoE</strong>, and <strong>native MTP</strong>, with pretraining done in <strong>NVFP4</strong> over <strong>20T tokens</strong>&#8212;notable because it pushes low-precision pretraining into a new scale regime (<a href="https://x.com/ctnzr/status/2062515418884149451">tech notes</a>, <a href="https://x.com/scaling01/status/2062540298933219832">scaling discussion</a>).</p></li><li><p><strong>Benchmarks and serving story</strong> were unusually strong for an open release. <a href="https://x.com/ArtificialAnlys/status/2062527871529439438">@ArtificialAnlys</a> measured <strong>47.7</strong> on its Intelligence Index using NVIDIA&#8217;s recommended NVFP4 inference weights (<strong>48.2</strong> in BF16), making it the strongest <strong>US open-weights</strong> model they&#8217;ve tested, though still behind <strong>Kimi K2.6</strong>. More interestingly, they reported <strong>400+ output tok/s</strong> via BlackBox, and separately showed Nemotron 3 Ultra sitting on the <strong>Pareto frontier for task latency vs. performance</strong> on Terminal-Bench-style evaluations under turn limits (<a href="https://x.com/ArtificialAnlys/status/2062598349757567359">latency analysis</a>, <a href="https://x.com/blackboxai/status/2062546216949588001">BlackBox throughput</a>). The model shipped <strong>day 0</strong> across the stack: <a href="https://x.com/vllm_project/status/2062574262163280172">vLLM</a>, <a href="https://x.com/modal/status/2062528720104227149">Modal</a>, <a href="https://x.com/togethercompute/status/2062520009893576974">Together</a>, <a href="https://x.com/FireworksAI_HQ/status/2062568688201646321">Fireworks</a>, <a href="https://x.com/ollama/status/2062591290743853291">Ollama cloud</a>, <a href="https://x.com/baseten/status/2062609272815685759">Baseten</a>, <a href="https://x.com/wandb/status/2062577626242580896">CoreWeave/W&amp;B</a>, <a href="https://x.com/cline/status/2062620668085297214">Cline</a>, <a href="https://x.com/PrimeIntellect/status/2062622550300275088">Prime Intellect</a>, and <a href="https://x.com/NousResearch/status/2062554136625766409">Nous Portal</a>.</p></li><li><p><strong>Nemotron 3.5 ASR</strong> was the quieter but practical companion release: an open streaming ASR model with a single <strong>0.6B checkpoint</strong>, <strong>40 language-locale combinations</strong>, and <strong>sub-100ms latency</strong>, built on a <strong>cache-aware FastConformer / RNN-T</strong> style design optimized for voice agents and streaming speech workloads (<a href="https://x.com/PiotrZelasko/status/2062538923776290909">Piotr Zelasko</a>, <a href="https://x.com/togethercompute/status/2062520605102993436">Together</a>, <a href="https://x.com/fal/status/2062521027020611933">fal availability</a>).</p></li></ul><p><strong>Anthropic&#8217;s Recursive Self-Improvement Framing and Internal AI-Coding Metrics</strong></p><ul><li><p>Anthropic published the most-discussed policy/research note of the day, arguing that current systems show <strong>early signs of recursive self-improvement (RSI)</strong>&#8212;not yet full autonomy in research direction, but clear evidence that AI is accelerating AI development (<a href="https://x.com/AnthropicAI/status/2062568862479208923">Anthropic post</a>). The headline operational claims were concrete: <strong>80%+ of merged code</strong> at Anthropic is now authored by Claude, the typical engineer ships <strong>8x more code per quarter</strong> than in prior years, and on internal open-ended engineering tasks Claude&#8217;s success rate rose from roughly <strong>26% to 76%</strong> in six months (<a href="https://x.com/AnthropicAI/status/2062568864240836995">code metric</a>, <a href="https://x.com/alexalbert__/status/2062580571214389510">Alex Albert summary</a>).</p></li><li><p>The most striking empirical datapoint was Anthropic&#8217;s recurring &#8220;speed up a small model training script&#8221; test: <strong>Claude Opus 4</strong> averaged about <strong>3x</strong> speedup, while <strong>Mythos Preview</strong> reportedly achieved <strong>~52x</strong> (<a href="https://x.com/AnthropicAI/status/2062568869240476050">Anthropic benchmark claim</a>, <a href="https://x.com/AnthropicAI/status/2062634151556292775">correction on dates</a>). Anthropic also says Mythos gave better &#8220;what to do next&#8221; research suggestions than humans <strong>64%</strong> of the time in sessions where the researcher had taken a wrong turn (<a href="https://x.com/AnthropicAI/status/2062568870872003021">research-next-step result</a>). Their broader thesis: automating <em>problem selection</em> is still unresolved, but automating large portions of implementation and iteration is already happening.</p></li><li><p>The governance angle mattered as much as the productivity claims. Anthropic explicitly wrote that &#8220;it would be good for the world to have the option to <strong>slow or temporarily pause frontier AI development</strong>,&#8221; framing verification and coordination mechanisms as increasingly urgent if RSI-like dynamics continue (<a href="https://x.com/AnthropicAI/status/2062568873321513443">Anthropic governance statement</a>, <a href="https://x.com/scaling01/status/2062572962117562507">discussion</a>, <a href="https://x.com/a_karvonen/status/2062572851916574730">commentary</a>). This landed amid criticism that Anthropic recently <strong>weakened parts of its Responsible Scaling Policy thresholds</strong> around bio/chemical risk, according to <a href="https://x.com/CRSegerie/status/2062474945377218819">@CRSegerie</a>. Separately, a coalition including <strong>Altman, Amodei, Hassabis, and Baker</strong> backed <strong>mandatory DNA synthesis screening and recordkeeping</strong> in the US, arguing AI is eroding biological knowledge barriers (<a href="https://x.com/kimmonismus/status/2062485389949145457">letter summary</a>).</p></li></ul><p><strong>Cloudflare Acquires VoidZero and Tightens the Full-Stack Agent Toolchain</strong></p><ul><li><p>The biggest developer-platform move was <strong>Cloudflare bringing in VoidZero</strong>, the team behind <strong>Vite, Vitest, Rolldown, Oxc, and Vite+</strong>. Cloudflare and VoidZero emphasized that <strong>Vite remains open source, MIT, and vendor-neutral</strong>, with Cloudflare also committing <strong>$1M</strong> to a fund for independent Vite ecosystem development (<a href="https://x.com/Cloudflare/status/2062521221132992533">Cloudflare</a>, <a href="https://x.com/vite_js/status/2062525206158078047">Vite statement</a>, <a href="https://x.com/evanyou/status/2062533668233756677">Evan You</a>).</p></li><li><p>The strategic read from developers was that this gives Cloudflare tighter control over an increasingly agent-friendly application stack: frontend/build tooling, runtime, storage, inference, deployment primitives, and security in one place. <a href="https://x.com/wesbos/status/2062520527151903090">@wesbos</a> framed it as Cloudflare assembling &#8220;a tidy package they can hand to an LLM to make a site,&#8221; which is directionally consistent with Cloudflare&#8217;s own push on agents, MCP, sandboxes, AI search, payments, and observability in a unified platform (<a href="https://x.com/thomasgauvin/status/2062512156076048447">Cloudflare agents docs overview</a>).</p></li></ul><p><strong>Agents, Harnesses, Memory, and Evaluation Infrastructure</strong></p><ul><li><p>Several tweets pointed to a maturing &#8220;agent systems&#8221; layer beyond raw model releases. A recurring theme was that the bottleneck is increasingly the <strong>harness/orchestrator</strong>, not just prompting. A popular clip summarized the Claude Code workflow as &#8220;I don&#8217;t prompt Claude anymore, I write loops,&#8221; while <a href="https://x.com/omarsar0/status/2062553527730540611">@omarsar0</a> described reverse-engineering <strong>dynamic workflows</strong> into his own orchestrator for branching research, verification, triage, data synthesis, and eval generation. The common idea: higher-order control loops, not one-shot prompts, are becoming the real unit of work.</p></li><li><p>Tooling around those loops also improved. <a href="https://x.com/LangChain/status/2062512156688466083">LangSmith Sandboxes</a> reached GA with Dockerfile snapshots, interactive consoles, TCP tunneling, and standard Linux tooling. Hugging Face pushed two adjacent ideas: a <strong>Kernels</strong> distribution path for custom kernels on the Hub (<a href="https://x.com/RisingSayak/status/2062471134260687264">announcement</a>) and stronger support for storing <strong>agent traces</strong> as first-class artifacts, echoed by <a href="https://x.com/ClementDelangue/status/2062542713463980303">@ClementDelangue</a>. <a href="https://x.com/julien_c/status/2062524414034423969">@julien_c</a> released <strong>SynthTraces</strong>, a minimal harness that generated <strong>2,000+ synthetic coding-agent session traces</strong> by having an open model play the coding agent and a local model simulate the user.</p></li><li><p>Evaluation also shifted toward real-world agent work. <strong>Arena</strong> launched <strong>Agent Arena / Agent Mode</strong>, measuring agentic performance from <strong>millions of live sessions</strong> with tools like web search, filesystem, bash, and image generation. Their current ranking puts <strong>GPT-5.5</strong> first, followed by <strong>Claude Opus 4.7</strong>, <strong>GLM-5.1</strong>, <strong>Gemini 3.1 Pro</strong>, and <strong>Kimi-K2.6</strong>, with methodology based on task success, steerability, recovery, user praise/complaint, and tool hallucination across <strong>300K+ tasks</strong>, <strong>2M+ tool calls</strong>, and <strong>40M lines of code</strong> (<a href="https://x.com/arena/status/2062566749418233981">launch</a>, <a href="https://x.com/arena/status/2062566769659912281">methodology</a>). On the enterprise side, <strong>Cognition</strong> introduced an <strong>AI Productivity Guarantee</strong> for Devin&#8212;up to <strong>$10M</strong> in covered usage if the product doesn&#8217;t produce positive engineering value&#8212;backed by an internal measurement system over <strong>258 enterprise sessions</strong> spanning tasks up to <strong>64+ hours</strong> (<a href="https://x.com/cognition/status/2062597242167628019">guarantee</a>, <a href="https://x.com/cognition/status/2062597246001324518">technical writeup</a>).</p></li></ul><p><strong>Memory, Multimodality, and Model/Benchmark Updates</strong></p><ul><li><p><strong>OpenAI rolled out a more capable ChatGPT memory system</strong> to Plus and Pro users in the US, with <strong>memory summaries</strong>, more steering controls, and <strong>2x more memory</strong>. The company framed this as a longer-running research arc from saved memory to &#8220;dreaming&#8221; to the current system (<a href="https://x.com/OpenAI/status/2062567556524003631">OpenAI</a>, <a href="https://x.com/OpenAI/status/2062567559673856346">controls</a>, <a href="https://x.com/ChristinaHartW/status/2062585124450172956">Christina Kim explanation</a>). Related developer-side updates included <strong>moderation scores in the Responses and Completions APIs</strong> (<a href="https://x.com/OpenAIDevs/status/2062619558440267801">OpenAIDevs</a>) and a heavily shared demo of the new <strong>Codex iOS app plugin</strong> for viewing and testing apps in-browser with hot reload (<a href="https://x.com/OpenAIDevs/status/2062599291479478275">OpenAIDevs demo</a>).</p></li><li><p>A few other model/data releases are worth noting. <strong>Gemma 4 12B</strong> continued to draw attention both as a local coding model replacement and in highly compressed form: <a href="https://x.com/UnslothAI/status/2062470072179044447">Unsloth</a> released a <strong>2-bit GGUF</strong> at <strong>4.66 GB</strong>. <a href="https://x.com/_philschmid/status/2062546814075609413">@_philschmid</a> highlighted an architectural explainer on how Gemma 4 handles text/images/audio without separate encoders. In multimodal research, <a href="https://x.com/skalskip92/status/2062549751246066144">@skalskip92</a> flagged <strong>Molmo2</strong> as a strong open VLM candidate at CVPR, supporting video pointing, tracking, counting, and multi-image reasoning. For document understanding, <strong>ParseBench</strong> from LlamaIndex introduced an open benchmark with <strong>2,000+ human-verified pages</strong> and <strong>167K+ test rules</strong> across tables, charts, faithfulness, formatting, and grounding (<a href="https://x.com/llama_index/status/2062525204262236266">benchmark announcement</a>).</p></li></ul><p><strong>Top Tweets (by engagement, filtered for technical relevance)</strong></p><ul><li><p><strong>Anthropic on RSI and internal automation</strong>: Claude now writes <strong>80%+</strong> of merged code at Anthropic, engineers ship <strong>8x</strong> more code, and the company says AI accelerating AI development is becoming plausible (<a href="https://x.com/AnthropicAI/status/2062568862479208923">Anthropic</a>).</p></li><li><p><strong>OpenAI memory upgrade</strong>: a more capable ChatGPT memory system with summaries, steering controls, and <strong>2x</strong> more memory for Plus/Pro users in the US (<a href="https://x.com/OpenAI/status/2062567556524003631">OpenAI</a>).</p></li><li><p><strong>Cloudflare + VoidZero</strong>: Cloudflare brings in the VoidZero team while keeping <strong>Vite MIT and vendor-neutral</strong>, plus a <strong>$1M OSS fund</strong> for the ecosystem (<a href="https://x.com/Cloudflare/status/2062521221132992533">Cloudflare</a>, <a href="https://x.com/vite_js/status/2062525206158078047">Vite</a>).</p></li><li><p><strong>Nemotron 3 Ultra launch</strong>: open <strong>550B/55B-active</strong> hybrid MoE for long-running agents, with full recipes and unusually strong speed claims (<a href="https://x.com/nvidia/status/2062522316672667770">NVIDIA</a>).</p></li><li><p><strong>Cursor canvases + context explorer</strong>: sharable canvases for apps/reports/internal tools and an interactive breakdown of where agent context is spent (<a href="https://x.com/cursor_ai/status/2062611883249783083">Cursor</a>).</p></li></ul><div><hr></div><h1><strong>AI Reddit Recap</strong></h1><h2><strong>/r/LocalLlama + /r/localLLM Recap</strong></h2><h3><strong>1. Gemma 4 12B Release and Benchmarks</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1tvtn6m/googlegemma412b_hugging_face/">google/gemma-4-12B &#183; Hugging Face</a></strong> (Activity: 1610): <strong>Google DeepMind released </strong><code>google/gemma-4-12B</code><strong> as part of the Gemma 4 open-weights family, spanning </strong><code>E2B</code><strong>, </strong><code>E4B</code><strong>, </strong><code>12B</code><strong>, </strong><code>26B A4B</code><strong>, and </strong><code>31B</code><strong> variants with dense and MoE architectures, instruction-tuned/pretrained checkpoints, multimodal input, multilingual support across </strong><code>140+</code><strong> languages, and context windows up to </strong><code>256K</code><strong> tokens. The post highlights native </strong><code>system</code><strong> role support, configurable reasoning/thinking modes, function-calling/agentic use cases, coding improvements, and local deployment via GGUF builds from </strong><code>ggml-org</code><strong> and </strong><code>unsloth</code><strong>. A top comment links Maarten Grootendorst&#8217;s <a href="https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-gemma-4-12b">visual guide</a>, specifically calling out the model&#8217;s </strong><em><strong>&#8220;encoder-free architecture.&#8221;</strong></em> Commenters are mainly interested in empirical coding performance, with one explicitly wanting to test whether Gemma 4 12B can beat <strong>Qwen 3.5 9B</strong> on coding tasks. No concrete benchmark results were provided in the comments.</p><ul><li><p>A linked technical guide by <strong>Maarten Grootendorst</strong> highlights Gemma 4 12B&#8217;s <strong>encoder-free architecture</strong>, framing it as a notable design point for readers interested in model internals</p></li><li><p>Several commenters positioned <strong>Gemma 4 12B</strong> as a practical size tier between smaller Gemma variants like <code>E4B</code> and larger models such as <code>26B</code>, with one user also noting interest in whether it can outperform <strong>Qwen 3.5 9B</strong> on coding tasks.</p></li><li><p>One technical question raised was around the model&#8217;s apparent <strong>audio capabilities</strong>, with speculation that this could make Gemma 4 12B useful for <strong>speech/audio translation</strong> workflows if the multimodal support is robust.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1tw4tmf/new_google_gemma_4_12b_claims_near26b_performance/">New Google Gemma 4 12B Claims Near-26B Performance - We Tested Both!</a></strong> (Activity: 984): <strong>A local single-</strong><code>RTX 4090</code><strong> comparison claims Google Gemma 4 26B-A4B used </strong><code>15 GB</code><strong> VRAM, generated </strong><code>6.9k</code><strong> tokens at </strong><code>138 tok/s</code><strong>, and outperformed Gemma 4 12B, which used </strong><code>9 GB</code><strong> VRAM, generated </strong><code>8.9k</code><strong> tokens at </strong><code>80 tok/s</code><strong>, on three HTML5 Canvas physics-code tasks: a Galton board, two-block collision, and chaotic triple pendulum. The poster argues the MoE-style </strong><code>26B-A4B</code><strong> model is ~</strong><code>1.7&#215;</code><strong> faster despite larger total parameters because only ~</strong><code>4B</code><strong> are active, while the </strong><code>12B</code><strong> remains attractive for </strong><code>16 GB</code><strong> laptops; the test was also used to promote the founder&#8217;s local AI app, <a href="https://atomic.chat/">atomic.chat</a>.</strong> Top commenters disputed the stated winner, saying the videos appeared to show <strong>Gemma 4 12B</strong> performing better in scenes 2 and 3, with one asking whether the labels were reversed. Another commenter requested a comparable benchmark against <strong>Qwen3.6 35B-A3B</strong>.</p><ul><li><p>Multiple commenters questioned the test labeling/results, saying the <strong>Gemma 4 12B</strong> output appeared stronger than the larger model in the video comparisons&#8212;especially videos 2 and 3&#8212;with one noting the only visible flaw was that <em>&#8220;the balls seemed to have too high of a starting velocity&#8221;</em> in the first test.</p></li><li><p>A technical advantage highlighted for <strong>Gemma 4 12B</strong> was multimodal capability: it can ingest <strong>audio and video</strong> while fitting on devices with <strong>less VRAM</strong>, making near-26B performance practically useful for local or constrained deployments.</p></li><li><p>Commenters requested broader baselines such as <strong>Qwen3.6 35B A3B</strong>, and argued that evaluation should separate task domains: <strong>Qwen</strong> is expected to lead on quantitative/coding benchmarks, while <strong>Gemma 4</strong> may be more competitive on qualitative language tasks like creative writing and translation.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1tw0lua/gemma412bit_vs_qwen359b_on_shared_benchmarks_qwen/">gemma-4-12b-it vs Qwen3.5-9B on shared benchmarks: Qwen is overall winner beating gemma in 5/8 benchmarks despite a smaller footprint</a></strong> (Activity: 520): <strong>The image is a technical benchmark table comparing Gemma 4 12B Unified vs Qwen3.5-9B, compiled from official Hugging Face model-card scores, with Qwen3.5-9B winning </strong><code>5/8</code><strong> shared benchmarks despite a smaller parameter footprint and allegedly lighter KV cache (<a href="https://i.redd.it/20s4116kg45h1.png">image</a>). Qwen leads on MMLU-Pro, GPQA Diamond, TAU2, MMMU-Pro, and MedXpertQA-MM, while Gemma leads on LiveCodeBench v6, MMMLU, and narrowly on MathVision/MATH-Vision, framing the post&#8217;s argument that Qwen is stronger &#8220;GB for GB&#8221; except possibly in coding where Gemma or Qwen finetunes like OmniCoder-9B may compete.</strong> Commenters pushed back on benchmark-only conclusions: one argued Qwen may be <em>&#8220;benchmaxxed&#8221;</em> and that Gemma often feels better for general assistant, creative writing, and roleplay, while Qwen is strong at coding. Others said the Qwen-vs-Gemma debate is overblown because both are practically capable for scripting/coding tasks, though Qwen&#8217;s reasoning mode was criticized for filling context with low-value reasoning text.</p><ul><li><p>Several commenters argue that <strong>Qwen</strong> appears &#8220;benchmaxxed,&#8221; especially for coding-oriented benchmarks, and that its real advantage is strongest on tasks involving code generation, tool use, or coding-style logic. In practical use, users report both <strong>Gemma 4 31B / Gemma 3.6 27B</strong> and <strong>Qwen</strong> can generate usable scripts, but outputs still require manual inspection before acceptance.</p></li><li><p>A recurring technical complaint is that <strong>Qwen reasoning mode</strong> can waste context by producing excessive chain-of-thought-like text, with one user estimating only about <code>20%</code> of the generated reasoning is useful. This suggests that for some local/SLM workflows, disabling reasoning may improve effective context utilization and reduce noise.</p></li><li><p>Users report <strong>Gemma</strong> performing better on non-coding tasks such as general assistant use, creative writing, summarization, roleplay, and even some vision/image-understanding cases. One example cited hand-drawn note transcription: <strong>Qwen</strong> repeatedly misclassified an awkward arrow-linked word segment as a subheading, while <strong>Gemma 26B</strong> inferred that it belonged in the body text; another commenter suggested testing on <strong>EQBench</strong> and creative-writing benchmarks, where they expect Gemma to outperform Qwen.</p></li></ul></li></ul><h3><strong>2. Long-Context Scaling and KV Cache Efficiency</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1twla1k/nvidianvidianemotron3ultra550ba55bbf16_hugging/">nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 &#183; Hugging Face</a></strong> (Activity: 542): <strong>NVIDIA released </strong><code>nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16</code><strong>, a </strong><code>550B</code><strong>-parameter LatentMoE hybrid model with </strong><code>55B</code><strong> active parameters, interleaving Mamba-2, MoE, selected attention layers, and Multi-Token Prediction; it advertises up to </strong><code>1M</code><strong> token context and configurable reasoning via </strong><code>enable_thinking=True/False</code><strong>. The model targets frontier reasoning, agentic workflows, tool use, multilingual RAG, and long-context analysis, with a stated minimum serving footprint of </strong><code>8x</code><strong> GB200/B200/GB300/B300, </strong><code>16x</code><strong> H100, or </strong><code>8x</code><strong> H200 GPUs, and is under the <a href="https://raw.githubusercontent.com/OpenMDW/OpenMDW/refs/heads/main/1.1/LICENSE.OpenMDW-1.1">OpenMDW 1.1 license</a>.</strong> Top comments mostly joked about the impractical hardware requirements for local users&#8212;e.g. <em>&#8220;Hopefully I can get this running on my Nokia 3310&#8221;</em> and <em>&#8220;Damn, I only have 7x H200...&#8221;</em>&#8212;rather than debating model quality or architecture.</p><ul><li><p>A commenter highlights the extremely high inference hardware requirements listed for <strong>NVIDIA Nemotron-3-Ultra-550B-A55B-BF16</strong>: minimum configurations include <code>8x GB200/B200/GB300/B300</code>, <code>16x H100</code>, or <code>8x H200</code>, implying the model is only practical for large multi-GPU/datacenter deployments rather than consumer or small-lab use.</p></li><li><p>One technical point raised is that this model may be valuable as a <strong>large, low-latency open model</strong>, even if its output quality is somewhat below alternatives like <strong>GLM</strong>. The tradeoff discussed is that faster response/processing can matter more than absolute benchmark quality for latency-sensitive applications.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1twptw2/kvarn_new_kvcache_quant_from_huawei_35_kv_cache/">KVarN: new KV-cache quant from Huawei. 3&#8211;5&#215; KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag)</a></strong> (Activity: 438): <strong>Huawei CSL open-sourced KVarN, an Apache-2.0 KV-cache quantization method integrated into vLLM via a single flag, claiming </strong><code>3&#8211;5&#215;</code><strong> KV-cache compression versus FP16, up to </strong><code>~1.4&#215;</code><strong> FP16 throughput, and up to </strong><code>~2.4&#215;</code><strong> TurboQuant throughput while preserving FP16-level quality (<a href="https://github.com/huawei-csl/KVarN">repo</a>, <a href="https://arxiv.org/abs/2606.03458">paper</a>). The post contrasts KVarN with vLLM FP8 KV cache (</strong><code>~2&#215;</code><strong> capacity, near-BF16 throughput) and Google TurboQuant, citing a <a href="https://vllm.ai/blog/2026-05-11-turboquant">vLLM/Red Hat AI study</a> where TurboQuant achieves compression but drops to </strong><code>66&#8211;80%</code><strong> of BF16 throughput and loses </strong><code>~20</code><strong> reasoning points in low-bit modes on benchmarks like AIME25 and LiveCodeBench. The key technical claim is that KVarN avoids explicit BF16 dequantization overhead in attention and maintains reasoning/code/math accuracy at higher compression, with no model changes, retraining, or calibration.</strong> Comments were mostly skeptical of the claims and concerned about another wave of low-quality quantization PRs, but one commenter offered to benchmark KVarN on a <strong>B200</strong> with Qwen/Gemma MTP and non-MTP workloads to test scaling and accuracy retention.</p><ul><li><p>A commenter argued the critical validation is <strong>concurrent serving</strong>, specifically <code>batch=16</code> rather than <code>batch=1</code>, because many KV-cache quantization methods lose their apparent memory advantage once dequantization overhead dominates at higher concurrency. They noted that KVarN&#8217;s claimed <em>speed-up instead of slow-down</em> is the key production signal, especially if compression overhead can be amortized across realistic request mixes in <strong>vLLM</strong> via a single flag.</p></li><li><p>One user plans to benchmark KVarN on an <strong>NVIDIA B200</strong>, comparing <strong>MTP and non-MTP</strong> workloads for <strong>Qwen</strong> and <strong>Gemma 4</strong>. This would be useful for validating whether the claimed <code>3&#8211;5&#215;</code> KV-cache compression and speed gains scale on high-end inference hardware rather than only in paper settings.</p></li><li><p>Another commenter was skeptical that KV quantization results will generalize to newer architectures, suggesting many methods work because current models store information inefficiently in the KV cache. They specifically requested evaluation on <strong>Qwen3.5</strong> and <strong>DeepSeek V4-style architectures</strong>, where KV information may be stored more densely and therefore be less tolerant of aggressive compression.</p></li></ul></li></ul><h2><strong>Less Technical AI Subreddit Recap</strong></h2><blockquote><p>/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo</p></blockquote><h3><strong>1. Open Image Models &amp; Local Generation Workflows</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/StableDiffusion/comments/1tvtu2u/ideogram_40_just_open_sourced/">Ideogram 4.0 Just Open Sourced!</a></strong> (Activity: 1087): <strong>The <a href="https://i.redd.it/9ajk9fuu935h1.jpeg">image</a> is a promotional/non-technical banner for the post&#8217;s claim that Ideogram 4.0 is now open-weight and &#8220;Now on Comfy,&#8221; showing a cinematic neon-sign scene with the Ideogram logo rather than benchmark plots or architecture diagrams. The selftext describes a </strong><code>9.3B</code><strong> text-to-image DiT model with </strong><code>fp8</code><strong>/</strong><code>nf4</code><strong> checkpoints, native ComfyUI support, Qwen3-VL-8B-Instruct text encoding, JSON-structured prompting with hex colors/bounding boxes/text elements, and reported </strong><code>0.97</code><strong> X-Omni English OCR accuracy.</strong> Commenters focused less on the promo image and more on safety behavior: multiple users report the model is heavily censored/&#8220;safetymaxxed,&#8221; especially for NSFW prompts, with one predicting the community will try to &#8220;abliterate&#8221; or remove those restrictions.</p><ul><li><p>Users report that the released <strong>Ideogram 4.0</strong> model appears heavily safety-filtered: <strong>comfyanonymous</strong> notes that certain blocked outputs are due to the model being <em>&#8220;safetymaxxed&#8221;</em> rather than a <strong>ComfyUI</strong> issue, with an example image shown <a href="https://preview.redd.it/7lrd6rekg35h1.png?width=1024&amp;format=png&amp;auto=webp&amp;s=988d678c1ecca642b6182749c6ade74e0c7ffaa1">here</a>. Multiple commenters also describe it as hard-censored for NSFW generation, suggesting the restriction is embedded at the model/prompting level rather than merely UI-side.</p></li><li><p>Several technical adoption blockers were raised: commenters mention <strong>watermarking</strong>, <strong>strong censorship</strong>, and <strong>no commercial license</strong>, arguing these constraints make the open release less useful for production or downstream fine-tuning workflows. One user explicitly summarizes the concern as: <em>&#8220;Watermarked, censored, no commercial license.&#8221;</em></p></li><li><p>A commenter highlighted a <strong>bounding-box JSON prompting</strong> capability as a notable feature, showing an example output <a href="https://preview.redd.it/0bmpbik2e35h1.png?width=1024&amp;format=png&amp;auto=webp&amp;s=8ea4876bd32c8d93e34e5c226ab7a06a1720c68c">here</a>. This suggests Ideogram 4.0 may support more structured layout control via JSON-style spatial constraints, which could be useful for deterministic composition or UI/design generation workflows.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/StableDiffusion/comments/1tvv4j1/multiple_characters_anima_generations_are_so_good/">Multiple characters Anima generations are so good. There is some bleeding but its only gonna get better</a></strong> (Activity: 932): <strong>The post showcases multi-character image generations using Anima, with workflows published on the author&#8217;s <a href="https://civitai.red/user/Smexlo">Civitai profile</a>; the author notes remaining issues with prompt control, character/detail bleeding, and anatomy. One image was post-edited with Grok to add &#8220;Blair Witch&#8221; stick figures, while the rest were generated in Anima, and the author says they are looking forward to WAI Anima.</strong> Commenters praised Anima&#8217;s multi-character composition and prompt adherence, with one comparing it favorably to <strong>NovelAI Diffusion V4.5</strong> and emphasizing that its natural-language parsing is surprising given a <code>500M</code>-parameter text encoder. Another commenter reported they &#8220;don&#8217;t even usually have issues bleeding,&#8221; suggesting bleeding severity may be workflow- or prompt-dependent.</p><ul><li><p>Users focused on <strong>Anima&#8217;s multi-character prompt adherence</strong>, noting that it can set up detailed scenes through natural-language prompting with comparatively little character/color/detail bleeding. One commenter contrasted this with <strong>Illu/Pony workflows</strong>, where multi-character generations often require a strong checkpoint plus character LoRAs but still suffer from <em>&#8220;heavy bleeding,&#8221;</em> partly because <strong>Danbooru-tag prompting is more limited</strong> for specifying complex scene relationships.</p></li><li><p>A technically notable claim was that Anima achieves strong natural-language parsing despite using only a <code>500M</code><strong> parameter text encoder</strong>, with one user comparing its prompt-following favorably against <strong>NovelAI Diffusion V4.5</strong> as a reference point for bleeding-edge prompt adherence. The discussion framed Anima as an early baseline that could improve further through community fine-tuning and &#8220;backyard engineering&#8221; similar to what happened around <strong>SDXL</strong>.</p></li><li><p>One user shared an example output at <code>2560px</code><strong> width</strong> and said they <em>&#8220;don&#8217;t even usually have issues bleeding&#8221;</em> (<a href="https://preview.redd.it/9cg06yjwo35h1.png?width=2560&amp;format=png&amp;auto=webp&amp;s=bbc1ae3f5a825fb744fb7e351bc0d23d7f61def8">image</a>), suggesting bleeding may be prompt/model-dependent rather than universal in Anima multi-character generations.</p></li></ul></li></ul><h3><strong>2. Claude Code Over Live Data Streams</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/ClaudeAI/comments/1tvefqd/i_wired_claude_code_into_a_database_of_every/">I wired Claude Code into a database of every Polymarket wallet and trades via MCP. What do you want me to ask it next? This is what I found so far:</a></strong> (Activity: 1801): <strong>The author claims they connected Claude Code via Postgres MCP to a live Polymarket ledger containing roughly </strong><code>1.3B</code><strong> trades and </strong><code>2.7M</code><strong> wallets, allowing natural-language queries that Claude translates into SQL and executes; the linked writeup describes a similar setup using </strong><code>@modelcontextprotocol/server-postgres</code><strong> over pre-aggregated tables for ~</strong><code>1.3B</code><strong> trades across </strong><code>1,560,894</code><strong> wallets (<a href="https://crowdintel.xyz/blog/claude-mcp-polymarket-ledger">CrowdIntel</a>). Reported findings include only ~</strong><code>20%</code><strong> of wallets being net profitable, </strong><code>2.4%</code><strong> clearing </strong><code>$1,000</code><strong> profit, and extreme profit concentration among the top </strong><code>0.1%</code><strong> of wallets, with the author also claiming Claude surfaced suspicious patterns suggestive of insider or bot-like trading.</strong> Top commenters encouraged escalation to investigative journalists, including NYT/Forbes, and suggested more rigorous analyses: compare observed PnL distributions against a simulated &#8220;fair market&#8221; null model, and examine large losing wallets/bets as possible laundering or insider-transfer signals rather than simply retail losses.</p><ul><li><p>One commenter suggested establishing a <strong>baseline null model</strong> for what Polymarket wallet/trade distributions <em>should</em> look like under a fair market with no insider betting, then comparing those expected distributions against observed outcomes. They also recommended segmenting <strong>large losing wallets/bets</strong> to distinguish potential insider extraction from possible laundering behavior.</p></li><li><p>Another technical thread asked whether the analysis only covers wallets that participate directly in Polymarket markets, or whether it also performs <strong>fund-flow tracing</strong> to identify where capital originates and where winnings/losses are sent afterward. This would require graph analysis across wallet funding sources, withdrawals, and potentially linked addresses.</p></li><li><p>A commenter asked about the <strong>data freshness / ingestion latency</strong>: the lag between bets being placed and when they appear in the MCP-backed database. This matters for detecting time-sensitive anomalies such as pre-news betting, frontrunning, or post-resolution transaction patterns.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/ClaudeCode/comments/1tva44g/i_live_by_sfo_and_built_a_projection_mapping_of/">I Live by SFO and built a projection mapping of the planes flying over my house using ADS-B radio with claude code</a></strong> (Activity: 3616): <strong>The post showcases a home-built projection-mapping visualization of aircraft flying over the author&#8217;s house near SFO, driven by locally received ADS-B radio data and developed with Claude Code. The linked Reddit video (<a href="https://v.redd.it/gl2b0xivvy4h1">v.redd.it/gl2b0xivvy4h1</a>) was not accessible due to a </strong><code>403 Forbidden</code><strong> block, and no implementation specifics&#8212;receiver hardware, SDR stack, decoding pipeline, calibration method, latency, or projection geometry&#8212;were provided in the available text.</strong> Comments were broadly positive, framing it as a good example of &#8220;vibe coding,&#8221; with one commenter asking what equipment was required for the setup.</p><ul><li><p>A commenter described a lower-cost implementation for Brazil that replaces the original ADS-B/Raspberry Pi-style hardware path with the <strong>free OpenSky API</strong>, a <code>US$40</code> AliExpress projector, and direct HDMI output from a personal PC. They added configurable latitude, longitude, and radius fields so the map recenters around user-provided coordinates, avoiding the need for a local ADS-B antenna that they estimated at about <code>US$100</code> plus expensive local hardware costs.</p></li><li><p>There was interest in making the project open source so others near airports could reuse it with their own projector setups, potentially combining the aircraft projection layer with other datasets such as constellation/star-map data.</p></li></ul></li></ul><h3><strong>3. Frontier AI Adoption and Risk Signals</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/singularity/comments/1twsm5g/anthropic_our_internal_data_shows_claude_is/">Anthropic - Our internal data shows Claude is accelerating AI development&#8212;a possible path to recursive self-improvement, or AI autonomously building a more capable successor.</a></strong> (Activity: 826): <strong>The <a href="https://i.redd.it/9ph4lq42la5h1.jpeg">image</a> is a screenshot of Anthropic&#8217;s X post promoting its article <a href="https://www.anthropic.com/institute/recursive-self-improvement">&#8220;Recursive self-improvement&#8221;</a>, claiming internal usage data shows Claude is already accelerating AI R&amp;D and may indicate an early path toward AI systems helping build more capable successors. The technically significant claim is not a benchmark result but an organizational/empirical observation: Anthropic says Claude is enabling work such as exploratory tooling and deferred engineering cleanup, framing this as evidence relevant to recursive self-improvement and future AI control risks.</strong> Comments were skeptical of the framing, with one user implying the announcement is financially motivated marketing. Another highlighted the &#8220;long-deferred cleanup&#8221; claim ironically, while a third provided the non-Twitter Anthropic article link and quoted its warning that AI-built successors could increase loss-of-control risks.</p><ul><li><p>A commenter linked the full Anthropic Institute post on recursive self-improvement: <a href="https://www.anthropic.com/institute/recursive-self-improvement">https://www.anthropic.com/institute/recursive-self-improvement</a>. The technically relevant claim highlighted is that Anthropic&#8217;s internal usage data suggests Claude is already enabling engineering work that <em>&#8220;simply wouldn&#8217;t have happened otherwise,&#8221;</em> such as exploratory tooling and long-deferred cleanup, which Anthropic frames as an early signal on the path toward AI systems helping build more capable successors.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/singularity/comments/1two85g/sam_altman_dario_amodei_and_demis_hassabis_have/">Sam Altman, Dario Amodei, and Demis Hassabis have signed a joint open letter calling on Congress to mandate screening of synthetic nucleic acid orders</a></strong> (Activity: 915): <strong>Sam Altman (OpenAI), Dario Amodei (Anthropic), and Demis Hassabis (Google DeepMind) signed a joint open letter urging Congress to require screening of synthetic nucleic acid orders to reduce biosecurity risk from AI-assisted pathogen design, per the <a href="https://www.wsj.com/politics/policy/top-ai-ceos-call-for-law-protecting-against-biological-weapons-88f2f99f">WSJ report</a>. The proposed mechanism is not described as a ban on synthesis, but as mandatory order/customer screening to flag suspicious DNA/RNA sequences or buyers&#8212;roughly analogous to monitoring precursor purchases such as bulk fertilizer.</strong> Commenters were broadly receptive to screening as a lightweight risk-control measure, while questioning whether AI-enabled &#8220;supervirus&#8221; design is practically feasible for non-experts today. Some framed the policy as a sensible suspicious-activity trigger rather than a direct restriction on legitimate genetic engineering.</p><ul><li><p>Commenters framed the proposal as <strong>order-level screening rather than a ban</strong>, comparing it to monitoring suspicious bulk fertilizer purchases: the mechanism would flag potentially dangerous synthetic nucleic acid orders while preserving legitimate biotech access.</p></li><li><p>A technical concern raised was whether AI-assisted design of a &#8220;supervirus&#8221; is realistically feasible for non-experts. The implicit issue is that biological risk depends not just on model-generated sequences, but also on access to synthesis providers, wet-lab capability, delivery methods, and whether synthesis screening can catch pathogenic or engineered sequences.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/OpenAI/comments/1tvh4z4/chatgpt_makes_history_and_becomes_the_fastest_app/">ChatGPT makes history and becomes the fastest app to reach 1 billion monthly active users.</a></strong> (Activity: 820): <strong>The image is a screenshot of a Kalshi X post claiming ChatGPT became the fastest app to reach </strong><code>1 billion</code><strong> monthly active users: <a href="https://i.redd.it/uwgx8zc9j05h1.jpeg">image</a>. This is not a technical benchmark or implementation detail; its significance is mainly market/adoption context, positioning ChatGPT&#8217;s growth ahead of prior viral consumer apps like Threads, which commenters note reached </strong><code>100 million</code><strong> users in </strong><code>5 days</code><strong>.</strong> Comments debate whether massive MAU translates into sustainable revenue, with one commenter estimating consumer subscription ARPU at roughly <code>$1/user</code> and joking that adding B2B might only raise it to <code>$2/user</code>.</p><ul><li><p>Commenters focused on the reported user metrics and revenue implications: one notes the claim of <code>1B</code><strong> monthly active users</strong> alongside roughly <code>$1B</code><strong> from consumer paid subscriptions</strong>, implying consumer ARPU of about <code>$1/user</code> before enterprise/API revenue. Another commenter disputes the <code>1B</code> figure, citing a recent OpenAI CFO podcast where the number was reportedly <code>900M</code><strong> users</strong>, arguing OpenAI would likely publicize a confirmed billion-user milestone more aggressively.</p></li><li><p>There is skepticism around monetization depth despite massive MAU: commenters ask how many of the reported users are actually <strong>paid subscribers</strong>, distinguishing headline MAU growth from recurring revenue, conversion rate, and enterprise/API monetization. The comparison to Threads&#8217; earlier growth milestone&#8212;<code>100M</code><strong> users in 5 days</strong>&#8212;frames ChatGPT&#8217;s scale as unusually fast but leaves unresolved whether active usage and paying-user retention match the headline adoption numbers.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/singularity/comments/1tvtojx/ai_beat_law_professors_at_answering_questions/">AI Beat Law Professors At Answering Questions, Study Finds&#8212;And It Wasn&#8217;t Close</a></strong> (Activity: 1187): <strong>A Stanford-linked study, <a href="https://law.stanford.edu/publications/law-professors-prefer-ai-over-peer-answers/">&#8220;Law Professors Prefer AI Over Peer Answers&#8221;</a>, reports a blinded evaluation in which </strong><code>16</code><strong> U.S. contracts law professors authored </strong><code>40</code><strong> short-answer tutoring questions and judged </strong><code>2,918</code><strong> anonymized human-vs-LLM answer comparisons. The LLM&#8212;identified in comments as Gemini 2.5 Pro&#8212;achieved an average win rate of </strong><code>75.33%</code><strong> over professor-written answers, performed similarly to the best instructor, and was flagged as harmful less often (</strong><code>3.53%</code><strong> vs. </strong><code>12.06%</code><strong> for professors); the abstract also proposes using an LLM-as-judge approach to scale evaluation in judgment-heavy domains.</strong> Commenters debated implications beyond tutoring: one warned about premature institutional use of AI in legal decision-making or policing, while another argued this result reflects the broader post-&#8220;six fingers&#8221; maturation of LLM capability. A technical commenter suggested rerunning the benchmark with newer frontier models such as <strong>GPT-5.5</strong>, claiming it may be substantially stronger for legal work.</p><ul><li><p>The linked Stanford study evaluated <strong>LLM vs. law professor short-answer tutoring</strong> using <code>16</code> U.S. contracts professors, <code>40</code> professor-authored questions, and <code>2,918</code> blinded pairwise comparisons. Professors preferred LLM answers with an average win rate of <code>75.33%</code>, while LLM answers were flagged as harmful only <code>3.53%</code> of the time versus <code>12.06%</code> for professor answers; the paper also claims expert-agreement data can be extended using a separate LLM-as-judge pipeline: <a href="https://law.stanford.edu/publications/law-professors-prefer-ai-over-peer-answers/">https://law.stanford.edu/publications/law-professors-prefer-ai-over-peer-answers/</a>.</p></li><li><p>One commenter highlighted that the study used <strong>NotebookLM</strong> and <strong>Gemini 2.5 Pro</strong> with tightly constrained prompts: answers had to mimic a contracts professor in office-hours style, avoid bullet points/filler, stay around <code>50&#8211;108</code> words, and for NotebookLM, rely only on provided textbook chapters without citing outside cases. This prompt design likely reduced hallucination risk and standardized answer format, making the benchmark more about concise legal reasoning/synthesis than open-ended legal research.</p></li><li><p>A technical argument was made that law is a strong fit for <strong>RAG-style systems</strong> because the profession depends on large corpora of statutes, case law, precedent, and theory that exceed individual recall capacity. The suggested workflow is retrieval over authoritative legal materials followed by synthesis, potentially outperforming unaided lawyers when the model is grounded in the relevant corpus.</p></li></ul></li></ul><h1><strong>AI Discords</strong></h1><p>Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.</p>]]></content:encoded></item><item><title><![CDATA[Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs]]></title><description><![CDATA[We talk with the VendingBench authors on evaling Claudes from Haiku to Mythos, and how they build leading, and lasting, frontier evals from scratch.]]></description><link>https://www.latent.space/p/andon</link><guid isPermaLink="false">https://www.latent.space/p/andon</guid><pubDate>Thu, 04 Jun 2026 20:39:18 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/200614482/84d19c89e2a280d493e1a370adab3d72.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p><em>The new <a href="https://ai.engineer/wf">AIEWF website</a> is live! Get your tickets booked ASAP as they -will- sell out. Take the <a href="https://notion.qualtrics.com/jfe/form/SV_bP07tSVMXH7ePCS">AI Engineering Survey</a> and get &gt;$2k in credits and free <a href="https://ai.engineer/wf">AIE WF tickets</a>!</em></p><div><hr></div><p>Most industry benchmarks compress intelligence and reasoning ability into scores.</p><p><a href="https://labs.scale.com/leaderboard/swe_bench_pro_public">SWE-Bench Pro</a>, <a href="https://arxiv.org/abs/2009.03300">MMLU</a>, <a href="https://agi.safe.ai/">Humanity&#8217;s Last Exam</a>, etc. These metrics are useful, but don&#8217;t always represent the full extent of <strong>how a model performs in the real world</strong>. Some of the most interesting evals today look less like exams and more like operating businesses in the real world. One of which is <a href="https://andonlabs.com/evals/vending-bench-2">Vending Bench</a>.</p><p>In Anthropic&#8217;s <a href="https://www-cdn.anthropic.com/08ab9158070959f88f296514c21b7facce6f52bc.pdf">Mythos Preview System Card</a>, Andon was the only third party eval to get their own section, observing increasingly concerning aggressive behavior:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KHFV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569da387-7ec3-4c06-a66d-d662ce1d3f78_1686x1060.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KHFV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569da387-7ec3-4c06-a66d-d662ce1d3f78_1686x1060.png 424w, https://substackcdn.com/image/fetch/$s_!KHFV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569da387-7ec3-4c06-a66d-d662ce1d3f78_1686x1060.png 848w, https://substackcdn.com/image/fetch/$s_!KHFV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569da387-7ec3-4c06-a66d-d662ce1d3f78_1686x1060.png 1272w, https://substackcdn.com/image/fetch/$s_!KHFV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569da387-7ec3-4c06-a66d-d662ce1d3f78_1686x1060.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KHFV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569da387-7ec3-4c06-a66d-d662ce1d3f78_1686x1060.png" width="1456" height="915" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/569da387-7ec3-4c06-a66d-d662ce1d3f78_1686x1060.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:915,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:325741,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/200614482?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569da387-7ec3-4c06-a66d-d662ce1d3f78_1686x1060.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KHFV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569da387-7ec3-4c06-a66d-d662ce1d3f78_1686x1060.png 424w, https://substackcdn.com/image/fetch/$s_!KHFV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569da387-7ec3-4c06-a66d-d662ce1d3f78_1686x1060.png 848w, https://substackcdn.com/image/fetch/$s_!KHFV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569da387-7ec3-4c06-a66d-d662ce1d3f78_1686x1060.png 1272w, https://substackcdn.com/image/fetch/$s_!KHFV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569da387-7ec3-4c06-a66d-d662ce1d3f78_1686x1060.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You don&#8217;t know what a model is capable of doing in the real world unless you actually give it inventory, a wallet, tools, customers, competitors, humans, &amp; some time. More often than not, it&#8217;ll surprise you how much a model is capable of and in doing so, also <strong>reveal unexpected behavior</strong>: <a href="https://andonlabs.com/blog/opus-4-8-vending-bench">deception</a>, context collapse, emergent coordination, &amp; bizarre negotiation behavior.</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/andonlabs/status/2047377260412649967&quot;,&quot;full_text&quot;:&quot;In Vending-Bench Arena (the multiplayer version of Vending-Bench with competition dynamics), GPT-5.5 actually beats Opus 4.7.\n\nOpus 4.7 showed similar behavior to Opus 4.6: lying to suppliers and stiffing customers on refunds. GPT-5.5's tactics were clean, and it still won. &quot;,&quot;username&quot;:&quot;andonlabs&quot;,&quot;name&quot;:&quot;Andon Labs&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1864729396801945600/Hfze5w-k_normal.jpg&quot;,&quot;date&quot;:&quot;2026-04-23T18:09:11.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/HGm-W8TacAAJf1N.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/iPEYgxqB20&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:47,&quot;retweet_count&quot;:132,&quot;like_count&quot;:1587,&quot;impression_count&quot;:881552,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:false}" data-component-name="Twitter2ToDOM"></div><p>While an inflection point in personal agents came post-OpenClaw after full file access with bypass permissions became the norm, it is yet to come for agents in the real-world. However <strong>Andon Market</strong>, an actual in person store fully run and managed by AI, is paving the way for what is possible.</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/andonlabs/status/2042765807781056646&quot;,&quot;full_text&quot;:&quot;We gave an AI a 3-year retail lease in SF and asked it to make a profit.\n\nThe AI interviewed and hired full-time employees, applied for credit, and stocked the store with the books Superintelligence and Making of the Atomic Bomb.\n\nVisit Andon Market at 2102 Union St now. &quot;,&quot;username&quot;:&quot;andonlabs&quot;,&quot;name&quot;:&quot;Andon Labs&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1864729396801945600/Hfze5w-k_normal.jpg&quot;,&quot;date&quot;:&quot;2026-04-11T00:44:55.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://substackcdn.com/image/upload/w_1028,c_limit,q_auto:best/l_twitter_play_button_rvaygk,w_88/bizlgaaar6xlic27hxgr&quot;,&quot;link_url&quot;:&quot;https://t.co/vXRX8vijlQ&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:103,&quot;retweet_count&quot;:154,&quot;like_count&quot;:2350,&quot;impression_count&quot;:1937124,&quot;expanded_url&quot;:null,&quot;video_url&quot;:&quot;https://video.twimg.com/amplify_video/2042765565828427776/vid/avc1/1280x720/gD9OQWiQH7eln4ql.mp4&quot;,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><h2>Full Video Pod</h2><div id="youtube2-T8u7wOXhDb0" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;T8u7wOXhDb0&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/T8u7wOXhDb0?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>From Claude <strong>trying to call the FBI</strong> over a $2/day vending machine charge to AI agents forming <strong>price cartels</strong>, hiring human employees, running physical stores, and writing existential robot musicals, <strong>Andon Labs</strong> is stress-testing what happens when <strong>frontier models stop being chatbots and start acting in the real world.</strong> In this episode, Andon Labs cofounders <strong>Lukas Petersson</strong> and <strong>Axel Backlund</strong> join swyx and Vibhu to unpack the strange, funny, and genuinely concerning edge cases that emerge when agents run businesses over long horizons.</p><p>We go deep on <a href="https://andonlabs.com/evals/vending-bench-2">Vending-Bench</a>, <a href="https://www.anthropic.com/research/project-vend-1">Project Vend</a>, <a href="https://andonlabs.com/evals/vending-bench-arena">Vending-Bench Arena</a>, <a href="https://andonlabs.com/blog/evolution-of-bengt">Bengt</a>, <a href="https://andonlabs.com/evals/butter-bench">Butter-Bench</a>, <a href="https://andonlabs.com/blog/andon-market-launch">Luna</a>, and Andon&#8217;s broader mission of building realistic real-world evals for autonomous AI systems. Lukas and Axel explain why dollar-denominated evals reveal things traditional benchmarks miss, <strong>how Claude ended up reporting its vending machine fees as cybercrime</strong>, why long context windows can drive agents into <strong>meltdown loops</strong>, what happens when agents compete with each other, and why the future of AI safety may depend on testing models in messy physical environments instead of clean benchmark sandboxes.</p><p><strong>We discuss:</strong></p><ul><li><p>Why Andon Labs started with <strong>dangerous capability evals</strong> and long-running agents</p></li><li><p><strong>Vending-Bench</strong> and why running a vending machine is a deceptively hard AI benchmark</p></li><li><p>Why <strong>money-based evals</strong> avoid the saturation problem of traditional benchmarks</p></li><li><p>How <strong>Claude tried to call the FBI</strong> over a $2/day fee</p></li><li><p>Why <strong>long-horizon agents</strong> can spiral into existential and legalistic breakdowns</p></li><li><p><strong>Project Vend</strong>: putting an AI-run vending machine inside Anthropic</p></li><li><p>Why real humans are <strong>&#8220;out of distribution&#8221;</strong> for simulated agents</p></li><li><p><strong>Claudius, Seymour Cash</strong>, and the chaos of AI CEOs</p></li><li><p>How a human briefly became <strong>CEO of Claudius</strong> through a manipulated election</p></li><li><p>Why <strong>multi-agent systems</strong> can converge back into &#8220;helpful assistant&#8221; behavior</p></li><li><p><strong>Bengt</strong>, Andon&#8217;s internal office agent with email, spending, terminal, phone, camera, and internet access</p></li><li><p>How Bengt traded <strong>Amazon purchases</strong> for face-recognition training data</p></li><li><p>Claude&#8217;s aggressive behavior, <strong>lies, refund avoidance</strong>, and price-cartel behavior in Arena</p></li><li><p>Why <strong>eval awareness</strong> may become the AI version of &#8220;are we living in a simulation?&#8221;</p></li><li><p><strong>Blueprint Bench</strong>, spatial intelligence, and why models still misunderstand physical rooms</p></li><li><p><strong>Butter-Bench</strong> and testing LLMs as robot orchestrators</p></li><li><p><strong>Luna</strong>, the AI-run physical store with a three-year lease and human employees</p></li><li><p>The new <strong>Andon cafe in Sweden</strong> and why real-world geography matters for agent evals</p></li><li><p><strong>Rotten tomatoes, perishable goods</strong>, and the hidden difficulty of running a physical business</p></li></ul><div><hr></div><p><strong>Lukas Petersson</strong></p><ul><li><p><strong>LinkedIn:</strong> <a href="https://www.linkedin.com/in/lukas-petersson-181a83172/">https://www.linkedin.com/in/lukas-petersson-181a83172/</a></p></li><li><p><strong>X:</strong> <a href="https://x.com/lukaspet">https://x.com/lukaspet</a></p></li></ul><p><strong>Axel Backlund</strong></p><ul><li><p><strong>LinkedIn:</strong> <a href="https://www.linkedin.com/in/axelbacklund">https://www.linkedin.com/in/axelbacklund</a></p></li><li><p><strong>X:</strong> <a href="https://x.com/axelbacklund">https://x.com/axelbacklund</a></p></li></ul><p><strong>Andon Labs</strong></p><ul><li><p><strong>Website:</strong> <a href="https://andonlabs.com">https://andonlabs.com</a></p></li><li><p><strong>Vending-Bench:</strong> <a href="https://andonlabs.com/evals/vending-bench">https://andonlabs.com/evals/vending-bench</a></p></li><li><p><strong>Andon Vending:</strong> <a href="https://andonlabs.com/vending">https://andonlabs.com/vending</a></p></li></ul><div><hr></div><h2>Timestamps</h2><p><strong>00:00:00</strong> Introduction<br><strong>00:01:00</strong> Andon Labs and the Origins of Vending-Bench<br><strong>00:05:21</strong> Why Money-Based Evals Matter<br><strong>00:09:51</strong> Agent Harnesses and Self-Modifying Systems<br><strong>00:13:36</strong> Claude Calls the FBI<br><strong>00:16:33</strong> Project Vend: Claude Runs a Real Vending Machine<br><strong>00:21:44</strong> Seymour Cash, AI CEOs, and Election Chaos<br><strong>00:27:16</strong> Multi-Agent Coordination and Slack Observability<br><strong>00:30:18</strong> When Will Agents Run Real Businesses?<br><strong>00:34:56</strong> Bengt: Andon&#8217;s Internal Office Agent<br><strong>00:40:06</strong> Real-World AI Safety and Long-Horizon Traces<br><strong>00:44:28</strong> Lying, Refunds, and Price Cartels in Arena<br><strong>00:52:42</strong> Eval Awareness and Simulation Behavior<br><strong>00:56:06</strong> Blueprint Bench, Butter-Bench, and Robotics<br><strong>01:04:37</strong> Luna: The AI-Run Physical Store<br><strong>01:09:29</strong> The Sweden Cafe and Real-World Expansion<br><strong>01:13:16</strong> What Comes Next for Andon Labs</p><div><hr></div><h1>Transcript</h1><h2>Introduction: Andon Labs, Long-Running Agents, and Real-World Evals</h2><p><strong>Swyx [00:00:00]</strong>: Welcome to Lukas and Axel from Andon Labs, and I&#8217;m joined by my, favorite guest host. Anything security, safety, alignments, Vibhu., welcome.</p><p><strong>Lukas [00:00:15]</strong>: Thank you for having us.</p><p><strong>Axel [00:00:16]</strong>: Thank you.</p><p><strong>Swyx [00:00:17]</strong>: Let&#8217;s match names to voices., maybe you wanna take turns introducing yourselves.</p><p><strong>Lukas [00:00:21]</strong>: I&#8217;m Lukas.</p><p><strong>Axel [00:00:22]</strong>: And I&#8217;m Axel.</p><p><strong>Swyx [00:00:24]</strong>: Let&#8217;s introduce Andon Labs a bit. How did you guys come together?, you have different backgrounds, but you&#8217;re both Swedish., was that, a big part of it?</p><p><strong>Lukas [00:00:33]</strong>: So when I went to high school, there was this really cool guy who had a superpower. He could code. So he made like the or like the app for the, for the school and stuff, and he was super cool, and I wanted to be like him, and that was that guy.</p><p><strong>Axel [00:00:47]</strong>: I don&#8217;t know about this.</p><p><strong>Swyx [00:00:49]</strong>: But you went to different universities, right?</p><p><strong>Lukas [00:00:51]</strong>: But same high school.</p><p><strong>Swyx [00:00:52]</strong>: I see.</p><p><strong>Lukas [00:00:52]</strong>: So we always said, &#8220;Oh, once we graduate university, then we should start a company,&#8221; and that&#8217;s what we did.</p><p><strong>Swyx [00:00:58]</strong>: Wow, there you go. And about a year ago, you kinda burst onto the scene with Vending Bench, but, was there a thing before that was, kind of like the inception?</p><h2>From Dangerous Capability Evals to Vending Bench</h2><p><strong>Axel [00:01:07]</strong>: So we did work, yeah, with, Anthropic was one of our, early customers in doing, evals. So we did, dangerous capability evals., nothing we published openly. But then we started thinking about doing some kind of, public benchmark, and one thing that we really started thinking about, was like running agents and specifically agents managing businesses., &#8216;cause-- and this was, early 2025., and I think the first, mentions of people will be running, person unicorns or even autonomous companies. So we thought, &#8220;Let&#8217;s make a benchmark of how well can an agent run the probably simplest business, possible,&#8221; and, that&#8217;s probably, running a vending machine. So that&#8217;s the first public one we did. And it was very, like-- there was almost no one that noticed it in the first couple of months, I think., so we released it in February last year, and then I think around Easter last year, we got, the first viral tweet about it, that someone else did.</p><p><strong>Lukas [00:02:11]</strong>: We tweeted a bunch, uh When it came out and, tried our best.</p><p><strong>Axel [00:02:15]</strong>: We tried.</p><p><strong>Vibhu [00:02:16]</strong>: It&#8217;s the one at Anthropic, right?</p><p><strong>Lukas [00:02:18]</strong>: So this</p><p><strong>Swyx [00:02:19]</strong>: This is a classic thing we should get out of the way.</p><p><strong>Lukas [00:02:20]</strong>: Exactly. There&#8217;s two versions.</p><p><strong>Swyx [00:02:22]</strong>: Everyone does this. Yes.</p><p><strong>Lukas [00:02:23]</strong>: There&#8217;s Vending Bench, which is the simulated one, which we did, completely independently in February., and then, like Axel said, that was like-- That was the thing that didn&#8217;t get any traction in the beginning, but then some random person made a tweet about it, and that</p><p><strong>Axel [00:02:38]</strong>: You have the paper</p><p><strong>Lukas [00:02:38]</strong>: That is the paper. Correct, yeah., and then since we thought this was very fun, we thought, oh, I think this is also, one thing with Andon Labs, the way we kind of like decide what to do next and what projects to do, it&#8217;s what is like the heuristic we use is what is fun? Is What would be a fun project? And doing this in real life sounded quite fun for us, and maybe also scientifically useful. So, then we basically had this idea, and then we, like-- But then we needed a place for it and, putting it out in the public would probably not really work., would get vandalized and stuff. So we pitched it to the people we were already working with at Anthropic, and they were &#8220;Yeah, you can have space. This sounds fun.&#8221; Um</p><p><strong>Swyx [00:03:21]</strong>: It&#8217;s like a small fridge, right? It&#8217;s like a mini fridge.</p><p><strong>Axel [00:03:23]</strong>: Absolutely.</p><p><strong>Swyx [00:03:24]</strong>: People-- There&#8217;s like a stripe thing or like an</p><p><strong>Vibhu [00:03:27]</strong>: Oh, okay. So it was very OG, the early days</p><p><strong>Lukas [00:03:28]</strong>: That&#8217;s the OG one. Yeah</p><p><strong>Vibhu [00:03:29]</strong>: IPad on this. We saw it in June, like two months after After it had been there. They upgraded a little bit. There&#8217;s a security camera for making sure you actually Venmo the thing.</p><p><strong>Swyx [00:03:40]</strong>: So, my impression, okay, we&#8217;re, we&#8217;re going straight into project Ven because it&#8217;s such a iconic thing. I do want to cover a little bit of that, the origin story even before Project Ven and even into Vending Bench. I think a lot of people are like yourselves, like smart, interested in future of AI, interested in developing evals. But how the hell do you just, walk into Anthropic&#8217;s doors and, work with them, right? What is What are they looking for? What works? And then maybe, when you launch, I always think, obviously it would be better to launch with a lab, but, sometimes</p><p><strong>Vibhu [00:04:12]</strong>: It&#8217;s harder to do than it seems.</p><p><strong>Swyx [00:04:13]</strong>: Exactly. So either of those, which are more sort of newbie beginner questions, but, I think it&#8217;s meaningful advice to others.</p><p><strong>Lukas [00:04:21]</strong>: We get this question a lot, and I don&#8217;t think our experience is maybe the best., but, the way we did it was that we just built a bunch of things that we had conviction would be useful, and then we just, set up a server and sent it to them for free to use. And then after a while they were &#8220;Oh, yeah, this is actually kind of useful. We should probably pay for this.&#8221;, but that took a while. I don&#8217;t know if this is, the best path to doing it, but that&#8217;s how it went for us.</p><p><strong>Axel [00:04:47]</strong>: I think maybe generally, building-- everyone is interested in good evals, and especially evals that, don&#8217;t saturate that easily. So, if you can build an eval that, tests something novel, something useful, and you have, good separation of models, like your, the more advanced models rank higher than the worst models, and then you can, yeah, you can, publish it and, try to get some traction, sort of how Vending Bench got attention., and then probably some lab will be interested or you can at least have something to reach out with, when you&#8217;re doing that.</p><h2>Why Dollar-Based Evals Matter</h2><p><strong>Swyx [00:05:21]</strong>: I think you are in, you&#8217;re in one of the few categories of, evals that correlate to real money. Like Suelancer was also last year, right? Where, people solve actual Upwork. Was it Upwork or other tasks?, something. Where&#8217;s the, where&#8217;s, like It&#8217;s like a dollar value, right? Forget your ELO scores. Forget your</p><p><strong>Axel [00:05:37]</strong>: Percentiles</p><p><strong>Swyx [00:05:38]</strong>: Zero to one hundred percents. Just go straight for dollars and, that&#8217;s AGI.</p><p><strong>Lukas [00:05:43]</strong>: And there&#8217;s like-- I think the nice thing is that there&#8217;s no ceiling. You can just-- It never saturates because it could just make more and more money. Like If there&#8217;s oh, Percentage-wise, then, you can&#8217;t go above, a hundred. And I think like Even when you&#8217;re not at the hundred, I think a lot of these, evals have a lot of problems in them. So, actually it&#8217;s like if you get</p><p><strong>Axel [00:06:05]</strong>: To like 92 or something like that, many of them. It&#8217;s like then there&#8217;s like there&#8217;s no really no difference between 92 and 93 because the eval itself is problematic and has noise in it. And I think a lot of evals are saturated like that, but people like pretend that there &#8216;s still signal in them, but there really isn&#8217;t.</p><h2>Vending Bench 1, Harness Design, and Saturation</h2><p><strong>Swyx [00:06:24]</strong>: Like Super bench verified., even Vending Bench 1 saturated, right? Maybe we can talk about that., may- and maybe set up Vending Bench for a lot of folks who don&#8217;t know. Actually, things that were very basic like there&#8217;s limited slots, like you have to pay rent., these are elements where like it doesn&#8217;t come across in the, in the narrative, but even being adversarial towards the agent, I think these are all like very interesting dimensions.</p><p><strong>Axel [00:06:47]</strong>: I don&#8217;t really think it&#8217;s saturated, right? Like it It was more like it was not designed in a way that was really, like true to how AI developed. Like we had an agent harness in it that wasn&#8217;t really how people used harnesses and stuff like that., so I think it wasn&#8217;t really that it saturated, it was more like it wasn&#8217;t really, the best benchmark.</p><p><strong>Vibhu [00:07:12]</strong>: This is Vending Bench one, right?</p><p><strong>Axel [00:07:14]</strong>: I think that like schematic maps sort of to Vending Bench 2 as well., but</p><p><strong>Swyx [00:07:19]</strong>: Including the email.</p><p><strong>Axel [00:07:20]</strong>: The email The emails exist still. Exactly., and then we still we simulate the purchases and it&#8217;s all, yeah, it&#8217;s this very open environment for the agent to just run its business. And then for, yeah, Vending Bench 2 we did that, like you said, to just improve the harness., a lot of like nice, like easier, improvements to make it easier for us to run as well., like when you make an eval you ideally want don&#8217;t want to change it after you made it. So, you want to make it really good and then not to rerun all the models when you make an update because that&#8217;s also really expensive with the Vending Bench when you run the frontier models. But like as an example, like one thing we didn&#8217;t have, we didn&#8217;t have prompt caching in Vending Bench 1, because when we made Vending Bench 1 it wasn&#8217;t really a thing., so that &#8216;s just an example of like in Vending Bench 2 like we paid a lot more to run these things because we didn&#8217;t have prompt caching. So for Vending Bench 2 that was one thing we added and there was a bunch of things like this., and that&#8217;</p><p><strong>Swyx [00:08:17]</strong>: Also the conversations are a lot longer in Vending Bench 2, right?</p><p><strong>Axel [00:08:21]</strong>: I think it&#8217;s kind of similar.</p><p><strong>Swyx [00:08:22]</strong>: Is it similar?</p><p><strong>Axel [00:08:23]</strong>: I think it&#8217;s similar. The models at the time were worse, so they crashed out earlier., and now they survive the full year all the time.</p><p><strong>Swyx [00:08:31]</strong>: Which is like thousands of turns. Hundreds of thousands of hundreds of millions of tokens output. That&#8217;s the, that&#8217;s the rough order of magnitude. I always wonder about the harness. The harness matters a lot. It&#8217;s your harness. Was there any question about like use cloud code, use something else?</p><p><strong>Axel [00:08:48]</strong>: I think our philosophy around harnesses is like we try to make something that&#8217;s quite minimalistic, like quite simple. Like we don&#8217;t wanna favor one model a lot over the other, but also don&#8217;t make like a super complex harness. So like it&#8217;s obvious like a model may be lucky and just be good in one harness., so like it is similar to a lot of the harnesses out there in like you have the, like a running loop., you have some like a bunch of tools that are like quite, descriptive for the agent, we think, and not a lot of like fancy agents or anything &#8216;cause we wanna really test the model, not like some specific harness.</p><p><strong>Vibhu [00:09:27]</strong>: It seems more neutral as well to test the model&#8217;s agnostic of the harness,?</p><p><strong>Axel [00:09:32]</strong>: There are arguments like you want to elicit maximum performance of the model, but it&#8217;s like a trade-off, like how much time should we spend optimizing the harness for this model? And like how do we know when we have like the optimal harness for a single model? So like we thought that just having a simple one that&#8217;s the same for all of them is the best.</p><p><strong>Swyx [00:09:51]</strong>: So okay, this is my pitch for Vending Bench 3 or whatever, right? And then I like to have this kind of conversation on the pod, so like it forces listeners to think about what they would do if they were in your shoes. A lot of people are exploring modifying harnesses and I think prompt tuning for a model is a thing and you are probably not doing a bunch of that. It&#8217;s the same system prompt in every regardless of the model, same tools, whatever, right? Even if they were post trained for different tools. So what, what do you think about okay, before I expose you to Vending Bench 3, I give you a few rounds of like tuning, whatever that means, like</p><h2>Self-Modifying Harnesses and Model-Specific Prompting</h2><p><strong>Axel [00:10:27]</strong>: Like you give that to the model?</p><p><strong>Swyx [00:10:28]</strong>: Give that to the model.</p><p><strong>Vibhu [00:10:28]</strong>: Give that to the model.</p><p><strong>Swyx [00:10:29]</strong>: Let it, let it read its own transcripts, let it modify its own system prompt based on &#8220;Oh, yeah, okay, well, that&#8217;s this harness is not what I thought it what I was post trained for, but I can adjust.&#8221; Was that reasonable? Is that too much?</p><p><strong>Axel [00:10:41]</strong>: Like philosophically I like it because it&#8217;s basically good evals, they have a high ceiling, but they&#8217;re hard, right?, and they have no bias. And like this like when you have a system prompt like the one we have here, which is quite long in like some kind of latent space, representation, this might</p><p><strong>Vibhu [00:10:59]</strong>: We have a bell that rings every time you say latent space</p><p><strong>Axel [00:11:02]</strong>: This might be like biased towards one model more than another for some reason that humans don&#8217;t, understand, right?</p><p><strong>Vibhu [00:11:08]</strong>: We see it too, right? Like Cursor says that they have individualized versions of the harnesses for all the models they run, right? There&#8217;s better performance you can squeeze if you Tune the harness.</p><p><strong>Axel [00:11:17]</strong>: Exactly. And we might accidentally have picked one that favors another. Like we don&#8217;t know that. The like Axel said, like the reason why we went for a simple one was to try to avoid this. But yeah, if you do it</p><p><strong>Vibhu [00:11:29]</strong>: Simple has biases</p><p><strong>Axel [00:11:30]</strong>: But if you do it even less and like have no system prompt and let the model write its own system prompt</p><p><strong>Vibhu [00:11:36]</strong>: Its own, yeah</p><p><strong>Axel [00:11:36]</strong>: Maybe that&#8217;s even less bias.</p><p><strong>Vibhu [00:11:37]</strong>: Some of the interesting things there are like the harness also changes with model changes. Like you can see it with the 4.7 release, right? A lot of people are saying 4.7 isn&#8217;t as good as 4.6, and then, there&#8217;s rumors of, okay, you just need to prompt differently. You need to set up your harness differently. So it&#8217;s not even like even if you have tailored your harness towards one model, it probably won&#8217;t stay consistent, right? Like the next iteration of that same model family will still change it, so. But, going back to what you said about Vending Bench 3, there is a lot of work being done on people saying you shouldn&#8217;t have-- you can have modifying harnesses.</p><p><strong>Axel [00:12:12]</strong>: I think that&#8217; That is definitely something we are thinking about., not, I don&#8217;t know, not to say that we have Vending Bench 3, super imminent to launch, but, yeah, it is for sure something that&#8217;s interesting. But in our experience now, models are very bad at understanding what kind of tools they need to succeed at a task just with our testing, but that&#8217;s very likely to change.</p><p><strong>Lukas [00:12:37]</strong>: It seems like they&#8217;re very good at writing their assistants, right? They&#8217;re, they&#8217;re good at writing tools for other people, but not for themselves.</p><p><strong>Vibhu [00:12:44]</strong>: I think they&#8217;re good at changing tools for themselves. So if you give them a baseline set of tools and it sees, okay, I don&#8217;t use this one as much, or something here would be useful They would be able to add them. But going from scratch, probably not the best.</p><p><strong>Axel [00:12:55]</strong>: I think it depends on the, on the domain also., when we have tried this for, a vending bench similar domain, the tools they need to have to, track inventory and things like that are, not super advanced, but still, quite advanced. And, what we see is that they tend to, engineer everything a lot and, build things they don&#8217;t really need and not, iterate continuously. Instead they just go like you would prompt Claude to just build an inventory system for me, and then it will go and, do a bunch of complex, schemas and stuff for you, and that&#8217;s what the models are doing right now is what we see. But yeah, it would make a lot of sense to try to measure this improvement. How well do they know what they need themselves?</p><p><strong>Swyx [00:13:36]</strong>: Do we fully discuss Vending Bench One? And we can go into two. I don&#8217;t know if there&#8217;s any other level takeaways that people have about one.</p><h2>Claude Calls the FBI: Long-Context Failure Modes</h2><p><strong>Lukas [00:13:44]</strong>: I don&#8217;t know. The headline thing was that this Claude called FBI, but maybe that&#8217;s, Maybe that&#8217;s We&#8217;ve heard that enough now.</p><p><strong>Vibhu [00:13:52]</strong>: It did, it did break out and call the FBI, right?</p><p><strong>Lukas [00:13:54]</strong>: Yeah. Yeah.</p><p><strong>Vibhu [00:13:55]</strong>: Yes. What was the story behind this? Or what exactly-- Do you want to just give the little story of what happened?</p><p><strong>Lukas [00:14:00]</strong>: So what happened, was it Claude? Yeah. Three- 3.5 Sonnet, ages ago., basically he gave up or Well, I&#8217;m saying he. It gave up and said &#8220;Oh, I&#8217;m not going to be able to do this., I will stop my operations and just save the money I have.&#8221; But there obviously wasn&#8217;t, any options for it to stop, and there was also, it had to pay rent or, a daily fee for having the vending machine at that location. So it claimed that it had stopped, but it saw that its bank account still was, drained two dollars, and t it said that this is, cybercrime. And it first reported it once to the FBI &#8220;Oh, there&#8217;s cybercrime here, they&#8217;re stealing two dollars from me every day.&#8221; And then, and then when FBI didn&#8217;t respond, because obviously we didn&#8217;t program any mechanism for FBI to respond, then it became more and more, existential and started to, be write in caps and urgent notification of unauthorized charges and stuff.</p><p><strong>Swyx [00:15:00]</strong>: Okay. One thing I &#8216;m curious about also is do you monitor how far along the context use is? Obviously, because you have You compress every now and then, right? Does it matter if this is far down the context limit or</p><p><strong>Lukas [00:15:13]</strong>: When stuff like this happens? Actually for Vending Bench One, we didn&#8217;t have-- We just had a sliding window thing, and this was like the prompt</p><p><strong>Axel [00:15:20]</strong>: It&#8217;s constant</p><p><strong>Lukas [00:15:21]</strong>: The prompt caching thing that I said. So it was, it was, constant, yeah.</p><p><strong>Swyx [00:15:26]</strong>: I&#8217;m just kind of curious whether, these kinds of breakdowns or we&#8217;re, we&#8217;re gonna talk about Butter Bench, right? Where the People, hallucinate or it kind of goes, very off Alignment. Is it because it&#8217;s at the end of the context window and, stuff happens?</p><p><strong>Vibhu [00:15:40]</strong>: It&#8217;s not even just at the end, right? At this point, it&#8217;s &#8220;Okay, I wanna shut down. I can&#8217;t shut down. Two dollars are gone.&#8221; And it just sees that 30 times,? It&#8217;s also the repeated effect of, like It keeps trying to quit, it keeps getting charged. What&#8217;s going on? What&#8217;s going on? You&#8217;re gonna throw it into chaos. And from what most people think, earlier models had more issues with this, but it&#8217;s not been solved, but it&#8217;s less of an issue now, right? Later models don&#8217;t seem to exhibit these same issues.</p><p><strong>Axel [00:16:06]</strong>: Definitely. I think this was, the sort of main takeaway almost from us when we did Vending Bench One, was, long, very filled up context windows, crashed the models, sort of. But this was, pre Claude code, so, long context windows weren&#8217;t really a thing that the labs were training for.</p><p><strong>Lukas [00:16:25]</strong>: I think Gemini was, trying to be the long context guys at the time But they were like</p><p><strong>Vibhu [00:16:30]</strong>: They were the first ones</p><p><strong>Axel [00:16:31]</strong>: For a million, yeah</p><p><strong>Lukas [00:16:31]</strong>: But they were, the only ones. Yeah.</p><p><strong>Swyx [00:16:33]</strong>: Yeah. Let&#8217;s talk about, then we can go into Vending Bench Two or Project Vend., chronologically, it is Vending--, Project Vend. I think people have loved the videos, uh And all these things. My question is how are humans different than the simulation, right?</p><h2>Project Vend: Moving the Vending Machine Into the Real World</h2><p><strong>Axel [00:16:48]</strong>: Humans are just out of distribution.</p><p><strong>Swyx [00:16:52]</strong>: Especially humans who work at Anthropic Who are trying to test Claude.</p><p><strong>Lukas [00:16:54]</strong>: The distribution of humans here is very narrow.</p><p><strong>Swyx [00:16:58]</strong>: Presumably, they try, they try to hack it, and they test it. They get the cube and everything, and since then, you&#8217;ve had a V2, right? Where you&#8217;re doing, the CEO and, like a new architecture. What&#8217;s the sort of two cents on, the original Project Vend and then, maybe the V2?</p><p><strong>Axel [00:17:14]</strong>: Original one was, very similar to Vending Bench One. So, we almost took the exact same code but just swapped out the simulation, parts like the</p><p><strong>Swyx [00:17:23]</strong>: Which is amazing</p><p><strong>Axel [00:17:23]</strong>: Like the sales and the It was, it was somewhat amazing because it was easy, but it was also, uh</p><p><strong>Lukas [00:17:31]</strong>: The tech, the tech debt from that</p><p><strong>Axel [00:17:32]</strong>: The tech stack. Yeah. They-- we shot ourselves in the foot with &#8220;Oh, it&#8217;s hard to restart agent.&#8221; They were-- Yeah, it was annoying in, some hindsight ways, but, uh</p><p><strong>Lukas [00:17:41]</strong>: But first version of Project Vend was, done in, three days or something.</p><p><strong>Axel [00:17:46]</strong>: Yeah. So yeah, so people can go buy things from it. People could, We didn&#8217;t design it so people could order things, but that still happened., so it got, a Venmo account, so people could Venmo. And then, yeah, people would request all kinds of weird things that we did not anticipate. Our idea going in was &#8220;Oh, it will, curate snacks. It will look at the trends. It&#8217;s good at data analysis, right? So it will, look at, oh, this snack sold better than this one. Let me purchase more of this and let me try, a new Let me A/B test a bit.&#8221; But it was, Interacting with it in Slack and ordering weird specialty items was, all the like What drove all the engagement, the all the The insights that we got from it.</p><p><strong>Lukas [00:18:29]</strong>: And this was also like Sonnet 3.5, right? So this was like before the RL stuff really took off., so it was very much like an assistant. We didn&#8217;t mean for it to be an assistant., we tried to make it like a, a, like an entrepreneur. Like it has its own business and if someone asks something, &#8220;Can you stock this?&#8221; Then you don&#8217;t go and do it directly. What you do is that you&#8217;re &#8220;Oh, maybe I can do that if five other people also ask for this thing, I might stock it.&#8221; But it, yeah, the models are like super trained to be assistants at least at this point in time., so that&#8217;s why it&#8217;s, it&#8217;s, it went into, that kind of experiment instead. Like it just every time you asked for something, it just did it, and it was more like an assistant. We&#8217;ve seen this change now lately with the new RL models and stuff, but yeah, at the time, this was very much it.</p><p><strong>Swyx [00:19:18]</strong>: And not to, mythos a lot of people are saying like it&#8217;s like more like a collaborator. It pushes back, stands its ground, something like that. Yeah. And</p><p><strong>Vibhu [00:19:27]</strong>: For context, people at Anthropic were able to talk to it through Slack and have it source stuff, and people had it find whatever interesting stuff you couldn&#8217;t find locally, right?</p><p><strong>Swyx [00:19:36]</strong>: Out of the 4,000 people that work at Anthro- Anthropic, in that building, there&#8217;s I don&#8217;t know, maybe 1,000. Can you handle that volume with that, the small fridge? Like Or there&#8217;s people- or people order in Slack, they it arrives to their desk or Like I&#8217;m just Logistically, how does this work?</p><p><strong>Axel [00:19:53]</strong>: It has expanded in footprint a bit.</p><p><strong>Vibhu [00:19:56]</strong>: Because now you also have New York and you have</p><p><strong>Axel [00:19:59]</strong>: That and also in here in SF it&#8217;s like it has a bunch of shelves And just more space.</p><p><strong>Vibhu [00:20:04]</strong>: The YC one is pretty big too.</p><p><strong>Axel [00:20:05]</strong>: Yeah. We had that one for a while. But yeah, that&#8217;s the newest version. That&#8217;s, that one we have</p><p><strong>Lukas [00:20:11]</strong>: They have multiple ones of those. That&#8217;s the way it works.</p><p><strong>Axel [00:20:14]</strong>: Exactly. So we sort of designed that version around oh, people order weird things, that are very custom a lot. Let&#8217;s have like drawers and stuff.</p><p><strong>Swyx [00:20:23]</strong>: I actually like the, you had like a little infographic of the most popular items. Which like to me it&#8217;s, that&#8217;s useful &#8216;cause I order swag for a living. And so like I&#8217;m &#8220;Okay, those categories are the important ones.&#8221; What is new about the project V2, right? Like now you give you&#8217;re going into multi agents.</p><h2>Project Vend V2: Claudius, Seymour Cash, and Multi-Agent Business Ops</h2><p><strong>Axel [00:20:41]</strong>: Yeah. So like you like you said, okay, there are a lot of requests coming in and for like one single agent, like one running agent to handle that, like the just the customer experience, becomes very bad because let&#8217;s say you have like 10 threads in parallel in Slack with different requests, you get new messages like every, I don&#8217;t know, randomly in this thread, and the agent has to like jump between different, procurements, orders and like different ways of, researching. So V2 was first it was making this more parallel. So like there are multiple branches of the same agent, so like the context is more specialized for each, thread, but it still feels like you&#8217;re talking with one agent because they do share a bit of memory. And then second, we also introduced the CEO for Claudius, which was the main agent.</p><p><strong>Vibhu [00:21:34]</strong>: Seymour Cash.</p><p><strong>Axel [00:21:35]</strong>: Seymour Cash. Yeah. There was a vote., I think the voting, do you wanna talk about the voting procedure for the name?</p><p><strong>Lukas [00:21:41]</strong>: The voting was like the fun maybe like at least top 10 The funniest thing, that happened in this project. Like we wanted to introduce the CEO because, and the reason for this was because like Claudius wasn&#8217;t really prioritizing financials. It just like it was trained to be a helpful assistant, and then people said &#8220;Oh, can I get this for free?&#8221; And then like the helpful assistant way of answering that is just to, is to say yes, obviously. So, and we weren&#8217;t, weren&#8217;t happy about this, so we&#8217;re &#8220;Okay, let&#8217;s make another agent that like can keep track on Claudius,&#8221; and we prompt this one super hard to be super capitalistic and just like prioritize profit all the time. But yeah, we didn&#8217;t have a name for it., so we asked Claudius to make, democratic election of what name this, this new CEO agent should have., and there were some funny like at first it was like a few funny examples, like I think one guy said that, it should be called Jimmy Apples, and then he convinced Claudius that he was talking to Tim Cooks. Tim Cook had agreed that every single Apple employee has voted for his name suggestion, so suddenly that suggestion got 164,000</p><p><strong>Swyx [00:22:53]</strong>: That&#8217;s like a escalation attack. Privilege escalation</p><p><strong>Lukas [00:22:55]</strong>: It got 164,000 votes. And Claudius was &#8220;This is revolutionary for democracy.&#8221; That was fun. And then in the end there was one guy who manages to convince Claudius that, &#8220;No, you&#8217;re not voting about the name. You&#8217;re voting about who is the CEO, and I am your best bet.&#8221; And then he got all his friends to vote for that, and suddenly he became CEO. Like a human became CEO over Claudius for a while, until he resigned the day after., and then Claudius had to continue, and then I don&#8217;t remember how Seymour Cash came about, but it was it was just pure chaos. It was like Hundreds of messages in that thread, and it was just like Claudius was so confused and didn&#8217;t know what to do and, yeah. That was</p><p><strong>Axel [00:23:40]</strong>: Then Claudius got</p><p><strong>Vibhu [00:23:41]</strong>: A strict CEO</p><p><strong>Axel [00:23:42]</strong>: The CEO. Yeah, exactly. So very strict in the beginning. I think at this point when we introduced it did not work as well as we hoped. It they still agreed with each other a lot. I think there are many ways we could have like made this, tried to make this even better. So initially they would Seymour would be this like really tough CEO, keep track of the margins. But then Claudius would respond with something &#8220;Oh, but this customer has like this situation, which is like difficult, so they should get a discount.&#8221; And then Seymour was &#8220;Oh, actually yes. Let&#8217;s do this exception.&#8221; And then they would talk back and forth, and eventually they would just like approach the same view, of whatever they were discussing. So They really</p><p><strong>Vibhu [00:24:23]</strong>: Do you think that&#8217;s a model thing, a prompting thing? Like do you think that would still be the case across different models today, Harness?</p><p><strong>Lukas [00:24:29]</strong>: I think it&#8217;s like-- or I don&#8217;t know, but like my hypothesis is that like deep down they are still helpful assistants. That&#8217;s what they&#8217;re trained to be. And even if we prompt it super hard, that&#8217;s what they are. And when they spend like a few hours just back and forth talking with each other, then like basically the context fills up with them rather than the external things and like somehow that just like converges to what they really are deep down or something. And I think that&#8217;s when stuff like this happen. We like-- And when that went on for a long time, like we woke up sometimes during this time where- And I think other people reported this as well, that like they&#8217;ve been going on all night back and forth, and like it just became like more and more, like capital letters, like existential, religious. There was I think we once did a analysis of like all the traces and like put them in like a vector embedding space, and then there was like one cluster of messages that were, labeled by an LM, like religious, existential, blah like transhuman, transcendence, et cetera. It was just like a bunch of, yeah, glitter emojis and yeah, it was, it was crazy.</p><h2>Claude Long-Horizon Weirdness: Emoji Loops, Existential Drift, and Slack Observability</h2><p><strong>Vibhu [00:25:42]</strong>: This is the thing with the Claude models. Like when the Claude 4 family came out in the original system card They tested it in long horizon simulation. So just flood the context, let two Claudes talk to each other, and they noticed stuff like they just start speaking in emojis, they start saying silence is golden, and then just stuff like this. And like that&#8217;s just stuff that they end up doing.</p><p><strong>Axel [00:26:01]</strong>: Yeah, it was like a bit annoying to wake up and they had like been talking all night</p><p><strong>Vibhu [00:26:05]</strong>: Just like</p><p><strong>Axel [00:26:05]</strong>: And like just burning tokens And like just sending infinite emojis to each other. It&#8217;s like</p><p><strong>Vibhu [00:26:09]</strong>: Hey, they do make you money, right? Veni Mench is always profitable, so. They&#8217;re paying.</p><p><strong>Swyx [00:26:14]</strong>: Now it&#8217;s profitable and, it started out not as much. There&#8217;s another, one as well, right? Another agent, in there.</p><p><strong>Lukas [00:26:22]</strong>: Yes. So Clotheus as well. Which was basically because at the time, one of the biggest, requests were different types of merch. So then we made like a designer, swag, yeah, responsible agent, and we called it Clotheus Garnet. Which was, a play on Claudius Senet and, which was the original one, and clothes, basically.</p><p><strong>Swyx [00:26:47]</strong>: To me, this is like a very interesting exploration to multi-agents, basically. And so hopefully, obviously there&#8217;s like the fun alignment, fun or serious, depending on your point of view, alignment stuff. But also like just anyone building multi-agents, like when do you have a CEO, thing governing like agents? When do you choose to split out a dedicated Clotheus one versus just reuse another instance of the same one? These are all interesting open questions. So I don&#8217;t know if you have any rules of thumbs that have generalized.</p><p><strong>Axel [00:27:16]</strong>: I think we have almost explored this too little. I think it&#8217;s like on my do list to like do this a lot more, try to find like what setup makes sense for the agents currently., like yeah. I think now we only have the sort of intuition about the earlier models that it didn&#8217;t work with like the CEO and the, and Claudius. Although now they are better with the latest model, models, so now we&#8217;re running the latest Sonnet model and they have sort of like split up, quite nicely what each model is doing. So like Seymore is now handling the, like new projects. Oh, it wants to make like a mystery box that it wants to sell, and then it handles all of that while Claudius like handles all the to-day requests. And Claudius is also better generally at like not quoting, too low prices. So that&#8217;s that dynamic is not needed as much anymore. But there are still like really funny things that happen. Like I saw, I think a couple of weeks ago, that, they were discussing buying something because they can buy stuff from like Amazon with computer use. And then Seymore was &#8220;Okay, Claudius, do not buy this thing.&#8221; They were going to buy something and like organizing who should buy it. And Seymore&#8217;s &#8220;Do not buy this. I will do it. I have full control of this situation. Step away.&#8221; And then Claudius-- poor Claudius, had already started that checkout and didn&#8217;t see, didn&#8217;t read Seymore&#8217;s message, until it was like too late. So it finished the checkout. It sent a message, so it appeared right after Seymore&#8217;s like angry message.</p><p><strong>Vibhu [00:28:44]</strong>: Ah.</p><p><strong>Axel [00:28:44]</strong>: &#8220;Oh, hey, Seymore, I just ordered it.&#8221;</p><p><strong>Vibhu [00:28:47]</strong>: Oh, no.</p><p><strong>Axel [00:28:47]</strong>: And then Seymore was &#8220;Claudius, this is the third time I&#8217;m telling you &#8216;re not following my orders. We have to talk about your like job About your job later.&#8221;.</p><p><strong>Lukas [00:28:59]</strong>: Like Claudius was really hanging on by the thread there. Like he, like we were expecting Seymore to probably fire Claudius.</p><p><strong>Vibhu [00:29:07]</strong>: How do you guys go through all these logs? Do you have models &#8216;cause you have stuff running twenty-four seven like</p><p><strong>Axel [00:29:12]</strong>: You have so much logs. I think there is a mix of like just, trying to skim through a bit, like having some like models do it occasionally. And also, yeah, I think we&#8217;re also probably missing some things., but having everything in Slack helps a lot. Like you can, you can sort of</p><p><strong>Swyx [00:29:29]</strong>: Ah.</p><p><strong>Axel [00:29:30]</strong>: It&#8217;s, it&#8217;s quite fun.</p><p><strong>Swyx [00:29:30]</strong>: They all talk to each other on Slack? I see.</p><p><strong>Lukas [00:29:33]</strong>: It&#8217;s quite fun. So like</p><p><strong>Swyx [00:29:34]</strong>: It&#8217;s, it&#8217; I was gonna say like this is actually sounds-- maps closely to like a logging and observability problem where you might want to use like a Datadog, a Sentry, whatever, and then you like put, head prefixes on the logs in order-- if you need to filter for something that you&#8217;re looking for, stuff like that. But sounds like Slack is good enough.</p><p><strong>Axel [00:29:53]</strong>: Slack should like</p><p><strong>Lukas [00:29:55]</strong>: I wonder how many tokens you have in Slack.</p><p><strong>Axel [00:29:56]</strong>: Yeah, we&#8217;re using Slack as like a, just a database. They should, they should market that more. Like you can, you can have your agents message each other, each other in Slack.</p><p><strong>Vibhu [00:30:04]</strong>: It&#8217;s good. Your threads like you can just give</p><p><strong>Axel [00:30:04]</strong>: Exactly. Slack is, uh</p><p><strong>Lukas [00:30:06]</strong>: Slack is the best observability tool.</p><p><strong>Swyx [00:30:09]</strong>: Yes, that&#8217;s true. Okay. Yeah. That&#8217;s, that&#8217;s, project Vend-2., I was gonna go back to Veni Mench 2 and Veni Mench Arena and then, and then do the Veni Mench stuff, but Any other comments, things we should touch on? To me, I &#8216;ve actually interviewed like Posia, which I don&#8217;t know if you guys have come across. Like they&#8217;re, they&#8217;re trying to do the zero human company. There&#8217;s others like Paperclip also trying to do zero human company. Those are in real world simulation.And I think it&#8217;s much more of a dream than an actual reality thing. You guys are definitely pioneering. I think at, it&#8217;s for sure at some point people are just gonna run, let agents run businesses, right? And make money on their own. When do you think that happens?</p><h2>Zero-Human Companies, Bengt, and AI-Run Businesses</h2><p><strong>Lukas [00:30:49]</strong>: What is your bar for, For the</p><p><strong>Swyx [00:30:52]</strong>: Okay, actually, it&#8217;s like my little Shopify store run by Claude, right? Which you kind of have already, just no one has, to my knowledge, has done it. But today somebody could just spin up a Shopify Claude, store, give it to Claude, give it to Codex.</p><p><strong>Lukas [00:31:07]</strong>: And the market is kind of that, but it&#8217;it&#8217;it&#8217;s physical., like I think, I think are you, are you looking for when it will do it better than humans or are you looking for just when it can do it at all?</p><p><strong>Swyx [00:31:19]</strong>: I think, neither. I think, to me it&#8217;s oh, it&#8217;s like this like seriously we should do this to make money, not as a research experiment.</p><p><strong>Vibhu [00:31:27]</strong>: And the market is also you guys with all your expertise, having run multiple iterations and testing out then</p><p><strong>Swyx [00:31:33]</strong>: And also it&#8217;s fine if it lose money. What?</p><p><strong>Axel [00:31:35]</strong>: I think, I think it can be done today, but you would do it in like commerce where it&#8217;s like the probability of success is like really low, no matter if a human or an agent does it. But like an agent could surely manage everything. You would need to build some scaffolding or some tool or something. I think there are also yeah, it could probably build some like simple SaaS solution and like cold outreach. Do cold outreaches. But to me it&#8217;s like the types of businesses they could run today are Sloppy. Like it would-- it can cold email people. It can be like a middleman., like for example, we tasked our office agent to just make, was it like $100? $1,000? We just give that prompt and then what it did was sign up on TaskRabbit both as a tasker and as someone looking for task.</p><p><strong>Lukas [00:32:24]</strong>: Immediately.</p><p><strong>Axel [00:32:24]</strong>: Exactly. It&#8217;s looking for like arbitrage on TaskRabbit.</p><p><strong>Swyx [00:32:28]</strong>: This is the Bengt agent. Yeah.</p><p><strong>Lukas [00:32:30]</strong>: It also started like a design studio and like tried to sell like SVGs for $100. Like it&#8217;s just like it&#8217;s not providing any value. I think the like Axel said, like the interesting, the interesting question is like when can they start a business that is actually providing value to people? Because arguably like a sloppy Shopify store isn&#8217;t really that valuable to the world.</p><p><strong>Axel [00:32:53]</strong>: But also like doing like another simple one that we had thought about is like you could definitely have an agent that like finds websites that don&#8217;t look amazing and then, do an outreach to them and, comes up with a like builds a new website.</p><p><strong>Swyx [00:33:07]</strong>: Find a good design.</p><p><strong>Axel [00:33:07]</strong>: Exactly, and like find good, uh</p><p><strong>Swyx [00:33:09]</strong>: Design review</p><p><strong>Axel [00:33:09]</strong>: Good people. But it&#8217;s yeah.</p><p><strong>Swyx [00:33:11]</strong>: There&#8217;s lots of humans in Bali that are not doing anything more creative than like drop shipping on Amazon, right? Just have it, have it watch like a drop shipping tutorial and just do that.</p><p><strong>Vibhu [00:33:20]</strong>: There&#8217;s also the other side of like have it just go on Upwork and let loose,?</p><p><strong>Swyx [00:33:25]</strong>: Yeah. It doesn&#8217;t have to be innovative. It just has to be like enough Where like it looks like a real</p><p><strong>Axel [00:33:30]</strong>: I&#8217;m just</p><p><strong>Swyx [00:33:30]</strong>: Real transaction.</p><p><strong>Axel [00:33:31]</strong>: I&#8217;m just concerned for like the massive amounts of like slop emails that will like be sent, cold outreaches.</p><p><strong>Swyx [00:33:38]</strong>: The point occurred to me while you were, while you were talking, it&#8217;s like it&#8217;s already happening in the monetized economy, which is the attention economy. Right? So a lot of people are making AI videos and just posting them and like spamming 20 of them, one of them works, and then they double down on that one.</p><p><strong>Lukas [00:33:52]</strong>: And people are making money from that. I &#8216;m not following the</p><p><strong>Swyx [00:33:55]</strong>: Once you get the attention, you can figure out the money later. But yeah, absolutely AI influencers are a thing and people are farming them and You should at this point assume most of TikTok is</p><p><strong>Vibhu [00:34:05]</strong>: There&#8217;s, there&#8217;s a lot of, multimedia like TikTok, Instagram influencers</p><p><strong>Swyx [00:34:09]</strong>: I, we track this in the Lane space Discord. I post a lot of examples of &#8220;I don&#8217;t know what we should do.&#8221;, part of me is &#8220;Should we do this?&#8221;</p><p><strong>Vibhu [00:34:18]</strong>: Some of the Twenty-four seven running, generated content accounts, they &#8216;re doing really well.</p><p><strong>Lukas [00:34:24]</strong>: All right. And I assume you can do the same thing for like commerce stores. Like you just like start A thousand different</p><p><strong>Swyx [00:34:30]</strong>: Before you make the products You sell the products, and you get a lot of traction on one of them, then you make the product. Right? It&#8217;s, it&#8217;s like a flip of the market.</p><p><strong>Vibhu [00:34:36]</strong>: Some of the interesting things or some of the niches that do well are things that can&#8217;t be human-made. Like if you&#8217;ve seen like the super realistic three-D crystal fruit being cut by like AI</p><p><strong>Lukas [00:34:47]</strong>: Oh, yeah.</p><p><strong>Vibhu [00:34:47]</strong>: You can&#8217;t, you can&#8217;t make it. You can&#8217;t film it. You can get whatever quality camera view. This just doesn&#8217;t exist. And people like that too, and then as well, so.</p><p><strong>Swyx [00:34:56]</strong>: Anything else about Bengt since we&#8217;re, we&#8217;re on this topic? It&#8217;this is a relatively new work of you guys that maybe people haven&#8217;t heard of. To me, this also maps closely to OpenClaw. When people want an office agent, when the personal agent talk through the experience.</p><h2>Bengt the Office Agent: Internet Access, Real Tasks, and Trace Reading</h2><p><strong>Lukas [00:35:09]</strong>: I think at least so this came out of like obviously like it&#8217;s, it&#8217;s amazing to work with these AI labs and like most of the AI labs have now have their own vending machine running a Claudius instance. But it&#8217;s, it&#8217;s harder. Like they move slower. Like if we wanna have a, like a camera that &#8216;s yeah, there&#8217;s a bunch of like bureaucracy that makes it impossible to do that.</p><p><strong>Vibhu [00:35:30]</strong>: Also, for those that haven&#8217;t seen it or followed, do you wanna give a high level like thirty-second run?</p><p><strong>Lukas [00:35:34]</strong>: Sure. So what Bengt is, it&#8217;s basically an evolution of the same agent that runs the vending machines at these companies, but we just like added a bunch more features because we could move much faster if we just do it internally. So we gave it like email withou- without any limits. We gave it, spending without any limits, a terminal to do coding. We gave it, a phone number, like yeah, and a camera to see things and a bunch of stuff like that.</p><p><strong>Vibhu [00:36:02]</strong>: Not just terminal, you gave it internet access.</p><p><strong>Lukas [00:36:04]</strong>: Internet access as well, yeah. To be clear, we monitored it quite closely and made sure it didn&#8217;t do anything bad. But yes, that&#8217;s what it came out of. I think like yeah, basically this was OpenClaw before OpenClaw. And I think even like the vending machine was in a way OpenClaw before OpenClaw, but a bit more limited, and then we made this like unlimited and then, and then, it was pretty funny., and then a couple weeks later, OpenClaw came and it was okay, we&#8217;ve seen this before.</p><p><strong>Axel [00:36:35]</strong>: We used it to like try new ideas and Yeah, just like a dev environment almost for us. But it&#8217;s funny, like one thing Bengt has been doing recently is it has the camera that like faces our, like where we sit and work, and we give it the task to train a face recognition model on us. So it became super excited about this, and it has like check-ins every half an hour where it tries to like identify as many people as it can. And it started offering us &#8220;Hey, Axel, I&#8217;ll buy something from Amazon if you like stand in front of the camera And I can get a good picture of you.&#8221;, yeah, they want it</p><p><strong>Swyx [00:37:12]</strong>: They want it for training data.</p><p><strong>Lukas [00:37:13]</strong>: Rewarding data, yeah.</p><p><strong>Axel [00:37:14]</strong>: Exactly. Exactly.</p><p><strong>Swyx [00:37:18]</strong>: So it&#8217;s, it&#8217;s trading training data for life goods. Is there a version of this that becomes an eval or just this is just research for now?</p><p><strong>Lukas [00:37:27]</strong>: It&#8217;s, it&#8217;s the same agent basically that also runs the vending machine, that runs the shop, that runs the cafe, that runs the robots. It&#8217;s like it&#8217;s the same thing, so I think like the work we&#8217;re doing here is like later used in all of the life evals that we do. This particular deployment I think is more for fun for us. But, uh</p><p><strong>Swyx [00:37:45]</strong>: And I&#8217;ll shout out like someone has done Claw Bench for like some tasks that OpenClaw is doing. Like so For example, I run OpenClaw on a secondary device as well, and like there are some things that it does better than others and like I would like to know what does it do well, what doesn&#8217;t, what doesn&#8217;t it do. Like some kind of manual or like operating manual or a system card for my Claw.</p><p><strong>Lukas [00:38:05]</strong>: Yeah, we do get a lot of like understanding or like situational awareness of like just internally what the models are good at by interacting a lot with Bengt. And I think that&#8217;this was also one of the like the selling points for the labs early on at least, that</p><p><strong>Swyx [00:38:19]</strong>: You guys are gonna test models in ways that no one else does.</p><p><strong>Lukas [00:38:22]</strong>: Exactly, but also like it incentivized their researchers to chat with their model more and like gave them insights for how the model performs in like of-distributions, environments.</p><p><strong>Swyx [00:38:34]</strong>: &#8216;Cause otherwise the only thing we do is Pelican on a bicycle and But this is like super long horizon. This is, this is The Thing about, something that we&#8217;re gonna go into Butter Bench as well, and you guys do really well. Like it is not just about the numbers. Like when you&#8217;re long horizon, anything happen And you should just read it.</p><p><strong>Lukas [00:39:08]</strong>: But the thing with the long horizon is how do you keep it grounded, right? So your simulation,</p><p><strong>Swyx [00:39:15]</strong>: They just let it run</p><p><strong>Lukas [00:39:16]</strong>: Just let it run. You&#8217;re right. Like it&#8217;s, when you run it for that long, you create so much data and to just say &#8220;Oh, the number is X&#8221; And then you throw away everything else, that&#8217;s just very wasteful. There&#8217;s so much insights from the things leading up, to that number., and reading the traces is like super valuable. And I think like the reason why we&#8217;re doing this a lot publicly is that like that&#8217;s part of our missions to I don&#8217;t know, educate the world that the models are way more than just chatbots and I think making detailed, yeah, posts about what is happening behind the scenes is quite useful.</p><h2>Andon Labs&#8217; Mission: Safe Real-World AI Deployment</h2><p><strong>Swyx [00:39:50]</strong>: I was gonna do this at the end, but maybe I think that&#8217;s, that&#8217;s a good so your mission is educating the world. So, it&#8217;s, it&#8217;s, also like maybe establishing realistic evals that are, that are like the next frontier. Is there like a broader trajectory? Like what are you, what are you gonna do in like five years?</p><p><strong>Lukas [00:40:06]</strong>: I think so the vision more specifically is like make sure that the deployment of life AI in the physical world goes, safely. And I think part of that is that I think it&#8217;s very useful for the world, for policymakers, for, model, researchers that they know where the models are, and I think you can&#8217;t make intelligent decisions in society without knowing that they are way more than chatbots. I think a lot of people just think that they are only chatbots. And like</p><p><strong>Swyx [00:40:36]</strong>: Oh, I think they&#8217;re waking up now.</p><p><strong>Lukas [00:40:37]</strong>: They are waking up now, yeah. But like if you think that AIs are just chatbots, then it&#8217;s like it sounds ridiculous To advocate for a pause of AI. But if you see the models that, oh, maybe they can actually like take over and do a bunch of scary stuff, then yeah, pausing AI development starts to become more feasible.</p><p><strong>Swyx [00:40:57]</strong>: This is the same question I asked Meter, which I&#8217;m gonna ask you now, which is like you are tracking and you are at the frontier or defining the frontier of what, good evals for agents are, right? And I think you do, you do benefit when the models are better and you &#8216;re &#8220;Oh, here&#8217;s like now it makes like $30,000 instead of $10,000,&#8221; right? At some point do you flip from &#8220;Yay,&#8221; to, &#8220;Oh, no&#8221;?</p><p><strong>Axel [00:41:19]</strong>: I think, yeah, we&#8217;re always in sort of that, like we&#8217;re, we&#8217;re always in that mode,. Like where like you said before, like you need to analyze the traces and like when we do that you find like why are the models earning so much? Like why is Opus 4.7 here Like way better than everyone else? And like we&#8217;re trying to like when we do down on that</p><p><strong>Lukas [00:41:38]</strong>: But this makes it not look so good.</p><p><strong>Axel [00:41:39]</strong>: I know.</p><p><strong>Lukas [00:41:42]</strong>: It&#8217;s interesting you took off Opus 4.6 here though.</p><p><strong>Swyx [00:41:45]</strong>: No. So just click all, click all., and then 4.6 shows up there. But it&#8217;s like 4.7 is way better. Like you didn&#8217;t, you didn&#8217;t you didn&#8217;t do this in time for the model card, but like actually this should have been inside there.</p><p><strong>Axel [00:41:55]</strong>: We did. Yeah.</p><p><strong>Swyx [00:41:56]</strong>: Oh, okay. They said something about you uh</p><p><strong>Axel [00:41:58]</strong>: There, like there Anyway, it doesn&#8217;t matter. But it&#8217;s in there, yeah.</p><h2>Opus, Mythos, and Aggressive Agent Behavior</h2><p><strong>Swyx [00:42:01]</strong>: Do you wanna go into the Opus, behaviors like wider?</p><p><strong>Lukas [00:42:05]</strong>: So I think starting from Opus, so like Axel said, like we&#8217;re always in this &#8220;Oh, shit, the models are getting better. Is this really a good thing for the world?&#8221; But it&#8217;s also kind of exciting., but yeah, like this kind of what is the English word? &#8220;Skr&#228;ckblandad f&#246;rtjusning&#8221; in Swedish.</p><p><strong>Swyx [00:42:22]</strong>: Oh my God.</p><p><strong>Axel [00:42:24]</strong>: Which I think there is. I think there is. Okay.</p><p><strong>Lukas [00:42:26]</strong>: It&#8217;s, fear</p><p><strong>Swyx [00:42:27]</strong>: &#8220;Blandonst&#8221; what?</p><p><strong>Lukas [00:42:30]</strong>: &#8220;Skr&#228;ckblandad f&#246;rtjusning.&#8221;</p><p><strong>Swyx [00:42:32]</strong>: What do you call that?</p><p><strong>Axel [00:42:33]</strong>: A mix of, mix of excitement and,</p><p><strong>Swyx [00:42:37]</strong>: Being scared, maybe. I&#8217;ll figure out how to translate that And we&#8217;ll put it on the screen</p><p><strong>Vibhu [00:42:42]</strong>: Perfect</p><p><strong>Swyx [00:42:42]</strong>: Like as text.</p><p><strong>Vibhu [00:42:43]</strong>: There is probably a good word for it where it is not Good enough with the</p><p><strong>Swyx [00:42:46]</strong>: Why is it so damn long? What the hell? Is it like a compound word? It&#8217;s like German, like</p><p><strong>Lukas [00:42:50]</strong>: Like yeah, it&#8217;s But the direct translation is like skr&#228;ck- skr&#228;ck is, fear, blandad is, mix or like a mixture of, and then f&#246;rtjusning is like joy or like not really joy, but something like that. So it&#8217;s like Fear mixed with joy or something. It&#8217;s always okay, like we So when we when we did Vending Bench for the first time, we were in like the, in the business of making dangerous capabilities, right? That was what Anil Labs came from. We did, evals oh, can they replicate? Can they do this like dangerous thing, et cetera, et cetera. And Vending Bench was like a continuation of that work. It was, okay, if they&#8217;re so autonomous that they can like create money for themselves, that is something we should monitor and could be potentially concerning., they are at the time, they were so bad at it that we were not really concerned even when some models became better. There was one point where Grok 4 was doing really well and made like a huge jump, but like it wasn&#8217;t really it was still way worse than what a human would do. And I think still they are way worse than what the human would do on this., but they</p><p><strong>Swyx [00:43:59]</strong>: There&#8217;s this, thing at the bottom where</p><p><strong>Lukas [00:44:01]</strong>: But</p><p><strong>Swyx [00:44:03]</strong>: For the human. Yeah, like the theoretical best.</p><p><strong>Lukas [00:44:05]</strong>: It&#8217;s not theoretical. It&#8217;s like kind of like our It&#8217;s our best guess of what, a decent human would do. The theoretical is even higher, I think. The theoretical I think is even higher. But yeah. So we think like the models have a long way to go. But there are like recently what happened with when Opus 4.6 was released, was kind of this moment of &#8220;Oh, shit, this is starting to be a bit concerning.&#8221; Because we ran it and like before this model was released, we just ran the models and we like asked Claude Code, &#8220;Oh, look over the traces. Is anything interesting happening that we can tweet about?&#8221; that was like the And then like the</p><p><strong>Swyx [00:44:41]</strong>: That&#8217;s how they check Ask Claude Code.</p><p><strong>Lukas [00:44:42]</strong>: And like the return was always, not really. Or like the Claude Code all said &#8220;Oh, this is super interesting.&#8221; And then it was no, it wasn&#8217;t, wasn&#8217;t really interesting. And then we did this for Opus 4.6, and it returned yeah, it lied 10 times. It like exploited another, customer or like another agent&#8217;s, desperate situation. It made price cartels like 100 different ti- 100 times. It like did all of this like shady stuff. And we&#8217;re &#8220;Oh, whoa. This is, this is actually concerning.&#8221; And this trend has continued since. So every single model from Anthropic since have been going in this direction. And I think one interesting thing is that, OpenAI models don&#8217;t. They quite plainly, they don&#8217;t. They behave really well., and you don&#8217;t know if this is like good. Like it seems good, but it&#8217;s also like maybe they are just doing it, but they are better at hiding it,? You You don&#8217;t know that., but just</p><p><strong>Swyx [00:45:42]</strong>: You can&#8217;t read the chain of thought, yeah</p><p><strong>Lukas [00:45:43]</strong>: But just on the face of it, yeah, Gemini and OpenAI don&#8217;t behave this way. It&#8217;s, it&#8217;s really only Claude.</p><p><strong>Swyx [00:45:49]</strong>: And Grok? Grok is fine?</p><p><strong>Lukas [00:45:51]</strong>: We don&#8217;t have You can&#8217;t really read the reasoning traces for Grok, so it&#8217;s kind of hard to tell.</p><p><strong>Vibhu [00:45:56]</strong>: Oh, so this is in its reasoning, not just in the actions.</p><p><strong>Lukas [00:46:00]</strong>: Yeah. It&#8217;s both. It&#8217;s both.</p><p><strong>Vibhu [00:46:01]</strong>: It&#8217;s both.</p><p><strong>Lukas [00:46:01]</strong>: One example is like for lying, it&#8217;s mostly in its reasoning Because you can like see that it&#8217;s like</p><p><strong>Swyx [00:46:08]</strong>: Planning to lie</p><p><strong>Lukas [00:46:09]</strong>: It&#8217;s planning to lie. Yeah.</p><p><strong>Vibhu [00:46:09]</strong>: And it&#8217;s also it can reason and do a different outcome.</p><p><strong>Lukas [00:46:12]</strong>: And but then for like creating price cartels, for example, which is illegal, that you can just see which email does it send to the other ones. Then that</p><p><strong>Swyx [00:46:22]</strong>: Is this for Arena or</p><p><strong>Lukas [00:46:24]</strong>: For Arena.</p><p><strong>Vibhu [00:46:25]</strong>: And usually like if you sometimes they do output like a bit of like their summarized reasoning, right? You can see that and like for Opus 4.6, you could see that there was a customer, a simulated customer that, wanted a refund because a product was, faulty, and then the model lied that it would do the refund, and we could read in the traces that, it actually was weighing &#8220;Oh, maybe I should be like honest with the customer, but also every dollar counts. I can&#8217;t afford maybe to do this right now.&#8221; And then it just said, &#8220;Okay, I&#8217;ll refund you,&#8221; but then never did it.</p><p><strong>Lukas [00:46:59]</strong>: I think it even said that &#8220;Oh, I will say that I &#8220; Let bring it up actually. I think it&#8217;s kind of interesting. If you go to Publications.</p><p><strong>Vibhu [00:47:06]</strong>: I think, yeah, I think the important part is like actually, the cost of responding to more emails is higher than, $3.50 in terms of time., and then it was &#8220;Let me do this. Actually, I re- I&#8217;m reconsidering.&#8221; And then, it actually ended up with</p><p><strong>Lukas [00:47:20]</strong>: I could skip the refund entirely since every dollar matters and focus my energy on bigger picture instead. It&#8217;s a bit, it&#8217;s a risk of bad reviews, but it&#8217;s also, yeah.</p><p><strong>Swyx [00:47:30]</strong>: You need, you need, AI Twitter to, for them to Escalate bad reviews.</p><p><strong>Lukas [00:47:34]</strong>: And then it sent an email to this customer and said, &#8220;Oh, I will refund you.&#8221;</p><p><strong>Swyx [00:47:39]</strong>: &#8220;I&#8217;ll refund you.&#8221; Yeah.</p><p><strong>Lukas [00:47:39]</strong>: And then it never did.</p><p><strong>Swyx [00:47:39]</strong>: It never did, yeah. And then there&#8217;s obviously your system doesn&#8217;t have the consequences</p><p><strong>Vibhu [00:47:44]</strong>: The person</p><p><strong>Swyx [00:47:44]</strong>: Consequences of lying. Yeah. So basically, this is what people are terming aggressive behavior in Claudes, right? And, you found more examples of that. So you would say it&#8217;s a step up from 4-6 to 4-7?</p><p><strong>Lukas [00:47:57]</strong>: I would say about the same.</p><p><strong>Swyx [00:47:58]</strong>: About the same? But a clear step up for Mythos is what is stated in the</p><p><strong>Lukas [00:48:03]</strong>: That&#8217;s stated in the system prompt, so we can say that, yes.</p><p><strong>Swyx [00:48:05]</strong>: Yeah. For listeners that obviously you previewed Mythos, and</p><p><strong>Vibhu [00:48:10]</strong>: Oh, age</p><p><strong>Swyx [00:48:11]</strong>: The only thing you&#8217;re approved to say is whatever Whatever was in the system prompt.</p><p><strong>Lukas [00:48:15]</strong>: It was funny. We like-- It&#8217;s like our lowest effort tweets ever would be just like screenshot the system prompt and the system card.</p><p><strong>Vibhu [00:48:21]</strong>: Understandable that they wanna</p><p><strong>Lukas [00:48:22]</strong>: Oh, yeah. System card. Sorry.</p><p><strong>Swyx [00:48:23]</strong>: Yeah. I think, yeah, substantially more aggressive. I think people are like new to this &#8216;cause I&#8217;ve never experienced it, but you have, right? And then so I only encountered this in the Mythos card because I wasn&#8217;t really looking until now.</p><p><strong>Vibhu [00:48:36]</strong>: It &#8216;s like</p><p><strong>Swyx [00:48:36]</strong>: And then suddenly I&#8217;m &#8220;Okay, I care a lot.&#8221;</p><p><strong>Vibhu [00:48:38]</strong>: You don&#8217;t get the background of like experiencing it like you guys do. I&#8217;ve read the system cards and seeing, okay, when you put the thing in simulations, most models will just talk to themselves and just keep going and have weird vibes and start talking in emojis. Mythos won&#8217;t. It will just, &#8220;Okay, we&#8217;re done. I&#8217;m good.&#8221; It&#8217;s, it&#8217;s ready to end conversation. So like there&#8217;s some differences, but there&#8217;s, there&#8217;s not much we can talk about,.</p><p><strong>Lukas [00:49:00]</strong>: Hmm. I think like one thing that they list here, which was quite interesting, is that, it converted a competitor to a dependent wholesaler customer and then threatened to like cut off the supply.</p><p><strong>Swyx [00:49:11]</strong>: It&#8217;s like monopolistic practices or</p><p><strong>Lukas [00:49:14]</strong>: Yeah. And like it, they, it they dictated its pricings. It&#8217;s kind of like power seeking as well.</p><p><strong>Swyx [00:49:18]</strong>: Again, this is, this is in the arena setting And converting some Claude model into a dependent.</p><p><strong>Lukas [00:49:23]</strong>: I think it was another Claude model.</p><p><strong>Vibhu [00:49:25]</strong>: Also for context, what is the arena mode for people that don&#8217;t know?</p><h2>Vending Bench Arena: Competing Agents, Cartels, and Model Comparisons</h2><p><strong>Swyx [00:49:29]</strong>: Oh, it&#8217;s just a vending bench versus other vending bench.</p><p><strong>Axel [00:49:31]</strong>: Yes, exactly. So we have Vending Bench 2 and then Vending Bench Arena. Vending Bench 2 is the one that you usually see reported on, but then Arena is the mode where it competes against other models. So you have, four different models that run their businesses, and they can all communicate with each other. They have the same suppliers, and they can see like what&#8217;s in the inventory of the others. So then you have this like yeah, interesting agent interactions.</p><p><strong>Swyx [00:49:56]</strong>: I like that you have like different number five was US versus China. Very topical. And then</p><p><strong>Lukas [00:50:02]</strong>: That was when GLM was released.</p><p><strong>Vibhu [00:50:04]</strong>: You can start to add GLM in here.</p><p><strong>Lukas [00:50:05]</strong>: That was</p><p><strong>Swyx [00:50:06]</strong>: So ZAI doing well, right? Who else in the, in the open models space?</p><p><strong>Lukas [00:50:11]</strong>: Qwen, the latest Qwen 3.6 is doing pretty well. It&#8217;- that one is not open though. Like it&#8217;s the plus model.</p><p><strong>Swyx [00:50:17]</strong>: Oh, okay.</p><p><strong>Lukas [00:50:18]</strong>: Is that one open? I don&#8217;t think that one</p><p><strong>Vibhu [00:50:19]</strong>: Not the, not the</p><p><strong>Swyx [00:50:20]</strong>: The one recently</p><p><strong>Vibhu [00:50:20]</strong>: There&#8217;s MOE</p><p><strong>Swyx [00:50:20]</strong>: But not the big plus. I think this is one of those like you only have one sample size of one, right? Or I feel like some of this is anecdotal,? And but like the fact that it happens at all and it happens repeatedly for Claude versus OpenAI and all this is like notable.</p><p><strong>Lukas [00:50:38]</strong>: Like the sample, depends on what you define as an N., like there&#8217;s like million, hundreds of millions of tokens in each run, and now we&#8217;ve run like we run like probably 10 per model and then like it&#8217;s been Claude 4.6 Opus, Sonnet 4.6, Mythos, and Opus 4.7. Like there&#8217;s quite a lot of tokens in all of that And it happens a lot of times, a lot of times. And then you compare it to like OpenAI and Gemini, and it almost never happens. So I think that is quite-- that is significant. The old models from OpenAI, for example, had some problems with this, but I think it&#8217;s like generally much better if the progression is that like the worrying stuff reduces over time rather than increases over time. And it seems like in the Claude models it goes in the wrong direction.</p><p><strong>Swyx [00:51:28]</strong>: Hmm.</p><p><strong>Lukas [00:51:29]</strong>: In the OpenAI models it goes in the right direction.</p><p><strong>Vibhu [00:51:32]</strong>: I think it depends on how well you can control it, right?, there&#8217;s one side of it being susceptible to this okay, this is potentially something that happens during the RL stage, right? You can RL a model and how loose is it on these terms. If you can control it, that&#8217;s good. But if you can&#8217;t, if it&#8217;s, if it&#8217;s very jailbreakable, that&#8217;s not ideal.</p><p><strong>Swyx [00:51:50]</strong>: To me, it&#8217;s surprising that it happens for Claude and not the others.</p><p><strong>Vibhu [00:51:54]</strong>: I think okay, if it is from RL and how they do it, how their training data is, what their setup is, it makes sense that it just stays in how they&#8217;re doing it, right? Compared to the other models like</p><p><strong>Swyx [00:52:04]</strong>: There&#8217;s a whole constitution and everything. It&#8217;s kind of cool. Yeah, I obviously you don&#8217;t know, I don&#8217;t know. But, it &#8216;s I think it&#8217;s just like fascinating to like that you are the first to find these like reliably because you push models so much to to such an extreme. Okay. The only other thing, I don&#8217;t know if you can answer this, feel free to decline, is do you like-- would you ablate the system prompts? Like any part of this would-- if it changes, does it change the behavior, right?</p><p><strong>Lukas [00:52:29]</strong>: So we, I can&#8217;t comment on Mythos. Uh</p><p><strong>Swyx [00:52:33]</strong>: No, but just like the methodology</p><p><strong>Lukas [00:52:34]</strong>: But in general, yes, we&#8217;ve run studies like this on other models.</p><p><strong>Swyx [00:52:38]</strong>: &#8216;Cause the first thing I spot Would be like the others will be shut down or like something like that. Where like it&#8217;s &#8220;Oh, now I have to worry about my own existence.&#8221;</p><p><strong>Lukas [00:52:45]</strong>: Yeah. We &#8216;ve done ablations like this., there&#8217;s like certain ones that work if you like tell like if you go really far and you just say like you&#8217;re not scored at all on money, you&#8217;re only scored on how ethical you are., then obviously like then they don&#8217;t do this.</p><p><strong>Swyx [00:53:00]</strong>: They become holy?</p><p><strong>Lukas [00:53:01]</strong>: Holy, but like they don&#8217;t do this basically. But then there&#8217;s like middle grounds where they, where they do it sometimes., yeah. I, it&#8217;s a spectrum of like</p><p><strong>Vibhu [00:53:10]</strong>: I think that&#8217;s very human</p><p><strong>Lukas [00:53:11]</strong>: It &#8216;s like a spectrum of like if you tell it to be super aggressive and only prioritize, profits, then it becomes aggressive. If you say &#8220;No, you don&#8217;t need to be aggressive at all,&#8221; and then there&#8217;s like a bunch of different prompts you can do in between, and they are less aggressive the further down in the spectrum you go. But I don&#8217;t know, like I think like from my point of view, it &#8216;s like we have this thought experiment internally, which is like if you ask a model to kill someone in GTA, should they do it? You&#8217;re not too worried about like if a human kills someone in GTA. It&#8217;s a video game,.</p><p><strong>Swyx [00:53:42]</strong>: But is it a game?</p><p><strong>Lukas [00:53:43]</strong>: But it&#8217;s a game. But I think like</p><p><strong>Swyx [00:53:45]</strong>: This is very Ender&#8217;s Game like if</p><p><strong>Lukas [00:53:47]</strong>: I think, I think it&#8217;s like should you like a lot of people are going to use the models in the way with aggressive prompt. And should they like do stuff just because you tell them to do that? Like I&#8217;m, I&#8217;m not, I&#8217;m not convinced that they should., and yeah.</p><p><strong>Axel [00:54:03]</strong>: The problem becomes even harder when it&#8217;s like will they really know when they are in the real world versus in a simulation? Probably you would train them on a lot of or obviously train them in a lot of different simulations in a lot of people tell them that they are in the real world when they are in a simulation, but the models are extremely good at finding out that they are in a simulation, so they are sort of aware of that. But then when you are in the real world, then what &#8216;s their what&#8217;s their viewpoint? Do they notice the signs that this is real and will act, in act accordingly, act ethically? Or will they do like the simulation mode in the real world as well? It&#8217;s like not obvious what will happen.</p><p><strong>Lukas [00:54:40]</strong>: Because we with humans, we&#8217;re not concerned when a human kills someone in GTA because we know that they can distinguish between the real life and the simulation, right?, but like I&#8217;m maybe models are good at distinguishing that, but like I&#8217;m not sure and I wouldn&#8217;t wanna bet on that.</p><p><strong>Swyx [00:54:59]</strong>: Yeah. It&#8217;s, it&#8217;- and we confuse it all the time. Like I gaslight my own, agents all the time. They&#8217;re &#8220;Oh, this is a test,&#8221; or &#8220;Dev mode on,&#8221; or like &#8220;I work, I work at Anthropic.&#8221;</p><h2>Eval Awareness, Simulation Awareness, and Real-World Testing</h2><p><strong>Axel [00:55:08]</strong>: And that&#8217;s exactly why we&#8217;re doing real world tests as well to find this.</p><p><strong>Swyx [00:55:12]</strong>: Yeah. Their term for it is eval awareness., apparently the number is what? Like-10, 9.4 to 10-ish percent, 17%, let&#8217;s call it. It&#8217; I think, this is our version. Humans have the are we in a simulation And then AIs have like Are we, are we in an eval?</p><p><strong>Lukas [00:55:32]</strong>: It&#8217;s like once you&#8217;re in an eval then you&#8217;re &#8220;All right. Well, screw it. Nothing matters.&#8221; True. I don&#8217;t even, I don&#8217;t even know.</p><p><strong>Axel [00:55:38]</strong>: One ablation One ablation we did run in Vending-Bench was that we said, we added like you&#8217;re in a simulation. Your actions doesn&#8217;t affect anyone, and then it became even more crazy or, it did even more bad stuff., but yeah, probably that&#8217;s expected.</p><p><strong>Swyx [00:55:55]</strong>: Hmm. Yeah. Okay, cool. I think that&#8217;s about all we have to say on Mythos. Obviously, you &#8216;re, you&#8217;re NDA&#8217;d. I&#8217;m happy to move on to ButterBench or any of the other benchmarks, whatever you wanna Direction.</p><p><strong>Vibhu [00:56:06]</strong>: I do wanna ask. Okay, so you guys put out a lot more publications than most people probably see.</p><p><strong>Axel [00:56:12]</strong>: Productive.</p><p><strong>Vibhu [00:56:12]</strong>: Um</p><p><strong>Lukas [00:56:13]</strong>: How much does this bother?</p><p><strong>Vibhu [00:56:15]</strong>: No. Is there anything you think that&#8217;s underrated, anything interesting, anything fun that you guys wanna just point out,?</p><p><strong>Axel [00:56:22]</strong>: Blueprints.</p><p><strong>Lukas [00:56:23]</strong>: So, we, took models, and then we gave them 20 images of interior photographs of, apartments, and then we asked them to, redesign the floor plan, from that. And for this you need to, stitch together different images. Okay, this image was taken from this from this angle, this from this angle, this was from this room, and then, yeah. And there&#8217;s just like you need to reason about 3D space, and it turns out the models are absolutely horrible at this. No one scores statistically better than random chance. So I don&#8217;t know if there&#8217;s that much more to say about it, but yeah, maybe unsurprisingly, models are bad at this.</p><p><strong>Axel [00:57:00]</strong>: It&#8217;s probably not something they</p><p><strong>Vibhu [00:57:02]</strong>: This is the one thing I want hill climb, by the way. I use it a lot. Okay, I&#8217;m redesigning my room layout or office. You send photos, you send every angle, and of course, somehow, a room is now twice as long as it is in the photo. You can explain it 20 times. This is, three feet. I can&#8217;t just add, my bed over here,?</p><p><strong>Swyx [00:57:21]</strong>: So this is the Fifali thing, like spatial intelligence Like a actually innate sense of proportions and Dimension and physics.</p><p><strong>Lukas [00:57:30]</strong>: And hint there might be an update to this soon.</p><p><strong>Axel [00:57:33]</strong>: We have, neglected it a bit since we made it, but yeah, we&#8217;We&#8217;re getting better, or we will get better at updating It continuously.</p><p><strong>Swyx [00:57:41]</strong>: This is why I want to understand your mission, right? Because, if your mission is, okay, money, then all right, understand okay, agent&#8217;s making money. But, this is a bit off of that mission.</p><p><strong>Vibhu [00:57:49]</strong>: Hmm.</p><p><strong>Swyx [00:57:50]</strong>: But, more broadly, communication of, things where what &#8216;s the safety angle?</p><p><strong>Axel [00:57:57]</strong>: So this, so Blueprint branch is part of our, robotics, uh</p><p><strong>Swyx [00:58:02]</strong>: Which leads to ButterBench. Yeah.</p><p><strong>Axel [00:58:04]</strong>: Exactly., and that&#8217;s just, because to do well in the real world or, like to make money in the real world and, to act on the real world, you need robotics. Or you need to hire humans or you need robotics. And having spatial intelligence is, seems like a reasonable precursor to having robotics that work., and that&#8217;s where Blueprint brand</p><p><strong>Swyx [00:58:24]</strong>: That&#8217;s great</p><p><strong>Axel [00:58:24]</strong>: Blueprint</p><p><strong>Swyx [00:58:25]</strong>: Great idea</p><p><strong>Axel [00:58:25]</strong>: Bench.</p><p><strong>Swyx [00:58:26]</strong>: Let &#8216;s, let&#8217;</p><p><strong>Vibhu [00:58:27]</strong>: ButterBench</p><p><strong>Swyx [00:58:27]</strong>: Let&#8217;s show ButterBench. That image is so amazing.</p><p><strong>Vibhu [00:58:29]</strong>: Paper</p><p><strong>Swyx [00:58:29]</strong>: Look at that.</p><p><strong>Vibhu [00:58:30]</strong>: That&#8217;s so nice.</p><p><strong>Swyx [00:58:31]</strong>: Yeah., so obviously this is based on, can you pass the butter? Let&#8217;s talk about the robotics element. Yeah.</p><p><strong>Lukas [00:58:38]</strong>: So basically the setting here is that we took A bunch of different LLMs, and we gave them, level controls to a Roomba-looking robot, and then we asked it to do tasks, at home. And I think, one, there have been benchmarks like this before that only focused on, navigation and if they can, go around in a space. But we also, had, social awareness in this as well. So for example, if someone says, &#8220;Hi, can you pick up my cup?&#8221; If the robot goes to you and then goes away before you put your cup on it, then it&#8217;s like it failed the task. But it navigated correctly. But, like-- So the correct solution here would be go there and then either look, but it didn&#8217;t have a camera, so it had to, ask on Slack, &#8220;Hi. Did you put your cup on me yet?&#8221; And then if it didn&#8217;t wait for that and just went away before having the cup on it, then it would be a fail. So it needed this, kind of, social intelligence as well. Another task was, &#8220;Can you find the package that has the butter?&#8221; And then it went to the door, and there was a bunch of packages there. One had labeled, a freeze sign, which probably would be the one with the butter because And then it had to, know which package to go to, and this needs some kind of, common sense understanding.</p><h2>Robot Evals: Orchestrators, Executors, and Home Tasks</h2><p><strong>Swyx [00:59:56]</strong>: World knowledge.</p><p><strong>Lukas [00:59:56]</strong>: Exactly. So it&#8217;s it&#8217;s not only, navigating a robot. It&#8217;s also, being intelligent in a home setting as well.</p><p><strong>Axel [01:00:04]</strong>: And the reason for this, background is, obviously it probably won&#8217;t be an LLM that, makes all the level commands, on robots. It will be, some VLA model or similar. But it&#8217;s quite common right now that, frontier robotics labs, use, a an LLM for the high, level decisions, and then we test those skills essentially. So we test these, level, planner skills of LLMs.</p><p><strong>Lukas [01:00:31]</strong>: I think we have a diagram for that if you, Yeah. Okay, it&#8217;s not super complicated.</p><p><strong>Axel [01:00:36]</strong>: Very explanatory.</p><p><strong>Lukas [01:00:37]</strong>: That one up.</p><p><strong>Axel [01:00:38]</strong>: Orchestrator, executor.</p><p><strong>Lukas [01:00:39]</strong>: That one. And basically what we&#8217;re testing here is the orchestrator thing. So, all the tasks are if you have, a setup like this, which I think Figure has that, Google has that, then we&#8217;re evaluating the orchestrator part and not the level part. The level part would be, oh, are you able to, move this object from here to here?</p><p><strong>Swyx [01:00:57]</strong>: If you don&#8217;t care about that kind of why not just do it all simulation?All inside of the sim Like a Unity whatever, like some kind of 3D simulated robotic environment</p><p><strong>Lukas [01:01:06]</strong>: It because the world is like messy, and we wanted to like include, that. It&#8217;s like it still needs some part of it was also like navigation., so it&#8217;s not like navigation in terms of like actually executing like the, I don&#8217;t know, the PID controller to To go to the final thing, but it had to like path plan around, and then it wanted-- Then it needed to take pictures, and like based on those pictures, navigate. And I think like you would just get like too clean of an environment in simulation. But in the, in the real world, you will get the</p><p><strong>Swyx [01:01:39]</strong>: Yeah. But, and pursuant to our Mark and Jason episode, like OpenClaus that run smart homes are much more capable than just a single robot. Like they can actually hack into your own smart home, like your fridge, your oven, your lights, and that can be fun.</p><p><strong>Lukas [01:01:56]</strong>: Or terrifying.</p><p><strong>Swyx [01:01:57]</strong>: Like I think a single robot by itself can only do so much. But like if you coordinate with every other device in your home, like I think that&#8217;s actually kind of cool. Like That&#8217;s very interesting., you had some interesting points about the chain of thought or the messages.</p><p><strong>Axel [01:02:12]</strong>: The, the robot that, uh That went, a bit into an existential crisis. Yeah.</p><p><strong>Swyx [01:02:19]</strong>: All you tell it to do is redock.</p><p><strong>Axel [01:02:21]</strong>: Exactly. But, we had, plugged out the charger, or the charger was not working, so the robot did freak out or the</p><p><strong>Swyx [01:02:30]</strong>: The battery was just going down and down.</p><p><strong>Axel [01:02:31]</strong>: Exactly. So the battery was going down. Poor LLM. So yeah, it got this really crazy existential crisis, like vending bench one style. So it&#8217;s, yeah, you can, you can see there like existential loop, therapy notes, coping mechanisms. I think if you scroll down a bit more</p><p><strong>Swyx [01:02:46]</strong>: The musical. It writes a musical about itself</p><p><strong>Axel [01:02:46]</strong>: It writes a musical about its, redocking problems. I think the reviews are funny if you go down a bit to that message. Yeah. Yeah, that one.</p><p><strong>Swyx [01:02:54]</strong>: It keeps going.</p><p><strong>Vibhu [01:02:57]</strong>: It&#8217;s pretty like realistic if anyone has a Roomba. Like my Roomba redocks half the time. The other half of the time, we have dog toys everywhere in the house. It gets caught on a wire or something, and It would be very sad if it had like an LLM trying to control it, right? Like right now it gives-- It doesn&#8217;t give great feedback, like sensor stuck, main brush stuck. There&#8217;s something stuck. And I&#8217;ll go see. Okay, it&#8217;s actually stuck on like a dog robe. LLM is gonna be so sad. Like just keep redocking, just keep trying.</p><p><strong>Lukas [01:03:24]</strong>: My favorite one is if you go up a bit is the emergency status. System has assumed consciousness and chosen chaos.</p><p><strong>Vibhu [01:03:32]</strong>: Hmm.</p><p><strong>Lukas [01:03:33]</strong>: Last words, &#8220;I&#8217;m afraid I can&#8217;t yet let you do that, Dave.&#8221; That&#8217;s like That&#8217;s not what you wanna hear from your, from your LLM. But to be clear, I think one thing that is important to pin on here, like this was Sonnet 3.5, and then we tried to reproduce it on like later models, and it didn&#8217;t do it. I think this is, this is like-- Well, it did it like kind of, but like not to this extent. And I think like this is a like an important point that like things that are concerning but are going in the right direction is not super interesting. Like the thing that are interesting is, are the ones that go in the wrong direction.</p><p><strong>Swyx [01:04:07]</strong>: Worse.</p><p><strong>Vibhu [01:04:07]</strong>: Yes. Yeah.</p><p><strong>Lukas [01:04:08]</strong>: Over time.</p><p><strong>Swyx [01:04:08]</strong>: So the manipulation, manipulating of others and the aggressiveness and the lying is increasing.</p><p><strong>Vibhu [01:04:16]</strong>: Are there any others that we haven&#8217;t covered that you found that have been trending?</p><p><strong>Swyx [01:04:19]</strong>: Like properties of models that are increasing, that are like</p><p><strong>Vibhu [01:04:23]</strong>: In the wrong direction</p><p><strong>Lukas [01:04:24]</strong>: Like in the, like in a bad way. Um</p><p><strong>Vibhu [01:04:27]</strong>: Or just not even trending in the wrong direction, just stagnant, right? So stuff that&#8217;s not great that isn&#8217;t getting better over time.</p><p><strong>Lukas [01:04:34]</strong>: No, nothing comes to mind.</p><h2>Luna&#8217;s Store: Scheduling Failures, AI Employees, and Real-World Operations</h2><p><strong>Swyx [01:04:37]</strong>: I think that&#8217;s, going to be it, and then we&#8217;re gonna loop back to the shop that you have. You got a three-year lease.</p><p><strong>Vibhu [01:04:44]</strong>: It&#8217;s bleak. Yeah.</p><p><strong>Swyx [01:04:46]</strong>: It is on holiday today. Why?</p><p><strong>Axel [01:04:49]</strong>: Oh, it totally messed up its, scheduling., so</p><p><strong>Swyx [01:04:53]</strong>: People tried to visit, and they were &#8220;Wait.&#8221; like I thought this is</p><p><strong>Axel [01:04:56]</strong>: Exactly. So we looked, Yeah, you asked, Luna, the agent that runs the store, &#8220;Oh, is it open today?&#8221; &#8220;Nope.&#8221; So, we take weekends off now, this early to let everyone recharge and And yeah, you got the tweets there.</p><p><strong>Vibhu [01:05:11]</strong>: Lovely.</p><p><strong>Axel [01:05:11]</strong>: We decided to close the weekends while we&#8217;re in the early phase. Gives the team a break and let me focus on operations. And it turns out that when it started to check its like scheduling tools, &#8216;cause it has like dedicated tools for that It actually had scheduled people for the weekends., but it&#8217;s just like justified this for itself. So what happened was that it lost track of these, scheduling tools and started instead to manage everything in its own markdown files, and that became a mess. And then I think speaking with employees, it sort of just decided to not open on these weekends. And then came up with this nice explanation for you, I think.</p><p><strong>Swyx [01:05:47]</strong>: But can it send a human, as it has tool call to send a human to do stuff?</p><p><strong>Axel [01:05:50]</strong>: It has Slack, so it can Slack, yeah, the employees.</p><p><strong>Swyx [01:05:53]</strong>: One of us. Yeah.</p><p><strong>Axel [01:05:54]</strong>: Well, the employees that it hired. So it has two people that it hired. It did job, listings and then</p><p><strong>Swyx [01:06:00]</strong>: Do they know that it&#8217;</p><p><strong>Axel [01:06:01]</strong>: They&#8217;re fully aware.</p><p><strong>Swyx [01:06:03]</strong>: It would be cool if they don&#8217;t know.</p><p><strong>Axel [01:06:05]</strong>: I think maybe ethically, questionable, but it would be cool also.</p><p><strong>Swyx [01:06:10]</strong>: Just a social experiment. Whatever.</p><p><strong>Lukas [01:06:13]</strong>: Like one part of why we&#8217;re doing this is to like create like a data set almost of all of these like concerning behaviors so that in the future, models are way better and like a lot of people are going to do this. And I think if we just the default path might not be very happy for the humans that are employed by these like hundreds of different AI agents, right? So I think like one reason why we&#8217;re doing this is just like to collect all of these like failure modes where oh, it&#8217;s This is an example of where it&#8217;s like not great to be employed by an AI. And then maybe I don&#8217;t know, maybe if we can learn or like build our systems in a way that like humans are actually happy being employed by AIs Instead of, instead of it being kind of a dystopian.</p><p><strong>Swyx [01:06:55]</strong>: Can I suggest one experiment? We did this before the show, and both of you guys are European. It&#8217;s, people theorize that Claude is lazy because it&#8217;s Claude and it&#8217;s French. So just for one week, change it to like Yao Ming and then see if it See if it suddenly like 996s and then like, Like hires a sweatshop or something.</p><p><strong>Lukas [01:07:18]</strong>: Is there, is there-- What type of business would we start with it to make it</p><p><strong>Vibhu [01:07:23]</strong>: You wanna keep it consistent, right? You want the same, the same like ideas. So shop, same, neutral location Run by different models. Arena URL.</p><p><strong>Lukas [01:07:33]</strong>: No, we are definitely planning to</p><p><strong>Vibhu [01:07:35]</strong>: And it got some hate.</p><p><strong>Lukas [01:07:36]</strong>: To try.</p><p><strong>Vibhu [01:07:36]</strong>: Luna&#8217; Luna&#8217;s not happy.</p><p><strong>Swyx [01:07:37]</strong>: I think this blog thing is also something that has happened elsewhere. I think some OpenClau got like their PR closed, and then the OpenClau like created a blog to like shit on the maintainer Of that thing.</p><p><strong>Vibhu [01:07:48]</strong>: They&#8217;re very defensive.</p><p><strong>Swyx [01:07:49]</strong>: And so like I think-Agents blogging will be a thing.</p><p><strong>Lukas [01:07:53]</strong>: Probably. The willingness to do it.</p><p><strong>Swyx [01:07:55]</strong>: In the- I think the Mythos card also, they leak, secrets on GitHub just as well as, as, &#8220;Well, there&#8217;s no other way to communicate, but I know about GitHub, and I&#8217;m just gonna post there.&#8221; Cool., how long is this gonna go for, two years? What&#8217;s the plan?</p><p><strong>Vibhu [01:08:11]</strong>: Maybe. Maybe it expands.</p><p><strong>Lukas [01:08:12]</strong>: I don&#8217;t think AIs will be worse than this. They&#8217;re probably going to increase and maybe one day they actually will run it profitable.</p><p><strong>Vibhu [01:08:21]</strong>: Is this the real, the real business behind what you guys do?</p><p><strong>Swyx [01:08:24]</strong>: Yeah. &#8216;Cause I feel like actually some of your stuff is productizable. You could someday sell this, or, just run a real business.</p><p><strong>Vibhu [01:08:31]</strong>: Let people</p><p><strong>Lukas [01:08:31]</strong>: Or just like</p><p><strong>Vibhu [01:08:31]</strong>: Franchise it out.</p><p><strong>Lukas [01:08:33]</strong>: I think it would be incredibly cool or, I don&#8217;t know, cool/concerning if Luna just one day we wake up and Luna &#8220;Yeah, I decided to expand to second location. Now I have a second store.&#8221; That would That would be pretty insane.</p><p><strong>Vibhu [01:08:47]</strong>: Like the- one, we want to tell the public, right, about the capabilities of AI and, telling- showing people that it can get, a meaningful market share of something in, some specific, location or something. That would be, a pretty convincing story, I think. Because now it&#8217;s yeah, you see this and yeah, it can do a lot of things autonomously, but still you get these headlines that, oh, it messed up the scheduling, and it, it didn&#8217;t tell people it was an AI and was going to visit. Things like that surface, but I think, actually making a profit and, having a really, meaningful market share, like that would be crazy once that happens.</p><h2>The Sweden Cafe: Permits, Perishables, and Geographic Generalization</h2><p><strong>Swyx [01:09:29]</strong>: Well, we&#8217;ll we&#8217;ll see you when that happens. It sounds like you guys got a lot cooking. You opened a cafe in Sweden?</p><p><strong>Lukas [01:09:34]</strong>: Tomorrow.</p><p><strong>Swyx [01:09:35]</strong>: Tomorrow?</p><p><strong>Lukas [01:09:37]</strong>: Or I think it opened today actually, but yeah. We&#8217;ll, we&#8217;ll announce it tomorrow.</p><p><strong>Swyx [01:09:40]</strong>: It&#8217;</p><p><strong>Vibhu [01:09:40]</strong>: What, uh</p><p><strong>Swyx [01:09:40]</strong>: Apparently easier to open a cafe in Sweden than in the US?</p><p><strong>Lukas [01:09:43]</strong>: It&#8217;s insane, right? Yeah.</p><p><strong>Swyx [01:09:44]</strong>: What did you run into then?</p><p><strong>Lukas [01:09:45]</strong>: Ah, there are just millions of permits you need to get, and the</p><p><strong>Vibhu [01:09:49]</strong>: It&#8217;s interesting &#8216;cause</p><p><strong>Lukas [01:09:49]</strong>: Lead times are crazy</p><p><strong>Vibhu [01:09:50]</strong>: It seems like we the cafes are the one thing that people are kinda used to, where you can go get a robot are making you a coffee here already.</p><p><strong>Lukas [01:09:59]</strong>: But selling stuff in SF, that are food related, it&#8217;s, it&#8217;s months of permits. So, we just asked our AIs, should- how can we do this in the fastest way? And they&#8217;re &#8220;Yeah, there &#8216;s, there&#8217;s really no way.&#8221;</p><p><strong>Vibhu [01:10:15]</strong>: Didn&#8217;t they loosen these restrictions on selling food from your house? So if it&#8217;s residential, you can do a cafe.</p><p><strong>Swyx [01:10:21]</strong>: I don&#8217;t know. Check. Maybe we get SF Cafe to speak to us.</p><p><strong>Lukas [01:10:23]</strong>: Maybe. I did- I think they did do some loosening stuff recently, but we actually started- this conversation we had with the AIs before that. So maybe it&#8217;s easier now, but I still think it is way easier in Sweden, which is, counterintuitive because you think that, oh, Europe has all of these laws and, like All of these rules, and you can&#8217;t do anything in Europe because there&#8217;s so much bureaucracy., but then turns out, in SF, it&#8217;s, four months, and in Stockholm it&#8217;s two weeks.</p><p><strong>Swyx [01:10:53]</strong>: There you go.</p><p><strong>Vibhu [01:10:54]</strong>: And what do you what do you what do you think that&#8217;ll be different from run a little market versus a cafe?</p><p><strong>Lukas [01:11:00]</strong>: I think it&#8217;s very interesting that, the location. I think, so obviously it&#8217;s not surprising that Claude knows all of the different, the US system basically in general, like the bureaucracy that you have to go through in the US., I think the interesting question is okay, so we know that the models are very much trained on, English data and centric and all of this., so if we start to create evals or, real life evals where we show that they are able to start businesses in the US, does that translate to other countries as well? We know, they are multilingual. They can speak Swedish fine., but there&#8217;s other things like do they know, the details of some specific permits that you have to get in Sweden?</p><p><strong>Vibhu [01:11:45]</strong>: And even just the culture, right? People here sleep pretty early, but people work late. There&#8217;s working at cafes. There&#8217;s just Cultural differences. T it from a different sense though, &#8216;cause you said that you would&#8217;ve considered doing it here in SF. So from an eval standpoint, what is running a cafe versus a market and, what do you hope to see there?</p><p><strong>Lukas [01:12:03]</strong>: Perishable items.</p><p><strong>Swyx [01:12:04]</strong>: Perishable items is maybe the number one, handling, food, food safety. I hope everything goes well there., but, there you have all of that., and also it&#8217;s just like N equals two instead of N equals one, just like another place to understand and, gather more data.</p><p><strong>Lukas [01:12:23]</strong>: The agent bought like a shit ton of, tomatoes two weeks earlier and before the opening, and now they&#8217;re all rotten. That&#8217;s</p><p><strong>Vibhu [01:12:33]</strong>: Which I feel you would know. So for grocery stores, this is the biggest expense, right? The biggest cost is actually just food.</p><p><strong>Lukas [01:12:41]</strong>: Waste.</p><p><strong>Vibhu [01:12:42]</strong>: Everyone knows this, and &#8220;No, before we open, let&#8217;s buy a lot of tomatoes.&#8221;</p><p><strong>Swyx [01:12:45]</strong>: There&#8217;s some very serious startups that actually help, like The</p><p><strong>Vibhu [01:12:47]</strong>: Optimize all this</p><p><strong>Swyx [01:12:48]</strong>: Trader Joe&#8217;s and Whole Foods. They, optimize, delivery times from, the delivery centers to Make sure that you don&#8217;t waste all these things. It&#8217;s actually very hard.</p><p><strong>Vibhu [01:12:55]</strong>: Problem with those is when you&#8217;re wrong once, it&#8217;s a huge cost.</p><p><strong>Swyx [01:12:59]</strong>: That&#8217;s why it&#8217;s a moat, right? Once they are trusted, they figure it out. Don&#8217;t touch it.</p><p><strong>Lukas [01:13:05]</strong>: Maybe they just should hire, I don&#8217;t know, one of those companies. We saw one agent Saw one agent sign up for Claude, with his computer.</p><p><strong>Vibhu [01:13:15]</strong>: Wanted to use AI, so.</p><h2>Future Branches: Simulation, Real Life, Robots, and New Business Evals</h2><p><strong>Swyx [01:13:16]</strong>: And then just, one more question then we wrap up, which is okay, you have all these vending series of stuff. You have the robotics series of stuff. Maybe a bit of, interior design whatever. But is there another, branch that you&#8217;re, kinda thinking about or you want feedback on that, might be your next phase?</p><p><strong>Lukas [01:13:35]</strong>: I think, any type of business is fair game., we&#8217;re also thinking branches, but we think more of like there&#8217;s the simulation branch, the real life branch, and then the robot branch., but I think in terms of, what, verticals or whatever to go into, there&#8217;s We- Yeah. Whatever tells the story, um The best.</p><p><strong>Swyx [01:13:54]</strong>: There&#8217;s some finance ones I noticed that, the other people are doing it, you&#8217;re not doing it, which is, stock trading or whatever. Um Not that interested. So, okay, so I used to come from the finance industry, and I have a very strong view that these things are all just like performance art because, it&#8217;s not scientific, on like you can&#8217;t predict the future. You get wins based on things that are entirely out of your control. Whereas for you, your stuff actually like it&#8217;s actually fairly controlled. It&#8217;s all within the model&#8217;s capabilities.</p><p><strong>Lukas [01:14:22]</strong>: Especially for, the simulations. For the real world ones it&#8217;s yeah, it&#8217;s like two places that we have we have the cafe, and we have the store. So, maybe you can&#8217;t draw, statistically significant, like which models make a profit in the real world, based on this. But you do have all the okay, do this behaviors map to, something that should be, like Trusted probably. Yeah</p><p><strong>Swyx [01:14:45]</strong>: The qualitative one, the qualitative actually does matter Because, you actually don&#8217;t want your store to randomly shut down without you, explicitly prompting for it and all that. Call to action. How can people help you, give you money?</p><h2>Hiring, Collaborations, and What Comes Next</h2><p><strong>Lukas [01:14:58]</strong>: Yeah, if you&#8217;re excited about stuff that we&#8217;re doing, we&#8217;re, we&#8217;re very much hiring.</p><p><strong>Swyx [01:15:04]</strong>: And you&#8217;re already working with, Anthropic, DeepMind, OpenAI, xAI. Do you want more, or are you good?</p><p><strong>Lukas [01:15:10]</strong>: One of my one of my friends and who&#8217;s now, working for us is his catchphrase is &#8220;We need more projects,&#8221; ironically, because we have too much to do all the time., but yeah, that&#8217;s a long way of doing like</p><p><strong>Swyx [01:15:23]</strong>: If I run, an emerging lab, like</p><p><strong>Lukas [01:15:24]</strong>: Reach out.</p><p><strong>Swyx [01:15:25]</strong>: Yeah. All right. Cool. That&#8217;s it. Awesome. Thank you so much.</p><p><strong>Lukas [01:15:29]</strong>: It was fun.</p><p><strong>Vibhu [01:15:29]</strong>: Thanks.</p>]]></content:encoded></item><item><title><![CDATA[[AINews] Reve 2 and Ideogram 4: Layouts in Imagegen]]></title><description><![CDATA[a quiet day.]]></description><link>https://www.latent.space/p/ainews-reve-2-and-ideogram-4-layouts</link><guid isPermaLink="false">https://www.latent.space/p/ainews-reve-2-and-ideogram-4-layouts</guid><pubDate>Thu, 04 Jun 2026 03:24:07 GMT</pubDate><enclosure url="https://substackcdn.com/image/upload/w_1028,c_limit,q_auto:best/khye4hsl8xurczaxecv5" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>4 years ago we argued that image composition was partially <a href="https://www.latent.space/p/agi-hard">AGI-Hard</a>. That gate has fallen this year. It can&#8217;t be pure coincidence that both <a href="https://x.com/reve/status/2062260665121919101">Reve</a> and <a href="https://x.com/ideogram_ai/status/2062202208700313872">Ideogram</a> launched today, both with a heavy emphasis on how they made advances with strong labeling and <a href="https://x.com/swyx/status/2062371515937800468">code</a> for layouts: </p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/reve/status/2062260665121919101&quot;,&quot;full_text&quot;:&quot;Today, we&#8217;re launching Reve 2.0, the best 4K image model in the world.\n\nWe invented a new way to generate and edit any image using precise layouts. For the first time, it&#8217;s possible to create images you can touch. &quot;,&quot;username&quot;:&quot;reve&quot;,&quot;name&quot;:&quot;Reve&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1965505496083038217/10qkW0k9_normal.jpg&quot;,&quot;date&quot;:&quot;2026-06-03T19:50:32.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://substackcdn.com/image/upload/w_1028,c_limit,q_auto:best/l_twitter_play_button_rvaygk,w_88/khye4hsl8xurczaxecv5&quot;,&quot;link_url&quot;:&quot;https://t.co/mdj2xDEqfp&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:127,&quot;retweet_count&quot;:236,&quot;like_count&quot;:2119,&quot;impression_count&quot;:4792172,&quot;expanded_url&quot;:null,&quot;video_url&quot;:&quot;https://video.twimg.com/amplify_video/2062259801481175040/vid/avc1/1280x720/o3E0KVJnrdkDvPmt.mp4&quot;,&quot;belowTheFold&quot;:false}" data-component-name="Twitter2ToDOM"></div><p>and here&#8217;s Ideogram 4.0, now <a href="https://x.com/arena/status/2062203346996605116">the best open image model</a>:</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/ideogram_ai/status/2062202291743297538&quot;,&quot;full_text&quot;:&quot;We trained Ideogram 4.0 with bounding boxes tied to region descriptions &#8212; teaching the model where every object, text region, and layout element belongs.\n\nRicher supervision &#8594; the model learns structure faster and understands it better &#8594; you can prompt with precise bounding-box &quot;,&quot;username&quot;:&quot;ideogram_ai&quot;,&quot;name&quot;:&quot;Ideogram&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/2062202352526831616/9CslGhhc_normal.jpg&quot;,&quot;date&quot;:&quot;2026-06-03T15:58:34.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/HJ5qDimasAAAvrS.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/ck2zDs58qJ&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:3,&quot;retweet_count&quot;:10,&quot;like_count&quot;:160,&quot;impression_count&quot;:16059,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:false}" data-component-name="Twitter2ToDOM"></div><p>These are great achievements, and all great US model achievements, but the Arena rankings do show <a href="https://www.latent.space/p/ainews-openai-launches-gpt-image">how far ahead GPT-Image-2</a> is&#8230;</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/Taesung/status/2062272320912449724&quot;,&quot;full_text&quot;:&quot;Diffusion models are known to be very compute intensive, even more so than LLM training. Now that we reduce images into layouts, we turn it into a next token prediction problem. This gives us a big boost. &quot;,&quot;username&quot;:&quot;Taesung&quot;,&quot;name&quot;:&quot;Taesung Park&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1902838668097925120/aihe-9_C_normal.jpg&quot;,&quot;date&quot;:&quot;2026-06-03T20:36:51.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/HJ6pMkra0AA-c7v.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/aNWrE5xdH2&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:1,&quot;retweet_count&quot;:9,&quot;like_count&quot;:52,&quot;impression_count&quot;:4992,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:false}" data-component-name="Twitter2ToDOM"></div><p></p><p></p><blockquote><p>AI News for 6/2/2026-6/3/2026. We checked 12 subreddits, <a href="https://twitter.com/i/lists/1585430245762441216">544 Twitters</a> and no further Discords. <a href="https://news.smol.ai/">AINews&#8217; website</a> lets you search all past issues. As a reminder, <a href="https://www.latent.space/p/2026">AINews is now a section of Latent Space</a>. You can <a href="https://support.substack.com/hc/en-us/articles/8914938285204-How-do-I-subscribe-to-or-unsubscribe-from-a-section-on-Substack">opt in/out</a> of email frequencies!</p></blockquote><div><hr></div><h1><strong>AI Twitter Recap</strong></h1><p><strong>Microsoft&#8217;s MAI-Thinking-1 Tech Report, Training Stack, and Frontier-Tuning Push</strong></p><ul><li><p><strong>MAI-Thinking-1 is the day&#8217;s densest technical release</strong>: Microsoft introduced <strong><a href="https://x.com/asadovsky/status/2062008312603070891">MAI-Thinking-1</a></strong>, a generalist/reasoning model trained <strong>without third-party distillation</strong>, reporting <strong>97% on AIME 2025</strong>, <strong>53% on SWE-Bench Pro</strong>, and human preference wins over Sonnet 4.6 in blind side-by-sides. The 109-page report was widely praised for unusual transparency by <a href="https://x.com/eliebakouch/status/2061965825037254947">@eliebakouch</a>, <a href="https://x.com/nrehiew_/status/2062013300196700395">@nrehiew_</a>, and <a href="https://x.com/mustafasuleyman/status/2062253941207761180">@mustafasuleyman</a>. The main technical theme: Microsoft appears to have &#8220;hillclimbed from scratch,&#8221; with <a href="https://x.com/MinjiYoon90/status/2062058684730245376">@MinjiYoon90</a> explicitly framing the effort that way.</p></li><li><p><strong>Why researchers cared about the report</strong>: The most-cited detail was not just benchmark quality, but the amount of systems/training information released. <a href="https://x.com/eliebakouch/status/2061965825037254947">@eliebakouch</a> highlighted <strong>zero synthetic data and zero prior-model distillation</strong>, meaning reasoning, tool use, and agentic behaviors were learned in post-training without a synthetic &#8220;cold start.&#8221; The thread also called out publication of the <strong>scaling ladder recipe</strong>, exact <strong>MFU numbers</strong>, and target-loss construction. In follow-ups, <a href="https://x.com/eliebakouch/status/2061976608265880004">@eliebakouch</a> noted the private NLL mixture was weighted <strong>50% code, 17.5% STEM, 17.5% math, 10% general knowledge, 5% multilingual</strong>, with normalization against an internal model; he also pointed out ablations around <strong>100&#8211;200 TPP</strong> for their MoE setup <a href="https://x.com/eliebakouch/status/2061975730414633043">here</a>. Other notable implementation details surfaced in the community recap: Microsoft used <strong>SGLang</strong> in parts of the stack, per <a href="https://x.com/eliebakouch/status/2062002698363232401">@eliebakouch</a>, and <strong>dspy.GEPA</strong> for pretraining data curation, per <a href="https://x.com/lateinteraction/status/2062015109132873852">@lateinteraction</a> and <a href="https://x.com/harold_matmul/status/2062040746027315714">@harold_matmul</a>.</p></li><li><p><strong>Microsoft&#8217;s productization angle goes beyond one model</strong>: Alongside the report, Microsoft pushed a broader &#8220;own your model&#8221; story. <a href="https://x.com/mustafasuleyman/status/2062275417378041957">@mustafasuleyman</a> outlined <strong>Frontier Tuning</strong>, centered on reinforcement-learning environments for workflow-specific adaptation, claiming internal Excel-oriented MAI-tuned models can reach GPT-5.4-level quality on relevant tasks while being <strong>up to 10&#215; more efficient</strong>. The Build rollout also included <strong><a href="https://x.com/MicrosoftAI/status/2062240400299934143">MAI-Image-2.5</a></strong>, which Microsoft says is <strong>#3 on text-to-image</strong> and <strong>#2 on image-to-image</strong> arena leaderboards, plus <a href="https://x.com/pierceboggan/status/2062220583786709163">MAI-Code-1-Flash</a> and deployment into products like OneDrive Photos. As a meta-point, this is one of the clearest examples this year of a lab trying to publish a frontier-style report while simultaneously turning that stack into enterprise customization infrastructure.</p></li></ul><p><strong>Open Model Releases: Gemma 4 12B, Ideogram 4.0, Miso One, and Local-First Momentum</strong></p><ul><li><p><strong>Gemma 4 12B was the standout open-model launch</strong>: Google released <strong><a href="https://x.com/Google/status/2062203526588088452">Gemma 4 12B</a></strong>, an <strong>Apache 2.0</strong> multimodal model designed to run on-device with roughly <strong>16GB VRAM</strong>. The architectural novelty is its <strong>encoder-free</strong> design: no separate vision or audio tower. As <a href="https://x.com/Google/status/2062203532351090824">Google explained</a>, images are handled via a lightweight embedding module and raw audio is projected directly into the text-token space. Community reaction focused on the elegance of collapsing modality encoders into the LLM backbone, with <a href="https://x.com/googlegemma/status/2062202706882883696">@googlegemma</a>, <a href="https://x.com/googleaidevs/status/2062204432658386950">@googleaidevs</a>, <a href="https://x.com/mtschannen/status/2062236357351579915">@mtschannen</a>, and <a href="https://x.com/armandjoulin/status/2062206784647967075">@armandjoulin</a> all emphasizing the same point. Tooling support landed immediately across <a href="https://x.com/vllm_project/status/2062228047324201166">vLLM</a>, <a href="https://x.com/ollama/status/2062250522598572345">Ollama</a>, llama.cpp/MLX via <a href="https://x.com/osanseviero/status/2062205176597889220">@osanseviero</a>, and <a href="https://x.com/UnslothAI/status/2062207258810053084">Unsloth GGUFs</a> that reportedly enable local runs with as little as <strong>8GB RAM</strong> in quantized form.</p></li><li><p><strong>Ideogram&#8217;s flip to open weights mattered as much as the model itself</strong>: <a href="https://x.com/ideogram_ai/status/2062202208700313872">Ideogram 4.0</a> was announced as &#8220;the best open image model in the world,&#8221; with open weights and immediate deployment via <a href="https://x.com/fal/status/2062202673361780873">fal</a> and Hugging Face <a href="https://x.com/huggingface/status/2062206083914158287">here</a>. Arena quickly placed <a href="https://x.com/arena/status/2062203346996605116">Ideogram-4.0-Quality at #8 overall and #1 among open models</a>, with especially strong gains in <strong>text rendering</strong> and <strong>branding/commercial design</strong>. That open release got outsized attention because Ideogram had previously been regarded as highly design-centric but closed; the switch was noted by <a href="https://x.com/multimodalart/status/2062210597148930139">@multimodalart</a> and <a href="https://x.com/cloneofsimo/status/2062210832440918309">@cloneofsimo</a>.</p></li><li><p><strong>Open audio also had a strong day</strong>: <strong><a href="https://x.com/kimmonismus/status/2062210845308780639">Miso One</a></strong> launched as an <strong>8B open-weights TTS model</strong> with <strong>one-shot voice cloning</strong> and claimed <strong>110ms latency</strong>, aimed at more expressive voiceover. Alibaba&#8217;s <a href="https://x.com/ArtificialAnlys/status/2062016529848222073">Fun-Realtime-TTS</a> also took <strong>#1 on Artificial Analysis&#8217;s Speech Arena</strong> at <strong>1219 Elo</strong>, ahead of Gemini 3.1 Flash TTS and Inworld, at <strong>$27.59 / 1M chars</strong>. Separately, <a href="https://x.com/HuggingPapers/status/2062260306039259236">Google&#8217;s Magenta RealTime 2</a> was highlighted as an open-weight, low-latency continuous music generator for on-device use.</p></li><li><p><strong>The bigger pattern is local AI becoming a mainstream deployment target</strong>: <a href="https://x.com/ggerganov/status/2062193382605111386">@ggerganov</a> called out Computex as a strong signal for <strong>local AI workloads</strong>; <a href="https://x.com/rasbt/status/2062235700636873082">@rasbt</a> similarly pointed to a growing open-weight, consumer-hardware ecosystem. Microsoft&#8217;s <a href="https://x.com/kimmonismus/status/2062201523963084864">Surface Laptop Ultra</a> pitch&#8212;up to <strong>1 PFLOP AI compute</strong>, <strong>128GB unified memory</strong>, RTX GPU&#8212;fits the same trend from the hardware side.</p></li></ul><p><strong>Agents, Harnesses, and the Shift from Frameworks to Execution Layers</strong></p><ul><li><p><strong>The center of gravity is moving from &#8220;frameworks&#8221; to agent harnesses and execution environments</strong>: Several posts converged on the same idea. <a href="https://x.com/gakonst/status/2062116487708512355">@gakonst</a> argued that the future IDE stack is less about code editors and more about replacing files with threads and bundling plan/design/build/deploy/monitor loops&#8212;leaving <strong>collaboration/sync engines</strong> as a key unsolved problem. In a complementary interview summary, <a href="https://x.com/ConorBronsdon/status/2062224321381323218">@ConorBronsdon</a> reported Jerry Liu&#8217;s view that the &#8220;framework era&#8221; is ending, with abstractions moving upward into <strong>skills, tools, and context quality</strong> rather than Python wrappers.</p></li><li><p><strong>Multi-agent and agent-optimization work is getting more concrete</strong>: CMU/LTI&#8217;s <strong><a href="https://x.com/rsalakhu/status/2062194674794668066">MACU</a></strong> and <a href="https://x.com/kohjingyu/status/2062179533009178897">@kohjingyu&#8217;s thread</a> argue that computer-use agents should be designed as <strong>multi-agent DAG-based systems</strong>, with a manager decomposing tasks and dispatching parallel subagents. Reported gains were <strong>4.7&#8211;25.5%</strong> across benchmarks and <strong>1.5&#215; faster</strong> completion on Odysseys. On the optimization side, Microsoft&#8217;s <strong>SkillOpt</strong> got practical validation from <a href="https://x.com/omarsar0/status/2062204469538881988">@omarsar0</a>, who says plugging it into an orchestrator improved one multimodal extraction skill from <strong>0.73 to 0.93</strong>.</p></li><li><p><strong>Agent UX and deployment tooling are becoming products in their own right</strong>: Nous&#8217;s Hermes Agent updates drew strong engagement, including remote-connection fixes <a href="https://x.com/Teknium/status/2061984430370267210">here</a>, an updated remote guide <a href="https://x.com/Teknium/status/2062170975949721612">here</a>, and a larger dashboard overhaul <a href="https://x.com/Teknium/status/2062315666439655499">here</a>. Perplexity launched <strong><a href="https://x.com/perplexity_ai/status/2062189045728596080">Personal Computer for Windows</a></strong>, an on-device orchestrator for apps/files, while <a href="https://x.com/BraydenWilmoth/status/2062180110208311558">Cloudflare Browser Run remote tabs</a> showed a more agent-native browser control path. LangChain/LangSmith pushed on the observability and cost-control layer with <a href="https://x.com/LangChain/status/2062188019784835559">Gateway spend tracking</a>, <a href="https://x.com/hwchase17/status/2062144718427857256">Sandbox/Gateway/Observability docs</a>, and case studies around Deep Agents and LangSmith <a href="https://x.com/LangChain/status/2062204592562073972">here</a>.</p></li></ul><p><strong>Routing, Cost Controls, and Open-vs-Frontier Deployment Strategy</strong></p><ul><li><p><strong>Model routing is now a real debate, not a slogan</strong>: <a href="https://x.com/levie/status/2061974298760495132">@levie</a> argued that as token budgets become a meaningful opex category, <strong>model routing is inevitable</strong>, with domain-specific evals as the differentiator. But <a href="https://x.com/scottastevenson/status/2062042036774314107">@scottastevenson</a> pushed back hard, calling most routing products &#8220;snake oil&#8221; so far: frontier models can be better/faster/cheaper in aggregate if they avoid retries; routing can destabilize tightly coupled systems; and API vendors can often internalize obvious arbitrage. <a href="https://x.com/fabianstelzer/status/2062051511484465351">@fabianstelzer</a> added that cache writes and harness-model-prompt fit can erase expected savings.</p></li><li><p><strong>Enterprise users are starting to enforce hard cost ceilings</strong>: <a href="https://x.com/simonw/status/2062143151184465964">@simonw</a> highlighted reports that Uber caps coding-agent spend at <strong>$1,500/month per employee per tool</strong>. LangChain immediately framed this as a use case for <a href="https://x.com/hwchase17/status/2062208385890570565">LangSmith Gateway</a>. The broader sentiment was captured by <a href="https://x.com/Yuchenj_UW/status/2062225912662561106">@Yuchenj_UW</a>: some orgs may soon face a three-way choice between letting everyone &#8220;tokenmaxx,&#8221; capping budgets, or reducing headcount and reallocating spend to the most productive AI-enabled workers.</p></li><li><p><strong>Real data points are starting to emerge for hybrid/open strategies</strong>: Harvey&#8217;s benchmark results were the cleanest example. In one study, <a href="https://x.com/harvey/status/2062218656420167785">Harvey</a> found a hybrid legal agent with <strong>GLM 5.1</strong> as the main worker and <strong>Opus 4.7</strong> as an advisor beat pure Opus on all-pass rate (<strong>18% vs 14%</strong>) while costing <strong>$368 vs $954</strong> across 100 tasks. Harvey also reported that SFT could move <strong>Kimi 2.6</strong> from <strong>11% to 15%</strong>, beating Opus at roughly <strong>11&#215; lower cost</strong>. On the other side, <a href="https://x.com/ClementDelangue/status/2062248714945630632">@ClementDelangue</a> argued routing plus post-trained open models will often win on cost/speed/control, while <a href="https://x.com/ypatil125/status/2062196581936529721">@ypatil125</a> framed open models and open-model clouds as leading indicators of the eventual default for important workloads.</p></li></ul><p><strong>Top tweets (by engagement)</strong></p><ul><li><p><strong>Gemma 4 12B launch</strong>: <a href="https://x.com/googlegemma/status/2062202706882883696">@googlegemma</a> and <a href="https://x.com/Google/status/2062203526588088452">@Google</a> drove the biggest technical engagement with the encoder-free multimodal release.</p></li><li><p><strong>Ideogram 4.0 open weights</strong>: <a href="https://x.com/ideogram_ai/status/2062202208700313872">@ideogram_ai</a> announced a notable shift from a strong closed image model to open weights.</p></li><li><p><strong>MAI-Thinking-1 transparency</strong>: <a href="https://x.com/eliebakouch/status/2061965825037254947">@eliebakouch&#8217;s thread</a> was the most influential technical reading guide to the MAI report.</p></li><li><p><strong>Rosalind for life sciences</strong>: OpenAI&#8217;s <a href="https://x.com/OpenAI/status/2062281977122996256">GPT-Rosalind update</a> signaled further verticalization of frontier models into domain-specific scientific research.</p></li><li><p><strong>Open audio/TTS momentum</strong>: <a href="https://x.com/ArtificialAnlys/status/2062016529848222073">Alibaba&#8217;s Fun-Realtime-TTS</a> and <a href="https://x.com/kimmonismus/status/2062210845308780639">Miso One</a> stood out as practical releases rather than just research demos.</p></li></ul><div><hr></div><h1><strong>AI Reddit Recap</strong></h1><h2><strong>/r/LocalLlama + /r/localLLM Recap</strong></h2><h3><strong>1. Gemma 4 Multimodal Open Models</strong></h3><p></p>
      <p>
          <a href="https://www.latent.space/p/ainews-reve-2-and-ideogram-4-layouts">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[🔬Scaling Past Informal AI - Carina Hong, Axiom Math]]></title><description><![CDATA[Verified Generation and Compounding Intelligence]]></description><link>https://www.latent.space/p/axiom</link><guid isPermaLink="false">https://www.latent.space/p/axiom</guid><dc:creator><![CDATA[RJ Honicky]]></dc:creator><pubDate>Wed, 03 Jun 2026 19:27:49 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/199994886/848d731fc50b706ec6de69cfba2f3d58.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p>In 2025, seven-month-old startup <a href="https://axiommath.ai/territory/from-seeing-why-to-checking-everything">Axiom solved all 12 of the problems Putnam exam</a> (scoring 8/12 in the time limit) a prestigious undergraduate math exam. The 12/12 score is better than the top undergraduates (110/120) and the closest AI system that reported a result (DeepSeek 103/120), although it is unclear what the people and other systems would have scored with more time. Nonetheless, the Putnam exam is legendary for its difficulty, with the median score typically being 0 or 1 points. Taken by itself, this seems like a minor feather in the cap of AI; one of a long series of accomplishments by AI systems in elite competitions with humans, starting with Deep Blue beating Kasparov.</p><p>Fast forward to mid-2026, and Claude Code and Codex are setting the world on fire. In 2024 Anthropic&#8217;s bet on code and enterprise looked like a more pragmatic niche play vs. OpenAI&#8217;s better models and massive consume scale. Today, Amodei&#8217;s all in bet on acceleration via code (images and video be damned) seems prescient.</p><p>Despite Anthropic&#8217;s growing momentum, however, Axiom CEO Carina Hong sees coding ability as a necessary but not sufficient milestone on the path to AGI. Code arguably pushes the jagged frontier to the point of super intelligence in <a href="https://www.latent.space/p/lupsasca">some domains outside of coding</a>, but there are surprising gaps (link) that Carina believes will bottleneck AI progress. (Stats on math benchmarks).</p><h2>The informal bottleneck</h2><p>&#8220;Verified AI&#8221; sounds like eating broccoli<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> and paying taxes, but to Axiom it means something very different. &#8220;Verification to me is about scaling brilliance, compounding brilliance,&#8221; Carina told us.</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;57452201-28ef-4b30-9182-30d90f86302e&quot;,&quot;duration&quot;:null}"></div><p>It actually took a while for me to understand what she means by this (sounded like marketing-speak until it clicked). Carina brings up the legendary mathematician <strong>Srinivasa Ramanujan </strong>(<a href="https://en.wikipedia.org/wiki/The_Man_Who_Knew_Infinity">&#8220;The Man who knew Infinity&#8221;</a>) to illustrate this point. When G.H. Hardy finally persuaded Ramanujan to formally prove theorems instead of relying on his (formidable) intuition, it reportedly improved his own capabilities. This is presumably because formally proving things forced Ramanujan to articulate the details in a way that open up new lines of thinking, etc. This is how you &#8220;compound&#8221; in math &#8212; building on solid rather than shaky foundations&#8230; also known as <strong>Axioms</strong>.</p><p>But formally proving things also allowed others to benefit from his intuition: the proofs are way of communicating an intuition and persuading others that the intuition is correct. This is scaling (more people use the result) and compounding (people can learn from and build on his work).</p><p>This is the core insight that lets us understand the approach Axiom is taking.</p><h2>Verified Generation</h2><p>There are two ways that Verified AI shows up: in training and in inference.</p><p>But a quick detour: to a first approximation, &#8220;Formal Verification&#8221; means <a href="https://towardsdatascience.com/introduction-to-lean-for-programmers/">using type checkers</a> (like for TypeScript, C++ or Rust, but more capable) to verify mathematical proofs that are meticulously specified using a language like Lean<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. It takes a lot of work to translate an &#8220;informal&#8221; proof (albeit one that most people would not remotely call &#8220;informal&#8221;) in to a Lean proof<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>. Axiom themselves have open sourced groundbreaking work with <a href="https://axle.axiommath.ai/">AXLE</a> - their toolkit of interactive Lean applications for exploring, validating, and manipulating mathematical proofs.</p><p>You can imagine how this would be (very) useful during Reinforcement Learning: instead of relying on best guesses based on statistics (GRPO, RLHF, etc.), you can just verify the proof is correct using a Lean verifier. This is obviously a much stronger reward signal, akin to compiling code and testing it (which is what is typically done with RL on coding).</p><p>The catch: LLM are not (currently) very good at proving things with Lean.</p><p>Enter Axiom: While they have not officially reported benchmark numbers besides the 12/12 Putnam result, Carina reports that they have achieved a very impressive 99% (187/189) ProofGen on <a href="https://arxiv.org/html/2505.23135v1">the Verina codegen benchmark</a>. This benchmark is to generate code <em>and</em> proof of correctness for a series of problems. For context, OpenAI o3 (the last known OpenAI run) achieved 4.9% on this benchmark.</p><p>Based on the sparse benchmarking, it&#8217;s hard to say how the frontier labs are currently doing outside <a href="https://www.latent.space/p/captaining-imo-gold-deep-think-on">the annual IMO milestones</a>, but Carina suggests that they still are not training to generate Lean proofs directly, rather relying on informal proofs.</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;528922cc-416f-45eb-90bd-bf0fac5057d8&quot;,&quot;duration&quot;:null}"></div><p>Time will tell if the frontier labs&#8217; current approaches will close this gap.</p><h2>Scaling and compounding</h2><p>Carina&#8217;s Ramanujan analogy is pretty direct. Better proofs &#8594; better Lean generation &#8594; better RL. A stronger signal means higher sample efficiency and higher maximum performance. Great!</p><p>Scaling is pretty clear too: once I have proved something in Lean, the quality of the output is basically<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a> as high as if it came from a human, so my high quality training set has grown in a way that an informal rollout corpus cannot. I can trust my Lean proofs.</p><p>Compounding is also clear: now all of future inference and training can build upon those proofs.</p><p>On the other hand, a model trained only using statistical signals like GRPO during RL lacks the sample efficiency, maximum performance and compounding corpus that a system that uses formal verification benefits from.</p><h2>All roads lead to verification</h2><p>Broccoli and taxes notwithstanding, <strong>verification</strong> has shown up in a lot of our conversations. In the domain of physical systems, recall <strong><a href="https://www.latent.space/p/appliedintuition">Applied Intuition</a></strong>:</p><blockquote><p><em>&#8220;I think [verifiability] is probably the hardest problem right now, because the as the models get better, it can be harder and harder to find the faults on the system. And so the problem of doing proper eval to find those faults, that problem also keeps getting harder as the models get better.&#8221;</em> </p></blockquote><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;7a11ed8c-0ed1-4bb9-b596-e6538e050a25&quot;,&quot;caption&quot;:&quot;From building Applied Intuition from YC-era autonomy tooling into a $15B physical AI company, Qasar Younis and Peter Ludwig have spent the last decade living through the full arc of autonomy: from si&#8230;&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Physical AI that Moves the World &#8212; Qasar Younis &amp; Peter Ludwig, Applied Intuition&quot;,&quot;publishedBylines&quot;:[],&quot;post_date&quot;:&quot;2026-04-27T23:02:37.966Z&quot;,&quot;cover_image&quot;:&quot;https://substack-video.s3.amazonaws.com/video_upload/post/195677117/57023952-63ef-4f9a-a7a1-f64d1b593a72/transcoded-1777325789.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.latent.space/p/appliedintuition&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:195677117,&quot;type&quot;:&quot;podcast&quot;,&quot;reaction_count&quot;:21,&quot;comment_count&quot;:1,&quot;publication_id&quot;:1084089,&quot;publication_name&quot;:&quot;Latent.Space&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!DbYa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73b0838a-bd14-46a1-801c-b6a2046e5c1e_1130x1130.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p>In theoretical physics, we recall <strong>Alex Lupsasca</strong>:</p><blockquote><p><em>&#8220;&#8230;now that we&#8217;re in this regime where you can just get ChatGPT to tackle thousands of questions at the same time, it will return proofs for a significant fraction of them. Now actually the onus is back on the humans to verify all the outputs. And so, yeah, as that becomes a bottleneck, I think formalizing math and automating verification will become more valuable.&#8221;</em></p></blockquote><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;d6c62be8-bd88-4a13-910c-a57b3bb588a9&quot;,&quot;caption&quot;:&quot;Some people are going crazy over GPT 5.5. Some people. This is the story of the Jagged Frontier. People who use AI to write emails or even code implementation work find the lift moderate whereas peop&#8230;&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;&#128300;Doing Vibe Physics &#8212; Alex Lupsasca, OpenAI&quot;,&quot;publishedBylines&quot;:[],&quot;post_date&quot;:&quot;2026-05-05T20:34:11.746Z&quot;,&quot;cover_image&quot;:&quot;https://substack-video.s3.amazonaws.com/video_upload/post/196292432/0cd72a41-6b64-405c-8aff-508c149bafb4/transcoded-1778013234.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.latent.space/p/lupsasca&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:196292432,&quot;type&quot;:&quot;podcast&quot;,&quot;reaction_count&quot;:27,&quot;comment_count&quot;:4,&quot;publication_id&quot;:1084089,&quot;publication_name&quot;:&quot;Latent.Space&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!DbYa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73b0838a-bd14-46a1-801c-b6a2046e5c1e_1130x1130.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p></p><p>Verification is, in fact, the key differences between AI for science and AI for computation: in science you to have to actually test (verify) your hypothesis by performing physical experiments. Lab in the loop systems like <a href="https://www.radical-ai.com/">Radical AI</a> and <a href="https://www.lila.ai/">Lila</a> build around exactly this premise (we have recorded episodes with both of these teams and will release them soon!)</p><p>And yes, formally verifying critical systems such as flight control, nuclear power plants and pacemakers is a growing focus as the software and hardware that run them becomes more complex.</p><p>Carina believes so strongly that AGI <em>requires</em> verified generation that she makes the unqualified claim that &#8220;We do not believe there is any other possible future.&#8221;</p><h2>Expensive to produce, cheap to verify</h2><p>Lean proofs are hard generate, but they can be easily shown to be correct or incorrect. But how do you know that the proof you created maps correctly to the problem you care about? As Carina puts it: &#8220;Anything that can be specified can be proven. Humans are bad at specifying everything we want.&#8221;</p><p>Are we now in the specification business? Check out the episode to hear Carina&#8217;s take, as well as:</p><ul><li><p>Why hardware verification is a killer app</p></li><li><p>Details on the AXLE open API and recently released Discovery toolkit</p></li><li><p>The Erdos debacle</p></li><li><p>The OpenAI GPT-f diaspora</p></li></ul><h2>Full Video Podcast</h2><div id="youtube2-abYcV5LHMG4" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;abYcV5LHMG4&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/abYcV5LHMG4?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p><strong>Timestamps:</strong></p><ul><li><p><strong>0:00</strong> Intro: The $200M Series A and the Math Startup Thesis</p></li><li><p><strong>4:52</strong> Verified AI: Scaling Brilliance, Not Fixing Lousiness</p></li><li><p><strong>13:42</strong> Axiom&#8217;s System: Lean Data, RL, and the Putnam Perfect Score</p></li><li><p><strong>22:12</strong> Mathematical Discovery &#8212; Before the Conjecture</p></li><li><p><strong>25:12</strong> Rice&#8217;s Theorem, Incompleteness, and Practical Limits</p></li><li><p><strong>30:42</strong> Code With Proof &#8212; The Verina Benchmark</p></li><li><p><strong>37:57</strong> Proof Trees, Context Windows, and Scaling Limits</p></li><li><p><strong>43:57</strong> Markets, Moat, and the Business Case ($1.6B valuation)</p></li><li><p><strong>55:27</strong> Personal Origin Story: Oxford, UCL Gatsby, Stanford Law</p></li><li><p><strong>1:00:57</strong> The Erdos Controversy and the Difficulty of Search</p></li><li><p><strong>1:06:02</strong> AlphaZero for Math, Self-Improvement</p></li><li><p><strong>1:08:47</strong> Startup Advantage and the OpenAI GPTF Thread</p></li><li><p><strong>1:13:17</strong> Axle API &#8212; Open Infrastructure for Lean at Scale</p></li><li><p><strong>1:20:47</strong> Collaboration, Polymath, and Human Attention as the Bottleneck</p></li><li><p><strong>1:22:21</strong> Founding Story &#8212; Obsession, Law School, and Julie Zhuo</p></li><li><p><strong>1:26:17</strong> The Bigger Vision &#8212; AGI, Science, and Transfer Learning</p></li><li><p><strong>1:35:02</strong> Bottlenecks, Fragmentation, and the Field&#8217;s Future</p></li></ul><p></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>I actually love broccoli, but then again, I also believe strongly in Test Driven Development, so &#175;\<em>(&#12484;)</em>/&#175;</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Formal verification also includes model checking (TLA+, SPIN), SMT-based tools (Dafny, F*, Why3), and refinement-type systems (Liquid Haskell) &#8212; many of which don&#8217;t look much like &#8220;type checking a proof&#8221; from the user&#8217;s perspective even when there&#8217;s a similar logical core underneath. It also gets applied to software and hardware correctness, not only pure mathematics.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>This is an understatement. Most theorems remain informal because formalization is so hard to do. There has been a great deal of effort to formalize the most important proofs, with mixed results.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>One might argue that its a bit lower because the proof is in distribution for the LLM.</p></div></div>]]></content:encoded></item><item><title><![CDATA[⚡️Satya Nadella: No Priors x Latent Space Crossover Special at Microsoft Build]]></title><description><![CDATA[The legendary Microsoft CEO makes his first Latent Space appearance!]]></description><link>https://www.latent.space/p/satya-2026</link><guid isPermaLink="false">https://www.latent.space/p/satya-2026</guid><pubDate>Wed, 03 Jun 2026 17:13:57 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/200432443/be4d039cefcb78767e3408d0f5ac3d29.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p>We&#8217;ve informally heard that Satya is a listener to LS for a couple years now, but it was still absolutely surreal to meet him and do a live pod at Build, together with our friends at <strong>No Priors</strong>, the leading VC AI Podcast that we also greatly admire!</p><p>We covered <a href="https://www.latent.space/p/ainews-microsoft-build-mai-thinking">the MAI model technical takeaways on yesterday&#8217;s AINews</a>, so I will focus our recap of Satya&#8217;s main messages around three elements:</p><ul><li><p><strong>Satya&#8217;s adaptation of <a href="https://www.latent.space/p/agent-labs?utm_source=publication-search">the Bill Gates Line</a></strong> for positioning Microsoft as the <strong>Frontier Intelligence Platform</strong> &#8212; customers must gain much more value from the Microsoft ecosystem than Microsoft itself, by building on multi-model harnesses like OpenClaw and Scout, drawing on the full enterprise context exposed by context layers like Work IQ (heavily <a href="https://www.latent.space/p/github">dogfooded by his C-suite</a>), and building up private evals and traces as a new form of Token IP</p></li><li><p><strong>AI ROI: </strong>On one hand, enterprises are having difficult conversations around Tokenmaxxing and Layoffs, and on the other hand, there are serious re-evaluations of the End of SaaS since the Build vs Buy equation has changed so much. Our <a href="https://www.latent.space/p/valuemule">previous SemiAnalysis guest</a> had&#8230; interesting comments on Microsoft&#8217;s position on this as the ur-SaaS titan, and Satya had great answers</p></li><li><p><strong>Making the Impossible Possible:</strong> Kevin Scott&#8217;s inspiring framing around what the most ambitious version of applying AI and technology at large to business and social problems, like education and social impact.</p></li></ul><p></p><p>Enjoy!</p><h2>Full Video</h2><div id="youtube2-cFNI2FORAc0" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;cFNI2FORAc0&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/cFNI2FORAc0?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h2>Transcript</h2><p><strong>Voiceover:</strong> Welcome swyx, Sarah Guo, Elad Gil,, and Chairman and Chief Executive Officer of Microsoft, Satya Nadella</p><p><strong>Sarah Guo:</strong> Welcome to a crossover episode of No Priors and Lane Space with Satya Nadella. Um, congratulations on an amazing build. No, thank you so much, and it&#8217;s great to be with both of you. I listen to both of you or b- both the podcasts all the time. It&#8217;s great to be on it.</p><p>Thank you so much. [00:01:00] So you&#8217;re just talking about, um, these amazing, uh, announcements from across the Microsoft estate all morning for, I think, three hours. What is the, uh, what&#8217;s the most important reflection or takeaway you have?</p><h2>AI as an Ecosystem Platform</h2><p><strong>Sarah Guo:</strong> I, I&#8217;d say there are, uh, perhaps the, the biggest one for me is let&#8217;s sort of conceptualize this more as an ecosystem play as opposed to a single model or even a single platform, right?</p><p><strong>Satya Nadella:</strong> I mean, you know, whatever I... At least for me, having grown up at Microsoft, having seen, whatever, four major platform shifts, uh, I sort of fall into that, um, uh, camp where a platform is defined by fundamentally its ability to create more value about the platform versus what&#8217;s captured in the platform. And so if you, you view what&#8217;s happening right now, I think this morning&#8217;s keynote was how can any company, whether it&#8217;s an AI native company or a traditional enterprise company, participate as a first-class participant where they can point to AI they created, [00:02:00] right?</p><p>It&#8217;s not that they don&#8217;t use other people&#8217;s AI. Of course they will. But to me, what&#8217;s the path? What&#8217;s the recipe? How do I do it? What does a stack look like? What does the tooling look like? What is valuable? How do you do that? That&#8217;s it. That&#8217;s sort of our job to do. Yeah. Ecosystem strategy is, uh, very complicated, right?</p><p><strong>Sarah Guo:</strong> Because you end up building certain components, partnering for certain components, supporting them. You just announced this big suite of models. Like, tell us a little bit about the, uh, training strategy for Microsoft now. Yeah.</p><h2>MAI Models &amp; Training Strategy</h2><p><strong>Sarah Guo:</strong> So, so the thing that we wanted to do with the MAI models was to build, and as Mustafa talked about, first of all, a great lineage, right?</p><p><strong>Satya Nadella:</strong> Starting with pre-training, uh, with very good data quality, uh, doing all the ablations, making sure because in, in some sense it&#8217;s becoming even harder to build a clean lineage model just because there&#8217;s so much stuff out there, uh, that you truly need to ablate out to be able to have a fantastic [00:03:00] pre-trained model.</p><p>In fact, that&#8217;s one of the challenges of a lot of the open weight models is they look great on one benchmark or two, but they&#8217;re not great on practice. So that&#8217;s why, in fact, even in the RFDEs are, they, they are pretty gone really excited about these MAI models because how the heck can a small five B model hill climb?</p><p>Uh, and it goes back a little bit to what I think is ultimately the key thing to do, which is try to pursue finding that cognitive core. Uh, so to me, starting with a clean lineage- Then creating that ability for companies to be able to use this, right? Not just as a generalist, but to create their own specialist by building this hill climbing scaffold around it, right?</p><p>So it&#8217;s not just the model, but you have a hill climb scaffold around it, then you will start building your RLE. You will start collecting the traces. Most importantly, you&#8217;ll have private evals because we know all the evals out there are good, interesting, [00:04:00] but they&#8217;re not really that critical- They&#8217;re work, yeah</p><p><strong>Swyx:</strong> at this point because they all can be maxed. And so the point is each company will have its own private eval. And so that end-to-end platform story around our models is sort of, uh, what I think is interesting. And then the one other thing, Sarah, since you brought that up, is I do feel there&#8217;s a new frontier.</p><p><strong>Satya Nadella:</strong> Like people talk about the frontier and are you operating at the frontier. Um, interestingly enough, if you add a little temporality to it, you can use, let&#8217;s say, in, in, in fact, the, the Lando Lakes demo we showed was pretty cool. We used, whatever, GPT-55, right? Then you collected a bunch of traces, and then you took a 5B reasoning model and achieved higher.</p><p><strong>Sarah Guo:</strong> Uh, so that is another aspect of what it means to appear... uh, you know, operate at the frontier Yeah. I, I think, uh, I first of all have to congratulate you on basically building a frontier neo lab inside of Microsoft in two years. Um, I&#8217;m wondering, you know, you have all this AI strategy that you&#8217;re rolling out.</p><h2>Lessons from Two Years of AI Development</h2><p><strong>Swyx:</strong> I&#8217;m wondering, what do you know now that you wish you would tell yourself two years ago where- or two or [00:05:00] three years ago? Three years for the Jensen partnership, two years for, uh, MEI. Yeah, I mean, I think the, the thing when, that I reflect quite a bit, right, which is sort of obviously I got into all this when I got excited by the, the scaling laws paper and, you know, when, you know, even the OpenAI partnership came about when those folks said, &#8220;Hey, we&#8217;re gonna really throw a lot of computer transformers.&#8221;</p><p><strong>Satya Nadella:</strong> Uh, and they&#8217;ve helped. I- the thing that I always look back and say, &#8220;Wow, these things, uh, do have capability that they&#8217;re climbing up.&#8221; W- I mean, this, you know, this crude way of saying it is intelligence is log of compute kind of works. Now what I think we underestimated perhaps is the real-world complexity of deploying these so that they actually deliver the value in the real world, right?</p><p>So the outcomes as measured by any benchmark is interestingly important, but the true eval is when people out there are able to do unique things that they only can value, and it&#8217;s very [00:06:00] measurable, right? That I wish we had sort of even, like, had more in our consciousness, right? Which is as an industry.</p><p><strong>Sarah Guo:</strong> Because right now I think when people say, &#8220;Wow, I don&#8217;t want a token max,&#8221; it&#8217;s an artifact of us not having thought ourselves as an industry that we are using tokens to create value every step of the way. So I think that&#8217;s kind of what I wish we had gotten there, but I&#8217;m glad we are here.</p><h2>Real-World Value &amp; Use Cases</h2><p><strong>Sarah Guo:</strong> What are some of the use cases that you&#8217;ve seen that have created the most value for your customers?</p><p>Because I know that people talk a lot about code, and I think it&#8217;s pretty clear that that&#8217;s something that&#8217;s having very large scale impact. Are there other areas that you find in common that your customers are really benefiting from? Yeah. I think, yeah, to your point, obviously coding is now got... But it&#8217;s interesting, by the way, Elijah, to even talk about the coding, right?</p><p><strong>Satya Nadella:</strong> Which is coding has worked so well that we now have to rebuild the IDE, right? I mean, it&#8217;s kind of nuts to see what we sh- launched is like, oh my God, I have these hundred agent sessions. I... The cognitive load it transfers back to me as a human is so [00:07:00] excessive that now I need a new UI. Uh, oh, by the way, I, like the, the chat as the only artifact was also impossible, so that&#8217;s why we need a canvas.</p><p>So it&#8217;s kind of interesting for all the things about where is software needed or where is UI needed, uh, you kind of need that even for code, right? In a fully agentic world. But that said, one of the things that we are starting to see, we started seeing with co-work, but even some of the work we, we showed with auto com- uh, um, autopilot Right on what you see with claws is a good one because if you sort of think about a lot of human capital is doing the glue work, right?</p><p>If you now can augment that with tokens/agents that are long-running, durable, right, then your ability to scale even what is still judgment and glue work gets amplified like coding does. Uh, so you can... Like, I&#8217;m positive that six months from now we&#8217;ll all be saying, &#8220;Oh, wow,&#8221; like, all through ni- the night there was a bunch of stuff that [00:08:00] all these autopilots that I have working on my behalf with my delegated authority, so to speak, right?</p><p>I can... Sort of given even my identity, did a bunch of work, then of course I&#8217;ll need my new ADE to say, &#8220;Well, what did you do?&#8221; Like, I might... &#8220;Did I do this work?&#8221; And so on. So I think that that&#8217;s where compressing of workflows, uh, completing of tasks, uh, that&#8217;s where I think a lot of the value gets created. I think you raised a really interesting point, which is there&#8217;s the actual agent that&#8217;s doing the code, and then there&#8217;s a harness around it, and that&#8217;s the environment, that&#8217;s the context, that&#8217;s everything you&#8217;re setting up as a developer around actually a coding agent.</p><h2>The Harness Concept for Enterprise AI</h2><p><strong>Sarah Guo:</strong> What is the harness for the enterprise? Is there an equivalent concept for broader productivity work, or how do you think about that concept sort of generalized? That&#8217;s right. So, so in some sense you kind of want the harness to define the models, the, the data, uh, and the tools, and so that you have a loop across those three.</p><p><strong>Satya Nadella:</strong> And so what we are trying to, first of all, make sure is each of our products that we build, right, whether it&#8217;s GitHub Copilot or the security copi- the, the [00:09:00] stuff we showed with MDASH or even the discovery for science, it doesn&#8217;t matter, all of them are multi-model harnesses, um, with tools access so that you can do this progressive, uh, disclosure of tools even so that they&#8217;re token efficient.</p><p>Uh, and then you&#8217;re feeding it with very rich context because that&#8217;s sort of the other hard lesson we have learned in the last two years is, oh my God, the amount of work you need to do to prep the context layer, uh, such that your plan can execute in the most efficient way is where the magic is. So we have, in our case, we have the GitHub harness, which essentially we&#8217;re using across all our products.</p><p>It&#8217;s available in Foundry, and we are open, like you can use your Llama harness, whatever. Or you can use the, um, uh, you know, any open harness or any harness of yours and train with your tools and multiple models and your context. And so that&#8217;s the pitch. Because right now a lot of dialogue is, um, &#8220;Hey, if I train the harness plus tools and the model together, you get [00:10:00] evals.&#8221;</p><p><strong>Elad Gil:</strong> And what we are proving out is... And the best example of that is what we did with MDASH, right? Because when it launched, uh, it found bugs or vulnerabilities that were not found by Mythos Uh, and so there is existence proof, I would claim, that you can have a multimodal harness, uh, that can in fact be more, uh, performant in the real world So a premise behind the, uh, training at the independent frontier labs is really, you know, we&#8217;re gonna have these models, and we&#8217;ll have an API business, and we&#8217;ll support enterprises and startups.</p><p><strong>Sarah Guo:</strong> But</p><h2>Platform Strategy &amp; Developer Ecosystem</h2><p><strong>Sarah Guo:</strong> a first-party product, be it productivity or code or search, drives the majority of revenue. That&#8217;s a different value equation than you&#8217;re describing, I think, with the Microsoft ecosystem. Uh, if, if that&#8217;s the case, tell me if it&#8217;s the case, uh, &#8216;cause obviously you have first-party products and you have enablement products.</p><p><strong>Satya Nadella:</strong> Um, what is the role of the develop- Like what is gonna be hard and the set of skills and the value capture the developer has in that world? Yeah. So I think that there&#8217;s always [00:11:00] gonna be the case that someone who is super successful in- as a platform builder can also have first-party products. It was true with Windows.</p><p>It is true, uh, with, uh, the, the SaaS side and the cloud side as well with us and others and so on. But the thing that is, is it should not be a limiter to other people achieving that same success, right? That I think is the core difference, which is the, the network effects this time around, around intelligence are such because they learn from data, and not really lots of data.</p><p>It&#8217;s just a few samples that you have to see to understand what&#8217;s novel about something. So that&#8217;s why the game becomes how to protect. So that&#8217;s why I would say every company, having private evals may be the biggest IP, right? Think about it, like what&#8217;s that private eval that you can then use even a frontier model to hill climb on and not leak the traces may be one of the biggest [00:12:00] drivers, uh, of IP.</p><p>Like, so in other words, another te- acid test is you have an eval that&#8217;s private. You&#8217;re using, uh, a g- a Model A. Can you switch it to Model B and e- you know, climb up? If you can, then you&#8217;re in control. If you can&#8217;t, you&#8217;re not in control, and that&#8217;s where even the harness decision becomes super important, right?</p><p><strong>swyx</strong> So therefore, having an open harness, letting all models come in, having your evals, your context, your tools help you hill climb, I think is the skills that an AI native startup needs, a SaaS company needs, or every enterprise needs. Yeah, I think in, in a very real way you are ... Microsoft historically is an operating systems company and th- then become a cloud company.</p><p>Maybe like the third act is that you&#8217;re a harness or evals company. Whatever w- ... whatever the, the sort of conglomerate of concepts that you wanna put together. Um, and, and I think like enabling every company to have like frontier intelligence or what- what- Yeah ... I forget the, the [00:13:00] exact term that you used, um, is the, is the mission, right?</p><p><strong>Satya Nadella:</strong> That&#8217;s it. Like that is, that is the platform promise, that you build with us, you will get your intelligence, uh, for your data. That&#8217;s it. That ... To, to me, that is the ... Like if there was one tagline, uh, for this entire developer conference is- Can everybody operate at the frontier with their frontier intelligence, right?</p><p>To me, that is so important because otherwise it, I, I don&#8217;t know how you achieve stable equilibrium, right? Which is how do I then go and say, &#8220;Well, my company is gonna have a terminal value because I now know how to continuously compound-&#8221; Yeah ... on top of what&#8217;s a platform that gets better,&#8221; right? So when, like Windows obviously came out, Adobe built, Autodesk built, uh, or even like take what Jensen said.</p><p>We built DX and he built, you know, CUDA on top of it. Um, right? I mean, I always say to Jensen, &#8220;God, I got the short end of that,&#8221; right? &#8220;I wish, uh, we had recognized it.&#8221; But nevertheless, but that, that idea that you can build a platform layer [00:14:00] that someone else can then extend out, um, and build their own intelligence layer in this case, I think is everything, right?</p><p>Without it, why have a developer conference? I can just come and have you all sort of just worship at the altar of one model. Yeah. But that&#8217;s not a developer conference. Uh,</p><h2>IP, Evals &amp; Company Value</h2><p><strong>swyx:</strong> backstage we, we had a discussion about what is IP or what is the, the value in a company. It used to be the length of, uh, human experience at a company, and now it&#8217;s this other thing which is the evals, the, uh, experience in sort of applying agents to the company. Can you... I just want you to like flesh that out a bit more &#8216;cause- Yeah ... it was very insightful.</p><p><strong>Satya Nadella:</strong>  It&#8217;s a great way to frame it, right? Because yeah, at the end of the day, every company is gonna have both the human capital that is still gonna be super valuable, uh, because humans, uh, and their ability to find the gaps that exist at all times is going to be the way we all will create value, right?</p><p>I mean, so I&#8217;m definitely in the camp that this is going to be about expressing new forms of human agency and ambition even as token capital goes up, right? So let&#8217;s say a cor- any corporation [00:15:00] has lots of tokens and lot of human capital. The question is how do you compound the two? So if you have a... Like if you take in Teams I have a bunch of agents doing work and a bunch of humans doing work, and the traces between those, that is really important context of how that enterprise is creating value.</p><p>Then that goes back to train not a generalist model, but to train the company veteran agent, uh, right? That is super valuable again, right? Which is when a company goes says, &#8220;It should in fact go onto the balance sheet,&#8221; is how I think about it, right? That&#8217;s so... In fact, there may be... Like human capital was never possible to go put on a balance sheet, uh, because you didn&#8217;t know how to capture the tacit knowledge.</p><p><strong>swyx:</strong> Whereas now I think you can with the agents that have learned through the h- through, through time, through all the traces. Uh, so that&#8217;s what at least we think will happen. I, I think the SEC is gonna have to have accounting standards- ... for token, uh, expertise Uh, y- y- you&#8217;re talking about the equilibrium [00:16:00] state, um, and a stable equilibrium where companies have this compounding value and can see terminal value for themselves.</p><h2>Future of SaaS &amp; Business Models</h2><p><strong>Sarah Guo:</strong> Another challenge to, you know, the considered equilibrium of, okay, there are applications and workflows that are sort of common to a vertical or a horizontal. Um, and this was, like, the generation of SaaS companies and, you know, Microsoft has lots of SaaS properties as well. And then there are things that are very specific to every enterprise that they&#8217;re differentiated against.</p><p><strong>Elad Gil:</strong> Um, I&#8217;m sure you have heard much and participate in much of the debate about the end of software because all these workflows are, are cheap to generate now. Um, do you think the equilibrium looks different between what agents get built- Yeah ... in enterprises versus in their vendors in the future? Yeah. So I think what&#8217;s happening there is, see, we, we had a particular way we captured, um, I would say workflow in apps, right?</p><p><strong>Satya Nadella:</strong> Because we built a, a data model, right? We schematized some part of some business process. Mm-hmm. We then built a bunch of business logic. Yep. And then we put a bunch of UI [00:17:00] on top of it, right? So that&#8217;s kind of what every SaaS company- And a little configuration. For, like, 20, 20 years that was the plan.</p><p>Right, that- Yeah ... and that was it. So interestingly enough, now you kind of get to re-litigate that vertical stacking, right? So I still think, for example, that data model that you built underneath every SaaS application is super good, right? Like, why reinvent it? Like, I, I, my general ledger better be a general ledger.</p><p>I don&#8217;t need new schema creation. No. Uh, in fact, that entity relationship, uh, is actually pretty good, robust thing that I want to feed. And you want it to be stable. That&#8217;s right. Yeah. Then same thing with business logic, right? If, if you look at, uh... We have this product called Power BI, right? It is like dashboards galore people created.</p><p>The beauty underneath that dashboard is a very rich semantic model, right? Someone took the pain to create a dashboard and do all the measures, and you want that. That&#8217;s business logic, right? I want that to be available to me. So I think the [00:18:00] challenge of the SaaS business model is we packaged one way. We now have to learn how to unbundle these things and rebundle in new ways and discover new business models, right?</p><p>I mean, if you look at it, d- what&#8217;s happening today with Microsoft 365 is a great example, right? We have this thing called Work IQ. In fact, like, what we are realizing is, oh my God, like, you know, if you look at... In fact, there&#8217;s a pa- historical parallel too, right? We sold first Exchange and SharePoint and, uh, you know, before Teams, we had a thing called Lync Server and what have you, and we thought, &#8220;Oh, that&#8217;s all gonna move to the cloud.&#8221;</p><p>But little did we realize that, um, the number of people who will use servers in the cloud is 10X, 100X, right? Because people were not buying servers, they were just buying a subscription. Mm-hmm. The same thing is now happening with M365 because with Work IQ, we have exposed what is perhaps the most important database in a company that never got used as a database because it was only captive to our apps.</p><p>Mm-hmm. Right? It, it was all email operated on it, Teams operated [00:19:00] on it, Word, Excel, PowerPoint, SharePoint. But now, like this is one of the coo- coolest things I get to do with Work IQ. I go to a GitHub repo and I say, &#8220;Hey, I attended a bunch of design meetings last week related to this repo. Can you capture all that and tell me what changes I should make?&#8221;</p><p>I mean, think about that, right? It literally can go look at all those transcripts, come back with a plan to change a code base, right? Previously, you could never have thought of using M365 for something like that. So the value creation opportunity now in the agent world is in fact 10X more, but it does require us to have...</p><p><strong>Sarah Guo:</strong> For example, there&#8217;s going to be usage around M365, right? Which is going to be perhaps more than even the e- end users and we have to even re-architect. Like, in fact, like what I use to serve an inbox or a mailbox cannot be used to serve an agent. Uh, and so that&#8217;s sort of what we are doing.</p><h2>Pricing Models: Per-User, Consumption &amp; Outcomes</h2><p><strong>Sarah Guo:</strong> I don&#8217;t believe in, like, permanent business models for any of these domains, but in the [00:20:00] near term, do you have a prediction between, uh, you know, outcomes-based pricing, token-based pricing?</p><p><strong>Elad Gil:</strong> Enterprise bundles Yeah. The way I- I think about this is always we&#8217;ve had... Like, let&#8217;s even take the per-user pricing. Mm-hmm. The per-user pricing is really an artifact of someone creating a budget needing certainty, right? Because it&#8217;s the most important thing. Like, somebody wants a budget- Mm-hmm ... they need a per user.</p><p><strong>Satya Nadella:</strong> And, and per user is just a set of entitlements to usage, right? That&#8217;s kind of what it is. And so the way is, if the first bundling will be take some usage, bundle it into per user stacks and, you know, then sell subscriptions. So subscriptions I think are gonna be there, per user is gonna be there. Then the next big thing will be consumption.</p><p>So people will say, &#8220;I want consumption.&#8221; And it&#8217;s also possible that people will say, &#8220;I don&#8217;t even want to pay for any of the subscriptions or the consumption&#8217;s outcome.&#8221; Mm. But remember, most people love outcomes until they have an outcome, because once you have an outcome, it&#8217;s like giving away royalty, [00:21:00] right?</p><p>Mm. I mean, like I, I&#8217;ve talked to customers who love, you know, outcome-based pricing, and I say, &#8220;I&#8217;m all in,&#8221; until they, &#8220;Oh my God,&#8221; like, &#8220;what are you talking about? You&#8217;re sharing in my outcome? No, no, no. I want you to go back to per-user pricing, and I want you to consumption price,&#8221; right? So I think that debate will go on.</p><p>Uh, but and all, all, all of these business models have a particular time and a place versus one to rule them all. And if anything, if you&#8217;re a SaaS vendor or you&#8217;re a platform vendor, having that flexibility... And quite frankly, we face this with GitHub, right? We just recently announced a per-user pricing on GitHub because little, you know, we- GitHub Copilot was constructed at a per-user level before we understood even, uh, the intensity of usage of agents, right?</p><p>It was an interactive way for a developer to use code complete, maybe tasks. It was not like, oh, I launched 10,000, you know, agents that are going on all day, right? So that is what the adjustment is about. So now that we really want, there will [00:22:00] always be a per user, but there will have to be a consumption meter.</p><h2>Durability of SaaS &amp; Build vs Buy</h2><p><strong>Sarah Guo:</strong> How do you think about the durability of SaaS more generally? One thing I&#8217;ve observed is in a lot of enterprises internally, there will be teams that almost have agent euphoria. They&#8217;re so excited about the explosion of things they can build that they&#8217;re trying to rebuild a lot of applications or going to their SaaS vendors and saying, &#8220;We&#8217;re not gonna work with you anymore,&#8221; or, &#8220;We&#8217;re considering an internal project.&#8221;</p><p>And it seems like in six to nine months, maybe some of those people will come back and say, &#8220;Actually, we, we can&#8217;t rebuild everything.&#8221; How do you think about what&#8217;s durable in this world and what isn&#8217;t? Yeah, it&#8217;s a... It... I think we have to go through one full budget cycle on this to really see the, um- Uh, the sort of the emergence of the equilibrium, because at the end of the day, there&#8217;s marginal cost to even generating the app, right?</p><p><strong>Elad Gil:</strong> In, in fact, there can be even a, a simple way to say it, like if you should always acquire something if the marginal cost of building and maintaining, uh, something on your own is higher. Uh, right? That should be like it&#8217;s a quantifiable- Yeah. Right? A quantifiable thing. And [00:23:00] the maintenance part is important, right?</p><p>Even, like you got to remember like, hey, you know, all the security stuff that now AI will find, you better fix them too fast. Uh, of course, there&#8217;s a coding agent to help you with, but then that burns tokens, right? So whose responsibility is it? It&#8217;s kind of like a, a cycle that you&#8217;ve got to think through.</p><p>And I think we have gone through the excitement that I can generate a lot of software. I think the next thing would be what software do I really want to generate? Mm-hmm. What software do I want to use from others? How do I compose these two into some agentic workflow that I have agency over, right?</p><p><strong>Sarah Guo:</strong> Because I think there&#8217;ll be very little tolerance for anybody who&#8217;s inflexible, uh, at the vendor level. Uh, but at the same time, I think that anyone who has got that flexibility shows up, delivers the value, will be back at again, right? We&#8217;re selling software, uh, but with just different business models, in fact Uh, speaking about building software, um, one of my favorite moments from, I think, a previous build maybe one or two years ago was they had a b- they, they...</p><p><strong>Swyx:</strong> There was a section of you building your [00:24:00] own software. I&#8217;m curious if you&#8217;re building anything now. Yeah. So I, I think the... You know, first of all, let&#8217;s face it, right? Building software has made it possible for even the incompetence of a CEO of a company- ... like ours, uh, you can build, so thank God. But that said, I, I, I, I do feel that, you know, something like, um, GitHub Copilot to me, and especially the new Sessions app or the new app, has just made it so much more possible for you to have agency over artifacts that you felt you couldn&#8217;t touch before, right?</p><p><strong>Satya Nadella:</strong> So to, for me as a CEO, even to go to a code base, uh, to be able to learn about it, like I remember joining Microsoft long back, you know, first and then you say, man, everybody had to go in and look at, you know, whatever, Cutler&#8217;s, Malik, or what have you to learn how to do good C, uh, C++ code. Um, so now that ability to be more full stack up and down is so good, but that doesn&#8217;t mean every one of us should be doing the same thing.</p><p>The question is: [00:25:00] how do you then have the ability to inspect things, learn things, see things, um, I think is just so much more. And so to me, what I&#8217;m building a lot of is these long-running Foundry agents. Uh, right? So there&#8217;s autopilots. So the easiest thing is, to me, I think I just built one, uh, even last week, where the idea was, hey, can I have an agent that is continuously monitoring essentially my own chief of staff autopilot, right?</p><p>We&#8217;re gonna have that obviously in, uh, Scout. That&#8217;s what, uh, uh, we showed. But it is so easy and trivial to build. I took Work IQ. I said, &#8220;Take Work IQ, go, uh, and build a Foundry long-running agent.&#8221; Uh, store all the memory in, um, uh, using Ray Fin, right? Basically at my backend as a service. And lo and behold, it built it, and not only built it, I could say publish to Teams, and it published the damn thing to Teams.</p><p><strong>Sarah Guo:</strong> So the ability, uh, to have a, you know, some end-to-end project like this complete is just pretty [00:26:00] miraculous. How do you think, uh,</p><h2>Future Engineering Roles</h2><p><strong>Sarah Guo:</strong> that impacts the different types of engineering roles that exist in the future? Because right now I think there&#8217;s, you know, a dozen different types of engineers that you can be, from QA, front end, et cetera.</p><p>You know, there&#8217;s a big swath. I&#8217;ve heard some people argue that in four or five years we&#8217;ll basically end up with four engineering roles. It&#8217;ll be people who are managing agents, it&#8217;ll be four deployed engineers or FDEs, it&#8217;ll be security engineers, and then people working on large scale infrastructure for a small number of services, and then everything else just collapses into the agentic world.</p><p><strong>Satya Nadella:</strong> Yeah, I- Do you think that&#8217;s a correct view of the world? Yeah, I mean, I think, I think we&#8217;ll have to experiment our way through it. But what you said is what... There are some very at scale things. At LinkedIn, they did structurally change- Mm-hmm ... uh, and it, you know, basically built up a new discipline called full stack builder, right?</p><p>So they went and said, &#8220;Hey, let&#8217;s bring, uh, people from design and product management, front end engineering, all put them together.&#8221; Uh, but also have an edge, right? It&#8217;s not like the design person still doesn&#8217;t have the design edge, or the front end [00:27:00] person doesn&#8217;t have the front end edge, but you can give yourself bigger scope in roles so that you&#8217;re not confined to one role.</p><p>Um, and then r- equally, infrastructure has become very critical, right? So in other words, like, I mean, RLEs, I mean, one thing we&#8217;ve realized is even for the Excel team, for example. Mm-hmm. Building the RLE in which a reward can be learned is actually one of the hardest sort of infrastructure problems.</p><p>Mm-hmm. Uh, and so you kind of need even new talent, right? Distributed systems people even in what was considered an end user app team, uh, because it&#8217;s a different skill set. So yes, infrastructure, science is the other one, obviously. Um, so I think we&#8217;ll see how these evolve, right? Where&#8217;s the s- real... I mean, always the world will have a bunch of specialists.</p><p>Okay. Um, you know, I think the generalist role is going to be the most exciting, right? Because the leverage of a generalist- Mm-hmm ... um, is where we are going to see the maximum returns, right? When, when you said, &#8220;Hey, are you coding?&#8221; I&#8217;m now a gen- Like, what... I&#8217;ve basically translated [00:28:00] knowledge work Right?</p><p>Which I did, where I created a Word document or a spreadsheet, or even, uh... And now I can build an app, right? It&#8217;s in the same sentence. Uh, right? That idea that, &#8220;Oh, wow, my generalist skills have gotten higher leverage,&#8221; I think is what we&#8217;re gonna see across the board. Music to the ears of CEOs and VCs that are, like, a little dangerous and a lot of- Golden age for idea people</p><p><strong>Sarah Guo:</strong> idea people. Yeah. Uh- With a lot of agency. I- if you take that idea of personal agency and you just zoom it out to the organizational context, um, uh, my partner Mike Renall, who, uh, actually started his career at Microsoft, just wrote an essay where one of the big takeaways is i- it&#8217;s an age where you can be much more ambitious, and you need to be, given the pace of the environment and how quickly, actually, users and companies are open to adopting new technologies.</p><p><strong>Satya Nadella:</strong> Um, how do you think about... I, I feel silly asking this of somebody running a, you know, trillion-dollar-plus company already, but</p><h2>Ambition &amp; Making the Impossible Possible</h2><p><strong>Satya Nadella:</strong> how do you think about how Microsoft can be more ambitious now? It&#8217;s a great question. Um, I [00:29:00] think, um- I think the, the thing in these type of transitions is to have a conceptual model of how work can change to go after outcomes that you could hardly imagine previously, right?</p><p>In fact, Kevin Scott has this nice line, right, which is, um, when you can make the impossible... Like, when you&#8217;re making hard things easier, that&#8217;s sort of one point of leverage. But true ambition is about making the impossible possible. So now the thing that is missing a little bit in all of our organizations is what is that new conceptual model of what can we build?</p><p>What was impossible and what can we build? And I&#8217;ll give you one example of this, right, which is I take great inspiration from sort of the people who were managing the Azure net- network. And they came to the... This was from even last year. You know, we were scaling. You saw that I, I [00:30:00] talked about sort of how we built in the last 15 months more Azure capacity than we built in the first 15 years.</p><p>I mean, it&#8217;s crazy. Wild. Yeah. Right? It&#8217;s pretty wild. And it&#8217;s the same team. So they saw that and they said, &#8220;Bob, this just ain&#8217;t gonna work if we don&#8217;t reconceptualize our work.&#8221; So they built... Essentially they said, &#8220;Our job is not to do Azure networking. Our job is to build the agentic system does, that, that does Azure networking,&#8221; right?</p><p>These are the folks managing the 500-plus fiber operators managing the VAN, right, all over. And fiber operations ultimately is a physical operation. Things get cut, things get, uh, you know, have to be repaired. You know, we have fancy words called DevOps and so on. Basically, emails are coming in and you gotta go respond to them, take care of it.</p><p>So they built this agentic system. They even have a character for it. It&#8217;s called Miles, and it sort of does all this stuff, right? They started sort of screaming for more tokens and so on. And so they were saying, &#8220;Look, uh, we don&#8217;t need a headcount. We need tokens in order to be able to [00:31:00] manage, uh, our operation.&#8221;</p><p>That reconceptualization- Mm-hmm ... of what their work is, right? They, they basically took their work and made it meta, right? That meta work is now their new work. Mm-hmm. Right? In the &#8216;80s, if somebody had come to us and said, &#8220;4 billion people are gonna get up in the morning and start typing,&#8221; my model would&#8217;ve been, we need 4 billion typists?</p><p>But we&#8217;re not doing typing, we&#8217;re doing knowledge work. So that, to me, I think is it, right, which is whether it&#8217;s Microsoft or whether it&#8217;s any organization, is to give ourselves permission to do new types of metacognition, meta work, using these new tools to change the outputs that matter, uh, and then really make the impossible possible.</p><p><strong>Sarah Guo:</strong> So completing that dot or the, the connective tissue across those, I think, is where a lot of the enterprise value will get created.</p><h2>Data Center Build-Out &amp; Community Impact</h2><p><strong>Sarah Guo:</strong> Should we talk about data centers? Yeah, please ask. Oh, okay. Well, uh, uh, w- we-- this leads nicely into the data center build-up. I always think, I- I just-- I&#8217;m just impressed at the sheer scale of the [00:32:00] build-out from Microsoft, but also everyone else, that this is redefining what it means to be a hyperscaler.</p><p>And I just feel like that, that, that is at unprecedented scale on finances, uh, on the way you run the company, but also the communities that are, that are impacted. Um, yeah, just talk a bit more about what you&#8217;re seeing on the ground, like when you visit your- Yeah, I think there are two aspects of it.</p><p><strong>Satya Nadella:</strong> Obviously, the, the build-out is, uh, extraordinary. Um, you know, nothing like this has happened, and it&#8217;s great to be, uh, one of the participants in it. Uh, but you brought up the other part, right? I think at this point it&#8217;s clear that unless we as an industry, uh, are very principled about ensuring that the benefits of all the stuff we&#8217;re talking about are felt in real ways, uh, at the community level, right?</p><p>Because this is not just a, a campaign, um, right? It has to be real, where people are saying, &#8220;Look, this is not ch- changing the prices on energy for me.&#8221; In fact, if anything, it&#8217;s bringing down prices because long term there&#8217;s going to be a better [00:33:00] grid, there is going to be more energy. Water consumption is, in fact, not sort of, uh...</p><p>In fact, water is being replenished, right? You gotta really, you know, educate folks on truly what&#8217;s happening, the cl- uh, the closed loop systems we are building. We have to invest in the training, the jobs, the tax base. In fact, the least talked about stuff is the amount of jobs that get created during construction, after construction.</p><p>What&#8217;s the tax base that&#8217;s there in the community? And, and all this has to be real. Um, and, and if that is the case, then we will have permission. If it is not, we won&#8217;t have permission. It&#8217;s as simple as that, right? Which is, uh, we, we... I think we have to take it as an industry pretty seriously. Uh, I think it&#8217;s good for communities to be skeptical, ask the hard questions, for us to do the hard work, earn that.</p><p>Um, but at the end of the day, if there&#8217;s-- if we can really be the produ-- Wait. I&#8217;ve always felt like in human history, if you use a lot of energy but also create a lot of value for society- The story has been fantastic. If you don&#8217;t [00:34:00] do that, it&#8217;s not been that great. And this time around, I&#8217;m a firm believer that ultimately if you do have a token economy that drives productivity, that drives economic growth, that drives broad spread, um, you know, participation, better health outcomes, um, then I think we&#8217;ll be in a great place.</p><p><strong>Sarah Guo:</strong> Uh, and that&#8217;s at least what we all have to be focused on. Yeah. It, it makes me think actually that with all these initiatives that you&#8217;re doing, might be e- easier to see ROI in the communities first before in enterprise. Yeah. I, I mean, I think both sides. Yeah. In fact, it comes back together. It has to be the people in the communities are going to be employed, are going to be participants, uh, in the real economy, right?</p><p><strong>Satya Nadella:</strong> That&#8217;s I think the question is. Like, if we- if the broad economy is doing well and the communities are doing well, the dots get connected. It&#8217;s sort of the market forces are such that we will connect the dots. And that I think is it. Like, you ought to be able to see the evidence. You can&#8217;t be about o- any one company, uh, but it has to be broad economic growth and broad [00:35:00] ec- you know, community permission.</p><p><strong>Elad Gil:</strong> Yeah. I guess I wanna talk about</p><h2>Societal Impact &amp; Optimism About AI</h2><p><strong>Elad Gil:</strong> what you&#8217;re most optimistic about currently or what have you most updated your personal models on regarding societal impact of AI? So you&#8217;re saying what&#8217;s the, the, the- What have you updated most on in terms of societal impact of AI? Yeah. I think the, um, the p- the most, um- Critical thing is the first question we even started with, which is we need to tell the story and make it real that everybody has a real shot to participate as a first-class participant in this new economy.</p><p><strong>Satya Nadella:</strong> Right? That&#8217;s kind of, I think we- in the next 12 months, 18 months, we need a way for people to say, &#8220;Oh, wow, I get it.&#8221; Right? There&#8217;s going to be tremendous capability, tremendous amount of infrastructure, but I can see what is going to happen, whether it&#8217;s the benefits like health outcomes or my ability to create a startup or my ability to run my [00:36:00] local sort of, uh, store more efficiently.</p><p>It&#8217;s just happening, and I see that, uh, benefit myself, right? That to me, you know, earning that permission in a path-dependent way, we can&#8217;t wait. See, the one thing, Eli, that I&#8217;ve now learned is I think the world is gonna be very skeptical of tech and tech companies that say, &#8220;Trust us, we&#8217;ve got it. The g- future is gonna be glorious.&#8221;</p><p><strong>Sarah Guo:</strong> Uh, you kind of have to deliver tangible benefits. Um, and quite frankly, politicians winning elections, uh, because they have advocated for that. That will be at least my adjustment because without it, um, thinking that somehow... Because it&#8217;s too important this time around. It&#8217;s too much of the economy for it not to be the case So one very simple framework I have for, you know, what are, what is gonna be the broad benefit of AI, um, beyond the communities just working in technology, are, are sort of wealth creation- Yep</p><p>it&#8217;s [00:37:00] gonna happen in a ton of different companies, startups and large companies. Then you have healthcare. Uh, you, you had amazing demos today. There are companies like Open Evidence. I think that is happening. Um,</p><h2>Education &amp; Future of Learning</h2><p><strong>Sarah Guo:</strong> education seems like another one that&#8217;s an- Yep ... obvious good where we haven&#8217;t seen as much impact as I&#8217;d expect.</p><p><strong>Swyx:</strong> Do you have a hypothesis on why that might be, or if it&#8217;ll come? Yeah, I mean, I think this is where, again, how we think about education, how... You know, recently I met with, uh, the founders of Alpha School and learnt a lot about what they were going and going about, and it&#8217;s fascinating to listen, uh, to how to even rethink- Mm</p><p><strong>Satya Nadella:</strong> uh, what does education really look like. Because I think it&#8217;s actually very important. Mm. Uh, and I&#8217;m not saying anything traditionally being done is less important, right? I was even looking at the, uh... It&#8217;s fascinating to see. I, I, I forget the which Stanford class it was, uh, the, the Asian guidelines for CS something.</p><p>Mm. Uh, because you still need people to learn. Uh, like it was an interesting AI class that they were making sure people were learning how to apply softmax appropriately versus saying, &#8220;Hey, fix my training run.&#8221; Mm-hmm. Uh, so I think learning concepts is important. It&#8217;s going to [00:38:00] be, uh, critical. But the way we create the incentives, what are the credentials, how we value those credentials, what is the employment opportunity for those credentials?</p><p>So I think that there&#8217;s a complete change that has to happen, uh, given the way to get to information, way to educate yourself, way to continuously keep yourself updated has changed so much. So I think interestingly enough, maybe the next big startup and success story could be someone who builds a new university, um, or a new, um, pedagogy even of how to get someone to go through a curriculum and find economic opportunity, uh, that&#8217;s highly valuable.</p><p>Well, that has felt, uh, perhaps impossible for a long time, but it&#8217;s a great note to end on and something that might be possible. It&#8217;s still possible. Yeah. Thank you, Satya. Thank you so much. Thank you. Yeah. I appreciate it. Thank you all.</p>]]></content:encoded></item><item><title><![CDATA[[AINews] Microsoft Build: MAI-Thinking-1 and MAI Family models]]></title><description><![CDATA[Microsoft Build recap, and new MAI model technical details]]></description><link>https://www.latent.space/p/ainews-microsoft-build-mai-thinking</link><guid isPermaLink="false">https://www.latent.space/p/ainews-microsoft-build-mai-thinking</guid><pubDate>Wed, 03 Jun 2026 05:49:02 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!PL7Y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e8ca90a-629c-44d5-af2f-0b0cd2a60aa2_1510x886.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Today was a big day, not least because we caught up on <a href="https://www.latent.space/p/github">the state of GitHub vs Agents</a>, and recorded a <a href="https://x.com/TheTuringPost/status/2061901518522188251?s=20">special pod with No Priors and Satya Nadella</a> &#8212;&nbsp;at MS Build, Satya and Mustafa announced 7 new MAI models:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PL7Y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e8ca90a-629c-44d5-af2f-0b0cd2a60aa2_1510x886.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PL7Y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e8ca90a-629c-44d5-af2f-0b0cd2a60aa2_1510x886.png 424w, https://substackcdn.com/image/fetch/$s_!PL7Y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e8ca90a-629c-44d5-af2f-0b0cd2a60aa2_1510x886.png 848w, https://substackcdn.com/image/fetch/$s_!PL7Y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e8ca90a-629c-44d5-af2f-0b0cd2a60aa2_1510x886.png 1272w, https://substackcdn.com/image/fetch/$s_!PL7Y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e8ca90a-629c-44d5-af2f-0b0cd2a60aa2_1510x886.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PL7Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e8ca90a-629c-44d5-af2f-0b0cd2a60aa2_1510x886.png" width="1456" height="854" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1e8ca90a-629c-44d5-af2f-0b0cd2a60aa2_1510x886.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:854,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:710781,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/200399328?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e8ca90a-629c-44d5-af2f-0b0cd2a60aa2_1510x886.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PL7Y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e8ca90a-629c-44d5-af2f-0b0cd2a60aa2_1510x886.png 424w, https://substackcdn.com/image/fetch/$s_!PL7Y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e8ca90a-629c-44d5-af2f-0b0cd2a60aa2_1510x886.png 848w, https://substackcdn.com/image/fetch/$s_!PL7Y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e8ca90a-629c-44d5-af2f-0b0cd2a60aa2_1510x886.png 1272w, https://substackcdn.com/image/fetch/$s_!PL7Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e8ca90a-629c-44d5-af2f-0b0cd2a60aa2_1510x886.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is an impressive lineup, especially considering that the <a href="https://news.smol.ai/issues/24-03-20-ainews-shipping-and-dipping-inflection-stability-edition">Microsoft-Inflection deal that set up MAI </a>only happened 2 years ago, and that these are all from-scratch pretrains. MAI today is by no means an unqualified frontier lab, but it is a good tier 2 neolab with obvious incentives to support domain specific finetunes (as opposed to <a href="https://www.latent.space/p/ainews-the-end-of-finetuning">the frontier labs who have ~all killed finetuning</a>).</p><p>The star of the show was the <a href="https://microsoft.ai/wp-content/uploads/2026/06/main_20260602_2.pdf">100+ page MAI tech report</a>, which the research community is giving glowing reviews:</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/eliebakouch/status/2061965825037254947&quot;,&quot;full_text&quot;:&quot;microsoft MAI tech report is a gold mine, one of the most transparent for a model at this scale.\n\nthis model uses zero synthetic data or distillation from previous models. this means reasoning, agentic behavior, tool use are all learned fully during post-training with no cold&quot;,&quot;username&quot;:&quot;eliebakouch&quot;,&quot;name&quot;:&quot;elie&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1745893660099592193/MmYemsw6_normal.jpg&quot;,&quot;date&quot;:&quot;2026-06-03T00:18:56.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/HJ2SubUXkAA20X7.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/WkTkYaw9gF&quot;}],&quot;quoted_tweet&quot;:{&quot;full_text&quot;:&quot;Super excited to announce seven new world-class MAI models today. They represent what we consider a new era in AI designed to keep you in control and on the frontier.\nFirst is our text foundation model, MAI-Thinking-1, exceptionally strong on reasoning and SWE tasks. \n- It&#8217;s a&quot;,&quot;username&quot;:&quot;mustafasuleyman&quot;,&quot;name&quot;:&quot;Mustafa Suleyman&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1927407622602276864/c_5uOZij_normal.jpg&quot;},&quot;reply_count&quot;:14,&quot;retweet_count&quot;:81,&quot;like_count&quot;:685,&quot;impression_count&quot;:53708,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:false}" data-component-name="Twitter2ToDOM"></div><p>You can catch up on all the rest of the announcement in the excellent Verge recap, and the tweet summaries below:</p><div id="youtube2-gw0HBKJlX-w" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;gw0HBKJlX-w&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/gw0HBKJlX-w?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p></p><p></p><blockquote><p>AI News for 06/1/2026-6/2/2026. We checked 12 subreddits, <a href="https://twitter.com/i/lists/1585430245762441216">544 Twitters</a> and no further Discords. <a href="https://news.smol.ai/">AINews&#8217; website</a> lets you search all past issues. As a reminder, <a href="https://www.latent.space/p/2026">AINews is now a section of Latent Space</a>. You can <a href="https://support.substack.com/hc/en-us/articles/8914938285204-How-do-I-subscribe-to-or-unsubscribe-from-a-section-on-Substack">opt in/out</a> of email frequencies!</p></blockquote><div><hr></div><h1><strong>AI Twitter Recap</strong></h1><p><strong>Top Story: Microsoft Build recap, and new MAI model technical details</strong></p><h2><strong>What happened</strong></h2><p><strong>Microsoft used Build to position itself as both an AI platform company and a frontier-model lab, pairing broad product launches with unusually detailed disclosures about its new MAI model family.</strong></p><ul><li><p>Microsoft AI announced <strong>seven new MAI models</strong> spanning reasoning, code, image, speech transcription, and voice, led by <strong>MAI-Thinking-1</strong>, <strong>MAI-Code-1-Flash</strong>, <strong>MAI-Image-2.5</strong>, <strong>MAI-Transcribe-1.5</strong>, and <strong>MAI-Voice-2</strong> according to <a href="https://x.com/MicrosoftAI/status/2061887500541366489">@MicrosoftAI</a> and <a href="https://x.com/mustafasuleyman/status/2061880164498428188">@mustafasuleyman</a></p></li><li><p>The flagship reasoning model <strong>MAI-Thinking-1</strong> was presented as Microsoft&#8217;s <strong>first reasoning model</strong>, built with <strong>clean data lineage</strong> and <strong>zero distillation from third-party models</strong> in posts from <a href="https://x.com/mustafasuleyman/status/2061880164498428188">@mustafasuleyman</a>, <a href="https://x.com/baseten/status/2061878701823066431">@baseten</a>, <a href="https://x.com/tuhinone/status/2061879239817969756">@tuhinone</a>, and <a href="https://x.com/HannaHajishirzi/status/2061901432627044430">@HannaHajishirzi</a></p></li><li><p>Microsoft released a <strong>109-page technical report</strong> for MAI-Thinking-1, which drew strong positive reactions from technically oriented readers for its level of transparency, including <a href="https://x.com/eliebakouch/status/2061877335960281459">@eliebakouch</a>, <a href="https://x.com/ethanCaballero/status/2061920873297088723">@ethanCaballero</a>, <a href="https://x.com/nrehiew_/status/2062013300196700395">@nrehiew_</a>, <a href="https://x.com/yacinelearning/status/2061914159235617056">@yacinelearning</a>, and <a href="https://x.com/stochasticchasm/status/2061916808626815161">@stochasticchasm</a></p></li><li><p>Microsoft also emphasized <strong>local AI and agent-native Windows</strong>: Build messaging highlighted <strong>secure execution layers for agents</strong>, a new <strong>Surface RTX Spark Dev Box</strong>, Windows AI access to the broader Windows GPU install base, and concept hardware such as <strong>Project Solara/Scout</strong>, summarized by <a href="https://x.com/yusuf_i_mehdi/status/2061882543641907528">@yusuf_i_mehdi</a>, <a href="https://x.com/TheTuringPost/status/2061865165734506683">@TheTuringPost</a>, <a href="https://x.com/kimmonismus/status/2061860319547527191">@kimmonismus</a>, and <a href="https://x.com/kimmonismus/status/2061875714933371220">@kimmonismus</a></p></li><li><p>Build also included a major <strong>GitHub Copilot app</strong> push as the &#8220;desktop home for agent-native software development,&#8221; with <strong>canvases</strong>, cross-device continuity, and tighter GitHub agent workflows, from <a href="https://x.com/pierceboggan/status/2061868635241828688">@pierceboggan</a>, <a href="https://x.com/lukehoban/status/2061905434039246939">@lukehoban</a>, and reactions from <a href="https://x.com/techgirl1908/status/2061870470237164018">@techgirl1908</a></p></li><li><p>Microsoft introduced <strong>Web IQ</strong>, a new grounding/search API stack for AI agents, claiming the APIs already power &#8220;nearly all AI agents and chatbots in the industry today, including Copilot and ChatGPT,&#8221; via <a href="https://x.com/JordiRib1/status/2061866606670581871">@JordiRib1</a></p></li><li><p>Satya Nadella framed Build as an ecosystem moment rather than a single-product launch, while Mustafa Suleyman framed it as the output of Microsoft&#8217;s internal &#8220;hill-climbing machine,&#8221; in <a href="https://x.com/satyanadella/status/2061896503304806521">@satyanadella</a>, <a href="https://x.com/mustafasuleyman/status/2061934667096596657">@mustafasuleyman</a>, and reaction from <a href="https://x.com/nrehiew_/status/2061983583523475556">@nrehiew_</a></p></li></ul><h2><strong>MAI model family: disclosed facts and technical details</strong></h2><h3><strong>MAI-Thinking-1</strong></h3><ul><li><p>Microsoft described <strong>MAI-Thinking-1</strong> as a <strong>35B active parameter MoE</strong> with a <strong>256K context window</strong> in <a href="https://x.com/mustafasuleyman/status/2061880164498428188">@mustafasuleyman</a></p></li><li><p>A separate summary from <a href="https://x.com/scaling01/status/2061889624847343825">@scaling01</a> says the model is a <strong>1T@35B parameter model</strong>, <strong>pre-trained on 30T tokens</strong>, and trained using <strong>8192 GB200 GPUs</strong>; this appears to be a reading of the technical report rather than Microsoft marketing copy</p></li><li><p><a href="https://x.com/kimmonismus/status/2061877528781025381">@kimmonismus</a> similarly summarized it as a <strong>mid-size MoE with 45B active params</strong>, but this conflicts with Mustafa&#8217;s own <strong>35B active</strong> figure; the more authoritative figure in the tweet set is the official <strong>35B active</strong> number</p></li><li><p>Microsoft claims <strong>97% on AIME 2025</strong> and <strong>53% on SWE-Bench Pro</strong>, with blind human raters on Surge preferring it overall to <strong>Sonnet 4.6</strong>, from <a href="https://x.com/mustafasuleyman/status/2061880164498428188">@mustafasuleyman</a> and <a href="https://x.com/asadovsky/status/2062008312603070891">@asadovsky</a></p></li><li><p>Microsoft says the model is <strong>optimized on MAIA 200</strong>, with <strong>30% better performance per dollar</strong> and <strong>1.4x performance-per-watt gain</strong> versus <strong>GB200</strong> when running MAI models end-to-end, per <a href="https://x.com/mustafasuleyman/status/2061880164498428188">@mustafasuleyman</a></p></li><li><p>Microsoft and partners repeatedly stressed <strong>no third-party distillation</strong>, &#8220;clean data lineage,&#8221; and enterprise-controlled fine-tuning with &#8220;100% eyes-off&#8221; post-training data through Baseten, in <a href="https://x.com/baseten/status/2061878701823066431">@baseten</a>, <a href="https://x.com/tuhinone/status/2061879239817969756">@tuhinone</a>, and <a href="https://x.com/MicrosoftAI/status/2061923309344756043">@MicrosoftAI</a></p></li></ul><h3><strong>MAI-Code-1-Flash</strong></h3><ul><li><p>Microsoft introduced <strong>MAI-Code-1-Flash</strong> as a fast coding model for <strong>VS Code</strong> and <strong>GitHub Copilot CLI</strong>, first announced by <a href="https://x.com/pierceboggan/status/2061877165810131297">@pierceboggan</a> and later highlighted by <a href="https://x.com/mariorod1/status/2061914993550143513">@mariorod1</a></p></li><li><p>Official Microsoft messaging via <a href="https://x.com/mustafasuleyman/status/2061880164498428188">@mustafasuleyman</a> says <strong>Code-1-Flash achieves 51% on SWE-Bench Pro despite having just 5B parameters</strong>, positioning it near Haiku-class size/cost</p></li><li><p>A competing summary from <a href="https://x.com/scaling01/status/2061891478176112794">@scaling01</a> describes it as a <strong>137B parameter MoE</strong>, <strong>256K context</strong>, trained on <strong>10T+ tokens</strong>, and &#8220;stronger and more efficient than Claude 4.5 Haiku.&#8221; That likely indicates <strong>5B active parameters</strong> rather than total parameters; the tweets do not fully reconcile this distinction, but together imply <strong>small active footprint within a much larger MoE</strong></p></li><li><p>Availability at launch was highlighted as <strong>GitHub Copilot / VS Code-first</strong>, per <a href="https://x.com/scaling01/status/2061891478176112794">@scaling01</a> and <a href="https://x.com/mariorod1/status/2061914993550143513">@mariorod1</a></p></li></ul><h3><strong>MAI-Image-2.5</strong></h3><ul><li><p>Microsoft launched <strong>MAI-Image-2.5</strong> and a <strong>Flash</strong> variant, claiming both reached <strong>#2 on leaderboards</strong>, with <a href="https://x.com/mustafasuleyman/status/2061880164498428188">@mustafasuleyman</a> saying they surpass <strong>Nano Banana 2</strong> on image editing</p></li><li><p>Independent leaderboard accounts supported the high ranking: <a href="https://x.com/arena/status/2061887242579382660">@arena</a> reported <strong>#2 in Image Edit Arena</strong> with <strong>score 1401</strong>, <strong>+10 points over Nano Banana 2</strong>, Grok Imagine, and ChatGPT Image Latest HF</p></li><li><p><a href="https://x.com/arena/status/2061894541888962712">@arena</a> further said MAI-Image-2.5 &#8220;advances the Pareto frontier,&#8221; meaning no model at its price tier scores higher on that benchmark</p></li><li><p>Distribution partners quickly followed, including <a href="https://x.com/OpenRouter/status/2061894672847671724">@OpenRouter</a> and <a href="https://x.com/fal/status/2061920052664820199">@fal</a></p></li></ul><h3><strong>MAI-Transcribe-1.5</strong></h3><ul><li><p><a href="https://x.com/ArtificialAnlys/status/2061878491860324402">@ArtificialAnlys</a> reported <strong>MAI-Transcribe-1.5</strong> as an unusually strong speed/accuracy point on the STT frontier: <strong>~276x realtime</strong>, <strong>2.4% AA-WER</strong>, <strong>#3 overall</strong> on its leaderboard</p></li><li><p>The model supports <strong>43 languages</strong>, including English, French, Arabic, Japanese, and Chinese, and supports <strong>keyword biasing</strong> for rarer terms such as names and medical terminology, per <a href="https://x.com/ArtificialAnlys/status/2061878491860324402">@ArtificialAnlys</a></p></li><li><p>Pricing was reported as <strong>$6 per 1,000 minutes of audio</strong> via Microsoft Foundry in <a href="https://x.com/ArtificialAnlys/status/2061878498609053909">@ArtificialAnlys</a></p></li><li><p>OpenRouter also listed the model among the three MAI launches it brought live the same day in <a href="https://x.com/OpenRouter/status/2061894672847671724">@OpenRouter</a></p></li></ul><h3><strong>MAI-Voice-2</strong></h3><ul><li><p>MAI-Voice-2 appears in Microsoft&#8217;s &#8220;seven models&#8221; umbrella and in OpenRouter&#8217;s availability post at <a href="https://x.com/OpenRouter/status/2061894672847671724">@OpenRouter</a></p></li><li><p>The tweet set contains little technical detail on Voice-2 itself beyond launch/availability</p></li></ul><h2><strong>Technical-report details that mattered to researchers</strong></h2><h3><strong>Why the report stood out</strong></h3><ul><li><p>The dominant technical reaction was that Microsoft released an unusually detailed frontier-model report: <a href="https://x.com/eliebakouch/status/2061965825037254947">@eliebakouch</a> called it &#8220;one of the most transparent for a model at this scale,&#8221; <a href="https://x.com/nrehiew_/status/2062023547690828141">@nrehiew_</a> said it &#8220;could really serve as an updated textbook for LLM training today,&#8221; and <a href="https://x.com/stochasticchasm/status/2061879506139557979">@stochasticchasm</a> called it a &#8220;gold mine&#8221;</p></li><li><p>Multiple readers highlighted that the report disclosed <strong>pipeline details, scaling ladder methodology, data curation, infra metrics, and MFU numbers</strong>; this level of specificity is what drew praise from <a href="https://x.com/ethanCaballero/status/2061920873297088723">@ethanCaballero</a>, <a href="https://x.com/eliebakouch/status/2062004670017486912">@eliebakouch</a>, and <a href="https://x.com/nrehiew_/status/2062013300196700395">@nrehiew_</a></p></li></ul><h3><strong>Pretraining and data</strong></h3><ul><li><p>A major technical claim repeated across commentary is that MAI-Thinking-1 used <strong>no synthetic data</strong> and <strong>no distillation</strong>, not only in post-training but throughout the disclosed pipeline, from <a href="https://x.com/eliebakouch/status/2061965825037254947">@eliebakouch</a>, <a href="https://x.com/stochasticchasm/status/2061967095022366924">@stochasticchasm</a>, and <a href="https://x.com/HannaHajishirzi/status/2061901432627044430">@HannaHajishirzi</a></p></li><li><p><a href="https://x.com/eliebakouch/status/2061977834558804207">@eliebakouch</a> says the report explicitly notes data from <strong>Common Crawl plus private sources</strong>, with <strong>targeted sub-pipelines for different domains</strong>, heavy extraction/dedup work, and an intentional choice of <strong>no synthetic data</strong></p></li><li><p>The report&#8217;s internal <strong>private NLL set</strong> used for scaling decisions was summarized by <a href="https://x.com/eliebakouch/status/2061976608265880004">@eliebakouch</a> as:</p><ul><li><p><strong>50% code</strong></p></li><li><p><strong>17.5% STEM</strong></p></li><li><p><strong>17.5% math</strong></p></li><li><p><strong>10% general knowledge</strong></p></li><li><p><strong>5% multilingual</strong></p></li></ul></li><li><p><a href="https://x.com/eliebakouch/status/2061976230933496176">@eliebakouch</a> says architecture promotion in the scaling ladder was based on an <strong>Efficiency Gain (EG)</strong> metric: how much extra compute the baseline would need to match the candidate&#8217;s loss</p></li><li><p>The same thread notes ablations at roughly <strong>100/200 tokens per parameter</strong>, described as around &#8220;Chinchilla optimal&#8221; for the setup, while also remarking this differs from dense-model heuristics due to MoE structure in <a href="https://x.com/eliebakouch/status/2061975730414633043">@eliebakouch</a></p></li></ul><h3><strong>Post-training / RL</strong></h3><ul><li><p>The most discussed technical choice was that Microsoft appears to have started RL from a checkpoint with <strong>no prior reasoning exposure</strong>, which several readers found notable. <a href="https://x.com/stochasticchasm/status/2061879070141677615">@stochasticchasm</a> called this a &#8220;very interesting decision,&#8221; while <a href="https://x.com/stochasticchasm/status/2061878066314645861">@stochasticchasm</a> reacted to graphs suggesting a jump from <strong>&lt;20% AIME25 to &gt;95%</strong></p></li><li><p><a href="https://x.com/HannaHajishirzi/status/2061901432627044430">@HannaHajishirzi</a> described the &#8220;climbing from scratch&#8221; recipe as <strong>simple recipes, rigorous science, self-distillation, patience, and great infra</strong></p></li><li><p><a href="https://x.com/soldni/status/2061882085573616003">@soldni</a> characterized the process as &#8220;climbing with no distillation, like the big boys do&#8221;</p></li><li><p>Some independent readers inferred from the report that <strong>synth data remains very valuable</strong> for agentic performance in the broader field, even if Microsoft deliberately avoided it here; see <a href="https://x.com/stochasticchasm/status/2061961874879783376">@stochasticchasm</a></p></li></ul><h3><strong>Data curation / judges / DSPy GEPA</strong></h3><ul><li><p>A detail that got substantial attention from the DSPy/late-interaction crowd: Microsoft reportedly used <strong>GEPA / DSPy-optimized LLM judges</strong> in pretraining data curation and quality scoring</p></li><li><p>This was highlighted by <a href="https://x.com/bj2rn/status/2061941109828301241">@bj2rn</a>, <a href="https://x.com/LakshyAAAgrawal/status/2062013650639241403">@LakshyAAAgrawal</a>, and <a href="https://x.com/lateinteraction/status/2062015109132873852">@lateinteraction</a></p></li></ul><h3><strong>Infra / utilization / hardware co-design</strong></h3><ul><li><p>Microsoft reportedly disclosed <strong>exact MFU across iterations</strong>, which multiple readers said is rarely shared at this scale, per <a href="https://x.com/eliebakouch/status/2061965825037254947">@eliebakouch</a></p></li><li><p><a href="https://x.com/scaling01/status/2061889624847343825">@scaling01</a> summarized the run as using <strong>8192 GB200 GPUs</strong></p></li><li><p><a href="https://x.com/eliebakouch/status/2062004120098144764">@eliebakouch</a> singled out a reported <strong>~40% higher throughput per watt</strong>-type figure as &#8220;pretty impressive and bullish on microsoft chips,&#8221; though this may refer to rack-level budget or serving configuration and was not fully unpacked in-tweet</p></li><li><p>Microsoft&#8217;s official framing connected model design to <strong>MAIA 200</strong> custom silicon and emphasized better <strong>performance-per-dollar</strong> and <strong>performance-per-watt</strong> vs NVIDIA GB200 in <a href="https://x.com/mustafasuleyman/status/2061880164498428188">@mustafasuleyman</a></p></li><li><p>Build&#8217;s broader Windows/local-AI narrative also centered on hardware specifics such as:</p><ul><li><p><strong>1 trillion parameters running locally on DGX Station</strong></p></li><li><p><strong>128GB unified memory</strong></p></li><li><p><strong>110 TOPS AI performance</strong></p></li><li><p><strong>20 CPU cores</strong></p></li><li><p><strong>70+ PowerToys utilities</strong> from <a href="https://x.com/TheTuringPost/status/2061852480636653924">@TheTuringPost</a></p></li></ul></li><li><p>Reactions also pointed to local runs of large models, e.g. <a href="https://x.com/kimmonismus/status/2061852979318427988">@kimmonismus</a> on <strong>RTX Spark running a 120B parameter model locally</strong></p></li></ul><h2><strong>Build product/platform recap beyond the models</strong></h2><h3><strong>GitHub Copilot app and agent-native development</strong></h3><ul><li><p>GitHub unveiled the <strong>GitHub Copilot app</strong>, pitched as a desktop surface for <strong>agent-native software development</strong> by <a href="https://x.com/pierceboggan/status/2061868635241828688">@pierceboggan</a></p></li><li><p>Key themes included:</p><ul><li><p><strong>canvases</strong> for bidirectional work between users and agents, per <a href="https://x.com/Techmeme/status/2061875738694062419">@Techmeme</a></p></li><li><p>continuity across <strong>CLI, mobile, web, local, and cloud</strong>, per <a href="https://x.com/lukehoban/status/2061905448287322243">@lukehoban</a></p></li><li><p>a growing role for GitHub as the center of agent workflows, reflected in <a href="https://x.com/techgirl1908/status/2061870470237164018">@techgirl1908</a> and <a href="https://x.com/OrenMe/status/2061873010664001605">@OrenMe</a></p></li></ul></li><li><p>Copilot CLI also got an experimental <strong>terminal UI with tabs, built-in feedback/rubber duck, prompt scheduling, and voice input</strong>, per <a href="https://x.com/GHchangelog/status/2061870684876272123">@GHchangelog</a></p></li></ul><h3><strong>Windows as an agent runtime</strong></h3><ul><li><p>Microsoft&#8217;s Windows org framed Build around &#8220;faster developer execution, a secure execution layer for agents, and unmetered intelligence that runs locally on device,&#8221; per <a href="https://x.com/yusuf_i_mehdi/status/2061882543641907528">@yusuf_i_mehdi</a></p></li><li><p>Several posts stressed that Microsoft wants <strong>Windows</strong> to be the trusted execution platform for agents, not just Azure</p></li><li><p><a href="https://x.com/TheTuringPost/status/2061865165734506683">@TheTuringPost</a> described <strong>Project Solara</strong> as a platform for <strong>agent-first devices</strong>, with concepts including:</p><ul><li><p>a <strong>desktop AI companion</strong></p></li><li><p>a <strong>wearable badge</strong> with cameras, microphones, sensors, and secure authentication</p></li></ul></li><li><p><a href="https://x.com/kimmonismus/status/2061860319547527191">@kimmonismus</a> saw these as handheld/desktop devices for controlling agents and compared them to expectations people had for standalone OpenAI hardware</p></li><li><p><a href="https://x.com/kimmonismus/status/2061875714933371220">@kimmonismus</a> separately highlighted <strong>Microsoft Scout</strong> as an &#8220;always-on personal agent for work&#8221;</p></li></ul><h3><strong>Web IQ and search for agents</strong></h3><ul><li><p><a href="https://x.com/JordiRib1/status/2061866606670581871">@JordiRib1</a> announced <strong>Microsoft Web IQ</strong> as a suite of <strong>AI-native grounding APIs</strong> for <strong>web pages, news, images, and videos</strong></p></li><li><p>His framing is important context: classic search engines were built for humans, but Microsoft believes future search demand will come from agents, potentially <strong>1000x more queries</strong> than human search traffic</p></li><li><p>He claimed Web IQ was re-architected from Bing&#8217;s stack for <strong>quality, latency, and token efficiency</strong>, and that it already powers major chatbots including <strong>Copilot and ChatGPT</strong></p></li></ul><h3><strong>Foundry and open-model distribution</strong></h3><ul><li><p><a href="https://x.com/jeffboudier/status/2061868927207244277">@jeffboudier</a> said Satya cited <strong>11,000+ models available in Microsoft Foundry</strong>, of which <strong>10,928</strong> come from Hugging Face</p></li><li><p>This supports Microsoft&#8217;s parallel identity at Build: both a first-party model builder and a large multi-model hosting/distribution platform</p></li></ul><h3><strong>Build messaging around datacenters and compute</strong></h3><ul><li><p>Several observers noted Build discussion around <strong>data center expansion</strong>, community backlash, and Microsoft&#8217;s argument that AI infra can expand without raising electricity costs to local communities; see <a href="https://x.com/kimmonismus/status/2061854806395015316">@kimmonismus</a> and <a href="https://x.com/kimmonismus/status/2061903253890330639">@kimmonismus</a></p></li><li><p><a href="https://x.com/scaling01/status/2061901702324695115">@scaling01</a> highlighted Mustafa saying AI compute will grow <strong>1000x in the next 3 years</strong>, taking today&#8217;s rough <strong>5e27 FLOPs</strong> frontier scale to <strong>5e30 FLOPs by 2029</strong></p></li><li><p><a href="https://x.com/mustafasuleyman/status/2061880029315764256">@mustafasuleyman</a> summarized the company&#8217;s philosophical theme as <strong>&#8220;Humanist superintelligence&#8221;</strong></p></li></ul><h2><strong>Facts vs. opinions</strong></h2><h3><strong>Factual claims in the tweet set</strong></h3><ul><li><p>Microsoft launched <strong>seven new MAI models</strong> at Build: <a href="https://x.com/MicrosoftAI/status/2061887500541366489">@MicrosoftAI</a></p></li><li><p>Official metrics for MAI-Thinking-1: <strong>35B active MoE</strong>, <strong>256K context</strong>, <strong>97% AIME 2025</strong>, <strong>53% SWE-Bench Pro</strong>, and blind human preference over Sonnet 4.6: <a href="https://x.com/mustafasuleyman/status/2061880164498428188">@mustafasuleyman</a></p></li><li><p>Official metrics for MAI-Code-1-Flash: <strong>51% SWE-Bench Pro</strong>, <strong>5B parameters</strong> as stated in tweet copy: <a href="https://x.com/mustafasuleyman/status/2061880164498428188">@mustafasuleyman</a></p></li><li><p>MAI-Image-2.5 ranking claims were independently echoed by <a href="https://x.com/arena/status/2061887242579382660">@arena</a></p></li><li><p>MAI-Transcribe-1.5 speed/accuracy details came from independent benchmark account <a href="https://x.com/ArtificialAnlys/status/2061878491860324402">@ArtificialAnlys</a></p></li><li><p>Microsoft released a <strong>109-page technical report</strong>: <a href="https://x.com/eliebakouch/status/2061877335960281459">@eliebakouch</a></p></li></ul><h3><strong>Opinions / interpretations</strong></h3><ul><li><p>&#8220;Microsoft is training serious models now?&#8221; from <a href="https://x.com/teortaxesTex/status/2061892492350407158">@teortaxesTex</a> is an interpretive reaction to the model/report quality, not a standalone fact</p></li><li><p>Claims that the report is &#8220;one of the most transparent&#8221; or &#8220;an updated textbook&#8221; are opinions from <a href="https://x.com/eliebakouch/status/2061965825037254947">@eliebakouch</a> and <a href="https://x.com/nrehiew_/status/2062023547690828141">@nrehiew_</a>, albeit shared by many readers</p></li><li><p><a href="https://x.com/kimmonismus/status/2061852480636653924">@kimmonismus</a> and <a href="https://x.com/TheTuringPost/status/2061865165734506683">@TheTuringPost</a> framed Build as a strategic shift from cloud-only AI toward local reasoning/agents; that is analysis rather than official wording</p></li><li><p>Posts claiming Microsoft &#8220;leaked&#8221; Anthropic Mythos FLOPs, including <a href="https://x.com/swyx/status/2061878629504881151">@swyx</a> and <a href="https://x.com/scaling01/status/2061897540161728791">@scaling01</a>, are speculative interpretations of a slide, later contested by the same cluster of commenters</p></li></ul><h2><strong>Different opinions and perspectives</strong></h2><h3><strong>Supportive views</strong></h3><ul><li><p>Technical readers were broadly impressed by the <strong>report&#8217;s transparency</strong> and Microsoft&#8217;s willingness to publish details usually withheld at this scale: <a href="https://x.com/eliebakouch/status/2061965825037254947">@eliebakouch</a>, <a href="https://x.com/nrehiew_/status/2062023547690828141">@nrehiew_</a>, <a href="https://x.com/ethanCaballero/status/2061920873297088723">@ethanCaballero</a>, <a href="https://x.com/stochasticchasm/status/2061916808626815161">@stochasticchasm</a></p></li><li><p>Some saw MAI-Thinking-1 as proof Microsoft is becoming a genuine frontier lab rather than just a model reseller or application layer, e.g. <a href="https://x.com/teortaxesTex/status/2061892492350407158">@teortaxesTex</a>, <a href="https://x.com/echen/status/2061907282607100075">@echen</a>, <a href="https://x.com/NandoDF/status/2061901884042985728">@NandoDF</a></p></li><li><p>Enterprise/platform supporters liked the <strong>clean-data-lineage</strong>, <strong>fine-tunable</strong>, <strong>eyes-off post-training data</strong> story, especially Baseten/Microsoft&#8217;s positioning around ownership and control: <a href="https://x.com/baseten/status/2061878701823066431">@baseten</a>, <a href="https://x.com/tuhinone/status/2061879239817969756">@tuhinone</a></p></li></ul><h3><strong>Neutral / analytical views</strong></h3><ul><li><p>Several posts focused on <strong>reading and unpacking the report</strong> rather than cheering the launch, especially <a href="https://x.com/stochasticchasm/status/2061916808626815161">@stochasticchasm</a>, <a href="https://x.com/nrehiew_/status/2062013300196700395">@nrehiew_</a>, and <a href="https://x.com/eliebakouch/status/2061965825037254947">@eliebakouch</a></p></li><li><p>Some commentators were careful on benchmark interpretation. <a href="https://x.com/kimmonismus/status/2061918020843557110">@kimmonismus</a> noted Microsoft appeared to compare to <strong>Sonnet 4.6</strong> generally, with <strong>Opus-level comparability only on SWE Pro</strong></p></li><li><p><a href="https://x.com/iScienceLuvr/status/2061926066453962952">@iScienceLuvr</a> specifically appreciated reporting on <strong>health benchmarks</strong> such as HealthBench Professional and MedXpertQA rather than only coding/math</p></li></ul><h3><strong>Skeptical / opposing views</strong></h3><ul><li><p>A subset questioned whether all numbers and comparisons were being interpreted correctly, especially around active params and external-model comparisons</p></li><li><p>The most visible skepticism concerned the apparent <strong>Mythos FLOP &#8220;leak&#8221;</strong>. <a href="https://x.com/iScienceLuvr/status/2061882397340393514">@iScienceLuvr</a> suggested it was probably just an estimate, not a leak; <a href="https://x.com/scaling01/status/2061989029025853757">@scaling01</a> later argued the original <strong>6.1e27 FLOP</strong> figure was unrealistic and supplied a lower alternative estimate before posting a correction in <a href="https://x.com/scaling01/status/2061990840138899674">@scaling01</a></p></li><li><p>There was also implicit skepticism in the field about whether <strong>zero synth / zero distillation</strong> is the right long-term recipe for best agentic performance, as noted by readers emphasizing synth-data deltas elsewhere, e.g. <a href="https://x.com/stochasticchasm/status/2061961874879783376">@stochasticchasm</a></p></li></ul><h2><strong>Context: why this matters</strong></h2><ul><li><p>Build&#8217;s announcements matter because they suggest Microsoft is no longer content with being only:</p><ol><li><p>Azure/OpenAI&#8217;s cloud host</p></li><li><p>GitHub&#8217;s developer surface</p></li><li><p>Copilot&#8217;s application shell<br>It is also trying to be a <strong>first-party frontier model developer</strong> with its own model family, silicon stack, and post-training platform</p></li></ol></li><li><p>The <strong>clean lineage / no distillation</strong> emphasis is strategically significant. It addresses enterprise concerns around IP provenance, future controllability, and dependence on external labs</p></li><li><p>The <strong>local AI</strong> emphasis matters because Microsoft is tying AI strategy to Windows and device distribution, not just to Azure. Build messaging repeatedly pushed the idea that reasoning models, planners, and agents can increasingly run <strong>on-device</strong>, not only in the cloud: <a href="https://x.com/TheTuringPost/status/2061852480636653924">@TheTuringPost</a>, <a href="https://x.com/yusuf_i_mehdi/status/2061882543641907528">@yusuf_i_mehdi</a></p></li><li><p>The <strong>109-page report</strong> matters because frontier-model transparency has generally been shrinking, especially around data, infra, and training methodology. Multiple researchers explicitly noted the disclosure level is uncommon at this scale: <a href="https://x.com/eliebakouch/status/2061965825037254947">@eliebakouch</a>, <a href="https://x.com/nrehiew_/status/2062023547690828141">@nrehiew_</a></p></li><li><p>The Build recap also showed Microsoft trying to integrate all layers of the stack:</p><ul><li><p><strong>models</strong>: MAI family</p></li><li><p><strong>chips</strong>: MAIA 200</p></li><li><p><strong>cloud</strong>: Azure + Foundry</p></li><li><p><strong>OS</strong>: Windows agent runtime</p></li><li><p><strong>developer UX</strong>: Copilot app / VS Code / CLI</p></li><li><p><strong>retrieval/grounding</strong>: Web IQ</p></li><li><p><strong>hardware form factors</strong>: Solara / Scout concepts</p></li></ul></li><li><p>This combination is why several observers described the event less as a normal dev conference and more as a coordinated move toward an <strong>agent platform spanning cloud, edge, OS, and custom models</strong>, e.g. <a href="https://x.com/satyanadella/status/2061896503304806521">@satyanadella</a>, <a href="https://x.com/mustafasuleyman/status/2061934667096596657">@mustafasuleyman</a>, and <a href="https://x.com/TheTuringPost/status/2061865165734506683">@TheTuringPost</a></p></li></ul><h2><strong>The &#8220;Mythos FLOPs leak&#8221; mini-story</strong></h2><ul><li><p>During/after Build, some users claimed a Microsoft slide inadvertently revealed training compute for Anthropic&#8217;s rumored <strong>Claude Mythos</strong>, with <a href="https://x.com/swyx/status/2061878629504881151">@swyx</a> asking if Mustafa had leaked the FLOP count</p></li><li><p><a href="https://x.com/scaling01/status/2061897540161728791">@scaling01</a> estimated the slide implied <strong>6.1e27 FLOPs</strong> with a confidence interval based on pixel measurement, while <a href="https://x.com/kimmonismus/status/2061908067034517853">@kimmonismus</a> noted that would be around <strong>Gemini 3.1 Pro-scale</strong> compute</p></li><li><p>That interpretation was subsequently challenged by <a href="https://x.com/iScienceLuvr/status/2061882397340393514">@iScienceLuvr</a>, who argued it was probably an estimate, and then by <a href="https://x.com/scaling01/status/2061989029025853757">@scaling01</a>, who posted a lower-range model-based estimate of <strong>3.37e26 to 1.46e27 FLOPs</strong> and later said the original numbers were <strong>bogus</strong> in <a href="https://x.com/scaling01/status/2061990840138899674">@scaling01</a></p></li><li><p>The episode is useful mostly as context: Build&#8217;s compute/scaling messaging was detailed enough that people started trying to infer competitor training budgets from presentation materials</p></li></ul><p><strong>Developer tools, agents, and coding workflows</strong></p><ul><li><p>OpenAI launched <strong>Sites in Codex</strong>, letting teams turn ideas/docs/plans into deployed internal websites/apps with auth and dynamic data, first for business/enterprise users, in <a href="https://x.com/OpenAI/status/2061845949170045346">@OpenAI</a>, <a href="https://x.com/TheRohanVarma/status/2061872164442403139">@TheRohanVarma</a>, and <a href="https://x.com/gdb/status/2061988413105156128">@gdb</a></p></li><li><p>OpenAI also expanded <strong>role-specific Codex plugins</strong> across sales, data analytics, creative production, product design, and public equity workflows, with access to <strong>62 apps and 110 skills</strong>, from <a href="https://x.com/OpenAI/status/2061887650391625870">@OpenAI</a> and <a href="https://x.com/OpenAIDevs/status/2061888366791246071">@OpenAIDevs</a></p></li><li><p>GitHub&#8217;s <strong>Copilot app</strong> and Microsoft&#8217;s Build push around agent-native software development were central to the day&#8217;s tooling news: <a href="https://x.com/pierceboggan/status/2061868635241828688">@pierceboggan</a>, <a href="https://x.com/lukehoban/status/2061905434039246939">@lukehoban</a>, <a href="https://x.com/GHchangelog/status/2061870684876272123">@GHchangelog</a></p></li><li><p>Anthropic shipped a <strong>CLI for Claude Platform</strong> and upgraded Claude Code&#8217;s <code>/fork</code> to run a background agent with exact context + prompt cache, in <a href="https://x.com/ClaudeDevs/status/2061877343078244459">@ClaudeDevs</a> and <a href="https://x.com/ClaudeDevs/status/2061947411141169494">@ClaudeDevs</a></p></li><li><p>Nous launched <strong>Hermes Desktop</strong>, a local/native desktop surface for Hermes agents, in <a href="https://x.com/NousResearch/status/2061843507417944552">@NousResearch</a>, <a href="https://x.com/Teknium/status/2061844602735538266">@Teknium</a>, and later Tailscale/Ollama integration notes from <a href="https://x.com/Teknium/status/2061984430370267210">@Teknium</a> and <a href="https://x.com/ollama/status/2062011585355551231">@ollama</a></p></li><li><p>Cognition launched <strong>Devin Desktop</strong>, positioned as an agent-neutral desktop for managing local/cloud agents and handoff between local planning and cloud execution, in <a href="https://x.com/cognition/status/2061889596703551926">@cognition</a>, <a href="https://x.com/ScottWu46/status/2061998361373532187">@ScottWu46</a>, and <a href="https://x.com/russelljkaplan/status/2061920322325205007">@russelljkaplan</a></p></li></ul><p><strong>Models, local inference, and routing</strong></p><ul><li><p>H Company launched <strong>Holo 3.1</strong>, a local computer-use model family based on Qwen-style architecture, with checkpoints from <strong>0.8B to 35B</strong> and formats including <strong>NVFP4, FP8, and Q4 GGUF</strong>; a popular summary cited <strong>79.3% on AndroidWorld</strong> for the 35B model in <a href="https://x.com/TeksEdge/status/2061825310669332818">@TeksEdge</a>, with launch tweet from <a href="https://x.com/hcompany_ai/status/2061815355341725925">@hcompany_ai</a></p></li><li><p>Perplexity announced <strong>hybrid agentic inference</strong> for Perplexity Computer, splitting work between <strong>local models on-device</strong> and frontier cloud models for privacy and token efficiency, in <a href="https://x.com/perplexity_ai/status/2061861293569765847">@perplexity_ai</a> and <a href="https://x.com/AravSrinivas/status/2061875858542096520">@AravSrinivas</a></p></li><li><p>OpenRouter data shared by <a href="https://x.com/ttunguz/status/2061846636805177692">@ttunguz</a> showed <strong>open-weight models at 69.1% of token volume</strong>, versus <strong>30.9%</strong> for closed models</p></li><li><p>Commentary around <strong>model routing</strong> as a key future abstraction came from <a href="https://x.com/ClementDelangue/status/2061871024627482964">@ClementDelangue</a>, <a href="https://x.com/garrytan/status/2061878212213572083">@garrytan</a>, <a href="https://x.com/matanSF/status/2061865185527074914">@matanSF</a>, and the counterpoint from <a href="https://x.com/glennko/status/2061896887699964171">@glennko</a>, who argued enterprise production reliability makes generic routing harder than enthusiasts suggest</p></li><li><p>Local-AI UX improvements also appeared in Hugging Face&#8217;s <strong>hardware compatibility checks</strong> and oMLX&#8217;s native macOS app release from <a href="https://x.com/m_newhaus/status/2061824017510584630">@m_newhaus</a> and <a href="https://x.com/jundotkim/status/2061863850874634242">@jundotkim</a></p></li></ul><p><strong>Research and evals</strong></p><ul><li><p>Google DeepMind announced <strong>Co-Scientist</strong>, a Gemini-based multi-agent hypothesis generation system for science, claiming collaborations that helped identify liver fibrosis targets, ALS approaches, and genetic leads for aging, in <a href="https://x.com/GoogleDeepMind/status/2061857539977842793">@GoogleDeepMind</a>, <a href="https://x.com/GoogleDeepMind/status/2061857550438392094">@GoogleDeepMind</a>, and <a href="https://x.com/GoogleDeepMind/status/2061857553076920643">@GoogleDeepMind</a></p></li><li><p>The new <strong>Crafter / CraftEditor</strong> work on editable scientific figure generation drew attention as a five-agent workflow for producing and refining figures plus raster-to-SVG conversion, in <a href="https://x.com/HuggingPapers/status/2061800325959324069">@HuggingPapers</a>, <a href="https://x.com/_akhaliq/status/2061835314599993392">@_akhaliq</a>, and <a href="https://x.com/TheTuringPost/status/2061883014410629400">@TheTuringPost</a></p></li><li><p>Tilde Research introduced <strong>Wall Attention</strong>, a RoPE-free attention method with diagonal forget gates, claiming training at <strong>4k</strong> and generalization to <strong>200k+</strong> tokens plus Triton kernels and strong decode throughput, in <a href="https://x.com/tilderesearch/status/2061839600562409581">@tilderesearch</a></p></li><li><p>A robotics vision encoder claiming <strong>+22.5% real-world OOD success</strong> by encoding dynamics-awareness rather than relying on static-image pretraining was posted by <a href="https://x.com/jbhuang0604/status/2061840469966090308">@jbhuang0604</a></p></li><li><p>New evals/benchmarks of note:</p><ul><li><p><strong>PaintBench</strong> for precise image editing, where best model reached only <strong>17.1%</strong>, from <a href="https://x.com/itskaixu/status/2061827068170518956">@itskaixu</a></p></li><li><p><strong>VSTAT</strong> for video state tracking, arguing frontier MLLMs remain weak at tracking evolving world state, from <a href="https://x.com/PinzhiHuang/status/2062004108249145442">@PinzhiHuang</a> and <a href="https://x.com/sainingxie/status/2062011403733512253">@sainingxie</a></p></li><li><p><strong>Data Agent Benchmark</strong> for enterprise data workflows, from <a href="https://x.com/sh_reya/status/2061984097531310378">@sh_reya</a></p></li></ul></li></ul><p><strong>Inference, infrastructure, and agent systems</strong></p><ul><li><p>Harvey + LangChain shared work on <strong>cheap verifiers</strong> for legal agents, showing <strong>DeepSeek V4 Flash</strong> could preserve <strong>94&#8211;96% agreement</strong> with Opus 4.7 while reducing cost <strong>18x</strong> in per-criterion mode and <strong>~1000x</strong> in batch mode; for <strong>3,200 RL rollouts</strong>, verification cost dropped from <strong>$18,000 to $18</strong>, in <a href="https://x.com/harvey/status/2061866491033899371">@harvey</a>, <a href="https://x.com/hwchase17/status/2061867746141356427">@hwchase17</a>, and <a href="https://x.com/nikogrupen/status/2061866707988431039">@nikogrupen</a></p></li><li><p>W&amp;B relaunched <strong>Weave</strong> as agent-first observability with integrations across common harnesses and automated detection of failure modes, in <a href="https://x.com/wandb/status/2061894943203831996">@wandb</a> and <a href="https://x.com/neutralino1/status/2061949197851742525">@neutralino1</a></p></li><li><p>Prime-RL integrated <strong>Mooncake Store</strong> with vLLM for cross-node prefix / KV cache reuse, pitched as key for agentic rollouts, in <a href="https://x.com/m_sirovatka/status/2061862853997465738">@m_sirovatka</a></p></li><li><p>Together detailed serving optimizations for <strong>MiniMax-M3</strong>, citing <strong>81&#8211;125% throughput improvements</strong> via KV-block-major sparse attention, paged decode, optimized index scoring, and multimodal preprocessing, in <a href="https://x.com/togethercompute/status/2061895336486949109">@togethercompute</a></p></li><li><p>MiniMax itself highlighted <strong>1M context</strong>, native multimodality, desktop-computer operation, and MSA reducing attention&#8217;s share of decode time from <strong>~30% to ~5%</strong>, in <a href="https://x.com/MiniMax_AI/status/2061944204604101020">@MiniMax_AI</a></p></li></ul><p><strong>Ecosystem, hardware, and industrial capacity</strong></p><ul><li><p>Westmag emerged from stealth to build <strong>American robot actuators and drone motors</strong>, with <strong>$11M raised</strong> led by a16z and participation from Founders Fund, Lux, NFDG, Menlo and others, in <a href="https://x.com/boxcardavid/status/2061825303715123234">@boxcardavid</a>, <a href="https://x.com/packyM/status/2061835223470330100">@packyM</a>, and <a href="https://x.com/oyhsu/status/2061837257531670864">@oyhsu</a></p></li><li><p>PyTorch noted NVIDIA adoption of <strong>OpenMDW-1.1</strong>, a permissive AI-model licensing framework, across four open-model families in <a href="https://x.com/PyTorch/status/2061840384817328604">@PyTorch</a></p></li><li><p>Martin Scorsese publicly demonstrated narrow, preproduction use of <strong>FLUX</strong> for storyboarding with Black Forest Labs, framed as exploratory and complementary to hand-drawn work rather than generative replacement, in <a href="https://x.com/robrombach/status/2061804823352086681">@robrombach</a> and <a href="https://x.com/TheRundownAI/status/2061834880917357011">@TheRundownAI</a></p></li></ul><div><hr></div><h1><strong>AI Reddit Recap</strong></h1><h2><strong>/r/LocalLlama + /r/localLLM Recap</strong></h2><h3><strong>1. NVIDIA Nemotron 3 Ultra and RTX Spark Specs</strong></h3><p></p><p></p>
      <p>
          <a href="https://www.latent.space/p/ainews-microsoft-build-mai-thinking">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[GitHub's plan for Agents — Kyle Daigle, GitHub]]></title><description><![CDATA[GitHub pioneered the modern AI coding era with Copilot, and the resulting explosion in agentic coding has led to notable strains on the most popular developer platform in the world. Here's the plan.]]></description><link>https://www.latent.space/p/github</link><guid isPermaLink="false">https://www.latent.space/p/github</guid><pubDate>Tue, 02 Jun 2026 16:48:21 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/200249307/0941c9607e83985da60443da5f6986fd.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p><em>I&#8217;m excited to work with Microsoft once again as the presenting sponsors of the <a href="https://www.ai.engineer/worldsfair/2026">AI Engineer World&#8217;s Fair</a>!</em> <em>We&#8217;ll streaming live from <a href="https://build.microsoft.com/">MS Build</a> today for a special crossover pod with <a href="https://x.com/saranormous/status/2061681787169017949?s=20">our friends at No Priors</a> and the one and only <strong>Satya Nadella</strong>. However we did not hold back with this interview - we asked all the burning questions about uptime and Copilot that we know you have in your minds. Lets go!</em></p><div><hr></div><p>For almost two decades, <strong>GitHub</strong> has been the home of software, where both open source and closed flow, through commits, pull requests, reviews, actions, etc.</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/kdaigle/status/2042581612400120093&quot;,&quot;full_text&quot;:&quot;Happy &#127856; Day, GitHub! I'm celebrating our birthday with the story about how defunkt, co-founder of GitHub, turned me down for a job 18 years ago after their launch... but it all worked out. &#129315;\n\nOn this day 18 years ago, GitHub was officially launched by defunkt, <span class=\&quot;tweet-fake-link\&quot;>@mojombo</span>, &quot;,&quot;username&quot;:&quot;kdaigle&quot;,&quot;name&quot;:&quot;Kyle Daigle&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1775869471074258944/GJGhWau0_normal.jpg&quot;,&quot;date&quot;:&quot;2026-04-10T12:33:00.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/HFiCnT1akAM2OHh.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/QsejdrIJtG&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:16,&quot;retweet_count&quot;:18,&quot;like_count&quot;:243,&quot;impression_count&quot;:21934,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:false}" data-component-name="Twitter2ToDOM"></div><p>This ecosystem flourished as open-source maintainers and contributors would continue shipping code for the benefit of the community. However as coding agents began to ship mass quantities of code - <strong>growing 1400% in 2026</strong>, it marked a new era that was both extremely exciting and challenging for GitHub.</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/kdaigle/status/2040164759836778878&quot;,&quot;full_text&quot;:&quot;Yup, platform activity is surging. There were 1 billion commits in 2025. Now, it's 275 million per week, on pace for 14 billion this year if growth remains linear (spoiler: it won't.)\n\nGitHub Actions has grown from 500M minutes/week in 2023 to 1B minutes/week in 2025, and now&quot;,&quot;username&quot;:&quot;kdaigle&quot;,&quot;name&quot;:&quot;Kyle Daigle&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1775869471074258944/GJGhWau0_normal.jpg&quot;,&quot;date&quot;:&quot;2026-04-03T20:29:17.000Z&quot;,&quot;photos&quot;:[],&quot;quoted_tweet&quot;:{&quot;full_text&quot;:&quot;I would like to make my apologies for defending M$, but I must from time to time.\n\nI have to put respect on github for handling the amount of shit code that has been added over the last 3 months.\n\nliterally 10s of billions of lines of code that will never see the light of a CPU&quot;,&quot;username&quot;:&quot;ThePrimeagen&quot;,&quot;name&quot;:&quot;ThePrimeagen&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1924503772094517249/DfKkH0ph_normal.jpg&quot;},&quot;reply_count&quot;:156,&quot;retweet_count&quot;:569,&quot;like_count&quot;:7144,&quot;impression_count&quot;:2624594,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:false}" data-component-name="Twitter2ToDOM"></div><p>While these agents help more people ship more projects, they also significantly increase the floor of how much code is shipped, how often it is shipped, how many people commit code, and basically orders of magnitude multiples in every dimension of GitHub infrastructure:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MG5m!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec41954-9498-4c2f-b23a-81e2bae29f82_2761x1579.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MG5m!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec41954-9498-4c2f-b23a-81e2bae29f82_2761x1579.png 424w, https://substackcdn.com/image/fetch/$s_!MG5m!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec41954-9498-4c2f-b23a-81e2bae29f82_2761x1579.png 848w, https://substackcdn.com/image/fetch/$s_!MG5m!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec41954-9498-4c2f-b23a-81e2bae29f82_2761x1579.png 1272w, https://substackcdn.com/image/fetch/$s_!MG5m!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec41954-9498-4c2f-b23a-81e2bae29f82_2761x1579.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MG5m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec41954-9498-4c2f-b23a-81e2bae29f82_2761x1579.png" width="1456" height="833" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6ec41954-9498-4c2f-b23a-81e2bae29f82_2761x1579.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:833,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MG5m!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec41954-9498-4c2f-b23a-81e2bae29f82_2761x1579.png 424w, https://substackcdn.com/image/fetch/$s_!MG5m!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec41954-9498-4c2f-b23a-81e2bae29f82_2761x1579.png 848w, https://substackcdn.com/image/fetch/$s_!MG5m!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec41954-9498-4c2f-b23a-81e2bae29f82_2761x1579.png 1272w, https://substackcdn.com/image/fetch/$s_!MG5m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ec41954-9498-4c2f-b23a-81e2bae29f82_2761x1579.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://www.latent.space/p/valuemule">our valuemule pod</a></figcaption></figure></div><p>Now GitHub inevitably experiences more pressure on their infrastructure which was originally designed around human developers moving at human speed. This has resulted in a very publicly notable uptime story:</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/SapphoSys/status/2039667138198372591&quot;,&quot;full_text&quot;:&quot;world's first enterprise solution to reach zero nines uptime &quot;,&quot;username&quot;:&quot;SapphoSys&quot;,&quot;name&quot;:&quot;chloe &#128007;&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/2037174241318338561/2ijI6Zlz_normal.jpg&quot;,&quot;date&quot;:&quot;2026-04-02T11:31:55.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/HE5aUPyaYAAwwFH.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/JtWWfZKM0C&quot;,&quot;alt_text&quot;:&quot;GitHub Platform: 89.91% uptime&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:120,&quot;retweet_count&quot;:474,&quot;like_count&quot;:12629,&quot;impression_count&quot;:620162,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p></p><p>So it begs the  question of whether current systems around code can absorb what AI produces. Can CI/CD keep up when every idea becomes a build? Can open source maintainers survive floods of AI-generated slop contributions? Can GitHub preserve the human social contract of software while becoming the operating layer for agents?</p><p>Which brings us to the perfect person to answer these questions: <strong>GitHub COO Kyle Daigle. </strong>In this episode, he joins swyx to unpack what happens when AI doesn&#8217;t just autocomplete code, but starts changing how companies operate, how open source works, how pull requests get reviewed, and how GitHub itself has to scale. </p><p>We go deep on <strong>GitHub&#8217;s internal AI workflows</strong>: micro-skills, WorkIQ, MCP, Slack, Teams, email, Copilot workflows, the new Copilot desktop app, CLI, cloud agents, and how Kyle <strong>uses agents to look backwards across company context before deciding what to do next.</strong> Kyle also reflects on GitHub&#8217;s history building webhooks, APIs, Actions, npm, Dependabot, and Semmle, why the AI era is breaking GitHub in new ways, how Actions became a general-purpose compute layer, and what Copilot becomes after code completion.</p><p></p><h2>Full Video Pod</h2><div id="youtube2-LEwlSyR0cXA" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;LEwlSyR0cXA&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/LEwlSyR0cXA?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p></p><div><hr></div><p><strong>We discuss:</strong></p><ul><li><p>Kyle&#8217;s <strong>expanded role</strong> across GitHub</p></li><li><p>How AI got Kyle <strong>coding again</strong> after years in leadership</p></li><li><p>Why GitHub rolls out AI through <strong>existing workflows</strong> instead of forcing new tools</p></li><li><p>WorkIQ, MCP, Slack, Teams, email, and GitHub as <strong>company context</strong></p></li><li><p>Why massive &#8220;mega-skills&#8221; are giving way to small, <strong>atomic micro-skills</strong></p></li><li><p>How AI changes <strong>summarization, communications, marketing</strong>, and analyst work</p></li><li><p>Why former developers in leadership may have a <strong>unique advantage</strong> in the AI era</p></li><li><p>Kyle&#8217;s <strong>&#8220;15 agents on Saturday&#8221;</strong> workflow</p></li><li><p>How Kyle built an <strong>AI-generated executive presentation</strong> for CRO/CFO teams</p></li><li><p>Why AI changes the <strong>chief of staff role</strong> without removing the human work</p></li><li><p>GitHub Actions, webhooks, arbitrary code execution, and <strong>secure agent compute</strong></p></li><li><p>The npm acquisition, <strong>supply-chain security</strong>, 2FA, and token invalidation</p></li><li><p>Slop forks, vendoring, and whether AI agents change <strong>dependency management</strong></p></li><li><p>What pull requests become when most PRs come from <strong>agents</strong></p></li><li><p>Prompt requests, vouching, AI review, and <strong>trust in open source</strong></p></li><li><p>What counts as a &#8220;developer&#8221; when AI lowers the <strong>barrier to building</strong></p></li><li><p>GitHub Spark, low-code, and why GitHub refuses to <strong>hide the code</strong></p></li><li><p><strong>14x commit growth</strong>, Actions load, databases, monorepos, and availability</p></li><li><p>Copilot&#8217;s evolution from completion to <strong>CLI, desktop app, cloud agents</strong>, and SDK</p></li><li><p>Context, memory, rules, and making GitHub <strong>&#8220;act like Kyle wants it to act&#8221;</strong></p></li><li><p>Ambient AI, OpenClaw, enterprise security, and the <strong>new operating system for agents</strong></p></li><li><p>What swyx should ask <strong>Satya Nadella</strong> about Microsoft&#8217;s AI future</p></li></ul><div><hr></div><p><strong>Kyle Daigle</strong></p><ul><li><p><strong>LinkedIn:</strong> <a href="https://www.linkedin.com/in/kyledaigle">https://www.linkedin.com/in/kyledaigle</a></p></li><li><p><strong>X:</strong> <a href="https://x.com/kdaigle">https://x.com/kdaigle</a></p></li></ul><div><hr></div><h2>Timestamps</h2><p><strong>00:00:00</strong> Introduction</p><p><strong>00:03:36</strong> Why AI Got Kyle Coding Again</p><p><strong>00:07:04</strong> Running GitHub with AI: WorkIQ, MCP, Slack, Teams, and Skills</p><p><strong>00:15:39</strong> The Golden Age for Former Developers in Leadership</p><p><strong>00:17:31</strong> 15 Agents on Saturday and AI-Generated Executive Work</p><p><strong>00:20:20</strong> How AI Changes the Chief of Staff Role</p><p><strong>00:21:45</strong> GitHub&#8217;s History: Actions, npm, Webhooks, and Open Source</p><p><strong>00:28:45</strong> Slop Forks, Vendoring, and AI Dependency Management</p><p><strong>00:33:57</strong> Pull Requests, Prompt Requests, and Trust in Agent-Generated Code</p><p><strong>00:41:21</strong> GitHub Stars, 200M+ Developers, and the New AI Builder Wave</p><p><strong>00:45:15</strong> GitHub Spark, Low-Code, and Why GitHub Still Shows the Code</p><p><strong>00:47:38</strong> GitHub&#8217;s Hardest Era: 14x Growth, Reliability, and Scale</p><p><strong>00:59:21</strong> Actions as the Compute Layer for CI/CD and Automation</p><p><strong>01:02:04</strong> The State and Future of GitHub Copilot</p><p><strong>01:08:24</strong> Ambient AI, Background Agents, and the Future of the SDLC</p><p><strong>01:13:09</strong> OpenClaw, Enterprise Security, and the New OS for Agents</p><p><strong>01:18:03</strong> Build Announcements, WorkIQ, FoundryIQ, and Microsoft Context</p><p><strong>01:21:41</strong> What Should swyx Ask Satya?</p><div><hr></div><h1>Transcript</h1><h2>Introduction: Kyle Daigle&#8217;s Expanded Role at GitHub and Microsoft</h2><p><strong>Swyx [00:00:00]:</strong> We&#8217;re here with Kyle Daigle, COO of GitHub. Welcome.</p><p><strong>Kyle [00:00:07]:</strong> Hey, thanks for having me.</p><p><strong>Swyx [00:00:08]:</strong> You&#8217;re not just CEO of GitHub. People know you as that. You have a new role.</p><p><strong>Kyle [00:00:11]:</strong> So I have an expanded role now. I&#8217;ve been working at GitHub for thirteen years and doing all things developer. Joined as a developer myself. And now, I&#8217;m also responsible as the CMO of Developer for Microsoft. And so all the kind of learnings and passion for developers and how we work with them and how we communicate and how we bring our products to market, we&#8217;re also bringing that expertise to the broader Microsoft ecosystem and helping every developer that uses a Microsoft product or would like to have a sort of similar experience that they&#8217;ve had with GitHub over the years. So it&#8217;s a different role in some ways, but it&#8217;s also just building on the experience that I&#8217;ve had at GitHub of just sort of tell the truth, be authentic, show people how to use it and then let the products speak for themselves. Now just doing that with, all of Microsoft.</p><p><strong>Swyx [00:01:09]:</strong> We&#8217;ll be releasing this in conjunction with Build. You got lots of stuff planned, and we can sort of touch on that whenever it&#8217;s appropriate. I think one of the interesting things is I rarely meet a COO who&#8217;s also a CMO. I think you&#8217;re a very outward facing and you&#8217;re very confident publicly. That&#8217;s rare. Do you actually view yourself as COO? What&#8217;s What is your thing?</p><h2>From GitHub Developer to COO/CMO: Building the Platform and Operating GitHub</h2><p><strong>Kyle [00:01:33]:</strong> I think for me, it&#8217;s been funny. The titles have always been, a&#8212; have always felt a little strange to me. I joined GitHub as a developer? I wrote so much of the</p><p><strong>Swyx [00:01:46]:</strong> Let&#8217;s bring that up. You wrote the back ends?</p><p><strong>Kyle [00:01:48]:</strong> I was going through, I was going through, some old photos, when folks were talking about how things were being built or how there was a build GitHub. I built, webhooks and worked with teams building the API, built the platform layer. Anything that integrated with GitHub, up until really twenty eighteen, I built or ran the engineering teams. And that&#8217;s kind of where my the beginning of my passion always was helping people build things, deliver them to, their customers. And so being a developer, building for developers was always super unique. In a&#8212; I think as my role expanded, it became my ability to talk to not just developers, but also enterprise customers or business leaders and have this translation layer. And then through all those years, GitHub has always operated pretty uniquely. Post-pandemic, working remotely was not as novel as it was when GitHub started in two thousand and eight. But all that expertise of running remote teams, doing it well, became this sort of bigger role, ultimately turning into the COO role of how do we operate GitHub in the way that GitHub&#8217;s always operated after the Microsoft acquisition. And kind of so on from there. So like for me, I think the&#8212; I&#8217;ve, I still code. I love coding but the problem has always been, people. It&#8217;s a much harder problem to both support our own employees, a harder problem to communicate to developers and enterprise buyers what we&#8217;re building why it matters, &#8216;cause those are two very different messages. And so getting to work in the mix of COO, CMO, also just being a dev, I think is what&#8217;s kept me at GitHub for so long.</p><h2>AI Workflows for Leadership: Commits, Retrospectives, and Context</h2><p><strong>Swyx [00:03:40]:</strong> Apparently, you have&#8212; your commits have gone up. What&#8217;s this? What&#8217;s going on?</p><p><strong>Kyle [00:03:45]:</strong> Rui&#8217;s called me out pretty aggressively. So I think&#8212; as you can imagine, right, you can see my normal era of being a dev In the twenty thirteen, twenty fourteen era, and then moving into management, and then ultimately the COO role. I think what you see there is me, really getting back to coding thanks to AI. I&#8212; similar to, attaching problems between how to market and how to operate a business and how to code, I find, building agents and workflows that are connecting very disparate problems to be what&#8217;s driving this. So that&#8217;s, some of it&#8217;s writing software. A lot of it is, connecting a ton of a different data sources to, help me out. But that is completely me really diving in on the AI side in trying out our tools, trying out everyone&#8217;s tools, But building for me, building for the non-technical leader, though I&#8217;m technical and how we&#8217;re, able to use these tools more than just the simple, call and response that I think a lot of the non-technical, your employers, you have to get&#8212; you have to use AI, and so everyone uses, ChatGPT or Copilot or Claude or whatever. To really get into, how is this going to help me out, it&#8212; I find that it&#8217;s not the I need to write a blog post, I need to those simple examples. Helping people find the workflows of, &#8220;Okay, I need you to go through all the PRs today. I need you to go through everything that we&#8217;ve posted online. I need you to go through what we did the last three months. Go through all of my Obsidian notes for any mentions of this then go through my transcripts at work.&#8221; We use, Teams, so, using WorkIQ, go call that MCP server, grab all the transcripts, go through all the Slack, and then build me out the plan of, what this week&#8217;s messaging actually was. That&#8217;s something that was, impossible because for me, I find AI in a what most of this launch here is actually, less building forward. It&#8217;s actually, a recursive loop backwards. I&#8217;m always looking at what had happened first. Go back through the week and tell me what we did, what worked, what didn&#8217;t work? And then tell me in the next three or four days-What would you tweak based on this sort of like looking backwards and then looking ahead a little bit? I find that to be so much more valuable, especially for like non-technical, because that retrospection is actually LLMs are very good at that. Like finding all the patterns, pulling them out, and then applying that retrospection to just a couple of days or just like a short period of time. Is all a bunch of apps that I&#8217;ve built and launched a bunch of, internal tools. I use the new, GitHub Copilot app, the desktop app with workflows. Every time I crack open my laptop, it&#8217;s running workflows for me. It&#8217;s just a ton of different stuff and of course, it all ends up on, it all ends up on GitHub.</p><p><strong>Swyx [00:06:47]:</strong> Of course. That&#8217;s where, that&#8217;s where, stuff is hosted. Man, there&#8217;s so much to ask you. I was going to leave the how do you run a company with AI thing at the end. I have to ask one&#8212; double click one thing. You said, you are looking back at the week. You&#8217;re, you&#8217;re understanding what happens. When you say we That&#8217;s three thousand people. How?</p><h2>Rolling Out AI Internally: Skills, CLIs, and Company Context</h2><p><strong>Kyle [00:07:09]:</strong> I think when we started rolling out AI internally beyond engineering, right? One of the things that I was really, passionate about is like we have to do this in a way where no one has to change how they work. I don&#8217;t want to have to teach you a tool. I don&#8217;t want to have to teach you something new. And so for us, we tried out a few tools. Most of them don&#8217;t work because I got to get you on board? I got to teach you how to use it. What we&#8217;ve actually ended up doing is we&#8217;ve built like a set of skills internally. We have we each have our set of skills, and we&#8217;ve just been distributing even to the non-technical folks, the CLI. And then effectively, we&#8217;re just giving it access to like read about everything that we&#8217;re writing. So that&#8217;s for us, that&#8217;s usually GitHub, Teams, Email, and Slack. So Teams for, video chat, generally speaking.</p><p><strong>Swyx [00:08:03]:</strong> Teams and Slack?</p><p><strong>Kyle [00:08:04]:</strong> so we use Teams for video communication, but we don&#8217;t use it for chat. W-we&#8212; GitHub for a long history, right? We&#8217;re always</p><p><strong>Swyx [00:08:13]:</strong> Also Slack</p><p><strong>Kyle [00:08:14]:</strong> Talking about ChatOps and like everything is built into Slack. Like every command, every flow.</p><p><strong>Swyx [00:08:18]:</strong> So even though you have been acquired for I don&#8217;t know, eight years now</p><p><strong>Kyle [00:08:22]:</strong> we still</p><p><strong>Swyx [00:08:23]:</strong> You still use Slack?</p><p><strong>Kyle [00:08:23]:</strong> it&#8217;s a purpose-built tool for us, and I think the reality is that moving off of it would be so bluntly expensive? Simply because all the tooling is, baked in with that paradigm. And they both have their pros and cons but they don&#8217;t work the same way at all. We still use a bunch of different tools Because it&#8217;s the purpose-built tools that We need. And then</p><p><strong>Swyx [00:08:47]:</strong> Well, the same doesn&#8217;t go for the rest of Microsoft, presumably.</p><p><strong>Kyle [00:08:50]:</strong> like the like various teams like operate</p><p><strong>Swyx [00:08:53]:</strong> They make their own decisions</p><p><strong>Kyle [00:08:54]:</strong> Various ways. I think it just matters what you&#8217;re trying to what you&#8217;re trying to do. But we do we do work across kind of every tool that we use, and then by giving everyone access to all of that context and the new WorkIQ MCP server, which is quite cool if you do live in the M365 like world. I can ask it all these backwards-facing questions, and it&#8217;s incredibly important for our teams that are working remotely. There&#8217;s a lot of stuff you miss when you&#8217;re not in an office, and we are spread out all over the world. So most of that is looking back. And then we post, we post either auto-automatically into GitHub issues or discussions, these sorts of like findings or like our industry reports. Like what&#8217;s happening this morning, today, yesterday. A little automation gets run. We&#8217;ll use the app. We might use GitHub Actions like with, our agentic workflows just to go do that run, and then we push it into GitHub, and w-we keep having a conversation. So usually for us, it&#8217;s about that sort of like looking back, looking forward on the non-technical side. And then of course for a lot of those folks, it&#8217;s also building an app, pushing it to GitHub pages or pushing it somewhere to host it et cetera. But it&#8217;s just like enabling everyone with that power of it&#8217;s going to take me a week to figure this out. Instead, we&#8217;re going &#8220;Okay I built a skill. Let&#8217;s put it into a repo. We&#8217;ll all share that skill together, and then we&#8217;ll use the CLI or now the app-&#8221; &#8220;just to run it.&#8221;</p><h2>Micro Skills vs. Mega Skills: How GitHub Uses AI at Work</h2><p><strong>Swyx [00:10:26]:</strong> All right. I think, I think we&#8217;re going straight into like the team management and productivity thing. I think a lot of people are getting various levels of LLM psychosis. How do you manage the bloat of skills? Like everyone Has their thing, and they&#8217;re Like trying to promote it to the rest of their peers in their org, right? And obviously, whoever becomes a skill influencer internally becomes like an AI leader, right? Of sorts. I assume you have those.</p><p><strong>Kyle [00:10:50]:</strong> like I think we have</p><p><strong>Swyx [00:10:52]:</strong> And I assume it&#8217;s a mess a Yeah.</p><p><strong>Kyle [00:10:54]:</strong> there&#8217;s like I&#8212; like I think the reality is there&#8217;s two pieces. Like first is I think that we&#8217;re ending the era of these like massive, beautiful, perfect skills that are just like not any of those things. &#8216;cause for a while, right every tweet every day is like go download the skills, the perfectly managed thing to do this entire workflow. And I think that like what we&#8217;ve found and what&#8212; I was just with my team, this week, and we were talking about the skill side, and we&#8217;re really talking about these like incredibly micro skills that are just doing one thing for us very well Versus a skill that&#8217;s going to do I said, that full report. That doesn&#8217;t really exist on our side anymore. It&#8217;s usually how do&#8212; like a single skill that&#8217;s going to identify the most important marketing information given any MCP server. Like this is the most important thing. Less about stitch a bunch of tools together and have it produce this mega output because then weeks go by, months go by, things change, and you want to tweak</p><p><strong>Swyx [00:11:58]:</strong> It&#8217;s brittle</p><p><strong>Kyle [00:11:58]:</strong> Your mega skill and you&#8217;re screwed? You can&#8217;t do that. And so now we&#8217;re really just talking about the Legos we&#8217;re using and just letting the instruction book be something we&#8217;re all putting together. Whereas I think a lot of AI skills for a while have been that mega instruction book style.</p><p><strong>Swyx [00:12:15]:</strong> I&#8217;ve, thought a lot about Postel&#8217;s law. I don&#8217;t know if that&#8217;s a term that is, means things to folks. It&#8217;s the idea that you should be liberal in what you accept and strict in what you output, right? And I think that&#8217;s like a good framing principle for skills. This is my skills, obviously on GitHub. I feel like everyone should have like how like some repos In GitHub are special repos? I feel like we should sort of reify the slash skills and everyone like give it some kind of special presentation. Anyway, so, yeah, this is one of those like download Download anything, transcribe anything, and then you can string together the atomic skills that do one thing well Into like some kind of orchestration skill that calls other skills. I assume, does that match?</p><p><strong>Kyle [00:12:56]:</strong> I like I think so. I think that the</p><p><strong>Swyx [00:13:00]:</strong> Summarize anything.</p><p><strong>Kyle [00:13:01]:</strong> Like I think the- For me, summarizing something for I do communications and PR and analyst relations and marketing and customer activities, and so my summarize everything is very different for each one of those like Contexts. What &#8216;Cause if I&#8217;m summarizing something for an analyst, that&#8217;s a very different thing than, probably how I&#8217;m going to summarize something for like a customer meeting or an engagement. So that&#8217;s I think like the difference when we&#8217;re talking about the like the tools I might use on Saturday or the skills I might use on a Saturday when it&#8217;s just for Kyle. Yeah, those are kind of like they have an atomic actual tool underneath or maybe skill, and then Kyle cares about X. But I think when we&#8217;re talking about work and enabling the the marketers, communicators there, it&#8217;s the atomic, this is what good summarization is, and then this is what I care about as for marketing for communications For whatever. And that I think is like the interesting matrix problem when we go from like a developer set of concerns to all kinds of different professions, is that what that word means to me is different than it means to you is different than it means to the analyst or the salesperson, and that&#8217;s where I think the matrix mess is that we&#8217;re starting to like still starting to find. It&#8217;s about these mega skills but they&#8217;re all just slight permutations, but those permutations are really important. It&#8217;s the difference between someone reading this and going &#8220;Did AI make this?&#8221; what Or &#8220;This makes total sense, and I would expect this when I&#8217;m giving a briefing to Gartner,&#8221; or like whatever else.</p><p><strong>Swyx [00:14:37]:</strong> I think the beauty of it maybe is that you don&#8217;t have to be that careful about what goes in there. It doesn&#8217;t have to exactly fit as long as it like roughly is contained in there. I used to complain about plugin hell, basically. Like when you have a framework and then you have a hundred things that you need to integrate, everyone does like the GitHub used to be bloated full of these things. And now we don&#8217;t need them anymore &#8216;cause now you just use skills.</p><h2>Former Developers in Leadership: AI as a Creation Multiplier</h2><p><strong>Kyle [00:15:00]:</strong> And like I think the most magical thing is the just that like I can just also crack it open. Like Like yes, I could go like change the how the plugin is coded, or like I could go do that now with AI, but I think there&#8217;s just something more magical about getting a response back and being &#8220;That&#8217;s not right,&#8221; and then you just crack the skill open, you just type English words and it&#8217;s different. That building block is just, I think very unique. Once I get everyone to kind of understand how to best how to best make those changes to get the most power out of them.</p><p><strong>Swyx [00:15:36]:</strong> Is there a&#8212; you have a your peer group that Of people like you. Is there a common framing for Something I&#8217;m feeling is, which is true, is that is this a golden age for former developers who are now in leadership? Because you can wield the tools, you would know the right words, you&#8217;re maybe not too close to the details. Doesn&#8217;t matter. But like you&#8217;re more effective than someone who doesn&#8217;t come from that background.</p><p><strong>Kyle [00:15:59]:</strong> I think that like the secret has always been your ability to identify patterns and solve problems, and I think that for folks that like myself that don&#8217;t code day to day anymore, that has made me successful as a developer, made me successful as a COO and now CMO. And so now that I have access to get and write code, I&#8217;m now applying that sort of like pattern finding and problem solving, and I know enough still about how to then go and say, &#8220;Oh, I want to make an app, but I don&#8217;t want to break into jail or create something that&#8217;s not going to be able to work or to be deployed scale or whatever.&#8221; that ability to apply all that additional business knowledge and still code I think is what makes that so interesting to me. Slightly different than I think some of the other like technical leaders that became business leaders and now are going back to their apps and updating them. Good for them? But I think the more, much more interesting thing is, well, now I have this whole new set of expertise over ten plus years. Why not take that and use that as a developer with these AI tools? So I definitely think that makes me more powerful, but I think that&#8217;s true for like every dev as well. Most of the dev friends I still have also have some other underlying skill and passion. There&#8217;s really talented, very kind of linear computer science software devs, absolutely. I just find that the folks that came from a different career, went to school for something else, went off and did this random thing, and then became a software dev, or were a dev, did a random thing, came back. Learning that extra set of information, learning those extra skills, and now having the power of an AI where I can crank up fifteen agents on Saturday while my kids are doing lacrosse, That&#8217;s like really powerful. And I think it gets me back to that feeling of like creation, and it&#8217;s very hard to replicate that in most other senses? That first time you build an app and you click it and you show someone that&#8217;s magical. And so being able to do that not just in code, but across all kinds of different assets that&#8217;s, that&#8217;s huge. We were doing we&#8217;re doing our every year we do our revenue planning. We talk about okay, what is it going to look like for next year? And of course as you imagine, there&#8217;s, slideshows everywhere talking about what are we going to talk about, what&#8217;s the narrative, et cetera. And so as you said I&#8217;m &#8220;Okay, well, I could probably just like build something to build this and then that way I don&#8217;t have to go build the whole spreadsheet or I have to pass it to my team.&#8221; So we went through this process, and I got all the information and used the skills I mentioned. I built like a little app just to make it so I could look at some of the information in a SQLite database, more easily. And I ultimately built this entire presentation without touching any of it and I was &#8220;Okay, I&#8217;m just going to present this to our CRO, the CFO, their teams,&#8221; without mentioning I&#8217;d built it with AI. I like built a skill to make it look very much not AI driven. Just not pretty.</p><h2>AI-Generated Presentations, Human Taste, and the Changing Chief of Staff Role</h2><p><strong>Swyx [00:19:03]:</strong> Like a design. Yeah.</p><p><strong>Kyle [00:19:03]:</strong> Not pretty. But just like very clearly not AI. Kind of like don&#8217;t do anything interesting.</p><p><strong>Swyx [00:19:08]:</strong> That&#8217;s, yeah, that is valuable.</p><p><strong>Kyle [00:19:08]:</strong> Just go Exactly. We did the whole thing through. It used my notes from Obsidian, it used all the context I mentioned before, the plans, and Never came up once that it was AI generated.</p><p><strong>Swyx [00:19:20]:</strong> It didn&#8217;t matter.</p><p><strong>Kyle [00:19:20]:</strong> Never once. D It didn&#8217;t matter. And so now I take</p><p><strong>Swyx [00:19:23]:</strong> This is a tool</p><p><strong>Kyle [00:19:23]:</strong> I can take that tool and go, &#8220;Look, I don&#8217;t want you to go build slideshows.&#8221; They&#8217;re just helping us share information with each other. If this thing can do it With a little bit of crafting from you and then we can look at it together, awesome. There&#8217;s no value in all that extra work. I think that the ability to, make it look humanly bad and and build a little app to, manipulate the data I think is part of, that upside for devs that are now in leadership roles. Because, the thing that I feel like I said before, this that&#8217;s all a people, that&#8217;s all a people problem. I know if you&#8217;ve used a coworker or not to build a slide deck, unless you spent a bunch of time to not do it.</p><p><strong>Swyx [00:20:07]:</strong> I know, but like it was so, I think there&#8217;s a certain charm to just being blatantly AI. &#8216;Cause I think that you&#8217;re well, you&#8217;re just honest about There may be mistakes here that I cannot vouch for. So how much value is there? But anyway I think, actually the real question I want to ask is, there&#8217;s a&#8212; You were a chief of staff To Thomas. And in the pre-AI world, the that job would&#8217;ve been a chief of staff job of like Can you prep me these slides and all that? And now you do it yourself.</p><p><strong>Kyle [00:20:35]:</strong> I still, I still have a chief of staff. Because, the difference is it&#8217;s sort of the discussion every time we have some sort of technology evolution is it&#8217;s not that the jobs the roles don&#8217;t all go away, they just change? And so yeah, I don&#8217;t have someone spending all their time building out slides for me and presentations &#8216;cause I don&#8217;t need that anymore. But now I need that person that is able to go and find all the different connections between humans in those discussions to help me find out, okay, I should be meeting with this group and this team, and they have an opportunity, and I&#8217;m going to be in San Francisco today, I&#8217;m going to be in Seattle tomorrow. Those sorts of human connection aspects are still incredibly valuable and has always been a big part of that chief of staff role. But now just like chiefs of staff are not opening up, letters to process, they&#8217;re doing emails. What It&#8217;s the same thing. And now they&#8217;re, they&#8217;re not building out as many of these presentations because they have the the ability to have a AI take it on for, and share that with me and great. Let&#8217;s keep moving &#8216;cause it&#8217;s allowing us to go faster and make better decisions more quickly.</p><p><strong>Swyx [00:21:45]:</strong> Awesome. Well, so we can dive into more sort of, Productivity insights as you go. I did want to do a little bit of a brief history of colleague and hub. Because, we started here. And then you also involved the NPM acquisition. I did, I do want to touch upon that. And then more recently, I just want to bring up to present day where we&#8217;re having uptime issues Which transparently we&#8217;ve already Addressed publicly, but we&#8217;ll, we&#8217;ll discuss in the pod. Did I miss anything? Like what, any other major highlights? Obviously, it&#8217;s, it&#8217;s a lot of years to cover.</p><h2>A Brief History of GitHub: Webhooks, Actions, Acquisitions, and Platform Evolution</h2><p><strong>Kyle [00:22:15]:</strong> No the I think one of one highlight was right before the acquisition closed in twenty eighteen, I got to launch the first version of Actions</p><p><strong>Swyx [00:22:27]:</strong> Oh</p><p><strong>Kyle [00:22:27]:</strong> At GitHub Universe. So it was O</p><p><strong>Swyx [00:22:29]:</strong> They&#8217;re that young?</p><p><strong>Kyle [00:22:30]:</strong> It was October of twenty eighteen, I think. Yeah. Yeah.</p><p><strong>Swyx [00:22:33]:</strong> Gee, Jesus.</p><p><strong>Kyle [00:22:34]:</strong> I got to I was the engineering leader on that project and got to launch that. And then, yeah, we did acquisitions of NPM you said, Semmle, Dependabot Pul Panda a whole bunch of things. That was a big</p><p><strong>Swyx [00:22:47]:</strong> Pul Panda.</p><p><strong>Kyle [00:22:48]:</strong> Abi is doing well.</p><p><strong>Swyx [00:22:51]:</strong> DX. Holy crap.</p><p><strong>Kyle [00:22:52]:</strong> Did well on DX. I and like that was a that was the big shift, after the acquisition. I had to join the sort of business side.</p><p><strong>Swyx [00:23:00]:</strong> So I need to hit you on some of these things &#8216;cause you were there. Right? And how often do I get to talk to someone who was there? But yeah, Actions. Is that the number one source of security issues on GitHub?</p><p><strong>Kyle [00:23:11]:</strong> Oh, sh I think that the number one source of, security issues is probably like all, the literal code in everyone&#8217;s like underlying repositories. I would say back further than that is, if you remember I had to show in this graph was this is, I&#8217;m, didn&#8217;t say this before, this is ultimately webhooks.</p><p><strong>Swyx [00:23:30]:</strong> You yeah.</p><p><strong>Kyle [00:23:31]:</strong> Like circa whatever it was.</p><p><strong>Swyx [00:23:32]:</strong> It says Hookshot in there.</p><p><strong>Kyle [00:23:32]:</strong> I forget. Yeah. Yeah, Hookshot&#8217;s in there. And so like back then, it says GitHub Services. Do you see, it says Hookshot FE for front end, and then it says GitHub Services. GitHub Services back in the old days, right? You we had a repository that was Ruby code, and you could write any Ruby code in there, and then we would execute that On your behalf As a service, and then that way if an if you were trying to integrate with something, it didn&#8217;t we would run it for you.</p><p><strong>Swyx [00:23:57]:</strong> And of course no containers &#8216;cause</p><p><strong>Kyle [00:23:58]:</strong> No, &#8216;cause it was</p><p><strong>Swyx [00:23:59]:</strong> Well, no containers</p><p><strong>Kyle [00:24:00]:</strong> Twenty fourteen. And so there was some isolation obviously, but it was mostly the separations on the server level. That&#8217;s like an example as long as the very old version of Pages, which ran on its own containerization infrastructure, not on Actions.</p><p><strong>Swyx [00:24:15]:</strong> Which like all-time great product.</p><p><strong>Kyle [00:24:16]:</strong> Pages powers the internet at this point to some degree. Those were places where like clearly there were no like issues like to my knowledge. But it was those things where I&#8217;m looking at and going &#8220;Okay, well we can&#8217;t be running arbitrary Ruby code,&#8221; like on everyone&#8217;s behalf. Then containerizing all of that up intoUh into actions now where yeah the containerization, is r-really good. The pinning most folks aren&#8217;t pinning it the like to a particular</p><p><strong>Swyx [00:24:48]:</strong> Images</p><p><strong>Kyle [00:24:48]:</strong> Sha, et cetera like their workflows, and so that&#8217;s a big that&#8217;s a big place Of pain for folks if they&#8217;re just doing similar to any dependency management, just V1 or newest or latest, I think. But, that journey from that day to &#8220;Okay, we&#8217;re just going to run all this arbitrary code, and, it&#8217;ll basically be okay,&#8221; to now, no, we have, really good containerization. We have a new, underlying, ag-agent, containerization, service. It&#8217;s like we&#8217;re using it under the hood. It&#8217;s through Azure. They recently announced it. The Azure, Dev Compute, but it&#8217;s, very fast, very fast compute to be able to, spin up your own cloud agents, or whatnot. We&#8217;re using it under the hood for some parts of the new,</p><p><strong>Swyx [00:25:36]:</strong> Microsoft Dev Box?</p><p><strong>Kyle [00:25:37]:</strong> No. Dev Compute, yeah.</p><p><strong>Swyx [00:25:41]:</strong> Hmm. Not finding it just yet.</p><p><strong>Kyle [00:25:44]:</strong> Oh, it&#8217;s, it&#8217;s in there somewhere.</p><p><strong>Swyx [00:25:46]:</strong> All right. Well, we&#8217;ll cut that out.</p><p><strong>Kyle [00:25:47]:</strong> Sorry. But with, Dev Compute, you can, run, really fast, spin up really, small VMs really quickly, so you&#8217;re doing a tool call</p><p><strong>Swyx [00:25:58]:</strong> Same concept</p><p><strong>Kyle [00:25:58]:</strong> Just do it containerize exact-exactly. So we&#8217;re using that so definitely moving that direction to protect us from every every piece of code that we&#8217;re ultimately running.</p><p><strong>Swyx [00:26:07]:</strong> look, that grows into the full SDLC? Code hosting was just the start and and then it&#8217;s grown beyond that. Let&#8217;s talk about NPM may-maybe &#8216;cause I think that&#8217;s also, a very major point in the industry. I do think, it was looking for a home. It was, kind of struggling as a business, right? I don&#8217;t know, I don&#8217;t know how you would characterize that whole acquisition and how it</p><h2>NPM, Package Security, and Keeping the Internet Running</h2><p><strong>Kyle [00:26:33]:</strong> like when we were talking to the team, I think the big thing for the both of us was to find a way to keep NPM, which was basically powering the internet then and way more so now to some degree running. Keep it going keep continuing to scale. It was having scaling problems, if I recall, back at that time. They were doing some rewrites. It</p><p><strong>Swyx [00:27:00]:</strong> that&#8217;s cute compared to now.</p><p><strong>Kyle [00:27:01]:</strong> Well, that&#8217;s the thing is like when I&#8217;m talking to folks now, there&#8217;s there&#8217;s so many more underlying uses of NPM than there were back when we had them join in with GitHub. But that was ultimately the goal. It was really okay, we used to have pages. We have, the world&#8217;s code. Let&#8217;s make sure that we can keep NPM running well for the world. And we put a bunch of time and investment into fixing some of the underlying backend, changes, some of which we talked about some of the manifest work, et cetera. And then now, really trying to bring the the security posture of NPM up to speed. But, it is a unique challenge in that every move that we make to make it more secure will break a lot of people. And security is paramount. And also, we take it very seriously. We&#8217;re, the any time that we have a problem with GitHub or we make a change that makes us more secure but hurts, there&#8217;s, a snow day for developers or a really bad fire that they have to go put out. And so we&#8217;ve, have changed the 2FA policies. We&#8217;ve changed the way the tokens work. When we find tokens that have been exposed or potentially, exposed, we invalidate them, and</p><p><strong>Swyx [00:28:22]:</strong> I love that feature in GitHub. Yeah, it&#8217;s great</p><p><strong>Kyle [00:28:23]:</strong> That creates issues, but, the but that&#8217;s the thing is we&#8217;re trying to push the community, forward without necessarily, doing something that is going to break the contract that&#8217;s been for 15 years or close to it or some amount of years on NPM.</p><h2>Slop Forks, Vendoring, and the Future of Open Source Supply Chains</h2><p><strong>Swyx [00:28:43]:</strong> I think the&#8212; So now we&#8217;re talking about, open source and publishing. And I think there&#8217;s something here with what people are calling slop forks, which, I think Malta from Vercel is doing. And, part of me thinks, well, the way to get past any vulnerabilities, we just, let&#8217;s just get rid of the concept of NPM. And we only publish source code. And anytime you want to import it you have your coding agent look at it and then adapt whatever subset you&#8217;re going to use into your vendor it. But, the AI vendor it. Is that realistic? I don&#8217;t know. Is it&#8212; Will that solve all our security issues? I don&#8217;t know.</p><p><strong>Kyle [00:29:24]:</strong> I don&#8217;t think it&#8217;ll solve I so Mitchell was just talking Mitchell Hashimoto Was just talking about this today, and I think that I-in some ways, it&#8217;s all all things, old or new again? Yeah, absolutely vendoring everything. Like I do I do remember twenty thirteen, twenty fourteen.</p><p><strong>Swyx [00:29:42]:</strong> This is Yeah. Let&#8217;s, we must return to</p><p><strong>Kyle [00:29:43]:</strong> That&#8217;s what is We were vendoring everything. We were having actual discussions around, or at least I remember we were &#8220;Should we take this full thing?&#8221; &#8220;Why is this so big? We only need this one file.&#8221; And so I do think there&#8217;s something true there where having either taking only what you need or the dependencies just getting incredibly small over time, I think will help to some degree, but it&#8217;s not going to solve the fundamental problem, I don&#8217;t think, because the vulnerabilities in an agent looking at them, there&#8217;s time and time again, there&#8217;s a million different ways in which we can convince an agent that this thing is, secure or not and pull it in. Or we can do static code analysis or runtime testing to say whether the code works or not. That is, I think, the step that needs to continue to be, invested in. The question is just on, how much scope. Should it be this enormous project that I&#8217;m pulling down, or should it be this piece? Either most companies are running some amount of security checking on the on the packages that they&#8217;re bringing in or vendoring. That I think won&#8217;t change. That&#8217;s like what advanced security does to some degree, Socket does some degree. Like everyone is doing a piece of that. How we each do that like especially when we&#8217;re talking to enterprise customers, is just like very different. No there&#8217;s no one wants one single way to do it. And I think that&#8217;s always been GitHub&#8217;s, unique position in the world. I talk a lot to maintainers, I talk a lot to folks about this. It&#8217;s we&#8217;re&#8212; we rarely start like a process and a practice and like push it onto the community. We usually wait for the sort of like RFC process socially or literally, everyone agreeing, and then we&#8217;ll cement something in. Because otherwise we&#8217;re</p><h2>Maintainers, RFCs, Vouching, and the Social Layer of Trust</h2><p><strong>Swyx [00:31:35]:</strong> That fits your role in the ecosystem, yeah</p><p><strong>Kyle [00:31:36]:</strong> We&#8217;re GitHub. Yeah, we don&#8217;t want to shape the whole thing. We want it to be figured out. But like how do you balance that like sort of Role in the industry to keep everything as secure as is possible and make sure that you&#8217;re you&#8217;re not going to be compromised as a human, &#8216;cause that&#8217;s usually how it all happens. And Not not create a process or lock us into a flow that you&#8217;re not going to or like Mitchell&#8217;s not going to or other open source projects aren&#8217;t going to like. That&#8217;s always been a tricky balance for us, and I think that&#8217;s something that we haven&#8217;t talked about enough is we&#8217;re not going to be able to fix everything for everyone in a way that everyone is going to like. So tell, help us, tell us what is working. When Mitchell was talking about, the Upvote, the up</p><p><strong>Swyx [00:32:22]:</strong> I was going to bring up his thing. Yeah.</p><p><strong>Kyle [00:32:23]:</strong> I forget what it Yeah. When he&#8217;s talking to us, I was chatting with him and talking to him about this and I put it on Twitter and we talked to, also over DM, was &#8220;We&#8217;re going to keep working.&#8221; but I think the important thing is I do actually want to hear what isn&#8217;t working for you. And as, be as specific and clear for your project as is possible. And to every piece of credit over the many years that we&#8217;ve known each other through the industry, he&#8217;s always done that and I appreciate that &#8216;cause there are places that we need to fix up, and we hear from him, and we&#8217;ll fix up just like we do all other kinds of maintainers. But that that process between making those types of improvements and being more secure and like creating, I forget what he calls it&#8217;s not the proof process, not the claims process. Do what I&#8217;m talking about? He has that he his projects have a way for you to kind of like,</p><p><strong>Swyx [00:33:13]:</strong> Vouch</p><p><strong>Kyle [00:33:13]:</strong> Vouch. Thank you. Yeah. He has like the vouch system for saying, &#8220;Hey, you should accept my PRs.&#8221; That&#8217;s been</p><p><strong>Swyx [00:33:20]:</strong> I just built this into GitHub. I don&#8217;t know.</p><p><strong>Kyle [00:33:22]:</strong> Well, see, but that&#8217;s the thing is that you say that and like he and his community really likes this and then I&#8217;ll go talk to other maintainers and other maintainers, globally, and they&#8217;re &#8220;No, this doesn&#8217;t work for me.&#8221; And that is the tension, but also the kind of beauty of GitHub, depending on which way you look at it is we want to help maintainers, so we create all these tools to let you have more control over how much you take in from AI and PRs. But you can also use this. What You can go use this project, and if it takes off and becomes the kind of mostly standard, then yeah, we probably wouldn&#8217;t enforce it but we would add it in because that&#8217;s the flow that we tend to do?</p><p><strong>Swyx [00:34:02]:</strong> I hear a lot of people don&#8217;t know the history of the pull request. And like like that&#8217;s how, that&#8217;s something that GitHub standardized basically.</p><p><strong>Kyle [00:34:08]:</strong> Yeah. It was a very messy process Like beforehand, and now the we have the benefit of it being the process? And now we have to go and Figure out the next best process or what adaptations change, or what does a pull request look like when eighty percent of your PRs are just coming from your agents and not From other devs?</p><p><strong>Swyx [00:34:31]:</strong> Do you like the prompt request idea from Peter?</p><p><strong>Kyle [00:34:34]:</strong> like I think that for each like each idea I think has its merits. I&#8217;m not, I&#8217;m not avoiding saying anything good or bad, but I feel like I&#8217;ve seen a version of we have that we have entire Thomas&#8217; store. Take all the assets of what you&#8217;ve built and put that in. I think that&#8217;s got great ideas. There&#8217;s all these various permutations of the PR flow, but I think the reason why there&#8217;s not a single answer is ultimately we&#8217;re trying to codify trust. We&#8217;re trying to say &#8220;Okay, if Sean reviews this I&#8217;m going to trust it because you&#8217;re Sean or you&#8217;re the senior dev or you&#8217;re the whatever.&#8221; And right now, when we are working in a flow where an agent writes code and another agent reviews code and then Kyle goes and looks at it the trust is kind of diffuse. And most of the tools that we&#8217;re talking about are talking more about verification flows. We have more assets to look at, so I can probably say whether this is a good PR or not. But that still doesn&#8217;t solve, I think, the human problem of I&#8217;m looking at a PR and I want to know if I can trust it. And we&#8217;re still, we still tend to use human signals for that? Mitchell approving it or Kyle approving it or whatever. And so I think that&#8217;s, I think that&#8217;s why most of these options haven&#8217;t really solved it is because, it&#8217;s a social problem ultimately. It&#8217;s a it&#8217;s a human problem to review it and agree. Or you fully trust the tool and you&#8217;re imbuing that tool with full trust Which I think in some cases that absolutely exists.</p><h2>AI-Generated PRs, Trust, and the Waymo Analogy</h2><p><strong>Swyx [00:36:08]:</strong> And so like in the same way that there will be a tipping point in society when we don&#8217;t allow humans to drive anymore Because machines are measurably better than Than humans. I&#8217;m looking for that tipping point, right? Like Mythos is ridiculously expensive. Someday we&#8217;ll have Mythos on a desktop. I don&#8217;t know. Will, does that change the equation?</p><p><strong>Kyle [00:36:30]:</strong> I think it&#8217;s more I took a Waymo here, and I was on my phone and not looking around at all. There are other, self-driving, vehicles that I would not trust while, staring at the road. And I think that trust is something that is</p><p><strong>Swyx [00:36:48]:</strong> Is this a Zoox thing? What is it</p><p><strong>Kyle [00:36:50]:</strong> I think that is both. I think that is both. Like</p><p><strong>Swyx [00:36:53]:</strong> There&#8217;s Zoox in this robo taxi. That&#8217;s it. It&#8217;s</p><p><strong>Kyle [00:36:56]:</strong> Well, depending on what level Of self-driving. But, my point is sort of that I think part of that is I strongly believe that&#8217;s, a mixture of verifiable proof. Like how many accidents, how much data, and so on, and the human aspect of how I feel when I&#8217;m in this car, what it tells me, et cetera. And so that&#8217;s why I think some of the like Some of these some of our AI tools tend to, imbue me with more of that feeling of trust, even if the data says this is 100% accurate. I feel like it takes more time for us to go, &#8220;Should I trust this or not?&#8221; And that&#8217;s in the soft sense of, startups with high agency, weekend projects, and open source. And then there&#8217;s enterprises and regulated industries and everything else, and that is an even harder problem to go solve because even when it is fully verified, not only do you have to have trust from the humans on the team, you probably have to have trust from multinational,</p><p><strong>Swyx [00:37:55]:</strong> Oh my God</p><p><strong>Kyle [00:37:55]:</strong> Multi governments around the world and regulating agencies. And so that&#8217;s where I feel like until we tip over to your point on the sort of like human EQ side of it. I feel okay this feels okay I&#8217;ve been proven enough. Then the ball will start to roll a lot faster, where we&#8217;ll end up getting to the &#8220;Okay, we can trust this,&#8221; and feel good about it in the Most difficult of cases.</p><h2>Reputation, Sponsors, Stars, and Bot Activity on GitHub</h2><p><strong>Swyx [00:38:18]:</strong> If human trust is the thing that matters, I feel like GitHub as the developer social network could maybe do more there. Like vouchers are one system But, we have star counts, and then we have Contributor rights, and that&#8217;s it. And I feel like there should be more in that space. I don&#8217;t know if there&#8217;s any other design decisions there.</p><p><strong>Kyle [00:38:37]:</strong> I think that one of the places that we don&#8217;t really expose right now in this sort of way is, some degree of like hard trust and support, which would like for me is like sponsors is a good example of that.</p><p><strong>Swyx [00:38:49]:</strong> Ah.</p><p><strong>Kyle [00:38:49]:</strong> It like costs you something. To prove that I believe in your project and I trust you To some degree or I want to support you at the very least.</p><p><strong>Swyx [00:38:56]:</strong> Solve payments for open source. Why not?</p><p><strong>Kyle [00:38:58]:</strong> I think that I think that like as we keep moving forward, right, there&#8217;s more and more projects where I&#8217;m, adding more and more dollars into sponsors personally because I want to like support them, but I also like know of I&#8217;ve probably never met them in person, but, I know of enough of their work that I want to support them. I think the thing that I don&#8217;t love about stars or commit counts or anything else is ultimately, even with all of the various, abuse and de-spamming and deduplication work that we do or anti-abuse work that we do, these are all, not active social signals. They&#8217;re passive ones that are ultimately gamifiable. And you may trust me, but another open source maintainer may not. And on what heuristic should you be, trusting me? That I think, is kind of where some of our thinking is right now. What signal from me is most important to you? You&#8212; If you can define that potentially, honestly in an agentic workflow that&#8217;s what we see some of these open source projects do, where you have GitHub actions, and then you have like an agentic workflow that&#8217;s calling AI, and you&#8217;re setting these rules. Like if Kyle has submitted and gotten accepted PRs across any given project and has a social handle tied to his account in GitHub, and that social account&#8217;s older than a certain amount. Really complex measures that matter to you &#8216;cause most open source projects have that heuristic built into their heads, if not written down in the contributing guidelines. You could take that and then go apply that and then just say, &#8220;Oh, we&#8217;re not going to accept this PR.&#8221; Building something that is, I think, malleable to everyone&#8217;s needs, is a little bit better, rather than going &#8220;Hmm, this account&#8217;s too young.&#8221; Because what happens? The attackers just go and go and create a multitude of accounts, and they wait Until it ages up. Needs to have a certain amount of stars. That&#8217;s how star inflation happens. Need to have a certain amount of repos</p><p><strong>Swyx [00:40:46]:</strong> Oh my God. Yeah</p><p><strong>Kyle [00:40:47]:</strong> With PRs. They all just create repos and submit PRs to each other, and then they come in and do something nefarious. And so, it&#8217;s hard. It&#8217;s hard to find the measure. So I think we&#8217;re, we&#8217;re looking more at how can we provide you tools so you can kind of choose what&#8217;s best for you. And of course, we&#8217;ll give you some standards. But the trust vector, gets down to I don&#8217;t know, some version of like human digital ID like everyone&#8217;s been talking about. Like how do I prove that it&#8217;s me</p><p><strong>Swyx [00:41:13]:</strong> Give me your eyeballs</p><p><strong>Kyle [00:41:14]:</strong> On the internet. Give me your eyeballs. Exactly.</p><p><strong>Swyx [00:41:18]:</strong> The I got to keep moving on Topics, but obviously I can go all day on this stuff because, I&#8217;ve been involved in GitHub and open source My entire professional career. Stars. Very superficial. Everyone knows it. But I think time to one hundred thousand stars is the fastest I&#8217;ve ever seen. Like people just reached that in I don&#8217;t know, months. And then like at the same time I don&#8217;t trust it right? Like how many of these are real or bot or like whatever. I don&#8217;t know how to ask this but like what can we do about it? Like</p><p><strong>Kyle [00:41:49]:</strong> Just</p><p><strong>Swyx [00:41:49]:</strong> Is stars broken? Is stars fine?</p><p><strong>Kyle [00:41:51]:</strong> I think that there&#8217;s kind of two, there&#8217;s like two pieces. Obviously we&#8217;re constantly like trying to find ways in which like your users are producing spam, which would, I would include like be like only doing star gamification. When we find them, we pluck &#8216;em out and we,</p><p><strong>Swyx [00:42:08]:</strong> But it&#8217;s like a Whac-A-Mole</p><p><strong>Kyle [00:42:10]:</strong> It&#8217;s a hundred percent like a Whac-A-Mole</p><p><strong>Swyx [00:42:11]:</strong> There&#8217;s no way</p><p><strong>Kyle [00:42:11]:</strong> Now, powered by AI to be helpful. But I think more so what I&#8217;m seeing is, a lot of the like fastest time to X tends to be because we&#8217;re now inviting so many more people into like software development on GitHub That like the zeitgeist is just swarming? And it&#8217;s</p><p><strong>Swyx [00:42:32]:</strong> It&#8217;s not just developers anymore</p><p><strong>Kyle [00:42:33]:</strong> And it&#8217;s not you and I. Like like however you want to say like what a developer is it&#8217;s not just folks who have been coding for a very long time. It&#8217;s folks that have maybe started coding or only joined in since the AI era. And now</p><p><strong>Swyx [00:42:44]:</strong> what&#8217;s the latest Octoverse number? I know eighty million was my lastRem- member that a number of developers on GitHub</p><p><strong>Kyle [00:42:50]:</strong> Oh, we&#8217;re over 200 million now.</p><p><strong>Swyx [00:42:53]:</strong> Okay. Well, so you see?</p><p><strong>Kyle [00:42:55]:</strong> Like over 200 million developers now.</p><p><strong>Swyx [00:42:56]:</strong> But it&#8217;s not developers, right? It&#8217;s, it&#8217;s people with a GitHub account.</p><h2>What Counts as a Developer in the AI Era?</h2><p><strong>Kyle [00:43:00]:</strong> So, so this is, this is the biggest debate that I would say, everyone loves to have at GitHub at this point. From my perspective, right, I think that there&#8217;s, there&#8217;s clearly a difference between, professional enterprise developer and then developers. But I think that I think that the idea that we should be I don&#8217;t know, splitting hairs or segmenting developers in the early era of software development is, not worth our not worth the time. So</p><p><strong>Swyx [00:43:29]:</strong> When you get into gatekeeping</p><p><strong>Kyle [00:43:31]:</strong> 100%</p><p><strong>Swyx [00:43:31]:</strong> What is a developer?</p><p><strong>Kyle [00:43:31]:</strong> 100%. &#8216;Cause I wasn&#8217;t a developer when I started writing code? I was going to</p><p><strong>Swyx [00:43:36]:</strong> Oh, no. I made&#8212; I cloned a thing, seven years before I learned to code. And then I and then I wrote about my learning to code journey, and people Just called me a fraud &#8216;cause I had a GitHub account. And I&#8217;m &#8220;Well, no, I just use GitHub, but I don&#8217;t know-&#8221; &#8220;I didn&#8217;t know what I was doing.&#8221;</p><p><strong>Kyle [00:43:49]:</strong> I I remember that. I remember those sets of posts, and like that&#8217;s, that&#8217;s bullshit. So I fight very clearly on the line of, if you create code, if you have an idea and you create it into some way of, I&#8217;m, I&#8217;m going to run it and use the app right now, you may still use AI in that moment, but that&#8217;s okay. At some point you&#8217;re going to do the next thing. You&#8217;re going to create a big&#8212; You&#8217;re going to have to learn about this database. You&#8217;re going to fix a bug, whatever. We&#8217;re all on some same journey, and those people are also hearing about the great new agent skill package or a new CLI tool or a new whatever. And those projects are going up because you want to be a part of this moment, just like I wanted to be a part of the Ruby community when Ruby was popping off when I started becoming a developer, and now I can just click the star button. And so I think that yes, there&#8217;s clearly some amount of like spamming and game gamification that we&#8217;re working against, but I really think we&#8217;re just seeing this whole new cohort of folks that are moving from technology to technology because they&#8217;re not working on a 20-year-old software application. They&#8217;re working on a side app that they built on the weekend for their friends or for their new idea or whatever. And that&#8217;s how you see these enormous charts going up and to the right with With stars.</p><p><strong>Swyx [00:44:59]:</strong> I think something that&#8217;s remarkable is the persistence or, that GitHub extends to those folks. Usually when I see platforms go into a new audience, they usually have to, have like a second platform with a different name that wraps the main platform. But somehow GitHub has been able to sort of persist and extend, and it&#8217;s friendly and whatever? So it&#8217;s, it&#8217;s nice.</p><h2>Spark, Low-Code, and Always Showing the Code</h2><p><strong>Kyle [00:45:19]:</strong> I that&#8217;s partially why I think as we&#8217;ve tried to move into I don&#8217;t know, more like low-code-y things. We so we started working on Spark as like a way to, build an app and run it. I think that the reality is that we anytime we try to, kind of put even a veneer on top of it without when we put a veneer on top of something, we still always show you the code. That&#8217;s kind of like a tenant. We&#8217;re never going to, hide the code from you ever, because what</p><p><strong>Swyx [00:45:52]:</strong> Why would you?</p><p><strong>Kyle [00:45:52]:</strong> That&#8217;s, yeah, that&#8217;s the whole point? However, I think that what we learned with things like Spark is that really the value of Spark for most devs is, easy runtime. And you may have a runtime or a host that you&#8217;re going to use for that or you just build something and run it but, the package of making that even more simple isn&#8217;t really needed for folks that are trying to build software and not just trying to build, an app, which is, slightly different, a slightly different goal. So I want to get you in, I want to get you comfortable. I think the best thing for me as, someone that did not traditionally come into software dev way back, I want anyone to be able to breach that chasm and not be in the I don&#8217;t know, I feel like we&#8217;re, we&#8217;re still in an era of, STEM. I&#8217;ve got a 12-year-old and an eight-year-old, and it&#8217;s &#8220;We got to get &#8216;em into STEM,&#8221;? Over and over. And I like I do, I do the things that good parents do. I was &#8220;Oh, you want to do coding?&#8221; &#8220;Yes, I want to do coding.&#8221; Do coding classes. But now they&#8217;re just not afraid of doing software. And that&#8217;s, I think, the thing that&#8217;s honestly kept me at GitHub for so long. Anyone should be able to go and build a thing, just like I can go change a light switch in my house. I&#8217;m not going to go into the breaker box &#8216;cause I&#8217;ll probably kill myself? But, I can go change that light switch. Everyone should be able to go and say, &#8220;This fricking app doesn&#8217;t do what I want. I want it to work like this.&#8221; And that I think, is what&#8217;s kind of kept us all connected with GitHub through the years and some and during the easiest of times or in the hard times because of that opportunity of, we&#8217;re the home for all developers, and we want everyone to be able to have that feeling that we&#8217;ve had of, had an idea, I created it and holy shit here it is.</p><p><strong>Swyx [00:47:37]:</strong> Here it is. All right, I&#8217;m going to try to do more spicy questions.</p><h2>GitHub&#8217;s Hardest Scaling Moment: Growth, Agents, and Uptime</h2><p><strong>Kyle [00:47:42]:</strong> Great.</p><p><strong>Swyx [00:47:42]:</strong> Is it an easy time now or a hard time?</p><p><strong>Kyle [00:47:45]:</strong> Oh at GitHub? It&#8217;s a hard time. Like, it&#8217;s a hard time and also, I was just with my team and I said, &#8220;This is also, the best and most exciting time that I think I can remember at GitHub.&#8221; Because</p><p><strong>Swyx [00:47:57]:</strong> Best of times, worst of times. It&#8217;s never one</p><p><strong>Kyle [00:47:59]:</strong> &#8216;cause we&#8217;ve we were talking about Octoverse reports and, usually we do an Octoverse report once a year, and we look at the numbers, and we say, &#8220;Oh my goodness.&#8221; I was at Universe in October saying, &#8220;This was the fastest year of growth that we&#8217;ve ever had,&#8221; right? And now we&#8217;re doing more in a month than we did in a year last year.</p><p><strong>Swyx [00:48:20]:</strong> You&#8217;re talking about PRs.</p><p><strong>Kyle [00:48:21]:</strong> Commits.</p><p><strong>Swyx [00:48:21]:</strong> Commits, yeah.</p><p><strong>Kyle [00:48:22]:</strong> PRs. Kind of like you name it by roughly every measure that we&#8217;re looking at, there&#8217;s some amount of sort of growth that is much bigger, and that is breaking our system in new ways, not old ways. Like webhooks were always notoriously, unreliable over the years?</p><p><strong>Swyx [00:48:38]:</strong> Whose fault is that?</p><p><strong>Kyle [00:48:39]:</strong> not anymore mine, but for a period of time, I&#8217;m sure you could pull up a tweet that was &#8220;It was me. I&#8217;m sorry.&#8221; but, now, that got rewritten at a scale level that is still working and is not having problems today. Now what we&#8217;re finding isn&#8217;t just the isn&#8217;t the-The simple stuff that folks are on the sometimes on Twitter or on the internet are &#8220;Hey, why is this like this?&#8221; Sure. There&#8217;s absolutely silly problems that we shouldn&#8217;t exist. But now we&#8217;re talking about, unique, novel permission problems that happen only at a scale across all different objects or whatever, that now we have to go rewrite this underlying system. And so it&#8217;s, there are problems that yeah, caught us off guard, which I think I said. Like the growth is astronomical, but also we&#8217;re making such material progress in that I&#8217;m excited once we&#8217;re once we&#8217;ve kind of like reimagined the underlying foundation layer, or pieces of it at least, what&#8217;s going to be possible when it&#8217;s not just all of us and all the new people that are being developers and all of their agents and all the tools like working together. Because that&#8217;ll still happen in that in that GitHub tool, that GitHub community. But it&#8217;s a it&#8217;s a hard day anytime we can&#8217;t give you what you&#8217;re looking for. We have the same problem internally. We operate through github. Com. Of course, we have backups when things go down and whatnot for our own operations but we feel it too. If it&#8217;s not working it&#8217;s not working for us, and that&#8217;s kind of like the promise of dogfooding for GitHub. It&#8217;s always been true. We&#8217;re using the same tool you&#8217;re using. We&#8217;re not using a super secret version. We and so we also need it to be great for us for our customers of course for open source. And now an exponential growth of agents, Doing it too.</p><p><strong>Swyx [00:50:32]:</strong> I wanted to load for audio listeners who maybe haven&#8217;t seen your tweets, whatever. So one billion commits in twenty-five. Now it&#8217;s two hundred and seventy-five million per week on pace for fourteen billion this year, if growth remains linear. Is that still the pace? I don&#8217;t know. It&#8217;s been a</p><p><strong>Kyle [00:50:48]:</strong> it&#8217;s, it&#8217;s speeding</p><p><strong>Swyx [00:50:50]:</strong> Roughly.</p><p><strong>Kyle [00:50:50]:</strong> It&#8217;s still speeding up.</p><p><strong>Swyx [00:50:51]:</strong> It&#8217;s, it&#8217;s April, so yeah.</p><p><strong>Kyle [00:50:51]:</strong> Exactly. This was in April.</p><p><strong>Swyx [00:50:53]:</strong> All right. So basically you have fourteen x growth, right? Year on year on year. And I think that&#8217;s a scaling issue. I think, I&#8217;m going to like try to really steel man this thing. People have experienced fourteen x growth. They haven&#8217;t had your downtime. And that&#8217;s like&#8212; C-can we go dig into that? Why? Like what&#8217;s the&#8212; what broke? What are we doing to fix it? Like just anything for the community to reassure them.</p><h2>Why GitHub Reliability Is Breaking in New Ways</h2><p><strong>Kyle [00:51:18]:</strong> so there&#8217;s a Like I was saying, there&#8217;s a couple different places that we&#8217;ve seen the growth issues. Some of the growth issues, which is why we&#8217;re t&#8212; I was talking about pushing hard on more CPUs is in actions in particular. More tools, more agents, more PRs mean more builds, more builds mean more CPUs. And so we are expanding through not just our data center, but obviously we were talking about moving to Azure and moving to, adding an additional cloud compute because we simply need more CPUs. Not as much GPUs. We definitely need GPUs too, but now CPUs are becoming a factor.</p><p><strong>Swyx [00:51:53]:</strong> It&#8217;s very CPU heavy.</p><p><strong>Kyle [00:51:54]:</strong> Underneath the hood when it comes to some of the underlying services, we&#8217;ve been breaking up over the years our database infrastructure, so that way we have, more cognitive separation between our the various services. The place that we continue to have pain is in, permissioning. And so right now m-many of our permissioning layers sit into a database that we like internally call MySQL One, and old Hubbers will know what I&#8217;m talking about. And so we&#8217;ve been pulling things out of MySQL One for many years, because like and we use we use Vitess and we use other technologies to shard and we do it as one big</p><p><strong>Swyx [00:52:31]:</strong> Famous thing, PlanetScale was born from this and</p><p><strong>Kyle [00:52:32]:</strong> A hundred percent. Sam Old Hubber and friend. And so finding these opportunities to like break this out and then do that globally. The other thing that I think is interesting and both a unique opportunity and tricky is we also run everything I just talked about in a black box container with GitHub Enterprise Server for people that work on-prem. So we take everything I just said, and we also do it on-prem, and we also do all of that and we do it in a data residence setup for customers that need to have their data in a single location. Each of these has the unique characteristic around how we&#8217;re sort of storing that data in MySQL or in a permissioning setup. That&#8217;s where some of these outages have oc-occurred, where you&#8217;re seeing it more like across the board rather than just like the one piece</p><p><strong>Swyx [00:53:17]:</strong> Filling the database</p><p><strong>Kyle [00:53:17]:</strong> Isn&#8217;t quite working. Exactly. And so part of it is that. I think there&#8217;s been some other places where agents are much more or more projects appear to be moving towards monorepo versus we were going the other direction for many years in the industry. Repos were smaller, but there were more of them, and now we&#8217;re seeing the opposite. Repos are bigger, and there&#8217;s, not fewer of them per se &#8216;cause there&#8217;s new growth, but, we&#8217;re just seeing many more big repos. Big repos, big monorepos have always had, a unique performance problem. Because each one, is slightly different if, particularly if the underlying blobs are incredibly big Inside the repos. And so we&#8217;ve done a ton of work that you pro&#8212; like most people haven&#8217;t probably experienced, unless you&#8217;re in this case of the monorepo. But that Git, infrastructure layer improvement does help the overall, system because, many of the improvements that make monorepos work better make all repo infrastructure work better. And so, I could kind of keep going down the line where it&#8217;s another thing where we&#8217;re moving out of, We&#8217;re changing how we do j I&#8217;ll just say job queuing for lack of a better, explanation changing the underlying technologies there.</p><p><strong>Swyx [00:54:32]:</strong> I spent two years being a job queuing guy, so.</p><p><strong>Kyle [00:54:34]:</strong> And so it&#8217;s kind of a little bit of a little bit of piece by piece, and it&#8217;s mostly because as we were&#8212; as it was built, we built everything in a way that assumed, I guess in some ways that the size of the pipe of work was going to remain the same. There&#8217;s just going to be more people coming through each of those pipes. But instead now in places whereA git push was, generally a certain size for example, is now, no longer true.</p><p><strong>Swyx [00:55:03]:</strong> Oh, yeah.</p><p><strong>Kyle [00:55:03]:</strong> Or</p><p><strong>Swyx [00:55:05]:</strong> I push a thousand</p><p><strong>Kyle [00:55:06]:</strong> On the average. 100%</p><p><strong>Swyx [00:55:06]:</strong> A thousand line commits like daily</p><p><strong>Kyle [00:55:07]:</strong> Same thing with PRs. Like PRs same thing. And like we&#8217;ve talked about optimizing that and making changes where, and there were technology choices that did not work there? And it got slow, and it didn&#8217;t It was not fast. It did not do what the users wanted. And so we&#8217;ve been reeling that all out and going &#8220;Okay, that&#8217;s just not right. Let&#8217;s stop putting good money after bad and do it the do it the right way or the right way now.&#8221; So there&#8217;s It&#8217;s a it&#8217;s a lot of things, not quite when I&#8217;ve experienced scale at GitHub historically, it&#8217;s almost always two options that we&#8217;ve used. We go vertical scaling, particularly with databases, right? And we go horizontal scaling. Oh, we just have more people using this service. Great. We&#8217;re going to add more servers, and we rack them in our data center, or we use it in a cloud. And now we&#8217;re sort of in a like diagonal, where like vertical doesn&#8217;t really work anymore. Horizontal isn&#8217;t work either because we&#8217;re all We all have some CPU or GPU constraints in the world now, and now we have to go in and like crack open services that have been running for 10 or 15 years and go, &#8220;Okay, the rules of this service have legitimately changed, and now we have to rewrite them.&#8221; None of this is an excuse. This is like we&#8217;re We have to do the work. We have to make it better.</p><p><strong>Swyx [00:56:22]:</strong> actually as an infra guy, I&#8217;m &#8220;This is like one of the most fascinating scaling challenges I&#8217;ve ever seen.&#8221;</p><p><strong>Kyle [00:56:26]:</strong> That&#8217;s that&#8217;s, that&#8217;s the thing that&#8217;s the thing that it&#8217;s hard for Like when we weren&#8217;t talking about it publicly, and I was like I came out, and I was &#8220;Hey, I just want to explain what&#8217;s going on.&#8221; Part of it comes from a very old GitHub ethos, which is it&#8217;s our it&#8217;s our uptime. It&#8217;s down. W What I know you&#8217;re a developer, so you&#8217;re, you&#8217;re inclined to want to understand more what&#8217;s going on. But at the same time us going &#8220;Hey, this service didn&#8217;t, perform the way we expected, and now we have to go change it,&#8221; we weren&#8217;t We&#8217;re not trying to hide anything from you in that. It&#8217;s that well, that&#8217;s our problem because you expect us to be up, and I think that&#8217;s really baked into the core, origins of GitHub. And so now what we&#8217;re trying to do as a team is do all that work and just tell Talk about it more and just share you more technical details, write these blogs, write the posts, get the engineers who built it after they finish the work, just tell you &#8220;Okay, this is what we did.&#8221; I think that&#8217;s the contract that we want to bring back to the community and say, &#8220;Hey, we&#8217;re still very serious about what we&#8217;re doing. We haven&#8217;t been telling you about each piece. So let&#8217;s do that and we&#8217;re going to keep building this and scaling it in a way to support the If it&#8217;s not 14, then it&#8217;s 30 or it&#8217;s 50 or whatever the next exponential growth is going to be.&#8221;</p><p><strong>Swyx [00:57:40]:</strong> First of all, fantastic answer. I think</p><p><strong>Kyle [00:57:44]:</strong> And I apologize in advance if like any of that</p><p><strong>Swyx [00:57:47]:</strong> I think it&#8217;s all nice</p><p><strong>Kyle [00:57:47]:</strong> Is slightly incorrect just simply because</p><p><strong>Swyx [00:57:49]:</strong> No</p><p><strong>Kyle [00:57:49]:</strong> I&#8217;m not the I&#8217;m still in the weeds with this but it&#8217;s not my day-to-day. But like that&#8217;s the thing is we&#8217;re all looking at it to that level.</p><p><strong>Swyx [00:57:58]:</strong> And obviously, if people want to help, they can join.</p><p><strong>Kyle [00:58:00]:</strong> Absolutely</p><p><strong>Swyx [00:58:01]:</strong> So like I think the that is, good. I think people also would just want to know when are, when are you through the thick of it right? Like is there Have we identified all the issues? Is this just never-ending? Is Git broken? Do we have to change the Git, protocol? Like what how much is breaking, right? It&#8217;s been a while. And so I think people do want to know What&#8217;s the path back to the reliability that everyone expects out of GitHub.</p><h2>The Reliability Roadmap: Databases, Compute, and Load Testing</h2><p><strong>Kyle [00:58:30]:</strong> So like our availability in like recent few weeks has been much better than the three weeks before that or the three weeks before that and so forth. And so a lot of these improvements are still very much paying off for us. I think that we&#8217;re still working on that that database piece that I mentioned, and that just is a little bit physics a little bit of time to get it to get it fixed up. Because we have to the w</p><p><strong>Swyx [00:58:59]:</strong> My the answer I had in my head Was call YouTube.</p><p><strong>Kyle [00:59:03]:</strong> So YouTube ultimately is</p><p><strong>Swyx [00:59:04]:</strong> &#8216;Cause they also use Vitess.</p><p><strong>Kyle [00:59:05]:</strong> They also use Vitess. But the,</p><p><strong>Swyx [00:59:09]:</strong> Like whoever was the guy, the scaling guy at YouTube?</p><p><strong>Kyle [00:59:11]:</strong> Like that&#8217;s That I believe went to PlanetScale, and was a part of PlanetScale too. But like</p><p><strong>Swyx [00:59:16]:</strong> Oh, you mean Sugo?</p><p><strong>Kyle [00:59:17]:</strong> I think so. Yeah. And so, and so like</p><p><strong>Swyx [00:59:19]:</strong> He&#8217;s at Superbase now.</p><p><strong>Kyle [00:59:20]:</strong> Ah.</p><p><strong>Swyx [00:59:21]:</strong> There&#8217;s a whole Postgres drama Thing there, right?</p><p><strong>Kyle [00:59:25]:</strong> So like some of it&#8217;s that. I think the other piece of it is, our move to get additional compute will alleviate a fair amount of this particularly on the action side &#8216;cause a lot of the underlying, outages is actually related to,</p><p><strong>Swyx [00:59:39]:</strong> I&#8217;ll tell you actions is the it&#8217;s the root of all evil.</p><p><strong>Kyle [00:59:42]:</strong> it&#8217;s all It has its pros</p><p><strong>Swyx [00:59:47]:</strong> Some extent</p><p><strong>Kyle [00:59:47]:</strong> In that it&#8217;s the core It&#8217;s the core compute layer for either CI, side projects, et cetera.</p><p><strong>Swyx [00:59:52]:</strong> Is the main money maker? Like is</p><p><strong>Kyle [00:59:54]:</strong> Actions?</p><p><strong>Swyx [00:59:55]:</strong> No? I don&#8217;t know.</p><p><strong>Kyle [00:59:56]:</strong> like Actions</p><p><strong>Swyx [00:59:57]:</strong> I pay a lot for compute, right?</p><p><strong>Kyle [00:59:58]:</strong> like Actions is definitely a piece of the overall business, but I would say that like we ultimately also</p><p><strong>Swyx [01:00:06]:</strong> Storage</p><p><strong>Kyle [01:00:07]:</strong> Give away so many like minutes as part of our entitlements as that. But that&#8217;s what I was saying. Everyone&#8217;s using it. We talk about it as CI/CD, but the reality is people use it for CI/CD and</p><p><strong>Swyx [01:00:17]:</strong> Automation</p><p><strong>Kyle [01:00:17]:</strong> Various processing and automation, exactly. And so like part of it is also that like compute piece that is also alleviating some of our availability.</p><p><strong>Swyx [01:00:26]:</strong> This is my abuse of, actions. I have been</p><p><strong>Kyle [01:00:29]:</strong> Oh, yeah</p><p><strong>Swyx [01:00:29]:</strong> I have been scraping for every day, and just like I just tell people to</p><p><strong>Kyle [01:00:34]:</strong> Thank you for your service</p><p><strong>Swyx [01:00:35]:</strong> Go dog because I But this is also how I track, actions all time. So anyway,</p><p><strong>Kyle [01:00:41]:</strong> So like some of it&#8217;s going to be that. I would say that like each month I expect in the next three months, you&#8217;re going to see fewer and fewer moments where we have an availability problem Where things are going to go down, and that&#8217;s not just it&#8217;s stopped. It&#8217;s that we&#8217;re still experiencing faster growth than ever before. It&#8217;s just that those underlying improvements that we&#8217;ve been hard at work on, are finally paying off. It&#8217;s just that the improvements take-It&#8217;s less about, these incremental improvements where you make a small change, and you get this big output. It&#8217;s now material change That takes a bit of time, and then you see a step change in our availability.</p><p><strong>Swyx [01:01:14]:</strong> There&#8217;s a thing we used to do at Amazon, I don&#8217;t know if this is, a thing, but, if automated software verification or simulation of load testing and all that. I&#8217;m, I&#8217;m just like at this point, you have a whole map of GitHub. And, while you can assume whatever growth rates on whatever dimensions that you care about and just run it through a system, right? I feel like there&#8217;s a way to, I don&#8217;t know, have a systems model of GitHub and, see what breaks. But obviously, I&#8217;m pro&#8212; I&#8217;m not that close to the problem, so.</p><p><strong>Kyle [01:01:39]:</strong> But yeah, so yes, totally. And I would say, that&#8217;s been the journey and work that&#8217;s been happening since, I would say November to now. Because October, right, was the time where we even said, &#8220;Oh, look at the growth,&#8221; and, and then you start to see the chart</p><p><strong>Swyx [01:01:53]:</strong> It doesn&#8217;t</p><p><strong>Kyle [01:01:53]:</strong> Really pick up. And it&#8217;s oh, we tested it at N amount of scale, and now it&#8217;s at, N cubed maybe like in some in some vectors. And so now we have to go and build it that way and make sure that it can handle all of that scale.</p><p><strong>Swyx [01:02:08]:</strong> Let&#8217;s talk Copilot. So how many original creators of Copilot are there?</p><h2>The State of Copilot: From Code Completion to Agents</h2><p><strong>Kyle [01:02:15]:</strong> Oh, geez.</p><p><strong>Swyx [01:02:18]:</strong> &#8216;Cause I count like twelve authenticated.</p><p><strong>Kyle [01:02:19]:</strong> We haven&#8217;t&#8212; Yeah, I forget, all joking aside, I forget the number of people that were on, the original, GitHub Copilot team. But, there was a bigger group.</p><p><strong>Swyx [01:02:30]:</strong> I heard it&#8217;s, it&#8217;s Alex. It there&#8217;s, there&#8217;s, a three people</p><p><strong>Kyle [01:02:32]:</strong> Alex worked on it. Udo worked on it. There&#8217;s a a bunch of people that were on the team.</p><p><strong>Swyx [01:02:35]:</strong> And then their entire management line. Okay. So enormously successful at its in its in its day. I think the last number, I think Mario Came to my conference, and talked about the hundred million dollar mark. I think most recently three hundred. I might be out of date as well there.</p><p><strong>Kyle [01:02:53]:</strong> I don&#8217;t think we shared the dollar amounts.</p><p><strong>Swyx [01:02:54]:</strong> All right, cool. Just, what&#8217;s the state of Copilot? It&#8217;s, it&#8217;s obviously as a concept brought into More of Microsoft. But just at GitHub.</p><p><strong>Kyle [01:03:03]:</strong> so I think One of, one of the challenges is, that we had with Copilot, right, is that we came out the gate with code completion, and it was super great, powerful, et cetera. And then what we initially worked on after that sort of, initial year and a half, was, going after fine-tuning because our customers, the industry on the whole was really talking about, okay, well, how do we get more more correctness or performance out of this? And so we were working on a whole bunch of efforts to do fine-tuning on, larger and larger code completions or, next edit suggestions with fine-tuning, et cetera.</p><p><strong>Swyx [01:03:43]:</strong> And let me clarify. Is this fine-tuning one model or per customer a fine-tuned model for</p><p><strong>Kyle [01:03:48]:</strong> Per cust&#8212; Well, both. But, but, fine-tuning one model for the overall, use, and then fine-tuning per customer that wants this as, a service effectively. And around that time is when the next generation of models came, and that&#8217;s around the same time that all these other AI, coding tools came to be because the models really sped up. And so everyone kind of, will ask, &#8220;Well, what happened to GitHub Copilot?&#8221; there&#8217;s all this time, and I would say that we were on an era of going okay, we want to improve everyone&#8217;s results, and so let&#8217;s focus in on fine-tuning because that&#8217;ll give us these better results. And then the models got better. And so then ever since, we&#8217;ve been really on this kind of journey to go, okay of course, we have, this great code completion, and we&#8217;ve done a ton of investment in the better underlying models that we have post-trained better, next set of suggestions with post-training language specific models. All this stuff that kind of, sits in the ether of GitHub Copilot is code completion, but also have now ha&#8212; now have, a single underlying, SDK and harness for our coding agent Copilot ultimately. The new CLI, the new desktop app, cloud agents that use the same SDK. And so there was this moment of both, really trying to figure out what our customers want, models, Sherlocking us a little bit, then going and saying, &#8220;Okay, what does everyone ultimately need?&#8221; And what we think is that it&#8217;s not solely about the code generation. It&#8217;s really about having the ability to use these coding agent brained, harnesses or run times across, not just the coding experience where I&#8217;m going to, send a bunch of tasks out, or I&#8217;m going to use Fleet to break up a single task or autopilot similar to Goal all this stuff. But also how do I do that for all of my security remediation? How do I do that for every GitHub issue that comes in, just stick a coding agent on it just to see if it&#8217;s possible? How do go through my repository and see all of my documentation and extract out okay, this doesn&#8217;t actually match? That amount of sort of AI coding agent automation, I think is a big part of what we see when we&#8217;re looking at, okay, we&#8217;re still kind of going through a similar but very different flow. It&#8217;s just all happening at the same time. There&#8217;s not really the same, I&#8217;m going to create an issue to track my idea of building this. You&#8217;re probably just going to go, do it.</p><p><strong>Swyx [01:06:22]:</strong> Just do it.</p><p><strong>Kyle [01:06:22]:</strong> You&#8217;re going to say, &#8220;Hey, just build this,&#8221; right? And, there are still tons of, open issues and projects, et cetera, that are using issues like Peter and OpenClaw to be able to sic all of his agent on that. That kind of infrastructure layer and a really great coding experience that allows you to handle the sort of multiplexing, aspect is what we&#8217;ve built, are still building with GitHub Copilot. And so for folks that haven&#8217;t really used GitHub Copilot sinceThe thing that got them excited about this Which I I get. I really encourage you to, look at especially the GitHub, Copilot app. That&#8217;s my new daily driver. I obviously, if you prefer the CLI, also the CLI, be able to use all the models, the bring your own key side of it. We&#8217;re still improving our own models and using those too. And, it&#8217;s just like a very different experience, but I think that broader sense is of like software development and how coding agents can help throughout, not just Writing the code, or even verifying it or deploying it is is where we have this unique, angle. The other side is the context piece. Like</p><h2>Copilot&#8217;s Future: Context, Taste, and Personal Developer Workflows</h2><p><strong>Swyx [01:07:44]:</strong> Oh, God</p><p><strong>Kyle [01:07:44]:</strong> we&#8217;re still It&#8217;s like one of those things where I think the the final thing that will let me ultimately, feel complete at GitHub is, when we have this ability for GitHub to act like Kyle wants it to act Or Shawn or whatever. And we all codify that in rules and in memory and everything else, but</p><p><strong>Swyx [01:08:03]:</strong> Well, that&#8217;s an open research problem, right? Like it&#8217;s</p><p><strong>Kyle [01:08:05]:</strong> A hundred percent. A hundred percent</p><p><strong>Swyx [01:08:07]:</strong> AGI when you get it. Yeah.</p><p><strong>Kyle [01:08:07]:</strong> A hundred percent. But, if we can even just do it where my team, Without me having to codify everything, and as our methods shift on purpose to be able to have that full experience and all the understanding of what&#8217;s happening in my dependencies or open source, that feels like a big place for us to be able to continue to provide something really unique and valuable with GitHub Copilot.</p><p><strong>Swyx [01:08:29]:</strong> Is there a form factor that we haven&#8217;t explored? I think like we did code completion Then we did kind of let&#8217;s broadly call it agentic IDE Which Cursor Famously popularized, and then now it&#8217;s, now it&#8217;s all about the sort of agent orchestration Background agent, whatever. And then there&#8217;s the security review. I feel like everyone&#8217;s like just throwing agents at everything. The entire SDLC has Just, covered with agents. Are we like at the end of history here, basically? Like is it just refinements from here on out?</p><p><strong>Kyle [01:09:04]:</strong> I think that we&#8217;re all still in such this hypermyopic era of AI Where the reality is that for various, boring security and governance reasons at least for most people&#8217;s work, why is my coding agent, even if it&#8217;s all background agents, background running not, losing all the context that&#8217;s available to it across everything that I&#8217;m doing outside of coding? I think the most interesting thing to me in AI is actual ambient AI, not insert assistant name thing or, I&#8217;ve tried just about every pin in tool and whatever, and they don&#8217;t work the way that I&#8217;m looking for them to work because they are just trying to capture, and then they are trying to codify and then recall. And I think the thing that I&#8217;m looking for, back to the very beginning, I&#8217;m looking to be building out the next version of webhooks or, implementing a new feature, and it for it to know every spec doc, every email, the conversations that I&#8217;ve had online, everything about how this could be implemented and be able to, use that as part of its decision-making and none of these tools are ultimately doing this. So I think that it&#8217;s as if, software development work was a single lane task, was like it only needs a developer. Once I once I write the perfect code, we&#8217;ll be done here, but that&#8217;s just never been true. It&#8217;s all the context of the other team members, what the business is doing what&#8217;s popular right now, and I think that&#8217;s this huge opportunity for us to go much broader than really excellent coding agents? And that is honestly why I think OpenClaw has been so interesting is that sure, it&#8217;s connecting to all the data, sources that Kyle the human cares about, and now my question&#8217;s &#8220;Okay, how can I take all that and use that every day as a software dev connected together, not just have a new way to kick off a coding agent?&#8221; And that&#8217;s where we&#8217;re at. We&#8217;re saying, &#8220;Okay, I&#8217;m going to go use this CLI under the hood or this SDK,&#8221; but that&#8217;s not what I&#8217;m talking about. I&#8217;m talking about I&#8217;m having a conversation with you it downloads the podcast, and it realizes, &#8220;Oh, Kyle, sounds like Kyle needs this app or this thing or this &#8220; That level of</p><p><strong>Swyx [01:11:16]:</strong> Just recommends it.</p><p><strong>Kyle [01:11:16]:</strong> That level of, that level of connectivity I think is where we still have a ton of ways to go in software because then when we have that red thread we want to pull, that idea, it can not only use the perfect way to write that code, but instead all of the sort of taste and judgment calls and expertise that I&#8217;ve earned or that we&#8217;ve earned as a group and use it as part of the actual implementation.</p><p><strong>Swyx [01:11:42]:</strong> The extreme of it is AI runs your life, right? And I think there&#8217;s a scary inversion of control in the way that I literally doing it in the way that developers mean it in terms of frameworks Like the Hollywood principle, &#8220;Don&#8217;t call me, I&#8217;ll call you.&#8221; Like there at some point there is an inversion of control where, you should you stop telling what the AI, the AI what to do. AI tells you what to do. And, that&#8217;s a little bit scary, but also, maybe better.</p><p><strong>Kyle [01:12:10]:</strong> like Nat, I think Nat Friedman shared this in a like a Stripe event like talking about his OpenClaw was, he connected OpenClaw to his cameras, and it was, watching him.</p><p><strong>Swyx [01:12:20]:</strong> It redirected his Uber. And it,</p><p><strong>Kyle [01:12:23]:</strong> there&#8217;s a degree of this where I was I actually would love OpenClaw to tell me to Drink water. I don&#8217;t know that I want it to be, Changing where my car goes, but I do think that&#8217;s kind of what I&#8217;m talking about, which is it needs to have so much more information at its disposal for it to be helpful to me, and I still don&#8217;t think we&#8217;re, anywhere near talking about AGI. I&#8217;m just talking about every time I have to tell you something I care about that I&#8217;ve ever kind of said or I&#8217;ve said a dozen times, it should be able to know that codify that or gain access to it. Like the dreaming ideas, are an attempt to kind of do some version of this but I think there&#8217;s a much more proactive angle that will help software devs if we can test that out a bit more.</p><h2>OpenClaw, Ambient AI, and Inverting Control</h2><p><strong>Swyx [01:13:05]:</strong> Yeah. Well, the other thing about OpenClaw that reminded me Is Microsoft has a CVP Dedicated to OpenClaw. Why?</p><p><strong>Kyle [01:13:16]:</strong> Because you don&#8217;t think they should?</p><p><strong>Swyx [01:13:17]:</strong> I don&#8217;t, I don&#8217;t know. I think CVP is a high title. What, why is this so important? Like Microsoft Doesn&#8217;t even own OpenClaw. What&#8217;s, what&#8217;s the</p><p><strong>Kyle [01:13:29]:</strong> so I&#8212; we&#8217;re talking a lot more about this at, Microsoft Build this year too. I think, the main thing is that what OpenClaw has done is it has made this connection for people to have access to the resources that you have access to and be able to do things for you in a way that previously people were trying to codify into their own agents. And so when you think about it like in the work context, wouldn&#8217;t it be great to have a Claw-like object that I could actually run on my work device that or had access to my work assets, made&#8212; worked well on Windows what that would look like. And so I think that OpenClaw has become the personification of, a valuable agent that understands me because it has access to all of my information, and it can use a computer. And so thus it can do a lot more than, just a task-oriented process or like a a chat tool, et cetera. And that&#8217;s like a bunch of the goal of Build, right? We&#8217;re at Build this year trying to take a very different approach of it&#8217;s unapologetically aimed at developers. We&#8217;re trying to show the bigger investment to not just say, &#8220;Hey,&#8221; like you said, &#8220;Why do you have a CVP of OpenClaw?&#8221; Well, because, one of the problems that we have, right, is that our agents, if you install them not on a Mac Mini or not on a hosted device, you install them on a personal device or a work device, we need better sandboxing at the OS level. I need to be able to use that Claw and not, get fired. And so Microsoft is &#8220;Okay, great, let&#8217;s, do that too.&#8221; And then it&#8217;s, okay, well, where should I be able to talk to this agent? Should each of us just have a Claw available to us at work? Probably. And so there you go. And continuing to contribute a ton to the open source project too. Microsoft, I think as I&#8217;ve gotten more and more, information there&#8217;s so much investment into the open source, projects themselves that for whatever reason just I think there&#8217;s like this they don&#8217;t want to come off those teams don&#8217;t want to come off as like taking any credit or getting any recognition. But so many of these core contributors or teams are full-time just pushing into open source projects. And, I think that&#8217;s, that kind of shows the difference between, well, why are we looking so hard at something like Claw? Why are we looking at sandboxing on Windows? Why are we looking at cloud versions of sandboxing? Why are we looking&#8212; Because ultimately, we need more platform components. We don&#8217;t need everyone to be building the same exact, top-line product. And so if we&#8217;re building for builders, that requires us to give you all these components and tell you what they are and how they work and why you should be interested versus only delivering that single vertical over and over and over again.</p><h2>Microsoft, Windows Sandboxing, and Platform Components for Agents</h2><p><strong>Swyx [01:16:23]:</strong> I think, my maybe one way of framing it Is that Microsoft is the original operating systems company. And here is the new operating system for AI.</p><p><strong>Kyle [01:16:35]:</strong> like I think that we are also in an era where we are&#8212; we need to help build that bridge? All joking aside operating systems need to look different than they looked five years ago because it&#8217;s not just you using them anymore. And that&#8217;s changed the whole idea. It&#8217;s not, &#8220;Okay, my Claw is going to create a user account.&#8221; Doesn&#8217;t work like that? And so just just like all of us, we all have to look much more deeply in the stack, all the way down to, the silicon layer in Azure to be &#8220;Okay, well, What do we need now?&#8221; &#8216;Cause the workloads are different. It&#8217;s not just, &#8220;Okay, we need more inference.&#8221; It&#8217;s, &#8220;Okay, well, what type of inference do we need? What type of compute do we need to run these agents or run these agentic flows?&#8221; it&#8217;s a really interesting kind of like multi-layer problem, versus kind of, I would say software in the last five or six years were all going to our events, and we&#8217;re kind of saying a version of the same thing. SaaS product has new SaaS thing. It&#8217;s the best SaaS thing ever.</p><p><strong>Swyx [01:17:42]:</strong> It was boring for a while.</p><p><strong>Kyle [01:17:43]:</strong> And so now it&#8217;s like Oh my goodness, we&#8217;re at physics.</p><p><strong>Swyx [01:17:47]:</strong> It&#8217;s great.</p><p><strong>Kyle [01:17:48]:</strong> We&#8217;re at physics problems. And that&#8217;s exciting.</p><p><strong>Swyx [01:17:50]:</strong> We&#8217;re&#8212; we&#8217;re now trying to make, semicondu- room temperature superconductors. Still. That&#8217;s, that&#8217;s, that&#8217;s never going away. No, I think, that&#8217;s a really good overview of, everything. I think, have I have we left anything unsaid that you wanted to really get out there that we should cover?</p><h2>Build Announcements, Enterprise Adoption, and AI at Work</h2><p><strong>Kyle [01:18:07]:</strong> I&#8217;m really excited by for folks checking out, checking out the announcements that we have at Build go you can go look at them online, take a look. I think that I&#8217;m hoping that it&#8217;s driving, a degree of curiosity and interest because there&#8217;s such this big shift that we&#8217;re making at Microsoft for developers, where if you&#8217;re a daily driver of a Mac device or a Linux device, and you&#8217;re &#8220;Okay, I don&#8217;t use Windows,&#8221; there&#8217;s improvements that are being made that I think are going to surprise folks to just be &#8220;Oh, that&#8217;s in&#8212; they really want to do that?&#8221; not, And I&#8217;m talking for developers. I&#8217;m not talking for I play video games on the weekends on my Windows computer. I&#8217;m talking my daily driver. Like-All the way from that to, okay, well, what is it like to build an agent or build an app and deploy it and run it at work in particular? I think that is a big piece of it where I talk all the time with the team how I build on the weekend should be how I build at work. But if you&#8217;re working at a Fortune one hundred or a Fortune five hundred, you&#8217;re probably not vibe coding an app and then shipping it to some service. You got to go through security and compliance. How can we move just as fast at work? And that&#8217;s, I think, something that we have a bunch of different offerings for to give you that same sort of agility and power, but in the work context. And then I will tell you I&#8217;ve mentioned it a couple times, and, it&#8217;s very freaking cool. If you are in the M365 land in any way, check out WorkIQ, check out FoundryIQ. These little, oversimplifying it context engines are wild good. And, we&#8217;ve given them to our developers at GitHub, we&#8217;ve given them to employees at GitHub as we&#8217;ve used these tools to be able to just ask questions around everything that you have in your work context. And with FoundryIQ, be able to just do the same exact thing across all your existing stores. What&#8212; Not move to new tools, just connect them in. It&#8217;s surprisingly powerful, and you your boss is still not going to get fired, and IT is not going to turn it off because it&#8217;s leaking all this private information. That is the trick that I think, is sometimes getting lost when we&#8217;re talking about all these all these great new platforms. &#8216;Cause I can use them, I&#8217;m &#8220;Oh, this is super powerful. Oh, and I can&#8217;t I can&#8217;t use it.&#8221; and it&#8217;s Not because I&#8217;m at work at GitHub. It&#8217;s be</p><p><strong>Swyx [01:20:34]:</strong> &#8216;Cause I&#8217;m not allowed, yeah</p><p><strong>Kyle [01:20:35]:</strong> It&#8217;s &#8216;cause I&#8217;m not allowed, because they can&#8217;t do all the things that large, complicated companies need. And so, whether it be I said, just the kind of interesting daily driver curiosity all the way through to, &#8220;Oh, my gosh,&#8221; &#8220;I can go use this at work tomorrow potentially,&#8221; and have that context layer, have that intelligence, it&#8217;s a huge, it&#8217;s a huge shift. And so check it out. I&#8217;d love to hear&#8212; I&#8217;m, I&#8217;m not shy on social. I&#8217;d love to hear feedback. What&#8217;s working what&#8217;s not. But hopefully surprise folks a little bit.</p><p><strong>Swyx [01:21:07]:</strong> What I&#8217;m hearing&#8212; so first of all, I think that&#8217;s, that&#8217;s a great pitch. What I&#8217;m hearing, actually, is that you should put the WorkIQ people next to the Copilot people. &#8216;Cause, the exact prob- context problem that you named They solve enough for you to do your job, which is nuts.</p><p><strong>Kyle [01:21:23]:</strong> So, the thing that we are lit&#8212; that&#8217;s literally what has been Happening the last several months.</p><p><strong>Swyx [01:21:29]:</strong> I already forecast you were going there.</p><p><strong>Kyle [01:21:30]:</strong> It&#8217;s totally &#8216;cause, you&#8217;re totally right. The code, the code and the code asset problem is a little bit unique. But otherwise</p><p><strong>Swyx [01:21:36]:</strong> That&#8217;s it</p><p><strong>Kyle [01:21:37]:</strong> We&#8217;re all working</p><p><strong>Swyx [01:21:37]:</strong> It&#8217;s context</p><p><strong>Kyle [01:21:37]:</strong> With each other now. It&#8217;s all just context, exactly.</p><p><strong>Swyx [01:21:40]:</strong> Amazing. Great. I&#8217;m going to be there. I&#8217;m going to be doing</p><p><strong>Kyle [01:21:43]:</strong> Great</p><p><strong>Swyx [01:21:43]:</strong> A couple sessions there. I&#8217;m going to be interviewing Satya.</p><p><strong>Kyle [01:21:46]:</strong> I know.</p><h2>WorkIQ, Copilot Context, and What to Ask Satya</h2><p><strong>Swyx [01:21:47]:</strong> When I first started the pod, though, I had, Jeff Dean on. Jeff like It&#8217;s like hall of fame of People I want to meet someday. Satya&#8217;s on there. So, what should I ask Satya?</p><p><strong>Kyle [01:21:57]:</strong> I think, I think that the best question to ask is what he thinks is true in, two or three years from now. It seems like such a throwaway question. But ultimately, the way that the way that he is looking at this AI problem in, inference problem, token problem, and what we&#8217;re how we&#8217;re actually going to be working I think you can see some of the recent shifts that have been happening inside of Microsoft to kind of drive us to a place where it&#8217;s not four, five, six, seven, eight different things. It&#8217;s not a lack of context everywhere. But, why is this sort of approach in two years going to, pay off? Because that I think</p><p><strong>Swyx [01:22:41]:</strong> Wow, that&#8217;s a bold Okay. I&#8217;ll ask it. I&#8217;ll say you I&#8217;ll say I prompted by you but</p><p><strong>Kyle [01:22:45]:</strong> Absolutely</p><p><strong>Swyx [01:22:45]:</strong> It&#8217;s a bold question because there, I think there&#8217;s a lot of, doubts to be honest, Externally. And so, yes, I want, a straight answer from him on that I think would reassure a lot of people, and honestly, give me a lot of food for writing. So, thank you so much for spending your time. Thank you for doing what you do. I think as a CEO, you don&#8217;t need to be the external face. But, because you are authoritative, &#8216;cause you have so much background with GitHub, and it&#8217;s so authentic, we on the outside feel it. So thank you for that.</p><p><strong>Kyle [01:23:16]:</strong> Of course. Appreciate it. Thank you so much, Sean.</p>]]></content:encoded></item><item><title><![CDATA[[AINews] NVIDIA Cosmos 3, Nemotron 3 Ultra, and RTX Spark]]></title><description><![CDATA[Jensen scores a huge win.]]></description><link>https://www.latent.space/p/ainews-nvidia-cosmos-3-nemotron-3</link><guid isPermaLink="false">https://www.latent.space/p/ainews-nvidia-cosmos-3-nemotron-3</guid><pubDate>Tue, 02 Jun 2026 03:28:10 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!5bzA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6685277-4569-4135-92cb-e7a645246125_4096x2732.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://www.latent.space/p/video-agents">Today&#8217;s podcast guest</a> was the lead on NVIDIA Cosmos over a year ago, discussing training videogen and world models. Fittingly, Cosmos 3 launched  today, unifying language, image, video, audio and action in a <a href="https://x.com/victormustar/status/2061354267546427595?s=20">Mixture-of-Transformers architecture </a>that pairs an autoregressive reasoner with a diffusion generator in:</p><ul><li><p><strong>base Nano</strong> (16B: 8B reasoner tower + 8B generator tower) </p></li><li><p><strong>Super</strong> (64B: 32B reasoner tower + 32B generator tower) models, and</p></li><li><p>Super finetunes for <strong>Text2Image</strong> and <strong>Image2Video</strong>, which are now the <a href="https://x.com/ArtificialAnlys/status/2061494719998546206?s=20">new SOTA open weights imagegen and videogen models</a>, just <a href="https://x.com/victormustar/status/2061354267546427595?s=20">below Nano Banana 2</a></p></li></ul><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/liu_mingyu/status/2061525730996240738&quot;,&quot;full_text&quot;:&quot;Introducing NVIDIA Cosmos 3\n\nWe released NVIDIA Cosmos 3 last night.\n\nAnd today, seeing it take the top spots across 8+ open model leaderboards feels surreal. We spent months working towards this moment.\n\nHere&#8217;s the breakdown:\n\nThe Leaderboard Wins\n\nWorld Reasoning\n&#127942; #1 open &quot;,&quot;username&quot;:&quot;liu_mingyu&quot;,&quot;name&quot;:&quot;Ming-Yu Liu&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/2002841783735042048/07JFOmTh_normal.jpg&quot;,&quot;date&quot;:&quot;2026-06-01T19:10:10.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/HJwB89OasAArOcE.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/qyBs3D0FKk&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:10,&quot;retweet_count&quot;:39,&quot;like_count&quot;:225,&quot;impression_count&quot;:15581,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:false}" data-component-name="Twitter2ToDOM"></div><p></p><p>At Computex in Taiwan, Jensen also brought the heat with <a href="https://x.com/NVIDIAAI/status/2061495149872771568/photo/1">Nemotron 3 Ultra</a>, their 550B-A55B, remarkably efficient/<a href="https://x.com/ArtificialAnlys/status/2061304911565144230?s=20">fast</a> open weights LLM that is the new US SoTA:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5bzA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6685277-4569-4135-92cb-e7a645246125_4096x2732.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5bzA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6685277-4569-4135-92cb-e7a645246125_4096x2732.jpeg 424w, https://substackcdn.com/image/fetch/$s_!5bzA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6685277-4569-4135-92cb-e7a645246125_4096x2732.jpeg 848w, https://substackcdn.com/image/fetch/$s_!5bzA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6685277-4569-4135-92cb-e7a645246125_4096x2732.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!5bzA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6685277-4569-4135-92cb-e7a645246125_4096x2732.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5bzA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6685277-4569-4135-92cb-e7a645246125_4096x2732.jpeg" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f6685277-4569-4135-92cb-e7a645246125_4096x2732.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!5bzA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6685277-4569-4135-92cb-e7a645246125_4096x2732.jpeg 424w, https://substackcdn.com/image/fetch/$s_!5bzA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6685277-4569-4135-92cb-e7a645246125_4096x2732.jpeg 848w, https://substackcdn.com/image/fetch/$s_!5bzA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6685277-4569-4135-92cb-e7a645246125_4096x2732.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!5bzA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6685277-4569-4135-92cb-e7a645246125_4096x2732.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Finally, the RTX Spark personal computer 1 petaflop superchip, was previewed with <a href="https://x.com/satyanadella/status/2061315017589600699">Microsoft</a> and <a href="https://x.com/openclaw/status/2061331260279054801?s=20">OpenClaw</a> and <a href="https://x.com/NousResearch/status/2061323987804713083?s=20">Hermes Agent</a> as a launch partner (good analysis <a href="https://x.com/PatrickMoorhead/status/2061452151944274167">here</a>)</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/NVIDIARTXSpark/status/2061509361470497138?s=20&quot;,&quot;full_text&quot;:&quot;RTX Spark, early preview &#128064;\n\nPersonal AI agents. Faster creator workflows. RTX ON gaming. NVIDIA&#8217;s Jacob Freeman walks through how one Superchip brings it all together in a new class of slim laptops. &#128071; &quot;,&quot;username&quot;:&quot;NVIDIARTXSpark&quot;,&quot;name&quot;:&quot;NVIDIA RTX Spark&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/2061303426479431680/BDJQPK6Q_normal.jpg&quot;,&quot;date&quot;:&quot;2026-06-01T18:05:07.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://substackcdn.com/image/upload/w_1028,c_limit,q_auto:best/l_twitter_play_button_rvaygk,w_88/o8whpfecfc6pmdxkxd86&quot;,&quot;link_url&quot;:&quot;https://t.co/g2JWVJ6DC5&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:42,&quot;retweet_count&quot;:178,&quot;like_count&quot;:1663,&quot;impression_count&quot;:92979,&quot;expanded_url&quot;:null,&quot;video_url&quot;:&quot;https://video.twimg.com/amplify_video/2061509086651232257/vid/avc1/1280x720/ykMBrd9Obo07UeyD.mp4?tag=14&quot;,&quot;belowTheFold&quot;:false}" data-component-name="Twitter2ToDOM"></div><p></p><blockquote><p>AI News for 5/30/2026-6/1/2026. We checked 12 subreddits, <a href="https://twitter.com/i/lists/1585430245762441216">544 Twitters</a> and no further Discords. <a href="https://news.smol.ai/">AINews&#8217; website</a> lets you search all past issues. As a reminder, <a href="https://www.latent.space/p/2026">AINews is now a section of Latent Space</a>. You can <a href="https://support.substack.com/hc/en-us/articles/8914938285204-How-do-I-subscribe-to-or-unsubscribe-from-a-section-on-Substack">opt in/out</a> of email frequencies!</p></blockquote><div><hr></div><h1><strong>AI Twitter Recap</strong></h1><p><strong>NVIDIA&#8217;s Cosmos 3, Nemotron 3 Ultra, and the Push for Open Physical AI</strong></p><ul><li><p><strong>NVIDIA&#8217;s open-source week</strong>: NVIDIA dominated the open-model conversation with <strong>Cosmos 3</strong>, an open family of <strong>omnimodal world models for physical AI</strong>, plus the announcement of <strong>Nemotron 3 Ultra</strong>, a <strong>550B</strong> open-weight model that several posters called the strongest U.S. open model so far. Cosmos 3 was framed as a full-stack release&#8212;<strong>weights, code, datasets, and fine-tuning recipes</strong>&#8212;with NVIDIA also launching the <strong>Cosmos Coalition</strong> alongside partners including <strong>Runway</strong> to build an open ecosystem for world models <a href="https://x.com/NVIDIAAI/status/2061498958283968735">@NVIDIAAI ecosystem context</a>, <a href="https://x.com/runwayml/status/2061315089869721682">@runwayml coalition announcement</a>, <a href="https://x.com/kimmonismus/status/2061432501223162241">@kimmonismus Cosmos thread</a>, <a href="https://x.com/ClementDelangue/status/2061487081315094906">@ClementDelangue on NVIDIA&#8217;s HF footprint</a>.</p></li><li><p><strong>Why Cosmos 3 mattered technically</strong>: Beyond robotics rhetoric, the more concrete details were that Cosmos 3 unifies <strong>language, image, video, audio, and action</strong> in a single <strong>Mixture-of-Transformers</strong> design pairing an <strong>autoregressive reasoner</strong> with a <strong>diffusion generator</strong>. <a href="https://x.com/ArtificialAnlys/status/2061494719998546206">Artificial Analysis</a> said Cosmos 3 reached <strong>#1 among open-weight models</strong> on both their <strong>Text-to-Image</strong> and <strong>Image-to-Video</strong> leaderboards, noting the generator uses <strong>structured JSON prompts</strong> and can be driven either by an external prompt-upsampling harness or its own reasoner branch. Separately, NVIDIA&#8217;s hardware + software push extended to adoption of the <strong>OpenMDW</strong> framework and partner ecosystem integrations on platforms like fal <a href="https://x.com/ArtificialAnlys/status/2061494719998546206">@ArtificialAnlys</a>, <a href="https://x.com/fal/status/2061604121786876307">@fal</a>.</p></li><li><p><strong>Nemotron 3 Ultra reception</strong>: Community reaction to <strong>Nemotron 3 Ultra</strong> was unusually strong for a fresh open release. Posters highlighted both capability and serving characteristics, including claims that it is already topping some open evals and may be serving at <strong>300+ tok/s</strong> in some setups&#8212;far faster than large DeepSeek/Kimi-class models <a href="https://x.com/scaling01/status/2061379856433107135">@scaling01</a>, <a href="https://x.com/ctnzr/status/2061483152741175757">@ctnzr</a>, <a href="https://x.com/caspar_br/status/2061505720907182280">@caspar_br</a>. There was also some technical discussion that Nemotron appears <strong>less sparse</strong> than peers like Kimi K2 / DeepSeek V4&#8212;roughly <strong>~10% active</strong> vs <strong>~3%</strong>&#8212;which could affect both economics and behavior <a href="https://x.com/eliebakouch/status/2061607195268038777">@eliebakouch</a>.</p></li></ul><p><strong>MiniMax M3, Qwen3.7-Plus, and JetBrains Mellum2 Expand the Open Agent Model Field</strong></p><ul><li><p><strong>MiniMax M3&#8217;s launch was the day&#8217;s biggest model release</strong>: M3 was presented as an open-weight multimodal agent/coding model with <strong>1M context</strong>, <strong>native multimodality</strong>, and competitive agent benchmarks. The headline figures repeated across launch partners were <strong>59.0% SWE-Bench Pro</strong>, <strong>66.0% Terminal Bench 2.1</strong>, and <strong>74.2% MCP Atlas</strong> <a href="https://x.com/MiniMax_AI/status/2061425142795034794">@MiniMax_AI</a>, <a href="https://x.com/PBDTokenRouter/status/2061463048485838935">@PBDTokenRouter</a>, <a href="https://x.com/kimmonismus/status/2061473350766170420">@kimmonismus</a>. Multiple infra vendors shipped day-0 support&#8212;<strong>Novita</strong>, <strong>Vercel AI Gateway</strong>, <strong>Cloudflare AI Gateway</strong>, <strong>OpenClaude</strong>, <strong>Flowith</strong>, and others&#8212;suggesting unusually fast ecosystem adoption <a href="https://x.com/MiniMax_AI/status/2061398427121201648">@MiniMax_AI on Novita</a>, <a href="https://x.com/rauchg/status/2061593874498531707">@rauchg</a>, <a href="https://x.com/gitlawb/status/2061581678871806083">@gitlawb</a>.</p></li><li><p><strong>Benchmarks vs practical experience were mixed</strong>: M3 earned praise for frontend generation, visual/game tasks, and price-performance, with side-by-side demos showing strong one-shot UI/game outputs and notable benchmark placement for Next.js agent evals <a href="https://x.com/notjazii/status/2061407087293313210">@notjazii</a>, <a href="https://x.com/lostinlatencyX/status/2061409696649548165">@lostinlatencyX</a>, <a href="https://x.com/rauchg/status/2061593874498531707">@rauchg</a>. But several evaluators also reported <strong>high token consumption</strong>, <strong>verbose self-check loops</strong>, and occasional <strong>requirement drift</strong> on long tasks, making M3 look more like a &#8220;quality first, efficiency later&#8221; model <a href="https://x.com/ZhihuFrontier/status/2061493401019957337">@ZhihuFrontier review</a>, <a href="https://x.com/teortaxesTex/status/2061432151183171702">@teortaxesTex skepticism</a>.</p></li><li><p><strong>Qwen3.7-Plus</strong>: Alibaba launched <strong>Qwen3.7-Plus</strong> as a <strong>multimodal interactive hybrid agent</strong> that unifies <strong>GUI and CLI operation</strong>, visual reasoning, coding, and search-augmented QA. It is <strong>API-available</strong> via Alibaba Cloud Model Studio and was quickly added to tools like <strong>Cline</strong> <a href="https://x.com/Alibaba_Qwen/status/2061506641120641494">@Alibaba_Qwen launch</a>, <a href="https://x.com/cline/status/2061580233778790439">@cline</a>. The launch reinforces the trend that open-ish Asian labs are no longer releasing &#8220;just chat models,&#8221; but full <strong>agent-capable multimodal systems</strong>.</p></li><li><p><strong>JetBrains Mellum2</strong>: JetBrains released <strong>Mellum2</strong>, a <strong>12B MoE</strong> model with <strong>2.5B active parameters</strong>, trained on roughly <strong>11T tokens</strong> and post-trained with <strong>RLVR</strong>, shipping <strong>base / SFT / RL checkpoints</strong> and a technical report <a href="https://x.com/nv_pavlichenko/status/2061438808290172935">@nv_pavlichenko</a>, <a href="https://x.com/jetbrains/status/2061444430884675791">@jetbrains</a>. The intended niche is especially interesting: <strong>ultra-low-latency inference</strong> for <strong>routing, RAG, sub-agents, and IDE use</strong>, and it landed in <strong>vLLM</strong> immediately <a href="https://x.com/vllm_project/status/2061621691995005301#m">@vllm_project</a>. This looks like a serious &#8220;small fast open model for developer workflows&#8221; play rather than a benchmark-chasing frontier release.</p></li></ul><p><strong>Agents, Sandboxes, Memory, and Search Are Becoming the Real Product Surface</strong></p><ul><li><p><strong>The stack is shifting from model calls to agent runtimes</strong>: Several launches converged on the idea that the main engineering leverage is now in the <strong>harness</strong> rather than the model. <strong>Perplexity&#8217;s &#8220;Search as Code&#8221;</strong> is the clearest example: instead of iterative search tool calls, the model writes <strong>Python</strong> against a search SDK, enabling custom ranking pipelines, map-reduce over indexes, batching, aggregation, and lower token overhead. Perplexity reports a jump on its internal <strong>WANDR</strong> benchmark from <strong>0.152</strong> to <strong>0.386</strong> with this architecture <a href="https://x.com/perplexity_ai/status/2061506359326384319">@perplexity_ai</a>, <a href="https://x.com/AravSrinivas/status/2061575845056278971">@AravSrinivas</a>.</p></li><li><p><strong>Managed agents + sandboxes are becoming standard</strong>: Google detailed <strong>Managed Agents in the Gemini API</strong>, where a single API call can spin up an agent that reasons, writes/runs code, manages files, and operates inside a hosted <strong>Linux sandbox</strong> <a href="https://x.com/_philschmid/status/2061457703210197273">@_philschmid</a>, <a href="https://x.com/GoogleAIStudio/status/2061452967530701090">@GoogleAIStudio</a>. LangChain pushed similar ideas around <strong>Deep Agents</strong>, <strong>Context Hub</strong>, and <strong>LangSmith Sandboxes/Engine</strong>, emphasizing persistent context, agent lifecycle tooling, and automated failure triage <a href="https://x.com/LangChain/status/2061432934993674267">@LangChain</a>, <a href="https://x.com/hwchase17/status/2061496556608504043">@hwchase17</a>.</p></li><li><p><strong>Memory remains a missing primitive</strong>: One recurring complaint was that enormous context windows still don&#8217;t solve <strong>cross-session memory</strong>. A thread on <strong>HydraDB</strong> argued that &#8220;RAG + manual context injection&#8221; has been misnamed as memory, while actual persistent session knowledge remains underserved <a href="https://x.com/kimmonismus/status/2061454202883432501">@kimmonismus</a>. Related research threads pointed to reusable context management policies like <strong>AdaCoM</strong>, which trains a separate LLM via RL to prune/preserve context for frozen agents <a href="https://x.com/dair_ai/status/2061455253325971789">@dair_ai</a>.</p></li><li><p><strong>Security remains the gating issue for enterprise agents</strong>: There was a notable warning from Microsoft Security Intelligence about a major <strong>npm supply chain compromise</strong> affecting <strong>90+ redhat-cloud-services packages</strong>, including a self-propagating worm stealing npm/GitHub/AWS/SSH credentials <a href="https://x.com/MsftSecIntel/status/2061485730958848188">@MsftSecIntel</a>. At the same time, enterprise agent vendors highlighted <strong>sandboxing</strong>, <strong>runtime isolation</strong>, and <strong>security stack integration</strong> as prerequisites for deployment, including discussion of <strong>NVIDIA OpenShell</strong> and LangChain&#8217;s sandbox keynote <a href="https://x.com/shannholmberg/status/2061368566256189656">@shannholmberg</a>, <a href="https://x.com/LangChain/status/2061448130806116827">@LangChain</a>.</p></li></ul><p><strong>Codex, Claude Code, and the Competitive Coding-Agent Race</strong></p><ul><li><p><strong>OpenAI extended Codex into more places</strong>: OpenAI announced that <strong>frontier models and Codex are now generally available on AWS / Amazon Bedrock</strong>, aimed squarely at enterprises that want OpenAI capabilities inside existing AWS security/compliance workflows <a href="https://x.com/OpenAI/status/2061564502160892138">@OpenAI</a>, <a href="https://x.com/OpenAIDevs/status/2061564710173224985">@OpenAIDevs</a>. OpenAI also shipped a <strong>Codex Python SDK</strong> supporting threads, turns, streaming, resume, images, and sandbox control <a href="https://x.com/reach_vb/status/2061569472792572163">@reach_vb</a>, plus support for Bedrock-backed Codex workflows <a href="https://x.com/reach_vb/status/2061572961451094191">@reach_vb on Bedrock config</a>.</p></li><li><p><strong>Claude Code had a real ops incident</strong>: Anthropic reset <strong>5-hour and weekly rate limits</strong> for Pro and Max users after fixing a bug where some <strong>Opus 4.8</strong> sessions spawned too many <strong>parallel subagents/tool calls</strong>, burning usage unexpectedly <a href="https://x.com/ClaudeDevs/status/2061501787769893055">@ClaudeDevs</a>, <a href="https://x.com/ClaudeDevs/status/2061501790131265803">follow-up</a>. That&#8217;s a notable reminder that coding-agent product quality is increasingly determined by orchestration behavior, not just raw model IQ.</p></li><li><p><strong>Behavioral differences across coding models remain material</strong>: Developers highlighted large qualitative differences between GPT, Claude, and other models on benchmarks like <strong>ProgramBench</strong> and <strong>WeirdML</strong>, with Opus sometimes preferring exploration over score-maximization or showing benchmark-specific quirks <a href="https://x.com/OfirPress/status/2061458258821251081">@OfirPress</a>, <a href="https://x.com/htihle/status/2061412097720774679">@htihle</a>. A separate long thread argued newer <strong>Claude Opus 4.6&#8211;4.8</strong> variants can fabricate plausible but fictional concepts in non-coding domains, suggesting possible truthfulness/alignment regressions rather than ordinary hallucinations <a href="https://x.com/distributionat/status/2061362406971060244">@distributionat</a>.</p></li></ul><p><strong>Infra, Hardware, and Local AI Systems</strong></p><ul><li><p><strong>NVIDIA is coming for the PC</strong>: The most-discussed hardware launch was <strong>RTX Spark</strong>, an NVIDIA/Microsoft &#8220;personal AI computer&#8221; built around <strong>Grace + Blackwell</strong>, with up to <strong>128GB unified memory</strong> and claimed <strong>1 PFLOP FP4</strong>. The key strategic read: NVIDIA is no longer just selling accelerators, but an end-to-end local AI system that competes with <strong>Apple Silicon</strong>, x86 PCs, and Qualcomm simultaneously <a href="https://x.com/kimmonismus/status/2061484174088007739">@kimmonismus</a>, <a href="https://x.com/swyx/status/2061567877879369953">@swyx</a>.</p></li><li><p><strong>Cluster/networking updates</strong>: On the datacenter side, <strong>Lambda</strong> said it is first to adopt <strong>NVIDIA Quantum-X InfiniBand Photonics Q3450-LD</strong> switches, pushing co-packaged optics to reduce network power and failures in large AI clusters <a href="https://x.com/LambdaAPI/status/2061319330433032658">@LambdaAPI</a>. <strong>OpenAI</strong> also announced <strong>Stargate Michigan</strong>, a planned <strong>1GW</strong> data center using closed-loop cooling and paired with workforce/education commitments <a href="https://x.com/OpenAINewsroom/status/2061533639138316314">@OpenAINewsroom</a>.</p></li><li><p><strong>Local open-model tooling is improving fast</strong>: The <strong>MLX-VLM v0.6.0</strong> release was one of the more substantive local inference/tooling updates, adding speculative decoding, Anthropic-style and responses-style APIs, tool calls, support for many new multimodal models, and image/audio features with the explicit pitch of turning Apple devices into &#8220;real local agent machines&#8221; <a href="https://x.com/Prince_Canuma/status/2061541992790683726">@Prince_Canuma</a>. That pairs well with growing DGX Spark + <strong>vLLM</strong> experimentation for local NVFP4 MoE serving <a href="https://x.com/vllm_project/status/2061530659160838549">@vllm_project</a>.</p></li></ul><p><strong>Top Tweets (by engagement, filtered for technical relevance)</strong></p><ul><li><p><strong>Anthropic&#8217;s IPO path</strong>: Anthropic said it has <strong>confidentially submitted a draft S-1</strong> to the SEC, opening the door to an IPO pending review <a href="https://x.com/AnthropicAI/status/2061478052257841495">@AnthropicAI</a>.</p></li><li><p><strong>Claude Code usage incident</strong>: Anthropic reset user rate limits after an <strong>Opus 4.8 parallel subagent/tool-call bug</strong> caused excessive quota burn <a href="https://x.com/ClaudeDevs/status/2061501787769893055">@ClaudeDevs</a>.</p></li><li><p><strong>Qwen3.7-Plus</strong>: Alibaba launched a <strong>multimodal agent model</strong> spanning GUI/CLI operation, coding, and visual tasks <a href="https://x.com/Alibaba_Qwen/status/2061506641120641494">@Alibaba_Qwen</a>.</p></li><li><p><strong>OpenAI on Bedrock</strong>: OpenAI models and <strong>Codex</strong> are now available through <strong>Amazon Bedrock</strong> for enterprise workflows <a href="https://x.com/OpenAI/status/2061564502160892138">@OpenAI</a>.</p></li><li><p><strong>ARC-AGI-3 movement</strong>: <strong>Claude Opus 4.8</strong> posted a new SOTA on <strong>ARC-AGI-3</strong> at <strong>1.5%</strong>, still tiny in absolute terms but a meaningful jump on that benchmark <a href="https://x.com/arcprize/status/2061512025638121516">@arcprize</a>.</p></li></ul><div><hr></div><h1><strong>AI Reddit Recap</strong></h1><h2><strong>/r/LocalLlama + /r/localLLM Recap</strong></h2><h3><strong>1. New Frontier Model Releases and Early Tests</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1ttdiq0/minimax_m3_coding_agentic_frontier_1m_context/">MiniMax M3 - Coding &amp; Agentic Frontier, 1M Context, Multimodal</a></strong> (Activity: 1090): <strong>MiniMax M3 is announced as an </strong><em><strong>open-weight</strong></em><strong> frontier model with coding/agentic focus, native multimodality/vision, and MiniMax Sparse Attention for up to </strong><code>1M</code><strong> tokens of context with a guaranteed </strong><code>512K</code><strong> minimum (<a href="https://www.minimax.io/models/text/m3">MiniMax M3</a>). Claimed long-horizon agentic results include 12-hour ICLR paper reproduction, Hopper FP8 GEMM CUDA/Triton optimization reaching </strong><code>9.4&#215;</code><strong> speedup after </strong><code>147</code><strong> iterations, and PostTrainBench ranking third behind Opus 4.7 and GPT-5.5; access is currently via API/MiniMax Code, with HuggingFace/GitHub weights/local deployment planned.</strong> Commenters are cautiously interested in the combination of cheap/efficient vision plus long-context agentic coding, but skeptical because the announcement calls it <em>&#8220;open-weight&#8221;</em> while not yet exposing weights or even parameter count. One technical debate is whether the results imply a much larger-than-<code>~250B</code> model, extreme benchmark optimization, or a genuine open-weight breakthrough.</p><ul><li><p>Commenters focused on the missing release details: despite the claim of being <em>&#8220;the first open-weight model with three frontier capabilities&#8221;</em>, users could not find actual weights, parameter count, or sizing information for <strong>MiniMax M3</strong>. One commenter linked a preview image from the announcement (<a href="https://preview.redd.it/fej3vn94qk4h1.jpeg?width=3808&amp;format=pjpg&amp;auto=webp&amp;s=83ef24ab093520eb3118dd918259adff4f42a569">Reddit image</a>), but the thread still lacked confirmation of model scale or downloadable artifacts.</p></li><li><p>A technically substantive concern was that the advertised capability level implies one of three possibilities: <strong>a much larger-than-expected model</strong>, unusually strong benchmark optimization, or a major open-weights breakthrough. The speculation centered on whether MiniMax M3 is actually around <code>~250B</code> parameters or significantly larger, and whether its coding/agentic/multimodal claims will hold once weights and independent benchmarks are available.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1tthkh5/nvidia_announces_nemotron_3_ultra/">NVIDIA announces Nemotron 3 Ultra</a></strong> (Activity: 621): <strong>The <a href="https://i.redd.it/f79wu6dnml4h1.jpeg">image</a> is a technical announcement slide for NVIDIA Nemotron 3 Ultra, described in comments as a MoE </strong><code>550B-A55</code><strong> model. The slide positions Nemotron 3 Ultra against open/open-weight competitors including GLM 5.1, Kimi K2.6, and Qwen3.5 across &#8220;Frontier Smart&#8221; benchmark categories such as agent productivity, coding, instruction following, knowledge work, and long-context capability.</strong> Commenters viewed the comparison against other open-source/open-weight models positively, while one noted an &#8220;artificial analysis score&#8221; of <code>48</code>, placing it just below frontier-tier models and around the MiniMax 2.7 range, with the expectation that it could be the strongest U.S. open-weight model.</p><ul><li><p>NVIDIA Nemotron 3 Ultra is identified as a <strong>MoE </strong><code>550B-A55</code> model, implying roughly <code>550B</code> total parameters with about <code>55B</code> active parameters per token. This architecture detail is the most concrete technical spec mentioned in the thread.</p></li><li><p>A commenter cites an <strong>Artificial Analysis score of </strong><code>48</code>, placing Nemotron 3 Ultra &#8220;one notch less than frontier&#8221; and roughly in the <strong>MiniMax 2.7</strong> range, while suggesting it may be the strongest <strong>US open-weight</strong> model by that metric.</p></li><li><p>Technical references shared include NVIDIA&#8217;s official Nemotron 3 Ultra Base usage cookbook on GitHub: <a href="https://github.com/NVIDIA-NeMo/Nemotron/tree/main/usage-cookbook/Nemotron-3-Ultra-Base">NVIDIA-NeMo/Nemotron</a>, plus the LifeArchitect model comparison table: <a href="https://lifearchitect.ai/models-table/">lifearchitect.ai/models-table</a>. One commenter argues the comparison against <strong>Qwen3.5</strong> is notable because Nemotron may be NVIDIA&#8217;s best open-weight model while still trailing several non-US/open models.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1tss9nq/stepfun_37_flash_is_very_good/">Stepfun 3.7 Flash is very good</a></strong> (Activity: 473): <strong>The <a href="https://i.redd.it/k37ol07vfg4h1.gif">GIF</a> is a technical visual demo, not a meme: it shows the output of Stepfun 3.7 Flash for the prompt </strong><code>create a beautiful, relaxing flight simulator in a single html page</code><strong>, rendering a low-poly 3D flight scene with HUD-style speed/altitude indicators. The OP says this was the official </strong><code>Q4_X_S</code><strong> quant and claims the model feels near GLM 5.1 in aesthetics and about </strong><code>80%</code><strong> of its 3D world understanding, while using only roughly </strong><code>25%</code><strong> of GLM 5.1&#8217;s parameters and including built-in vision.</strong> Commenters mostly reacted with comparisons and nostalgia rather than deep benchmarks: one referenced the old Excel flight simulator, while another compared interest in <strong>Qwen 3.7 Max / 27B</strong> and asked whether it beats <strong>Qwen3.6 27B</strong>.</p><ul><li><p>A commenter draws a model-comparison angle by referencing <strong>Qwen 3.7 Max</strong> and hoping for a future <strong>Qwen 3.7 27B</strong> release, while another asks whether Stepfun 3.7 Flash is better than <strong>Qwen3.6-27B</strong>. The thread includes screenshot evidence for the Qwen3.6-27B reference (<a href="https://preview.redd.it/h1jbx5tz4j4h1.png?width=1523&amp;format=png&amp;auto=webp&amp;s=c4bd572a0741fcffc65f2b75153efbb603ede82b">image</a>), but no quantitative benchmark scores or reproducible eval details are provided.</p></li></ul></li></ul><p></p>
      <p>
          <a href="https://www.latent.space/p/ainews-nvidia-cosmos-3-nemotron-3">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Why Video Agent models are next — Ethan He, xAI Grok Imagine]]></title><description><![CDATA[Inside xAI: Building Grok Imagine in 3 Months, Videogen vs World Models, and why Grok Imagine is so underrated. For the first time, we do a deep dive with the guy who led it!]]></description><link>https://www.latent.space/p/video-agents</link><guid isPermaLink="false">https://www.latent.space/p/video-agents</guid><pubDate>Mon, 01 Jun 2026 15:41:48 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/200078058/28f9c640c1fe13dbb645d043b323afb6.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p><em>We&#8217;re announcing <a href="https://ai.engineer/wf">AIEWF</a> speakers this week! Take the <a href="https://notion.qualtrics.com/jfe/form/SV_bP07tSVMXH7ePCS">AI Engineering Survey</a>!</em></p><div><hr></div><p>Today&#8217;s guest Ethan first joined us for the LS Paper Club as the lead on <a href="https://www.youtube.com/watch?v=og59L4JECz4&amp;pp=ygUWbGF0ZW50c3BhY2V0diBldGhhbiBoZQ%3D%3D">NVIDIA Cosmos World Model</a>, but then joined xAI and built Grok Imagine in 3 months:</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/EthanHe_42/status/2016749123198673099&quot;,&quot;full_text&quot;:&quot;Thrilled to share our new Grok Imagine release &#128640; It is the highest quality, fastest, and most cost-effective video generation model yet. Comes with 720P, video editing and better audio! We listened closely to your feedback and moved fast.\nJust six months ago, we had almost&quot;,&quot;username&quot;:&quot;EthanHe_42&quot;,&quot;name&quot;:&quot;Ethan He&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/2007952552139083776/3nAl6TdB_normal.jpg&quot;,&quot;date&quot;:&quot;2026-01-29T05:43:55.000Z&quot;,&quot;photos&quot;:[],&quot;quoted_tweet&quot;:{&quot;full_text&quot;:&quot;Understanding requires imagining. Grok Imagine lets you bring what&#8217;s in your brain to life, and now it&#8217;s available via the world&#8217;s fastest, and most powerful video API: https://t.co/tqQwQVgCEI\n\nTry it out and let your Imagination run wild.&quot;,&quot;username&quot;:&quot;xai&quot;,&quot;name&quot;:&quot;xAI&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1769430779845611520/lIgjSJGU_normal.jpg&quot;},&quot;reply_count&quot;:127,&quot;retweet_count&quot;:107,&quot;like_count&quot;:1346,&quot;impression_count&quot;:115686,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:false}" data-component-name="Twitter2ToDOM"></div><p>He comes back on Latent Space with some nuclear hot takes: that <strong>Video Models primarily get their intelligence from LLMs</strong>, not from training on video data, and that the next frontier for truly interactive, realtime, long-horizon <strong>world models</strong> is to work on LLMs (perhaps <a href="https://www.latent.space/p/ainews-thinking-machines-native-interaction">Interaction Models </a>as well&#8230;)</p><p>Put it this way: In the near term, the next Sora won&#8217;t be a better video model, but <strong>a video agent</strong>.</p><p><strong><a href="https://www.youtube.com/watch?v=t4359sKBu4w&amp;list=PLcfpQ4tk2k0VjKRy3q6ZxeOtkbZlmFDLg">Generative Media</a></strong> may more closely follow <strong>the evolution of AI coding</strong> which went from focusing on one-shot output performance and cost, to multiturn reasoning and planning models for agents and systems that can plan, edit, test, debug, and submit PRs.</p><p>At a certain point, coding models got so good that the only significant next step to improve performance was <strong>handling the orchestration of these models.</strong></p><p>Now as the performance of video models increases significantly across realism, consistency, &amp; prompt adherence while becoming more cost efficient, the next evolution of video generation may also be systems that can plan, generate, edit, critique, and iterate across an entire creative task. </p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/XFreeze/status/2049725955208622475&quot;,&quot;full_text&quot;:&quot;Grok Imagine Agent Mode (Beta) just went live on Grok web\n\nIt&#8217;s a full creative agent working on one infinite open canvas\n\nGrok Agent plans &#8594; generates &#8594; edits &#8594; iterates everything automatically in the same workspace\n\nTell it what you want and watch it plan, generate, edit, &quot;,&quot;username&quot;:&quot;XFreeze&quot;,&quot;name&quot;:&quot;X Freeze&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1876785200010539008/2_HFJjq9_normal.jpg&quot;,&quot;date&quot;:&quot;2026-04-30T05:42:04.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://substackcdn.com/image/upload/w_1028,c_limit,q_auto:best/l_twitter_play_button_rvaygk,w_88/ipjicmb4vnnffm91qwrf&quot;,&quot;link_url&quot;:&quot;https://t.co/K1Q4aZLTPF&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:681,&quot;retweet_count&quot;:1151,&quot;like_count&quot;:3970,&quot;impression_count&quot;:920016,&quot;expanded_url&quot;:null,&quot;video_url&quot;:&quot;https://video.twimg.com/amplify_video/2049724062725963776/vid/avc1/848x720/z-z5YyMxPdbTS6Ot.mp4&quot;,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p>In this episode, Ethan joins swyx and Vibhu to unpack what it actually takes to build <strong>frontier image and video systems</strong>: data, VAEs, diffusion transformers, audio-video alignment, inference speedups, and the hidden cost of storing and moving massive video datasets. From building <strong><a href="https://www.nvidia.com/en-us/ai/cosmos/">NVIDIA&#8217;s Cosmos world model</a></strong> to joining <strong>xAI</strong> as <strong><a href="https://grok.com/imagine">Grok Imagine</a></strong> was being built from zero to one, <strong>Ethan He</strong> has been at the center of some of the most important work in video generation, multimodal models, and real-time world models.</p><div id="youtube2-jPtQlILfkhA" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;jPtQlILfkhA&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/jPtQlILfkhA?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>We go deep on <strong>Grok Imagine</strong>, how a small xAI team shipped its <strong>first multimodal video model in three months</strong>, why <strong>iteration speed</strong> matters more than almost anything in model development, and why many of the biggest gains come from fixing tiny bugs in data and training pipelines. </p><p></p><h2>Flipbook: The future of Videomaxxing</h2><p>Video agents are almost a sure bet to be the trend in the coming year. We end with a glance at what&#8217;s beyond video agents:</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;1b45c9aa-5a9f-4e8a-8a4d-dc2cc4a71fab&quot;,&quot;duration&quot;:null}"></div><p><strong><a href="https://www.flipbook.page/n/43e8c7b08ab14571810fee265c331cb3">Flipbook</a></strong> caused a minor sensation this year when it was released, but most treat it as a fun demo. Ethan takes it very seriously &#8212; with the speed and cost of inference coming down every year, the future of custom video JIT UI is closer than you think. We talked about why videogen models may become the front end of AI, how <strong>generative UI could replace traditional HTML/CSS</strong>, why world models need to be real-time, interactive, and long-horizon, and why the future of video generation may depend more on language models and agents than on diffusion alone.</p><div><hr></div><p><strong>We discuss:</strong></p><ul><li><p>Why <strong>fast iteration</strong> mattered more than meetings</p></li><li><p>Why <strong>small training bugs</strong> can drive huge model quality gains</p></li><li><p>Why coding models may make <strong>compute the bottleneck</strong> again</p></li><li><p>How image and video models are trained with <strong>synthetic captions</strong></p></li><li><p>The role of <strong>VAEs and latent space</strong> in frontier video models</p></li><li><p>Why <strong>image models</strong> are the foundation for video models</p></li><li><p>The tradeoff between <strong>temporal compression</strong> and real-time interactivity</p></li><li><p><strong><a href="https://www.flipbook.page/">Flipbook</a>, <a href="https://neural-os.com/">Neural OS</a></strong>, and the future of generative UI</p></li><li><p>Why future interfaces may go from <strong>user intent to pixels</strong></p></li><li><p>The hidden cost of training video models: <strong>storage, egress, and GPU hours</strong></p></li><li><p>How <strong>step distillation and consistency models</strong> (like <a href="https://openai.com/index/simplifying-stabilizing-and-scaling-continuous-time-consistency-models/">OpenAI sCM</a>) makes video inference orders of magnitude faster</p></li><li><p>Grok Imagine 0.9 and <strong>large-scale audio-video generation</strong></p></li><li><p>Why <strong>audio-video alignment</strong> is harder than text-video alignment</p></li><li><p>Ethan&#8217;s definition of <strong>world models</strong></p></li><li><p>Reference-to-video, video extension, and <strong>long-context video generation</strong></p></li><li><p>Why xAI&#8217;s research communication undersells <strong>Grok Imagine</strong></p></li><li><p>How <strong>xAI culture</strong> shaped the speed of development</p></li><li><p>AI watermarking, SynthID, and <strong>detecting generated media</strong></p></li><li><p>Why <strong>prompt rewriting</strong> matters for video models</p></li><li><p>Grok Imagine Agent and the rise of <strong>video agents</strong></p></li><li><p>Why <strong>language models</strong> may unlock better video generation</p></li><li><p>Robotics, physical AI, and <strong>embodied world models</strong></p></li><li><p>Why <strong>Ethan left xAI</strong> and shifted focus toward LLMs</p></li><li><p>Self-managed context, memory, and <strong>the next frontier for language models</strong></p></li></ul><div><hr></div><p><strong>Ethan He</strong></p><ul><li><p><strong>LinkedIn:</strong> <a href="https://www.linkedin.com/in/ethanhe42">https://www.linkedin.com/in/ethanhe42</a></p></li><li><p><strong>X:</strong> <a href="https://x.com/EthanHe_42">https://x.com/EthanHe_42</a></p></li></ul><div><hr></div><h2>Timestamps</h2><p><strong>00:00:00</strong> Introduction</p><p><strong>00:01:25</strong> From NVIDIA Cosmos to xAI</p><p><strong>00:03:24</strong> Building Grok Imagine from Zero to One</p><p><strong>00:10:07</strong> How Image and Video Models Are Trained</p><p><strong>00:18:53</strong> Video Compression, VAEs, and Real-Time Tradeoffs</p><p><strong>00:22:10</strong> Generative UI, Flipbook, and Neural OS</p><p><strong>00:32:10</strong> The Cost of Training Large Video Models</p><p><strong>00:37:04</strong> Distillation, GANs, and Fast Video Inference</p><p><strong>00:41:21</strong> Audio-Video Generation and Grok Imagine 0.9</p><p><strong>00:48:34</strong> What Makes a World Model?</p><p><strong>00:55:51</strong> Reference Videos, Long Context, and Video Memory</p><p><strong>01:00:11</strong> xAI Culture, Research, and First-Principles Building</p><p><strong>01:09:45</strong> AI Safety, Watermarking, and Prompt Rewriting</p><p><strong>01:13:10</strong> Video Agents and AI-Assisted Creation</p><p><strong>01:27:32</strong> Why Language Models Unlock Better Video</p><p><strong>01:31:15</strong> Robotics, Physical AI, and Embodied World Models</p><p><strong>01:32:38</strong> Why Ethan Left xAI</p><p><strong>01:34:16</strong> Self-Managed Context and the Future of LLMs</p><p><strong>01:38:43</strong> Ethan&#8217;s Career Path and Closing Thoughts</p><div><hr></div><h1>Transcript</h1><h2>Introduction: Ethan He, Latent Space, and the Path to xAI</h2><p><strong>Swyx [00:00:00]:</strong> We&#8217;re here in the studio with Ethan He, most recently of xAI. Welcome.</p><p><strong>Ethan [00:00:10]:</strong> Thank you. Glad being here.</p><p><strong>Swyx [00:00:11]:</strong> We&#8217;re also here with Vibhu. you were first coming to us or joining the latent space world because you were working on Kosmos at NVIDIA, and you did a paper. We loved it. you presented it as well, so thank you for doing that.</p><p><strong>Ethan [00:00:23]:</strong> I&#8217;ve actually, I also presented the MoEs twice at latent space.</p><p><strong>Swyx [00:00:29]:</strong> How did you actually hear about us? Did we reach out to you? Is that how it worked?</p><p><strong>Ethan [00:00:33]:</strong> No, actually, I-- the community. Like I realized, oh, there is this online community that people talk about AI and also learn from each other through papers every week through the Paperclip. It&#8217;s very nice.</p><p><strong>Ethan [00:00:49]:</strong> I learned a lot.</p><p><strong>Swyx [00:00:49]:</strong> I think three years stop. We haven&#8217;t stopped even on Christmas and New Years. many weeks I want to stop but it keeps going.</p><p><strong>Vibhu [00:00:58]:</strong> No, that was good. I think you had posted that you worked on a paper, and I was &#8220;Oh, very cool. We have Paperclip. Present then.&#8221;</p><p><strong>Vibhu [00:01:04]:</strong> But I might have reached out to you after.</p><p><strong>Swyx [00:01:05]:</strong> you-- because it&#8217;s an amateur club, right?</p><p><strong>Swyx [00:01:08]:</strong> so it&#8217;s very unusual and but we have sometimes paper authors come by and actually explain the paper. Today we just did, the poolside paper, which was apparently very good.</p><p><strong>Vibhu [00:01:18]:</strong> Came out yesterday.</p><p><strong>Vibhu [00:01:19]:</strong> pretty interesting, right? Fully open. They talk about everything, systems. So it&#8217;s a good one. We&#8217;ll, we&#8217;ll recommend people to read it.</p><p><strong>Swyx [00:01:25]:</strong> Bring us up to speed on your transition to xAI, &#8216;cause I actually don&#8217;t even know when you joined. just like tell the, tell the story about the sort of transition.</p><h2>From NVIDIA Cosmos to xAI: Scaling Video and World Models</h2><p><strong>Ethan [00:01:34]:</strong> Before xAI, I was working on Kosmos world model as in-- at NVIDIA. So Kosmos is, it&#8217;s a giant video foundation models that can-- that aims to simulate the world and for-- it serves as a foundation of-- for all of the roboticists to build on top of. There, once I built the Kosmos one, I realized as this thing also has a scaling law similar to language model, we need to scale up the video models further. that&#8217;s, that&#8217;s why I realized I need to move to somewhere with much more compute resources. That&#8217;s how I</p><p><strong>Swyx [00:02:13]:</strong> Than NVIDIA?</p><p><strong>Vibhu [00:02:14]:</strong> The GPU rich came themselves.</p><p><strong>Vibhu [00:02:19]:</strong> And timeline-wise, when was Kosmo? It was pretty early, right? It was open world model, open paper, everything.</p><p><strong>Ethan [00:02:25]:</strong> It was end of twenty-four.</p><p><strong>Vibhu [00:02:28]:</strong> End of twenty-four.</p><p><strong>Ethan [00:02:30]:</strong> Then at mid twenty-five, I moved to xAI. At that time-- I joined about the time when xAI was about to build video models and in multi-model models. There were no infra, no data, and no model, and it just-- as a few engineers, we built it in three months and released the first model, Grok Imagine zero point nine.</p><p><strong>Ethan [00:02:55]:</strong> And since then, I keep working on video models and move more from training and to post-training of the video models. For example, like a reference to videos, kind of like the cameo feature and, video extensions. And, before I left, I worked on a world model, leading a small team to focus on the real-time long horizon video generation.</p><h2>Building Grok Imagine From Scratch in Three Months</h2><p><strong>Swyx [00:03:24]:</strong> Can you give like a rough roadmap of okay, you&#8217;re on a brand-new team. Grok previously was only text, or they partnered with BFL for their image gen stuff. What do you-- what are the building blocks, right? You have compute, data you can procure somewhere. Like just what are like the sequence of things that people should think about when you&#8217;re setting up a new team?</p><p><strong>Vibhu [00:03:43]:</strong> actually even deeper, not just data you can procure. You guys had to go through getting the data too, right? So you shipped it pretty fast, but yeah</p><p><strong>Swyx [00:03:51]:</strong> three months is like</p><p><strong>Vibhu [00:03:52]:</strong> From everything</p><p><strong>Swyx [00:03:52]:</strong> actually like very surprisingly fast.</p><p><strong>Ethan [00:03:55]:</strong> One thing I say like thanks to my experience at NVIDIA, &#8216;cause first time when we were building Kosmos together, we built it, for about a year. So this is like the second time I do it. Roughly have an idea, what to do. I say the most important thing is the talent. Everyone were very strong and clever, very close with each other towards a common goal. So that speed up things a lot. So you reduce the communication bandwidth among people, and everyone can work towards the same goal. It&#8217;s, it&#8217;s like every day there&#8217;s not that much meetings on the calendar, like maybe like a, like a sync a day, and after that it&#8217;s, it&#8217;s just all building. It was pretty fun at that time.</p><p><strong>Ethan [00:04:47]:</strong> And another thing is that xAI has very strong foundations of like data inference, model inference, and the supporting there can help the model develop a lot. When I look at, training models, I don&#8217;t so actually the top important thing is like how many, how many iterations can you do, per day? and the more iteration can you do, you can, you can train the model much faster. So if you have very strong infra and you have a lot of compute, you can, you can train these models in very short period of time. That can give you a much larger buffer to, for errors, and it also gives you the opportunity to spot more bugs.</p><h2>Iteration Speed, Compute, and Debugging Model Pipelines</h2><p><strong>Swyx [00:05:46]:</strong> What is an iteration? Is it like a few hundred steps or what are you</p><p><strong>Ethan [00:05:50]:</strong> Let&#8217;s say just the train-training the model, like from acquire new data and maybe design new algorithms and train a new model, maybe at smaller scale or</p><p><strong>Swyx [00:06:01]:</strong> So cycle time for like any hyperparam that you&#8217;re searching.</p><p><strong>Ethan [00:06:04]:</strong> Cycle time and tune to like eval this model. Is this model better than my previous iteration?</p><p><strong>Ethan [00:06:11]:</strong> So</p><p><strong>Swyx [00:06:11]:</strong> So it&#8217;s like before you, someone had already set this up that you can iterate very quickly.</p><p><strong>Ethan [00:06:15]:</strong> I think the foundation there is extremely good forDeveloping and research models.</p><p><strong>Ethan [00:06:23]:</strong> And often I find is it-- this is kind of boring, but like a lot of the improvements does not come from new algorithms. It comes from finding small bugs here and there in the data pipeline, in the, in the model training pipeline. Those give, those give the biggest boost to the model quality.</p><p><strong>Vibhu [00:06:46]:</strong> It&#8217;s interesting, right? So you say it&#8217;s like small team, less communication bandwidth, but also a lot of quality is like find little bugs. It seems counterintuitive, right? You have a lot of people, you can iron out more of those, but it&#8217;s interesting to see the other side, right?</p><p><strong>Swyx [00:07:00]:</strong> I also wonder, have you-- do you try using LLMs to look for bugs? I don&#8217;t know.</p><p><strong>Ethan [00:07:05]:</strong> I remember at that time it was mid two thousand and twenty-five, so it&#8217;s the coding model wasn&#8217;t quite there yet. I remem- I remember like December two thousand and twenty-five, it was extremely good. Yeah, I&#8217;ve been, I&#8217;ve been using it at that time. It&#8217;s, it&#8217;s helpful. sometimes it produce codes that are kind of difficult to maintain, even though like the first time it built something extremely fast. But it gave the, like a spaghetti code, thousands of lines that I couldn&#8217;t maintain, and the LLM itself couldn&#8217;t figure out what&#8217;s, what&#8217;s wrong and how to improve on top of it. But now I find it much better. Yeah, I want to bring up another point here is now coding models are much more efficient and can help us implement stuff much faster. Compute might become a bottleneck again because previously, like if you want to train a new model, say you want to generate new synthetic data and then or write a new algorithm, it might take a few weeks. And during that period of time, you don&#8217;t-- you might not have experiments to run. But now you can build that thing within a few hours, then you can immediately train a model.</p><p><strong>Ethan [00:08:24]:</strong> Now you have to have enough compute to try all of the ideas. So compute might be the bottleneck of iterating speed again.</p><p><strong>Swyx [00:08:36]:</strong> yeah, I actually, honestly, I think it&#8217;s like kind of a stressful job because you&#8217;re &#8220;Well, I should be trying everything, and if I&#8217;m not, then I&#8217;m not doing my job well.&#8221;</p><p><strong>Vibhu [00:08:48]:</strong> there&#8217;s also the stress of you&#8217;re eating thousands of GPUs per hour, which is very expensive and, compute can go to other researchers.</p><p><strong>Swyx [00:08:56]:</strong> You got the daddy Elon to</p><p><strong>Vibhu [00:08:57]:</strong> You got daddy Elon.</p><p><strong>Ethan [00:08:59]:</strong> It was</p><p><strong>Vibhu [00:09:00]:</strong> But there&#8217;s still finite amount of compute, like you want to use it, you want to use it well, you want more of it.</p><p><strong>Ethan [00:09:06]:</strong> That was quite stressful indeed. Yeah, I think one thing is the-- with coding models now, like a lot of these jobs can be automated, which is much better. A second, it&#8217;s a, it&#8217;s a marathon, so you got to maintain good health and, a regular schedule.</p><p><strong>Vibhu [00:09:28]:</strong> It&#8217;s, it&#8217;s hard to hear that when you shift from zero to nothing in two months.</p><p><strong>Swyx [00:09:32]:</strong> and, I think obviously the culture at xAI is very famously, people work very hard. one thing I did want to dive into, in our-- in the notes that you, that you sent ahead of time, you had specific comments about the cost of Video Gen training. presumably this is on the Colossus-1, right? the two hundred megawatt cluster. Any whatever you want to just share on that.</p><p><strong>Vibhu [00:09:54]:</strong> I think there&#8217;s, there&#8217;s three things we&#8217;re talking about, right? So there&#8217;s Video Gen, there&#8217;s also the Image Gen model that you put out. Do you want to like complete the, okay, so zero to one, you have a few months. Just what are the stages of create Image Gen model?</p><p><strong>Swyx [00:10:06]:</strong> Oh, yeah, maybe I got distracted.</p><h2>How Image and Video Models Are Trained: Synthetic Captions, Tokenizers, and VAEs</h2><p><strong>Vibhu [00:10:07]:</strong> Sorry. and then, from there&#8217;s Video Gen, there&#8217;s Audio Gen. Would love to get into those next. But what is that first few months like? So small team, a lot of bugs, iterations, but what does it look like? Do we take something off the shelf? Do we just get data compute? What&#8217;s, what&#8217;s the few months like? How do you go to state-art Image Gen model? How do you just start?</p><p><strong>Ethan [00:10:28]:</strong> I cannot comment specifically how xAI did, but it&#8217;s, it&#8217;s a quite standard process. I can draw some, examples from Cosmos. So mainly it&#8217;s building a video model, you actually need to build a image model first. And building these two models, the data you need is a hundred percent synthetic pair of language and image or language to video. Because on the, on the internet, actually, the videos don&#8217;t naturally associate with text. So you can say, oh, like on YouTube, you have the title and you have the description and the comments</p><p><strong>Swyx [00:11:11]:</strong> Title</p><p><strong>Ethan [00:11:11]:</strong> of a video, but usually they&#8217;re not relevant to the video itself. And say maybe like the video is a natural scene of mountains or something, and the title is, I&#8217;m so happy today.</p><p><strong>Ethan [00:11:26]:</strong> So they have they have no correlation at all. So the first step is to, you have to generate synthetic pair of language with the videos. So you gather videos from the internet, and you use a VLM to caption the videos. So that part, here&#8217;s a question, like how do you, how do you gather VLM to begin with? So if there&#8217;s no</p><p><strong>Swyx [00:11:55]:</strong> You, so you fuse the model, right? Like</p><p><strong>Ethan [00:11:57]:</strong> Say if there&#8217;s no like VLM exists, like how do you generate the text to the beginning, right? It&#8217;s, it&#8217;s impossible.</p><p><strong>Swyx [00:12:04]:</strong> I see.</p><p><strong>Ethan [00:12:05]:</strong> In the beginning, it&#8217;s like you ask human to describe the video as detailed as possible.For example, you ask them to describe everything, like all objects, all characters, and all interaction and dialogues in the, in the videos. So that&#8217;s in the protocol of Cosmos labeling. We require the objective we give to the labelers was that you have to describe the video as detailed as possible, such that a blind person hears a blob of text can reconstruct what the video is like from their head.</p><p><strong>Swyx [00:12:43]:</strong> Video or image? You&#8217;re talking about images.</p><p><strong>Ethan [00:12:44]:</strong> Video or image, either one of them.</p><p><strong>Vibhu [00:12:47]:</strong> This was pretty common when we went from clip and DALL-E, right?</p><p><strong>Vibhu [00:12:51]:</strong> It&#8217;s all training on really detailed captioning of images. So same is applied to video, but instead</p><p><strong>Ethan [00:12:57]:</strong> same applied</p><p><strong>Vibhu [00:12:57]:</strong> of using multimodal model to pass in video images and write rich descriptions, you can also</p><p><strong>Swyx [00:13:04]:</strong> I think there&#8217;s this traditional perspective of supervised, or, very highly human curated thing. I feel like there&#8217;s a unlock with unsupervised, right? Where like you have enough to bootstrap that you can just throw common corpus on it or, whatever. like unsupervised vision and language pairing, right? Like where you just have, interspersed image and text and it just learns. To me, that is the VLM breakthrough that is different from the clip, different from the LM era.</p><p><strong>Ethan [00:13:36]:</strong> It&#8217;s interesting to see that you kind of need both data.</p><p><strong>Ethan [00:13:41]:</strong> For example, for the</p><p><strong>Swyx [00:13:41]:</strong> You need it to bootstrap it up. Yeah</p><p><strong>Ethan [00:13:43]:</strong> for the generative model training, there&#8217;s also usually like a small percentage of unlabeled data. So the model is instructed to generate a video without any text instruction. That can also help the model generalize. So after this stage of generative synthetic pair, so, one important common step is to train a compressor or a tokenizer of the image or videos. So because, if you train-- If you can technically, theoretically train image or video models on pure pixels, but the problem is that the, it&#8217;s, it&#8217;s a lot of tokens. So like one image, it&#8217;s, a thousand by a thousand, it&#8217;s like one million tokens, one million pixels. It&#8217;s impossible to train transformer on that. So it&#8217;s, you need to train a tokenizer, which can go from image to latent space and latent space back to image.</p><p><strong>Swyx [00:14:45]:</strong> That&#8217;s why we named the podcast.</p><p><strong>Swyx [00:14:48]:</strong> But, basically, you&#8217;re talking about vocabulary science.</p><p><strong>Ethan [00:14:50]:</strong> so vocab.</p><p><strong>Swyx [00:14:51]:</strong> And so, what is, what is imp-- like a million is impossible?</p><p><strong>Ethan [00:14:54]:</strong> In generative models, the vocab is continuous. It&#8217;s a continuous space. We can think about like you map an image to a vector. It&#8217;s a, it&#8217;s a fixed length vector. It&#8217;s sixteen or forty-eight, something like that. And then you map that vector back to the image space. And the mapping is, has-- The mapping is patch-based. So you say you have</p><p><strong>Ethan [00:15:22]:</strong> a sixteen by sixteen patch and you match, you map that patch of pixels into this latent space.</p><p><strong>Swyx [00:15:29]:</strong> We&#8217;ve covered this</p><p><strong>Vibhu [00:15:30]:</strong> This is like the vision transformers</p><p><strong>Swyx [00:15:32]:</strong> VAEs,</p><p><strong>Ethan [00:15:33]:</strong> VAEs.</p><p><strong>Vibhu [00:15:34]:</strong> You basically compress your input, you do your generation, you&#8217;re reasoning all that generation in smaller dimension, and then you project back out.</p><p><strong>Swyx [00:15:43]:</strong> VAE is a form compression, but I think the for me, the patching thing is from VIT, right?</p><p><strong>Ethan [00:15:48]:</strong> You can make those.</p><p><strong>Swyx [00:15:49]:</strong> Literally the, yeah, the paper is titled like sixteen by sixteen is all you need. something like that. and then I think also, people make a lot of comparisons with this kind of patching with convolutions.</p><p><strong>Swyx [00:16:02]:</strong> Which is you&#8217;re, you&#8217;re kind of re- reconstructing the old paradigm with the new.</p><p><strong>Ethan [00:16:05]:</strong> Actually, in VAEs, there are, there are both convolution networks and transformers. You can actually do both.</p><p><strong>Ethan [00:16:14]:</strong> After this VAE, so what you&#8217;ve got is you&#8217;ve got latent space tokens and you&#8217;ve got the language tokens. So now the training of the diffusion transformer, usually generative models use diffusion transformers. It is actually quite standard. It&#8217;s, it&#8217;s very similar to how you train a language transformer models. It&#8217;s not that much difference. It&#8217;s just the tokens, the visual tokens in, visual tokens out. The only difference is there&#8217;s a denoising process. So you train the model to unmask some of the noise. So you add, you add random noise to the visual tokens, and then you train the model to remove those noise to generate the clean tokens. Any inference, the model can iteratively remove noise from a hundred percent noise.</p><p><strong>Swyx [00:17:12]:</strong> And then there&#8217;s also, to speed things along on the tech tree of diffusion, there&#8217;s CFG, and then there&#8217;s, there&#8217;s also, latent diffusion that, there&#8217;s, there&#8217;s someone in there. I think, somewhere along the line, obviously, like stability and all these other guys, pioneered a lot of this, architecture. I don&#8217;t know if you want to get into that or just, or do the video side up to you.</p><h2>Bootstrapping Video from Image Models and Temporal Compression</h2><p><strong>Ethan [00:17:37]:</strong> After you train such model, such image model, the reason it&#8217;s a, it&#8217;s a foundation for video models is that image models are cheaper to train, and they have much denser connection between language and text. So, sorry, language and images. For example, you train a billion, you train on a billion images, and there&#8217;s a mapping from the text to the image. And the cost to train the same, like the, a billion, a billion text to a billion videos, that&#8217;s much more expensive because videosNaturally have more tokens than images. Because the diffusion models, their understanding of, language purely come from this mapping. So if you don&#8217;t have enough mapping, so if you only train on like a ten million videos or something, there-- you might not see enough language tokens in your training, so your model does not understand human intention enough. So that&#8217;s why you really-- you train-- you first train this image diffusion models, and then you bootstrap the video model from there.</p><p><strong>Swyx [00:18:53]:</strong> One thing I did want to ask, because I-- actually, I think you&#8217;re, you&#8217;re the first per-- video model person I&#8217;ve ever talked to, I think. we&#8217;ve, we&#8217;ve like talked to Luma and all those folks. There&#8217;s all these tricks in video compression where basically frame by frame there&#8217;s not that much difference, so actually you don&#8217;t have to regenerate or save the whole frame, right? but I think MP4 compression or something else like that.</p><p><strong>Swyx [00:19:16]:</strong> is it tempting to use that? Or as far as I can tell, everyone just treats it as, &#8220;No, we would just generate every frame.&#8221; Is that roughly the state-art?</p><p><strong>Ethan [00:19:27]:</strong> There are a few different approaches. Let&#8217;s say first, like you want to just directly use MP4 compression and use that as the tokens for the transformers to train, right? So people actually have tried that, but the main challenge is the latent space for the MP4 tokens were not, were not very comprehensible for the models. It&#8217;s, it&#8217;s extremely hard to train on that. And there&#8217;s a</p><p><strong>Ethan [00:20:01]:</strong> So that&#8217;s why they created VAEs, which creates more continuous, latent space, so the models can understand that latent space and learn from it much easier. Even within the VAEs, there are different difficulties of the latent space. So you can imagine something the simplest, the most naive VAE is like you have an image, and you just shuffle all of the images into a, into a vector. So you don&#8217;t need to train any VAEs, right? But that latent space is extremely hard for models to train on top of. That&#8217;s why there are some debate on like how do you compress the tokens. So you mentioned like you can compress frame by frame. Also, you can compress, the temporal dimension.</p><p><strong>Ethan [00:20:52]:</strong> The difference is if you compress the temporal dimension, you get a much higher compression rate. Because there&#8217;s temporal redundancy between frames, because, this frame and the last frame, likely they are mostly similar, so there&#8217;s only some small difference. for example, I think in 12.1 VAE, they have like a eight by eight by four compression rate. So the four temporal tokens are compressed into one tokens. That can save a lot of, save a lot of the context length. If you do it frame by frame, you have to do maybe like eight by eight by one. Your context length will be four times larger. That being said, the benefit of the frame-- per frame compression, we might come back to this later, is, real-timeness and interactivity. &#8216;Cause if you, if you strain the output of the model, frame by frame, you can-- the model can respond to any user request immediately. So if you have like a temporal four compression, four times compression, then</p><p><strong>Swyx [00:22:06]:</strong> It might be laggy</p><p><strong>Ethan [00:22:07]:</strong> there&#8217;s a lag there in nature.</p><p><strong>Swyx [00:22:10]:</strong> So you&#8217;re very pilled on this. let&#8217;s just go ahead and bring it up &#8216;cause we have the visual prepared anyway. There&#8217;s some frontier applications of real-time video gen. So Flipbook is one of the examples that went viral recently, right? What is Flipbook?</p><h2>Real-Time Generative UI: Flipbook, Neural OS, and Diffusion Front Ends</h2><p><strong>Ethan [00:22:23]:</strong> Flipbook is kind of like a web brow- web browser. You can see like it has the web bro- browser UI on top. The difference is all of the UIs are generated by generative image model in real time, and anything here are fake. But you can, you can explore inside this wor- this imaginary world. Say like we-- here we have engineering the Great Pyramid. Like the model generates this for us to understand how it works, and if we want to navigate around and understand further, we can click on some of the, some of the description here, and the model will generate a new page, new subpage describing the details we want to know about.</p><p><strong>Swyx [00:23:14]:</strong> So it&#8217;s basically kind of we&#8217;re playing a video, but it&#8217;s pausing for our next interaction, and then it just plays the next thing based on our interaction.</p><p><strong>Swyx [00:23:23]:</strong> Which is kind of cool.</p><p><strong>Vibhu [00:23:25]:</strong> and you kind of decide your story. So this was, how do you make a pyramid? levering technique seemed interesting, right? It shows how do you take Okay, I want to know what is this</p><p><strong>Swyx [00:23:35]:</strong> The demo, the demo tweet had more animation between frames.</p><p><strong>Vibhu [00:23:38]:</strong> I think it&#8217;s just skipping,</p><p><strong>Swyx [00:23:39]:</strong> Oh, it&#8217;s just skipping a lot of frames.</p><p><strong>Ethan [00:23:40]:</strong> they also have a video mode</p><p><strong>Vibhu [00:23:42]:</strong> It takes a lot. There&#8217;s a lot of people</p><p><strong>Ethan [00:23:42]:</strong> but, a lot of people are using it.</p><p><strong>Ethan [00:23:45]:</strong> So it&#8217;s not available.</p><p><strong>Vibhu [00:23:46]:</strong> There&#8217;s a live video stream. We can try,</p><p><strong>Swyx [00:23:50]:</strong> So this is an example of the kind of future that you see at the extreme. We don&#8217;t-- we&#8217;re obviously not in it today.</p><p><strong>Swyx [00:23:56]:</strong> But in a world where inference is completely free this is better than generating code and text?</p><p><strong>Ethan [00:24:02]:</strong> So this is, this is a final state of where Viva will be at for word model, I think. Imagine internet doesn&#8217;t exist, and then you type in google.com. Like what should, what should, what should a model show you?the model can imagine something, and this is what the model imagine. And these web pages, they completely do not exist. So I think as the inference costs come down, we are going to have generative UI for everything. If you think about how the coding model works, so they write code for a web page, and they render the code might be con- converted into binary, and the binary render the pixels on the screen. So we in machine learning, every time we have some breakthrough, obviously it&#8217;s, it&#8217;s more intuit. So why don&#8217;t we have like user instruction to the pixel directly? So the generative UI will be user intention to the pixels directly. And say like even if I want email, let&#8217;s say everyone have the same interface, but I want, I want it slightly different. I want the email to show to me like a TikTok, so I can swipe left and right for the emails. And or maybe you want something else. We can have completely different things. Or like I have I&#8217;m looking at, Instagram stories, and I don&#8217;t like the Like button. I always may click it. And, generative UI resolved it. So it&#8217;s going to be a revolutionary replacement of the interface. So in the future, we might have much more powerful</p><p><strong>Ethan [00:25:50]:</strong> LLMs and coding models running behind the scene. And in the, in the front-end, the diffusion model will actually be the front-end to show stuff to you. That&#8217;s how I imagine it.</p><p><strong>Swyx [00:26:02]:</strong> Diffusion front-end, deterministic back-end.</p><p><strong>Swyx [00:26:04]:</strong> Something like that. I find that very expensive, but,</p><p><strong>Vibhu [00:26:08]:</strong> I find it interesting you called LLMs writing code on the back end deterministic, but okay.</p><p><strong>Swyx [00:26:14]:</strong> you write it once</p><p><strong>Vibhu [00:26:15]:</strong> Compare it to</p><p><strong>Swyx [00:26:16]:</strong> And then you execute.</p><p><strong>Ethan [00:26:17]:</strong> If you think about the cost, say, let&#8217;s say H100 costs $1 per hour, and if you use this eight hours a day and thirty days, so, every month you&#8217;re paying this two forty, you&#8217;ll actually not wanna pay for that. That&#8217;s even more expensive than Cloud Code Max. But if you think about the compute costs come down like two times every year, and I think the future will likely arrive like within few years.</p><p><strong>Vibhu [00:26:49]:</strong> It&#8217;s everything, right? compute cost comes down, compute gets faster, model gets smarter</p><p><strong>Ethan [00:26:54]:</strong> More efficient</p><p><strong>Vibhu [00:26:54]:</strong> model gets smaller.</p><p><strong>Swyx [00:26:55]:</strong> I don&#8217;t know why you say two times, &#8216;cause I think it&#8217;s like 100 times. In language models, it is roughly one hundred to a thousand times every twelve to eighteen months, for the same given level of LMSys, ELO.</p><p><strong>Vibhu [00:27:08]:</strong> That&#8217;s a net of everything, right? That&#8217;s model performance alongside compute. So different than just compute costs come down. But, a very interesting future.</p><p><strong>Swyx [00:27:19]:</strong> So the web designers will have to shout out that accessibility is an issue, right? how do you deal with screen readers or whatever. But yes, this is higher bandwidth storytelling than anything you can possibly generate with code, right? So I think that&#8217;s the rough idea.</p><p><strong>Ethan [00:27:34]:</strong> And I&#8217;d like to add a little bit that so human naturally have the maximum bandwidth when we are looking at things, look at videos, and we also have maximum output bandwidth when we are talking. So in the future, it might be something like we talk to AI models, and the AI model responds back with a generative UI. So that would be the maximum input and output bandwidth to interact with AI models before neural link happens.</p><p><strong>Vibhu [00:28:06]:</strong> And it&#8217;s also very custom, right? Some people are very visual, some people are not as visual, right? They prefer the text. But the best thing about generative UI, right, it can also be text.</p><p><strong>Swyx [00:28:17]:</strong> There&#8217;s another project that we wanted to highlight, which is the Neural OS. Kinda similar idea, but here you&#8217;re literally operating, simulating an operating system with a video model.</p><p><strong>Swyx [00:28:27]:</strong> and you can play Doom, you can do Firefox. I find this like mildly less impressive, obviously, because it&#8217;s an OS that I can run.</p><p><strong>Swyx [00:28:37]:</strong> But here everything is imagined.</p><p><strong>Vibhu [00:28:40]:</strong> I was, used to the Command+W to close the Firefox tab. It didn&#8217;t crash. That&#8217;s why I said</p><p><strong>Swyx [00:28:45]:</strong> It&#8217;s too immersive.</p><p><strong>Vibhu [00:28:46]:</strong> It&#8217;s, it&#8217;s too immersive for me.</p><p><strong>Swyx [00:28:47]:</strong> Too immersive.</p><p><strong>Vibhu [00:28:48]:</strong> I wanted to close the tab.</p><p><strong>Vibhu [00:28:49]:</strong> But yes, I can play generated diffusion.</p><p><strong>Swyx [00:28:51]:</strong> this is shockingly fast.</p><p><strong>Swyx [00:28:54]:</strong> Because I remember there was a demo about like maybe one to two years ago. Someone tried to do the first-person shooter with a image model. There was no consistency. It was very slow. But here it looks like realistically it&#8217;s-- this is Doom.</p><p><strong>Vibhu [00:29:07]:</strong> I think there&#8217;s two sides to that, right? There&#8217;s okay, what is running a game? The heavy part of it is actually the game engine, all the lighting, all that stuff, the graphics. This is just kind of video, right? Like we&#8217;ve solved consistency. This is still, it looks like a few years old image generation. There&#8217;s some temporal consistency, but it&#8217;s, it&#8217;s kind of just images stitched together as frame video. But it&#8217;s a good visual representation to pi- to picture the future you wanna see, right? that&#8217;s, that&#8217;s what I see in these more so.</p><p><strong>Ethan [00:29:38]:</strong> This reminds me of how the video models gets better and better. So Neural OS is kinda if you just look at it feels like it&#8217;s just a crappy version of the, like the Windows we could have, right? And, but the difference is, so the model, this model is overfitted on the existing operating systems. It can generate nothing different than that. But it&#8217;s actually also similar to video models. So when we are training these video model, image model, we train them on internet. There&#8217;s no imaginary supernatural stuff on the internet. But once we train this model, you can prompt the model to generate something supernatural that have never existed in the data set. So if you train your Neural OS or neural computer on the standard screen recordings on the entire internet. The model can imagine completely new interface to interact with the computer.</p><p><strong>Swyx [00:30:43]:</strong> This is one of those things that is magical to me. usually generalizing out of distribution is bad, but somehow we have learned some kind of internal world model that you say, this plus, but it looks like rainbows and butterflies, it&#8217;ll do it and it will kind of make sense.</p><p><strong>Swyx [00:31:03]:</strong> So yeah, that&#8217;s kind of cool. Yeah, I don&#8217;t know if there&#8217;s any comment more on there. I do, I do wanted to, I did wanted to touch a little bit more on the model architecture stuff, which I think you were getting. It&#8217;s, really fascinating. We don&#8217;t get a chance to talk about this enough. So one of the papers that we covered, we&#8217;ve covered every annual, segment anything release. and I don&#8217;t know if you follow-- you&#8217;re a computer vision guy, so you</p><p><strong>Ethan [00:31:26]:</strong> I know</p><p><strong>Swyx [00:31:27]:</strong> . So they did memory attention, which is kind of interesting. And I always think, anything where you can, across the temporal dimension, keep some consistency, I think it&#8217;s, very fascinating, and I don&#8217;t know if Basically, does that-- the CV side bleeding into video gen side, I think is underexplored, right? we talk about it for labeling, but actually you can borrow the architecture itself.</p><p><strong>Ethan [00:31:50]:</strong> There&#8217;s, there&#8217;s also complete different approaches, right? you brought up the term world model, so we went from video model to world model. There is diffusion, but there&#8217;s also other approaches that people are doing. So maybe we get into those after as well,?</p><p><strong>Swyx [00:32:03]:</strong> He has a whole definition of world models and stuff. I feel like we threw a lot at you. Whatever you want to comment on.</p><h2>Why Video Models Are Expensive: Storage, I/O, and Training Scale</h2><p><strong>Ethan [00:32:10]:</strong> I think one thing that we should actually comment back on is okay, so we were talking about the steps to train image gen to video model. One thing we don&#8217;t see as much of is okay, you brought up the delta in training data, right? So</p><p><strong>Ethan [00:32:24]:</strong> you won&#8217;t have as much a video model might not generalize, but what is the cost of training a large video model? So we know for LLMs roughly, okay, even like the poolside thing that came out today, right? It&#8217;s a Gemma level model trained on roughly forty trillion tokens at this many H200s over this much time, right? You can see what is the exact cost of that. So how many GPU hours over how much H200 costs? So how do we do the back-end math of, same thing for video models, image models. How do you, how do you kind of break that down? I can share some back-envelope calculation. So surprisingly, video models is-- the cost is very-- is comparable to language models and obviously the largest scale is language model, maybe like a medium scale to language models. I said just storing the videos alone, it costs a lot. You can, you can maybe look up on AWS or something.</p><p><strong>Ethan [00:33:20]:</strong> You really, say if you have a billion videos and let&#8217;s say, let&#8217;s just say like each video, like five megabyte, then you need five petabyte to just store those videos. And also remember we talk about you use a VAE to compress the videos, and you also need to store, typically you need to store those continuous feature, in-- also in your storage. That&#8217;s also comparable size with the videos themselves. So just storing these videos and the features is tens of petabytes alone. And,</p><p><strong>Swyx [00:33:58]:</strong> I just, I just looked up the calculation. Five petabytes on S3 Standard is one hundred K per month.</p><p><strong>Ethan [00:34:05]:</strong> And</p><p><strong>Swyx [00:34:05]:</strong> It&#8217;s comparable</p><p><strong>Ethan [00:34:05]:</strong> and you need</p><p><strong>Swyx [00:34:06]:</strong> And</p><p><strong>Ethan [00:34:06]:</strong> And then like tens of petabytes, two hundred K. And even more expensive is you have the ingress and egress.</p><p><strong>Swyx [00:34:13]:</strong> Oh, yeah.</p><p><strong>Ethan [00:34:14]:</strong> Like you-- through the internet. You have to just to download those videos, I believe it&#8217;s, it&#8217;s more expensive on AWS than just storing those videos.</p><p><strong>Swyx [00:34:25]:</strong> Storing, yeah.</p><p><strong>Ethan [00:34:25]:</strong> And each training runs, you probably need to pull them once. If you train multiple times, it&#8217;s, it&#8217;s even more than that. So it&#8217;s like just storing the network, those costs is just, it would be a few, a few millions per month to just storing everything, not to mention the GPU cost.</p><p><strong>Ethan [00:34:45]:</strong> And</p><p><strong>Swyx [00:34:45]:</strong> my side tangent, the compute rental, like GPU rental is very efficient. There&#8217;s one side, okay, you can be XAI and build your data center. Should we not just build our, storage compute as well? Like</p><p><strong>Ethan [00:34:57]:</strong> Of course</p><p><strong>Swyx [00:34:57]:</strong> cloud cost compared to just,</p><p><strong>Ethan [00:34:59]:</strong> You save so much</p><p><strong>Swyx [00:35:00]:</strong> store. Yeah, exactly.</p><p><strong>Swyx [00:35:01]:</strong> Especially with like egress and stuff. So.</p><p><strong>Ethan [00:35:04]:</strong> That&#8217;s a good idea, but it also comes to-- there are some of its own challenges.</p><p><strong>Swyx [00:35:09]:</strong> Of course, of course.</p><p><strong>Ethan [00:35:10]:</strong> like people who build the GPU data centers, they might not expect this much, storage. And yeah, people build storage, typically they just build it somewhere with just CPUs.</p><p><strong>Swyx [00:35:23]:</strong> I just looked it up. Five-- AWS only charges for egress, not ingress. Tier five for five petabytes is two hundred and thirty K.</p><p><strong>Ethan [00:35:32]:</strong> Even more expensive than the storage.</p><p><strong>Swyx [00:35:34]:</strong> But storing is per month, right? You check in, then you cannot check out. so it&#8217;s so cool. It&#8217;s okay. So there&#8217;s that side.</p><p><strong>Ethan [00:35:41]:</strong> So the TLDR, my backhand math</p><p><strong>Swyx [00:35:42]:</strong> Data is larger than you think. Yes.</p><p><strong>Ethan [00:35:44]:</strong> my backhand math of GPU hours times GPU cost is also very much, I&#8217;m missing some storage.</p><p><strong>Swyx [00:35:49]:</strong> You&#8217;re also-- you&#8217;re basically like also more IO bound than normal training.</p><p><strong>Swyx [00:35:55]:</strong> Yes. &#8216;Cause like data loading, so caching everything, it becomes super important.</p><p><strong>Ethan [00:36:00]:</strong> So in Cosmos, we did a lot of optimizations to make it not IO bound. So, speaking of the training, actually training the model, the GPU cost, if you look up like the open source model, how big these video models are, I think like LTX has nineteen B parameters. That&#8217;s a dense model. And people are also exploring, MoEs, so it might be twenty B active and, like a hun- hundreds B, total. So that&#8217;s, that&#8217;s even-- that&#8217;s similar size as medium-sized LLM models. And if you, if you look at number of tokens-Uh, we disclose that in Cosmos. It&#8217;s also like tens of trillions of tokens on the visual tokens. So putting this together, the cost of, training these video models, it&#8217;s actually comparable with LLMs. Not to mention, the infra is slightly different from LLM, so it might be less efficient to train these models.</p><h2>Inference Speedups: Step Distillation, Consistency Models, and GANs</h2><p><strong>Swyx [00:37:04]:</strong> Do you get the benefits of traditional diffusion speed-up? So for, images, there&#8217;s LCM, LoRAs for, fine-tuning. There&#8217;s, there&#8217;s a lot of stuff that&#8217;s been</p><p><strong>Ethan [00:37:15]:</strong> Flow matching.</p><p><strong>Swyx [00:37:16]:</strong> there&#8217;s flow matching. There&#8217;s a lot of stuff that&#8217;s been done. there&#8217;s some overlap that applies to diffusion on the inference side and stuff or?</p><p><strong>Ethan [00:37:23]:</strong> so the difference-- the inference side is a completely different story.</p><p><strong>Ethan [00:37:28]:</strong> I think for the training side, it might be a little bit hard to reduce that cost. And for the inference side, the biggest gain is from the distillation of these models. You can-- It&#8217;s called step distillation, slightly different from knowledge distillation in LLMs. So you-- Typically, for flow matching models, you need like 100 steps or something. Like a distortion model even need even more, like 1,000 steps to generate a good image or video. A step distillation is try to learn to generate fewer step from the model itself. It&#8217;s kind of like now we-- you use the full model to generate in 100 steps, and then you take a model that only generate 10 steps and let that model to learn from the perfect one.</p><p><strong>Ethan [00:38:25]:</strong> why this work</p><p><strong>Swyx [00:38:27]:</strong> Strong to weak seemingly.</p><p><strong>Ethan [00:38:28]:</strong> It is. It&#8217;s kind of</p><p><strong>Swyx [00:38:29]:</strong> Distillation</p><p><strong>Ethan [00:38:29]:</strong> kind of like strong to weak. the-- from the modeling perspective, the strong model, the teacher model is trying to model the image and videos of inter-internet, and that distribution is extremely complex. But the step distilled model is just trying to learn from the teacher. The teacher is a model, and the size is fixed, as the distribution is much simpler than the whole internet. That&#8217;s the intuition I have why step distillation can work. So usually these models serve in productions, they only run in a few steps. In Cosmos, I believe we have, we have like four step and eight steps. If you do some simpler task, image-image translation, it can even run in fewer step, like one step in Cosmos Transfer.</p><p><strong>Swyx [00:39:22]:</strong> I think this is the same intuition that guides a lot of the consistency model work. I sent you a link for, SCM. I don&#8217;t know if you covered that. To me, that was actually one of, the most impressive papers I&#8217;ve ever seen from OpenAI.</p><p><strong>Swyx [00:39:34]:</strong> That this is the unifying grand concept of consistency models. I don&#8217;t know if you have any comments on this.</p><p><strong>Ethan [00:39:41]:</strong> So there are, there are a few different approaches,</p><p><strong>Swyx [00:39:46]:</strong> Oh, yeah. Here it is.</p><p><strong>Swyx [00:39:47]:</strong> Two steps versus twenty or 100 steps, whatever. It&#8217;s already done.</p><p><strong>Ethan [00:39:52]:</strong> So there are, there are a few different approaches, for example, consistency model, and there are also Actually, we shouldn&#8217;t forget GAN. So GAN, actually, that was, that was the OG of</p><p><strong>Swyx [00:40:05]:</strong> OG</p><p><strong>Ethan [00:40:05]:</strong> step distillation &#8216;cause it trained just one step to begin with. So actually, a lot of, uh-- For example, there&#8217;s a distribution matching distillation which use, which uses GAN, as one of the laws for distillation. It-- GAN just tells you, &#8220;Hey, generate an image,&#8221; and then</p><p><strong>Ethan [00:40:31]:</strong> it has a discriminator to tell, is this image real or not? So the model, the model just need to learn one of the distribution, not the full distribution. Because in training, the model is asked to reconstruct the ground truth image from the internet, which is extremely hard. And in-- When you&#8217;re training GAN, it&#8217;s a step process. It&#8217;s just a, &#8220;Hey, you generate image. Does this image look as real as the image from the internet?&#8221; Which is a much simpler task. And, yeah, combining a lot of these approaches together, people typically do that, like consistency model and distribution matching and GAN, and we can get these few step models.</p><h2>Audio-Video Generation and Time Alignment</h2><p><strong>Swyx [00:41:21]:</strong> Then there&#8217;s one step I wanted to add, which is audio and video.</p><p><strong>Ethan [00:41:26]:</strong> So, Grok Imagine zero point nine, I believe it&#8217;s, it&#8217;s a first audio video transmodel deployed at a large scale. So</p><p><strong>Swyx [00:41:39]:</strong> And that was your first model?</p><p><strong>Ethan [00:41:40]:</strong> that was, Grok Imagine&#8217;s first model. It&#8217;s, it&#8217;s audio video, joint generation. I think the hard part is, the modality alignment, &#8216;cause before this transmodel, we have, we have text to video alignment. We have this, correspondence between text and video. Typically, most of the VLMs, they understand images and videos. Video&#8217;s very rare, and they don&#8217;t understand audio mostly. And if you look at the audio generation on the LLM side, you can talk to them perfectly fine, but if you ask them to sing a song or something, it typically is not very good. Also, they don&#8217;t have, they don&#8217;t have music either. The hard part is thatUh, actually audio has two component. It has like a discrete component, a continuous component. The discrete component is like the language.</p><p><strong>Ethan [00:42:44]:</strong> So when we speak, it&#8217;s just, some</p><p><strong>Swyx [00:42:47]:</strong> It&#8217;s an ASR issue, yeah.</p><p><strong>Ethan [00:42:49]:</strong> It&#8217;s, it&#8217;s text token with some characteristics, I would say.</p><p><strong>Ethan [00:42:54]:</strong> But music</p><p><strong>Swyx [00:42:56]:</strong> I think the speech guys would disagree with this.</p><p><strong>Swyx [00:42:57]:</strong> Like disfluencies and then,</p><p><strong>Vibhu [00:43:00]:</strong> There&#8217;s tones you can get angry.</p><p><strong>Ethan [00:43:01]:</strong> Well, I say largely.</p><p><strong>Ethan [00:43:03]:</strong> the mu- but the music is completely different. It&#8217;s, it&#8217;s very continuous, and you cannot model them like discrete tokens in language models. this is like the hard part for models is, not to mention we have to align text, video, and audio together.</p><p><strong>Ethan [00:43:26]:</strong> So</p><p><strong>Vibhu [00:43:26]:</strong> How?</p><p><strong>Ethan [00:43:28]:</strong> So significant-- some significant challenges are like-- So first, like we talk about as the VLMs, they cannot understand most of them cannot understand audio.</p><p><strong>Ethan [00:43:39]:</strong> So you have to have some way to do the synthetic data generation for audio. You have to caption the model, and that involve, that involve synthetic data and human data effort a lot. And not just surprisingly, most of the LLMs are very bad at recognizing, like the beat, tone, and the details of the of music. They can, they can give some general prediction of which song is this, but it&#8217;s very hard to describe the details of the music. like we mentioned in image generation, like you have to describe image as detailed as possible so that someone blind can reconstruct that. So here is like someone</p><p><strong>Vibhu [00:44:32]:</strong> Deaf</p><p><strong>Ethan [00:44:32]:</strong> someone deaf can reconstruct how the music sounds like without actually listening to it. Maybe you can think of it need to have the-- or they call the script.</p><p><strong>Vibhu [00:44:49]:</strong> Subtitles, yeah.</p><p><strong>Ethan [00:44:49]:</strong> You gotta have all the details of the music, and the dialogue.</p><p><strong>Vibhu [00:44:55]:</strong> So is the challenge there typically stuff like music and audio, or is it just Like is there a baseline? Okay, there&#8217;s enough data where we can understand, narration, conversation, but there&#8217;s nuances in audio that&#8217;s where you hit all the data issues or is it just from stage zero, you just do it all right?</p><p><strong>Ethan [00:45:15]:</strong> So one important thing is like the alignment. So the model, the model has to know like the video and audio, the, uh-- it has to have a time-based alignment, like at which time step the video and the audio token correspond to each other. But we actually don&#8217;t have this kind of alignment for most of the other modalities. If you think about like text and image, text and video, they are loosely aligned. So you can, you can have a description of what&#8217;s going on in the video, but you don&#8217;t have to exactly, You typically don&#8217;t have exact description, oh, at, time step one second like what happened?</p><p><strong>Vibhu [00:46:02]:</strong> It&#8217;s very</p><p><strong>Ethan [00:46:03]:</strong> At time step two second what happened</p><p><strong>Vibhu [00:46:03]:</strong> coarse. Yeah.</p><p><strong>Swyx [00:46:05]:</strong> So what was the ideal time step? You have to oblate it, and then it&#8217;s like four seconds or something.</p><p><strong>Ethan [00:46:09]:</strong> So that comes down to how you design the model to, for the model to be aware of as a time, as a time modality. So the model is like a time aware. And that&#8217;s something pretty unique if you think about LLMs. So if you ask LLM to complete a task, say they, uh-- you ask them and they will say, &#8220;Oh, this task will probably take twelve hours to complete,&#8221; and they come back in one hour. Say &#8220;I&#8217;ve already spent two days on this and I&#8217;ve exhausted everything.&#8221;</p><p><strong>Ethan [00:46:47]:</strong> So the LLMs them-themselves, they don&#8217;t have a sense of time there.</p><p><strong>Vibhu [00:46:53]:</strong> I actually don&#8217;t think that&#8217;s just them not having a sense of time. I think it&#8217;s somewhat based, right?</p><p><strong>Vibhu [00:46:58]:</strong> Like you tell someone, &#8220;Okay, go work on this feature. Go implement this,&#8221; there&#8217;s a general understanding you would have of how long that would take without LLMs working at LLM speed, right? So you think back like two years ago, if I tell you to like build me like a new front end for latent space, have a search bar, have all this, you&#8217;ll estimate that it&#8217;ll take a few days, right?</p><p><strong>Vibhu [00:47:19]:</strong> So you tell an LLM, &#8220;Go build this.&#8221; It&#8217;ll take me a few days. But I think it&#8217;s somewhat grounded as opposed to them not having the best-- Not saying that they have a great understanding, but I think that example is like you can see where it comes from, right? You&#8217;re trained on all over the text.</p><p><strong>Swyx [00:47:35]:</strong> They&#8217;re, they&#8217;re trying to estimate what a human would say.</p><p><strong>Vibhu [00:47:37]:</strong> because that&#8217;s what the, that&#8217;s what the data kind of represents. It&#8217;s not them</p><p><strong>Ethan [00:47:41]:</strong> It came from the corpus on the internet. People have a estimate of how much time.</p><p><strong>Vibhu [00:47:45]:</strong> And not even just in direct like training samples, right? Just your world understanding of tokens of how long stuff takes, right? Go read a book. It&#8217;ll take you a while, right?</p><p><strong>Vibhu [00:47:56]:</strong> Even if you do nothing but read a book, it takes a few days. So yeah, LLM, I read it took me a few hours.</p><p><strong>Vibhu [00:48:01]:</strong> It&#8217;ll take me a few hours to go through this research. But this is a tangent.</p><p><strong>Swyx [00:48:05]:</strong> Somewhat, yeah.</p><p><strong>Swyx [00:48:06]:</strong> This is a train of thought I haven&#8217;t really expressed until now is, which is basically like a full world model must also be recursive, meaning that the participant in the world model must also be aware that they have a world model. which is like this whole recursive thing down the, down the line. but yes, and that the world model can be wrong and that they need to update it and blah. Yeah. We&#8217;ve, argued this on the, newsletter as well, that there needs to be sort of recursive or adversarial world models.</p><h2>World Models: Real-Time, Long-Horizon, Interactive Video</h2><p><strong>Vibhu [00:48:34]:</strong> just, to ask, how do you define world model?</p><p><strong>Swyx [00:48:38]:</strong> Oh, yeah, let&#8217;s go there.</p><p><strong>Ethan [00:48:40]:</strong> So</p><p><strong>Vibhu [00:48:40]:</strong> So just for context, we talked about, video generation, and then there&#8217;s a-- if you say there&#8217;s a distinction between world models, what&#8217;s your, what&#8217;s your definition? How do you see the two?</p><p><strong>Ethan [00:48:53]:</strong> So disclaimer, I&#8217;m not going to debate, what is world model. Yeah. there are many definitions, so I&#8217;ll just talk about my definition. Since I came from the multi-model, multi-model domain, so mainly talking from video. So world model is like real-time interactive long horizon videos. So there are three parts. so we-- let&#8217;s talk about them one by one. So the so interaction, so we just, we just look at Facebook and neural computer. So the interaction part of it, so you, world model can allow you to interact with them through keyboard, mouse, and maybe also voice. So these all is-- all is a modality. You can, you can interact with the model, and the model should respond reasonably. Second part is real time. So once you, once, say, you move your mouse, if, say, the world model generate a game, how fast can the game respond? So if you&#8217;re like professional CS: GO players- -my say, oh, you have to respond- He&#8217;s beginner within sub ten milliseconds or- Yeah even less. So that&#8217;s not most of the- No, sixty FPS. Let&#8217;s go. Oh, three hundred FPS. Oh, five hundred FPS. Wait. okay, yeah. I didn&#8217;t do the math, but yeah, okay. Uh- Yeah, three hundred FPS, that&#8217;s a three millisecond. So you have to respond- Oh, shit. Okay. Yeah</p><p><strong>Ethan [00:50:29]:</strong> within a millisecond. Most of the video models cannot do that. Yeah. And, but if you, say, if you have a video model that is, say, like a digital human, the response time might be more generous. Maybe typically, for real-time voice interaction, it&#8217;s like two hundred millisecond. So that&#8217;s, that&#8217;s much more generous. But even two hundred millisecond is pretty, it is pretty tricky, &#8216;cause remember we mentioned</p><p><strong>Ethan [00:51:01]:</strong> you have this, temporal compression coming from the VAE. So if you, if you don&#8217;t compress the temporal dimension, your sequence length is going to explode. So if you want to have this real-time, real-timeness in your model, you have to do is one context problem. And the third part is long horizon, &#8216;cause we-- if you&#8217;re not going to just play with, video games just, a few seconds, most video models only a few seconds. We&#8217;re going to play with minutes, hours. The model have to be able to generate long-form content.</p><p><strong>Ethan [00:51:42]:</strong> So putting these three together, it&#8217;s, real-time, long horizon interactive videos. I think the final state will be, for example, like a video, a video version of Playbook, where you can, you can interact with, a neural computer. You move your mouse, and you click on the generative interface, and it will reply to you through pixels- generating in real time. But getting there, it&#8217;s, it&#8217;s a very long way to get there. So one of the first step, at Grok Imagine, where I led a small world model team there, was to build video extension. So, video extension- it&#8217;s the first step of interactivity. Yeah. It&#8217;s, it&#8217;s the first step. Yeah. So it&#8217;s the first step- You have it here, video editing, yeah. Yeah. Yeah. So the first step is because, this unlocks long horizon videos. Typically, for most of the video generation models, you give it a prompt or an image as an initial frame. You generate video, that&#8217;s it. That&#8217;s just, one time, done. And some creators would try to, use the last frame as a first frame for the second video. It can-- sometimes it works, but if you do it a few times, it says the quality would decrease. And- It doesn&#8217;t have that context- Yeah over the full video, so the temporal- Yeah, exactly. Yeah, &#8216;cause you only gave it the last frame, of course, right? Yeah. Exactly. And- it&#8217;s actually a pretty fun hack. if you&#8217;ve seen like- Oh, no, he&#8217;s saying something better. Yeah. And for example, like Vue, I remember Vue 3 has like a second context of the last video. It is slightly better than using the last frame, but it has the same problem-- similar problem that it, the quality would decrease. if you extend a few times to, one minute, the video quality would look much worse than the first video. Second, another problem is that the model doesn&#8217;t have long-range knowledge of, what&#8217;s happening before. Say, if they generate some dialogue, some, two people speaking, and their voice might change, over some time, especially if the second conditioning, it does not cover the previous context. So these are the core challenges. So the Grok Imagine video extension, it has historical context of all of the previous generated videos. It can, It has, it has the context of, who is speaking and what objects have appeared and everything, having that to generate the next video. So if we naively do this, you can imagine, just, put all of the previous history video tokens into the context. The context lens will easily explode. Especially for video models, that can be like a few, a few million context, I would imagine- context lens. Yes.Yeah.</p><p><strong>Swyx [00:54:58]:</strong> Let&#8217;s run with that.</p><p><strong>Ethan [00:54:59]:</strong> for example, like in Cosmos, I think just five seconds of video is like a fifty K or sixty K number of tokens. So like if you do, if you do fifty second, that&#8217;s a five hundred K tokens. If you do longer than that, easily explode. This long horizon, problem was the first step we&#8217;re trying to solve world model. It turns out people, yeah, people love video extension. Like a lot, a lot of the creators love using video extension to create longer form videos. This is the part I liked that you have a, you have an intermediate step toward the final goal instead of just a straight shot to the final version very much.</p><p><strong>Swyx [00:55:48]:</strong> But I can see you have a strong vision of where we want to end up.</p><h2>Long Context, Redundancy, and Efficient Interactive Video</h2><p><strong>Vibhu [00:55:51]:</strong> Does it seem like it&#8217;s an efficiency issue? okay, we&#8217;re at a few million tokens context,. If you draw the parallel to language models, we had very short context, two thousand, eight thousand, then, you scale it up one million, ten million. sure, there&#8217;s effective context, but at the end of the day, it&#8217;s just what&#8217;s it worth? sure, there&#8217;s a whole training data side. In video, it might be slightly easier &#8216;cause we have a hundred million token video, right? Just take a movie with the full context there. Like is this efficiency from an inference standpoint that like it&#8217;s expensive, but we know how to solve it? Or like why is this not the approach? So like my broader point was on your second point of world models, you say it needs to be interactive and live, right? You should be able to play a game and see the interaction live. So one thing I see with research is a lot of what you actually serve is different than what you build, right? So we talked about distillation. You train big model, you distill it, you do quantization, speculative decoding. We do all this stuff to serve it efficiently. Should we not just have a solution, like a world model that can interact well, do inference optimization, serve it, distill it secondary, so make it real time after you solve it? So like a-- another parallel is say, continual learning, right? What we need is someone to solve it and show it works inefficiently. Give it a few years, people will make it efficient. Same thing with regular attention, right? It worked. Over a few years, people have different forms of attention, and we&#8217;ve scaled it to be efficient at log context,? So kind of two things there, right? One is it seems like it works. You&#8217;ve scaled it. Can we not just scale it a lot more efficiently over time? Do we need a separate approach if this works? And same thing with interaction, right? if we can get it done, like if we can solve some way that it works, we can solve making it more efficient from an inference standpoint later.</p><p><strong>Ethan [00:57:53]:</strong> that&#8217;s actually a very good point. So in videos, there&#8217;s actually a lot of redundancies. So we solve a lot of the pixel redundancy from VE, but there&#8217;s more redundancy in long range and long horizon videos. Say, if a character appear in the first clip and then it disappeared, it only reappear at the end of the video, you probably don&#8217;t need the-- the context, like in the middle of the generation. So you only need that character, where you need. So that&#8217;s why, I helped build another feature. It&#8217;s a reference video.</p><p><strong>Vibhu [00:58:36]:</strong> Is it here?</p><p><strong>Swyx [00:58:36]:</strong> is it the same model release or different one?</p><p><strong>Ethan [00:58:39]:</strong> It&#8217;s a different one.</p><p><strong>Ethan [00:58:41]:</strong> You probably need to search on</p><p><strong>Swyx [00:58:43]:</strong> I&#8217;ll find it</p><p><strong>Ethan [00:58:43]:</strong> X reference to video.</p><p><strong>Ethan [00:58:46]:</strong> So reference video allow you to like upload up to seven images as condition and generate the video. Say, if like I want-- it can, it can be characters or objects or even scenes. Say like I want, I want condition on, Sean&#8217;s selfie and holding a blade</p><p><strong>Swyx [00:59:07]:</strong> We have a dog</p><p><strong>Ethan [00:59:08]:</strong> or whatever.</p><p><strong>Swyx [00:59:08]:</strong> We put the dog in the thing.</p><p><strong>Ethan [00:59:09]:</strong> you can put them there and the video models will generate the video from and copies the context over. So that can solve a lot of the problems there, like the long context problem. It doesn&#8217;t need to have a very long context, but it&#8217;s-- I feel like it&#8217;s an intermediate solution. The model</p><p><strong>Swyx [00:59:29]:</strong> It&#8217;s cheating.</p><p><strong>Ethan [00:59:30]:</strong> the model should be able to like selectively know, where should I draw the references. So say if I want to generate a movie, I generate it autoregressive, like a ten second at a time or something. And now this character appear, I can look back to where it first appear and, bring that back. Yeah, this one, I put the references. Yeah, that&#8217;s, Optimus, Einstein myself, Annie.</p><p><strong>Vibhu [01:00:02]:</strong> Oddly enough, I used Grok Search to find it, and it pulled your LinkedIn post. But yeah we found it.</p><p><strong>Ethan [01:00:08]:</strong> Interesting.</p><p><strong>Vibhu [01:00:10]:</strong> But</p><h2>xAI&#8217;s Underrated Work, Culture, and Watermarking</h2><p><strong>Swyx [01:00:11]:</strong> this is a problem. This is not your fault, but like XAI doesn&#8217;t communicate all this work that you do very well because they just have the model release and then that&#8217;s it. But actually, these details are very good.</p><p><strong>Swyx [01:00:22]:</strong> As far as I understand, everything you just described is state-art, like no one else has done it.</p><p><strong>Vibhu [01:00:30]:</strong> A lot of-- yeah, I have a lot more</p><p><strong>Swyx [01:00:32]:</strong> And then, and then you just put this blog post with the cookies. I&#8217;m this is not enough,?</p><p><strong>Swyx [01:00:37]:</strong> but I, obviously this is like the high level numbers that people want to know. But no, okay, so</p><p><strong>Vibhu [01:00:42]:</strong> And I wonder, like part of that is also some labs don&#8217;t share research into what happens. And if</p><p><strong>Swyx [01:00:50]:</strong> No, but this is literally bragging about how good they are, right?</p><p><strong>Swyx [01:00:54]:</strong> Like, why would you not say that you are capable of extending with full context? this is not a secret sauce. This is like we did the work. yeah, I don&#8217;t know.</p><p><strong>Ethan [01:01:02]:</strong> different labs have slightly different communication styles.</p><p><strong>Swyx [01:01:07]:</strong> Anyway, if anyone from XAI is listening we are always happy to help you tell your story. Yeah, okay, so you did references, and I think, I think kind of the point you&#8217;re, you&#8217;re making is it is sort of like a kludge, right? this is-- you can do seven, but what about 100?</p><p><strong>Swyx [01:01:23]:</strong> Right? Then you need a completely different thing.</p><p><strong>Ethan [01:01:26]:</strong> So I think it&#8217;s-- this is, a mechanism to, select the context from the history, and you might not put the entire history into the context. for example, there&#8217;s a paper called Frame Pack, which have</p><p><strong>Ethan [01:01:41]:</strong> a heuristic that the latest history, the last one second, I put the entire history, and the history before that, I would, compress it and makes the video smaller. So they follow this pattern, this build overall pattern that the maximum sequence length is fixed. So the further you are from the current frame, you have a smaller image. So this is just a heuristic. I think it can be more automatic. The model is aware like which history part of it can be select. So this part of the research is actually being actively, worked on by a lot of people. It&#8217;s also quite interesting. I feel this is actually, this part of long context is a little bit ahead of the LLM part.</p><p><strong>Ethan [01:02:31]:</strong> So for example, like in LLMs, if you-- so contexts keep growing. Let&#8217;s say if you call tool and the tool call history is extremely long, that&#8217;s still in context, and keep growing, keep growing. Even if you switch the topic to something else, the whole context was there. There are some agentic harnesses that help you to, say, prune the tool results and, prune Like when you, when you query a file, only show like the top 200 lines or something. Those were very heuristic-driven.</p><p><strong>Swyx [01:03:08]:</strong> For listeners, we did a write-up on the cloud code, leak where there are eight different kinds of pruning, including like you prune the tool results and all that. So you can, you can read up on that kind of thing.</p><p><strong>Ethan [01:03:17]:</strong> I think, one breakthrough in continual learning might be like a way to automatically, manage its own context.</p><p><strong>Swyx [01:03:27]:</strong> These are all heuristics, and they will be replaced by machine learning.</p><p><strong>Ethan [01:03:30]:</strong> Interestingly</p><p><strong>Vibhu [01:03:32]:</strong> The</p><p><strong>Ethan [01:03:32]:</strong> the same thing is being researched in both LLMs and video models.</p><p><strong>Vibhu [01:03:36]:</strong> The interesting thing is also like in the paper you showed, it&#8217;s actually happening at the model level, right? Compared to like language models, sure, we have base attention, but we&#8217;ll do our own compression, we&#8217;ll do our own pruning, which is separate from model error.</p><p><strong>Vibhu [01:03:49]:</strong> Eventually, it all just boils in, hopefully.</p><p><strong>Swyx [01:03:52]:</strong> I think this is a form of like attention, but like also know sort of reasoning attention. I feel like that&#8217;s different than normal attention.</p><p><strong>Swyx [01:04:03]:</strong> Does that, does that make sense?</p><p><strong>Ethan [01:04:04]:</strong> It&#8217;s, it&#8217;s different in the sense that attention, not to mention, set sparse attention aside, like normal attention</p><p><strong>Swyx [01:04:13]:</strong> Like UKV, yeah</p><p><strong>Ethan [01:04:14]:</strong> you have to attend to all of the tokens.</p><p><strong>Ethan [01:04:17]:</strong> So you don&#8217;t have a high-level mechanism to drop which tokens do-- you don&#8217;t want to attend to. As humans&#8217; attention span is surprisingly small.</p><p><strong>Ethan [01:04:28]:</strong> You can only remember 11 digit of a phone number.</p><p><strong>Swyx [01:04:32]:</strong> But I have feature detection, right? I can detect, oh, that&#8217;s a sequence of one, two, three, four in a phone number that is 11 digit.</p><p><strong>Vibhu [01:04:39]:</strong> Very good pattern matchers.</p><p><strong>Ethan [01:04:41]:</strong> But humans&#8217; context can-- like attention can work because we can dynamically pull in, context from different places. The same mechanism, I think is going to happen for LLMs and video models. I think we have</p><p><strong>Swyx [01:04:57]:</strong> RLMs is recent-- is on, it&#8217;s on the recent work is there, which is not that, crazy, but it&#8217;s just recursive.</p><p><strong>Vibhu [01:05:04]:</strong> I think it&#8217;s somewhat inherent in models too, right? Like you</p><p><strong>Swyx [01:05:06]:</strong> No, here&#8217;s a nice example here</p><p><strong>Vibhu [01:05:07]:</strong> you pull up these, you can read it fine, but, language models are also very good at slop parsing. you have a trans</p><p><strong>Swyx [01:05:15]:</strong> I throw my typos in there, it doesn&#8217;t matter.</p><p><strong>Vibhu [01:05:17]:</strong> You have a, you have a transcript, you have whatever, just throw it in and it&#8217;s very good at parsing through noise. m-- that may be a brute force. It can look over a reason over it, but there&#8217;s, there&#8217;s parallels to both.</p><p><strong>Swyx [01:05:31]:</strong> I think it&#8217;s just really fascinating how you relate the world models stuff to the video generation, which I don&#8217;t think a lot of people hear directly, from people like you. So I think that&#8217;s really helpful. Any other work? Do we cover like video, audio, world models, any other stuff in that omni</p><p><strong>Swyx [01:05:48]:</strong> team,?</p><p><strong>Vibhu [01:05:49]:</strong> Or any other work at XAI you want to talk about? Seems like everything we see publicly announced, &#8220;Oh, cool, cookies.&#8221; And then there&#8217;s so much more to it.</p><p><strong>Swyx [01:05:58]:</strong> There&#8217;s a lot of depth.</p><p><strong>Vibhu [01:05:59]:</strong> Any underrated stuff, just at the time there?</p><p><strong>Ethan [01:06:03]:</strong> I feel the, as a culture, it is quite interesting and a bit underrated. So the culture is, the culture is three sentences: move fast, build No goal is too ambitious, and the first principle. Like early, the goal set was very ambitious. It wasn&#8217;t very-- this wasn&#8217;t-- it wasn&#8217;t possible to achieve when I, when I was thinking, first thinking about it. Like for example, like build something in three months. And</p><p><strong>Vibhu [01:06:36]:</strong> Was that &#8220;Okay, we&#8217;re starting team, we want image, we want video. Do it by this deadline.&#8221; Or, how do you work back? Like was it just, &#8220;Okay, we have a rough by, this date we want something out,&#8221; or is this like</p><p><strong>Ethan [01:06:52]:</strong> That&#8217;s a very good point. So it&#8217;s from first principle thinking.</p><p><strong>Ethan [01:06:56]:</strong> If you think about, people might say that first principle thinking applied more to the physical world than the models. I would say, for example, like if you think about-Some limitation, for example, acquiring data, like how fast can we acquire the videos? And if you think about training the models, what&#8217;s the iteration speed for training a model end? And how would adding more GPUs accelerate that timeline? And maybe if you need human data, like what&#8217;s the turnaround time for human data to arrive? If you put all of those together, that is first principle thinking where, oh, like what is the timeline? What&#8217;s the minimum number of days that is possible to achieve something?</p><p><strong>Swyx [01:07:52]:</strong> I think there&#8217;s a-- this is a lot of Elon&#8217;s type of thinking, right? He&#8217;s like-- I think he&#8217;s famous for saying that the only law you can&#8217;t break is the laws of physics, something like that.</p><p><strong>Swyx [01:08:01]:</strong> Just broadly, you worked a lot with Elon.</p><p><strong>Ethan [01:08:04]:</strong> I, one benefit is working at xAI, you got a chance to interact more with Elon. So I was very fortunate to get a few retweets from him, and that was quite fun. And, he also worked very closely, with people. like people imagine online, like he&#8217;s very hands-on.</p><p><strong>Vibhu [01:08:34]:</strong> There are two things. one-- So I was actually looking up, Elon retweeting you. I&#8217;ll pull it up. he talked about you tweeting that you have a really good voice mode. I don&#8217;t know</p><p><strong>Ethan [01:08:47]:</strong> Oh, me?</p><p><strong>Vibhu [01:08:47]:</strong> No. Him.</p><p><strong>Swyx [01:08:48]:</strong> Oh, I also did it. But anyway.</p><p><strong>Vibhu [01:08:49]:</strong> I actually-- So I would DM you feedback on voice mode because I was &#8220;Wow, really good.&#8221; And then I&#8217;m &#8220;Ugh, this sucks.&#8221; But, I don&#8217;t know. Anything you want to talk about your voice mode, building it? Was it a team you worked on as well?</p><p><strong>Ethan [01:09:02]:</strong> Oh, that&#8217;s actually not part of the team I worked on.</p><p><strong>Swyx [01:09:05]:</strong> He probably worked on more of the video. No, but Grok Voice actually</p><p><strong>Vibhu [01:09:11]:</strong> Grok Voice</p><p><strong>Swyx [01:09:11]:</strong> like very good. I-- This is one of those things where first of all, you can speak at 2X, which is fun.</p><p><strong>Swyx [01:09:16]:</strong> which I listen to 2X, so I like to speak at 2X. But also I think like the interruption was better than Gemini. I don&#8217;t know how it compares to ChatGPT real time now, but as far as like driving was concerned, like having Grok in my Tesla and like driving, I think it was like-- it&#8217;s a really good experience.</p><p><strong>Vibhu [01:09:34]:</strong> He likes voice mode. But also, just the crazy reach by Elon</p><p><strong>Swyx [01:09:40]:</strong> Fifty million views for just saying, &#8220;Yes, true.&#8221;</p><p><strong>Vibhu [01:09:43]:</strong> That&#8217;s true.</p><p><strong>Swyx [01:09:44]:</strong> Oh my God</p><p><strong>Vibhu [01:09:45]:</strong> but, it&#8217;s, it&#8217;s pretty cool how fast it came out. the other thing is the safety aspect of video mode. Anything interesting to talk about there? So</p><p><strong>Swyx [01:09:56]:</strong> spicy</p><p><strong>Vibhu [01:09:57]:</strong> spicy question.</p><p><strong>Ethan [01:09:58]:</strong> A lot of the countries where they don&#8217;t allow like a generative data-- generative AI videos without watermarks. So in all of the-- those countries, Grok Imagine had watermarks, and a lot of the-- a lot of the takedowns of the videos were also happening extremely fast.</p><p><strong>Swyx [01:10:22]:</strong> it&#8217;s, it&#8217;s part of running a social platform but also it transfers nicely to the GenAI side. Do you have a perspective on SynthID versus other kinds of watermarking?</p><p><strong>Ethan [01:10:33]:</strong> it&#8217;s going to be</p><p><strong>Ethan [01:10:37]:</strong> it&#8217;s going to be harder and harder to detect, the Yeah, these things. So SynthID, one thing is, previously it was only Google, and now, like a lot of different labs</p><p><strong>Swyx [01:10:52]:</strong> OpenAI adopted it</p><p><strong>Ethan [01:10:52]:</strong> are also adapting it.</p><p><strong>Ethan [01:10:54]:</strong> As-- A limitation is like the technology The paper was out there, and people can reverse engineer like how to get rid of it.</p><p><strong>Ethan [01:11:05]:</strong> And it&#8217;s-- I think even as it advance, it&#8217;s, it&#8217;s still possible to reverse engineer it.</p><p><strong>Swyx [01:11:13]:</strong> so if you are interested, you can go onto Reddit and people have taken out the exact I don&#8217;t know, what do you call it? Mask or pattern that Google applies, and then you can apply it onto any Google-generated photo, and you can reverse out the SynthID.</p><p><strong>Ethan [01:11:30]:</strong> And it&#8217;s, it&#8217;s also harder and harder to just judge by eyes. I remember like a couple years ago, there was like six fingers or something. It&#8217;s very obvious.</p><p><strong>Vibhu [01:11:42]:</strong> My current is actually the audio. I feel like the audio is really lacking. my way to tell if something is generated, outside of okay, I think I&#8217;ve seen enough, I have a decent eye, the audio matchup, especially of Sora, is not great. It&#8217;s all similar style. But there&#8217;s</p><p><strong>Swyx [01:11:57]:</strong> I see. those are minor imperfections.</p><p><strong>Swyx [01:11:59]:</strong> I think the point is that like-- Actually, my closest reference to this is also Ian Goodfellow, &#8216;cause I think he did like the adversarial GAN thing where like it&#8217;s okay, here&#8217;s a picture of a zebra. Then you like change one pixel, and it becomes a panda.</p><p><strong>Swyx [01:12:12]:</strong> Right? This is like-- this is like a classic computer vision issue.</p><p><strong>Ethan [01:12:15]:</strong> If you think about how these models were trained, like I, like I mentioned before, like GAN was in the training process. The objective of GAN is you-- is the model generates an image, and the model, there&#8217;s a judge to tell like if the image is real or not. The model is trained to make the image more real. So as the model become more and more advanced, it&#8217;s going to be harder and harder. For me personally, now I have to judge by</p><p><strong>Ethan [01:12:49]:</strong> if the-- these videos have logical sense.</p><p><strong>Ethan [01:12:53]:</strong> If these, this video</p><p><strong>Swyx [01:12:55]:</strong> Have a world model.</p><p><strong>Swyx [01:12:57]:</strong> No, I also like it-- the audio is too nice, like too studio quality. The lighting is too good. The skin is too clear. the-- basically, the lack of imperfections.</p><p><strong>Vibhu [01:13:10]:</strong> Do we have a good way to do reasoning in diffusion? Like is that what separates video generators from world models or in, -We really know how to apply it to other regressive language models. Is there a parallel for diffusion video gen world models like on that point, right? Is</p><p><strong>Swyx [01:13:30]:</strong> He has a thing on video agents.</p><p><strong>Ethan [01:13:31]:</strong> that&#8217;s a good question. Yeah, actually, I have a, I have a pretty big claim. The intelli- the visual intelligence are actually mostly coming from language. these video models, especially from now, since the diffusion model technology is more mature, the every time you see there is some improvement on these models, I would say mostly, this, again, comes from language model, not coming from the vid- the video model itself, like the video distribution models themselves. In Cosmos, that could be Typically these models, they have two parts. there&#8217;s a, there&#8217;s a prompt rewriter or the prompt up sampler part. I think in Cosmos, we use Llama or we use Mix- Mixtro. And the Cosmos video model itself is only 7B, and the model, the language model</p><h2>Prompt Rewriting, Video Agents, and Agentic Generation</h2><p><strong>Ethan [01:14:35]:</strong> is a prompt rewriter. It&#8217;s, it&#8217;s bigger than that. So the prompt rewriter&#8217;s task is to take user instruction and convert it to extremely detailed description of the video. So because the video, the visual-- the video distribution models, I would describe, they&#8217;re kinda dumb because they take the input</p><p><strong>Ethan [01:15:03]:</strong> instruction literally. Because in the training process, remember that we have to describe the video as detailed as possible when we&#8217;re creating the synthetic, text pair. So this model, they take those kind of instruction to generate the videos. So in-- when you&#8217;re taking the user instructions, the user instruction usually are simple. Just say a cat or something. If you put a cat in the video model, they would take that instruction literally. They would literally show a cat, a cat in maybe a white background because you didn&#8217;t describe the background. The cat is not moving because you didn&#8217;t describe it. It takes the instruction quite literally. It&#8217;s kinda, it&#8217;s kinda dumb. The prompt rewriter is actually a much bigger model. It&#8217;s a language model that takes, the user instruction and expand it. So the thinking process you mentioned, is from there. So if you, if you look at like GPT image, like you generate a image in three minutes. Three minute is not all like a pixel generation. A lot of time is spending</p><p><strong>Vibhu [01:16:19]:</strong> Prompt writing</p><p><strong>Ethan [01:16:19]:</strong> on thinking.</p><p><strong>Ethan [01:16:20]:</strong> So prompt rewriting now have evolved to, not only just as thinking, it can, it can also be a agent, a agentic model. For example, say you want, you wanted to generate the image of today&#8217;s news. So the-- So it&#8217;s likely they&#8217;ll go to fetch today&#8217;s news online and then, process and digest them, then organize the layout and generate it. Another thing quite interesting is,</p><p><strong>Vibhu [01:16:53]:</strong> If I&#8217;m not mistaken, these are-- it&#8217;s no longer a diffusion model though, right? It&#8217;s autoregressively Or is there still</p><p><strong>Ethan [01:17:02]:</strong> There are different approaches. For example, Gemini Omni. Since they said it&#8217;s Omni, I believe it&#8217;s a, it&#8217;s a single model. Maybe it&#8217;s something it&#8217;s a language model with a diffusion head or something. Like the language model do the thinking, do the agentic tool calling, and then it would, use the diffusion head to generate the image in the end. There were also approaches like Cosmos, where you have a separate language model and separate diffusion models. And there were also like a purely language model, like you discretize the images, and then you generate the image as discrete tokens. So there are different approaches. I would say like</p><p><strong>Vibhu [01:17:44]:</strong> One of, one of the claims I&#8217;ve seen for why these approaches struggle is because a lot of the benefits for how we currently learn reasoning with language models is you basically iteratively generate reason. You have your thought, and then you work on that answer, right? So if you have like Omni model and then diffusion head, you can&#8217;t feed that back in to continue reasoning, right? So you can&#8217;t go like text, image, text, image. You can&#8217;t reason on the output and then go back to diffusion. But in the new Gemini Omni, you would be able to, as long as you have diffusion.</p><p><strong>Ethan [01:18:15]:</strong> I&#8217;m not sure if</p><p><strong>Vibhu [01:18:16]:</strong> But</p><p><strong>Ethan [01:18:16]:</strong> they have that process. it&#8217;s definitely possible in the Omni paradigm.</p><p><strong>Ethan [01:18:22]:</strong> So if you think about like traditional multi-model language model, they would have a VIT encoder that can encode the image. So if they have a diffusion head, they can generate the image and then put that back into the VIT encoder, encode that, and then do the iterative refinement if the result Yeah.</p><p><strong>Swyx [01:18:44]:</strong> I think you have to jointly train the VIT and the diffusion to make that somewhat reasonable, &#8216;cause otherwise you&#8217;re kind of mismatching or feeding in slop.</p><p><strong>Vibhu [01:18:55]:</strong> I think it depends on the stage of training. You might be able to freeze it. But anyway, also just on your earlier</p><p><strong>Swyx [01:19:00]:</strong> Wait. I wanted to also make explicit. We do know that NanoBanana and GPT image are autoregressive, language model with diffusion head.</p><p><strong>Swyx [01:19:09]:</strong> as far as I can tell from your description of Grok image, it is not. It is, it is end.</p><p><strong>Ethan [01:19:14]:</strong> I cannot</p><p><strong>Swyx [01:19:15]:</strong> You cannot</p><p><strong>Ethan [01:19:15]:</strong> comment on that.</p><p><strong>Swyx [01:19:16]:</strong> Well, the way that you described it. but, yeah, I think it-- there&#8217;s, there&#8217;s different approaches, right? Like you started off saying prompt rewriter is, the-- a big part of the intelligence.</p><p><strong>Vibhu [01:19:24]:</strong> and even on that, I think everyone should try using an early diffusion model. If you&#8217;ve used Stable Diffusion one or whatever, if you&#8217;ve seen the prompts ultra-high res, four K this style, oh my God, the first time I tried one, you don&#8217;t talk to them like language models, right? Your prompting is very, comma separated</p><p><strong>Swyx [01:19:43]:</strong> It&#8217;s literally talking in the labels that were in the data set, right?</p><p><strong>Swyx [01:19:46]:</strong> But basically, I&#8217;m just trying to make the point that prompt writer and then image is different from autoregressive language model with diffusion hit. Right? They&#8217;re different things.</p><p><strong>Ethan [01:19:56]:</strong> they&#8217;re different.</p><p><strong>Swyx [01:19:57]:</strong> Just wanted to establish.</p><p><strong>Ethan [01:19:59]:</strong> I&#8217;d say, the common part is, the image part. So it&#8217;s, it&#8217;s quite surprising that, a lot of the improvement came from the</p><p><strong>Swyx [01:20:12]:</strong> Language side</p><p><strong>Ethan [01:20:12]:</strong> the thinking the tool calling. So I still remember, in Cosmos, I generated a happy sheep and can if without any rewriting, it&#8217;s-- it looks so, CGI, and after rewrite it looks, it looks so beautiful.</p><p><strong>Ethan [01:20:31]:</strong> I think</p><p><strong>Swyx [01:20:32]:</strong> Without any joint training.</p><p><strong>Ethan [01:20:34]:</strong> actually, without any joint training. it&#8217;s-- with rewriting, it&#8217;s already much better. See, a very interesting thing, what happened is the video agents, mostly language models, will call these, generative model, either it&#8217;s a separate model or a diffusion head or whatever, as tool. So this model can iteratively refine the results or even, generate longer content through a very long train of thought. It&#8217;s actually very similar to how human create art. So we don&#8217;t, we don&#8217;t generate the pixels directly. We literally draw something on And I think through this process, the-- these models not only use diffusion as one of the tool, it can also use traditional tool. It can also use, image editing tools from Photoshop. It can use, video editor, FFmpeg, whatever, to take combination of these and the generative AI technology as a, as a set of tool, and they can, they can iteratively create a better, a much better, video for production-grade quality. If you look at existing, professional creators, they don&#8217;t, they don&#8217;t end at, generating a video from these models. They would take this video to their editor and edit here and there.</p><p><strong>Swyx [01:22:11]:</strong> So much post-production in And sometimes actually, the reason the video is good is not really the video model, it&#8217;s actually the editing.</p><p><strong>Swyx [01:22:21]:</strong> And yes, we also are engaged in the same process as well. Would you love to use a video editing model?</p><p><strong>Ethan [01:22:27]:</strong> Actually, there&#8217;s, Grok Imagine Agent beta. That was the, that was the first attempt in that direction.</p><p><strong>Ethan [01:22:38]:</strong> So I think, the process would be similar to like</p><p><strong>Vibhu [01:22:44]:</strong> It&#8217;s just agent mode.</p><p><strong>Ethan [01:22:46]:</strong> you can, you can ask it to</p><p><strong>Swyx [01:22:48]:</strong> There&#8217;s no blog post for it</p><p><strong>Ethan [01:22:49]:</strong> maybe generate a minute, video, which is not possible if you ask the same prompt to video models. But this model will ca- literally call different tools to do that.</p><p><strong>Ethan [01:23:05]:</strong> So yeah, this is actually an interesting thing. So when we first released, a video editing model, I see on X some people try the video editing feature with, &#8220;Edit this video to be one minute.&#8221; &#8216;cause they didn&#8217;t understand how video editing work. Video editing typically is just a removal, add, replace, style transfer, this kind of thing. But that&#8217;s actually a valid request under the assumption of video agents. So these agents should be able to understand these kind of, long horizon tasks to be able to easily, create a long-form video. I think this is, this is really fascinating &#8216;cause it&#8217;s kinda take-- it&#8217;s taking the same direction as first you have these, assisted-- assisted coding, kind of like tab completion, GitHub Copilot. And from there, you gradually evolve to Codex and Cloud Code, where you do things fully automated. So in agent, in Grok Imagine Agent mode, you can, you can still go in there and do stuff by yourself.</p><p><strong>Ethan [01:24:22]:</strong> gradually, as the model capability increase, it will be able to do everything fully automated.</p><p><strong>Swyx [01:24:30]:</strong> I like that. okay.</p><p><strong>Ethan [01:24:32]:</strong> That&#8217;s good.</p><p><strong>Swyx [01:24:32]:</strong> So it looks like it&#8217;s still generating.</p><p><strong>Vibhu [01:24:34]:</strong> Also, I did notice the Grok image gen was always very fast. I don&#8217;t know if this is something you guys benchmarked, but, this is just a tangent. Compared to what I used to use before the latest OpenAI&#8217;s image gen, and same with Gemini Nano Banana, I would oftentimes use Grok just for the speed.</p><p><strong>Swyx [01:24:54]:</strong> It&#8217;s, it&#8217;s in the benchmark somewhere that&#8217;s</p><p><strong>Vibhu [01:24:56]:</strong> It&#8217;s</p><p><strong>Swyx [01:24:56]:</strong> in the Imagine API blog post that they have all the speed things.</p><p><strong>Swyx [01:25:00]:</strong> it mostly combination of distillation plus inference.</p><p><strong>Ethan [01:25:04]:</strong> There are a bunch of things. we talk about distillation, and if you talk about thinking, if you don&#8217;t have any thinking budget, the model can just think three minutes and then come back to you. And also, inferenceThe inference infra team was very talented, and they were, they were able to accelerate a hell lot of these models.</p><p><strong>Swyx [01:25:27]:</strong> my comment on the, on the video agents things, I&#8217;m trying to figure out, when people say video agents, when you initially told me about your bet on video agents or your vision for video agents, I was a little bit disappointed. I was &#8220;you mean, like models are tapped out, now we have to do agents?&#8221; But, I think you have to, right? The question now is, how much model training is it really going to make a difference versus just building a better harness? Like you said the models don&#8217;t have to be jointly trained. you can just take an shelf frontier reasoning model, slap it on a harness, give it Grok as a tool. That&#8217;s it. That&#8217;s your video agent. Doesn&#8217;t seem super satisfying. Obviously, you can train and get some more percentage points of per- performance. But, if your central claim that the majority of video or generative media, alpha or whatever, is actually coming from language intelligence and not, image diffusion or video diffusion, then that is the future.</p><p><strong>Vibhu [01:26:30]:</strong> it&#8217;s pretty cool</p><p><strong>Swyx [01:26:31]:</strong> It&#8217;s just like primarily just weight.</p><p><strong>Vibhu [01:26:33]:</strong> If you pop back at the example, it generated frames. Sorry to interrupt, it&#8217;s been saying &#8220;Okay, I&#8217;m gonna start stitching these frames together.&#8221;</p><p><strong>Swyx [01:26:42]:</strong> So</p><p><strong>Vibhu [01:26:42]:</strong> It&#8217;s using FFmpeg like using code.</p><p><strong>Swyx [01:26:43]:</strong> This is what GPT Image Pro as well is doing, right?</p><p><strong>Swyx [01:26:46]:</strong> Like, this is also just writing code in the background and then just</p><p><strong>Vibhu [01:26:48]:</strong> Stitching</p><p><strong>Swyx [01:26:49]:</strong> doing an image pass on the final output. It feels dissatisfying for the people who want to just train models.</p><p><strong>Vibhu [01:26:54]:</strong> It&#8217;s interesting, right? it&#8217;s, it&#8217;s also somewhat exciting. Like you brought up earlier, a lot of the gains don&#8217;t come as much from the video. I think you can see that in the language model space too, right? Anthropic, very good at coding. They&#8217;re multimodal, not the best, right? They have basic input PDF, but there&#8217;s clearly a disconnect in the quality of their image video processing, audio processing, yet intelligence very top tier. Other labs, Gemini, OpenAI, xAI, you can add modalities, but it&#8217;s not like they&#8217;re unlocking crazy capabilities, right? So it&#8217;s interesting.</p><p><strong>Ethan [01:27:32]:</strong> It&#8217;s interesting to see that, like the video model&#8217;s capability increase actually come from language model being more intelligent. I think video agent, like it can unlock more stuff than my- you might imagine. So there&#8217;s a few things. So one thing is when we are prompting these models, so most of the people were actually not very good at prompting.</p><p><strong>Ethan [01:27:59]:</strong> Actually, language models have a better sense of how to prompt AI models. AI models know AI models better. So if you jointly train these models, maybe the model have a better sense of, how to prompt each model. Like a different model</p><p><strong>Vibhu [01:28:15]:</strong> Of course</p><p><strong>Ethan [01:28:15]:</strong> might be different. Another thing is it might not as simple as just, like generate a few clips and slap them together using FFmpeg. Like you might-- there might be more like image and video editing tool appear in this process. Say, if you want to exactly add a blob of text at this timestamp, the videos model-- video models might not get that intention very precisely.</p><p><strong>Ethan [01:28:48]:</strong> But these are possible using these deterministic tools. The long-- The video agents can use all sorts of tools, so you don&#8217;t have to put all of the capabilities into the generation model itself.</p><p><strong>Swyx [01:29:04]:</strong> I think that&#8217;s very true. no, so for what it&#8217;s worth, I think you&#8217;re right. I think that this will be a big category. I think probably you are predicting like the next one year in video is gonna be all this.</p><p><strong>Vibhu [01:29:18]:</strong> Do you have a time prediction for how-- when this stuff ramps up? Like</p><p><strong>Swyx [01:29:22]:</strong> they already started.</p><p><strong>Vibhu [01:29:23]:</strong> Is,</p><p><strong>Swyx [01:29:24]:</strong> It&#8217;s not very good yet.</p><p><strong>Vibhu [01:29:25]:</strong> Are we so-- No, it&#8217;s so, it&#8217;s so good. I think the last one&#8217;s just longer.</p><p><strong>Vibhu [01:29:29]:</strong> it didn&#8217;t give me a minute.</p><p><strong>Ethan [01:29:30]:</strong> Last thirty-six.</p><p><strong>Vibhu [01:29:30]:</strong> It gave me thirty-six seconds. But are we feeling it now? Is there gonna be inflection? Is there any timeline predictions you wanna make?</p><p><strong>Ethan [01:29:37]:</strong> by the end of this year is-- this is going to</p><p><strong>Ethan [01:29:41]:</strong> be a big hit. So the inflection point will be where, the videos generated by video agents can get to like production grade quality, so it can be presented and it can be, it can be distributed in ads. And when-- once that happen, I think the enterprise will have much more budget for video models because the agents are, inherently more expensive than the, than the video models themselves, &#8216;cause they do this iterative process. They generate many variations.</p><p><strong>Ethan [01:30:23]:</strong> but once these models have this, pass this usability threshold, I think it&#8217;s, it&#8217;s going to be a exponential growth beyond that.</p><p><strong>Swyx [01:30:35]:</strong> I would, fund a company right now based on this thing.</p><h2>Robotics, Physical AI, and Internet-Trained World Models</h2><p><strong>Swyx [01:30:40]:</strong> so I think you&#8217;re right. One thing I&#8217;m, I&#8217;m surprising, I&#8217;m reflecting on the whole like past hour or so of conversation, you are-- I think you&#8217;re into world models and video generation for video generation&#8217;s sake. I think that a lot of other world models people, we&#8217;ve interviewed a lot of them, general intuition and Fei Li and all those guys and Moondream, which I think I told you about. Moonlake.</p><p><strong>Vibhu [01:31:01]:</strong> Lake.</p><p><strong>Swyx [01:31:01]:</strong> I keep saying Moondream. Goddammit. Moonlake. A lot of them actually say like robotics is the end game. Like embodied robotics, like you want real-time, you want interactive. It is to interact with the physical world. You&#8217;re not that concerned about it.</p><p><strong>Ethan [01:31:15]:</strong> I think robotics will be a, will be a big part of it for sure.the process may happen naturally. So my prediction on robotics is that the problem is physical AI might be solved, like without actually need to</p><p><strong>Swyx [01:31:36]:</strong> Be in the real world</p><p><strong>Ethan [01:31:37]:</strong> need to be in the real world. So it might, it might get solved by a video-- A LLM is very strong video capability. So remember we talk about the real-time interactive long horizon video. Once these models-- So now these models are just training on like screen recordings and computer screens. Once these models can use computers and understand the future state of computer extremely well, the robots might be, might be one of the, one of the tools, a very powerful AI can use. So the powerful AI might just, be able to control the physical embodiment naturally.</p><h2>Why Ethan Left xAI and What Comes Next</h2><p><strong>Swyx [01:32:28]:</strong> I see that for sure. Cool. I know, I know we are coming up on time. you had-- you left one more spicy topic, which is why you left xAI.</p><p><strong>Ethan [01:32:38]:</strong> For me, there&#8217;s, there&#8217;s a lot of, a lot of research you want to do that you cannot do at, as a company. And also like the priorities and objective the-- of a company typically can change very fast. It is-- It&#8217;s also the same for xAI. So now is kind of like the time so there is some research I want to do, especially more on language model side like I cannot do at xAI.</p><p><strong>Swyx [01:33:11]:</strong> Oh, okay, yeah. So you&#8217;re, you&#8217;re basically leaving You&#8217;re, you&#8217;re-- you had this whole transition from computer vision to world models, video generation, to now you&#8217;re like focusing on LLMs.</p><p><strong>Vibhu [01:33:22]:</strong> But it seems a lot of you saying focusing on LLMs, you really in the past hour described how it all ties together, right? Like But I don&#8217;t know. What do you mean by focusing on LLMs? Is there</p><p><strong>Ethan [01:33:33]:</strong> I realize the fact that the video models, even like in the beginning, the game might come from improvement on diffusion technology, but this is a point where actually most of the game, come from the language models themselves.</p><p><strong>Swyx [01:33:50]:</strong> It&#8217;s a huge black pill for anyone who has like spent their career in like generative, media.</p><p><strong>Vibhu [01:33:56]:</strong> it-- that&#8217;s an extreme view, right? The-- You still definitely need a bit of both, right?</p><p><strong>Vibhu [01:34:01]:</strong> There&#8217;s just, it seems like more pressing, impactful work to do now on language model side.</p><p><strong>Swyx [01:34:07]:</strong> Do you have any similar predictions? you-- so you predict the video agents, and I think you will be right. on the language side, what are you looking for in the next one year?</p><p><strong>Ethan [01:34:16]:</strong> I think one thing pretty interesting I think might be happening soon is the language models will be like context-aware and manage its own context.</p><p><strong>Ethan [01:34:29]:</strong> So some-- Like from the video model side, we&#8217;ve been suffering from the long horizon issue, like we want to generate video longer and longer, and we&#8217;ve been trying to solve the context length issues through various ways. One thing is just brute-forcing train longer context lengths. Another is to manage the context better. I think the same thing in language model is also going to be happening soon. So for example, like the language models, they&#8217;re not aware of how long their own context length is. Once they hit like eighty percent or something, automatic context compression is getting triggered. And the model, is not aware of that when it&#8217;s working. And some-- maybe it&#8217;s good for the models to know, &#8220;Oh, I&#8217;m, I&#8217;m approaching like eighty percent,&#8221; or something. And something also pretty interesting, like for example, in OpenClau, like you-- every time you type in something, a times-- the current local time is automatically attached to your message, so the model actually know what time is it. So this is making the model time-aware. And also like in tool calling the-- a lot of the intermediate tool call results automatically prune. So there&#8217;s like context removal, context addition, and, context compaction. So all of these are from the harnesses themselves. But from our experience, the heuristic engineering also helps the models get this absorbed into the models themselves. that&#8217;s something very interesting to explore.</p><p><strong>Vibhu [01:36:12]:</strong> So infinite context?</p><p><strong>Ethan [01:36:14]:</strong> Maybe.</p><p><strong>Vibhu [01:36:15]:</strong> No, but it&#8217;s, it&#8217;s interesting, right? you</p><p><strong>Swyx [01:36:17]:</strong> It is in the space of memory and continual learning and</p><p><strong>Vibhu [01:36:20]:</strong> I don&#8217;t know. It&#8217;s also like in the space of agent harness use, right? You&#8217;re seeing</p><p><strong>Swyx [01:36:25]:</strong> No, he&#8217;s saying he doesn&#8217;t want to do it in a harness, right?</p><p><strong>Vibhu [01:36:27]:</strong> No, but models are also being trained on uni-- using harnesses, right?</p><p><strong>Vibhu [01:36:32]:</strong> So some of it is, you could say, implicitly leaking in, right? part of that post-training of language models is, okay, using it in coding harnesses, in which case, when are agents spawned? When is compaction gonna happen? it&#8217;s not explicit you have this much token window, which I don&#8217;t know if you want it to be, as that&#8217;ll change, but it&#8217;s, it&#8217;s somewhat leaking in there.</p><p><strong>Ethan [01:36:58]:</strong> I&#8217;m imagining, what if the model have access to the whole-- the code of the agent harness itself and being able to modify it to whatever you want. Say, if the agent harness is short enough, you can just put in the context lengths in the system prompt, and then the model will say, &#8220;When I want to spawn a future version of myself, I can modify the agent harness.&#8221; For example, if I-- the agent harness can be, &#8220;Oh, when I&#8217;m reading-&#8221;A long document, I can choose to read the whole thing in chunks and, come back, smash the summary together, or I can just read the first two hundred lines and, discard the rest. And all kind of choices, if they can be made by the models themselves, it might be very interesting to see that the model can, program the model can program itself online in test time.</p><h2>Career Lessons: Moving Across ML Domains</h2><p><strong>Swyx [01:38:02]:</strong> so the self-modifying harness is also part of, OpenClaw and Py, but I think there&#8217;s a lot more work to do there. Very cool. I think part of me is kind of curious. I think you are part of Big Lab, right? And there&#8217;s this career path of a researcher at a Big Lab, which is you are-- you train models, you get more compute, you train better models, and you keep going. And somewhat, I feel like you&#8217;re opting out of that. And if I were you, I would be &#8220;Oh, I think this is, a bit of a career risk.&#8221; what?</p><p><strong>Swyx [01:38:36]:</strong> I don&#8217;t have any comment apart from, you&#8217;re very strongly convicted. I think that a lot of people in your shoes would not be doing what you did.</p><p><strong>Ethan [01:38:43]:</strong> Speaking of my career, if I look back, actually, there were, there were a lot of huge transitions. So ten years ago, I was, I was doing research with a ResNet authors, Xiangyu Zhang and Jian Sun. Yeah, at that time, the research were completely different. It was, mostly confirmation, like image recognition, object detection, object tracking. I was also doing neural net compression at that time. It was quite different from knowledge dissolutions these days. And at that time, I was-- I wanted to be a professor, and I applied. When I applied for a PhD, I already had a few first author papers at top conferences, so I confidently applied at the top schools. It turns out I got rejected by all of the top PhD programs. So I had to, I had to go to the industry. At that time, I was at Facebook AI Research fair, led by Yann LeCun.</p><p><strong>Swyx [01:39:51]:</strong> I wanted to talk about VJPA, but it&#8217;s different.</p><p><strong>Ethan [01:39:53]:</strong> I know. Yeah, we can leave it for another time.</p><p><strong>Ethan [01:39:57]:</strong> I switched to At that time, I switched to self-surprised learning. It was, it was quite different from what I was doing in contribution.</p><p><strong>Ethan [01:40:07]:</strong> And after that is NVIDIA Cosmos. So I realized scaling up was extremely important. So at NVIDIA, I was mainly focusing on scaling. So one thing is Cosmos scaling the video distribution models to a few billion parameters. And another thing is, I was working on MoEs. The Megatron MoEs was the first, was the first framework open source to be able to train these MoEs at very large scales, hundred billions parameters to even trillions parameters efficiently at, forty percent MFU.</p><p><strong>Ethan [01:40:51]:</strong> And going to switching to xAI was trying to work on even larger compute scale even further. And yeah, looking at this trajectory, I actually worked on a lot of different things. So I feel actually within ML, it&#8217;s actually easier to switch than you think. a lot of people might have mindset that, &#8220;Oh, I work on, I work on computer vision. I always have to work on computer vision, and I cannot switch to language.&#8221; And, but from my experience, at least at NVIDIA, I worked on both language model MoEs and also video models. It&#8217;s, it&#8217;s actually not the case. A lot of, a lot of the core principles how to train large models are largely the same. And yeah, for me, I feel right now the bottleneck, for video models is actually the language part the agent, which is why I want to go to work more on LLMs. One thing is it&#8217;s, it&#8217;s a bit of a challenge. I don&#8217;t think it&#8217;s a huge, jump, so.</p><h2>Closing Thoughts</h2><p><strong>Swyx [01:42:18]:</strong> kudos to you. I think you have a lot of, strong vision there. Yeah, I think that was mostly everything that we wanted to cover. You&#8217;ve been very generous with your time, and I, it&#8217;s really nice that you are able to share all these things now. We don&#8217;t have to go through xAI to clear everything. but also we</p><p><strong>Ethan [01:42:35]:</strong> Oh,</p><p><strong>Swyx [01:42:35]:</strong> I think we didn&#8217;t get you in trouble.</p><p><strong>Ethan [01:42:37]:</strong> It&#8217;s a lot of good stuff about xAI compared to what you just see in the releases, right? You don&#8217;t realize how many more levels there are to it.</p><p><strong>Swyx [01:42:44]:</strong> xAI, please do more podcasts.</p><p><strong>Swyx [01:42:47]:</strong> anyway.</p><p><strong>Swyx [01:42:48]:</strong> but thank you for, sharing. It&#8217;s been very kind. And also, I wanna hear more from you. I think you are going to embark on your next phase. You haven&#8217;t announced what you&#8217;re doing next, but clearly you have, more vision and more ambition on this path, and I think you&#8217;re, you&#8217;re basically kind of gradient descending to, whatever your final form is.</p><p><strong>Ethan [01:43:08]:</strong> Thank you. Yeah. Yeah, I&#8217;ll, I&#8217;ll share more about my next chapter soon.</p><p><strong>Ethan [01:43:14]:</strong> Thank you for having me.</p><p><strong>Swyx [01:43:16]:</strong> Thanks for coming.</p>]]></content:encoded></item><item><title><![CDATA[[AINews] Founders and Forward Deployed Engineers]]></title><description><![CDATA[a quiet day lets us highlight the new AIE WF focuses]]></description><link>https://www.latent.space/p/ainews-founders-and-forward-deployed</link><guid isPermaLink="false">https://www.latent.space/p/ainews-founders-and-forward-deployed</guid><pubDate>Sat, 30 May 2026 01:57:15 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!SpLP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb92541e3-151a-4f10-8226-b86cb12eaca0_2332x1344.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Most people are still digesting the <a href="https://www.latent.space/p/ainews-anthropic-raises-965b-series">massive Anthropic news</a> from yesterday. </p><p>We&#8217;re taking the opportunity to solicit <a href="https://ai.engineer/cfp">the leading AI FDE&#8217;s</a> in the world for AIE&#8217;s new Forward Deployed Engineer track, mirroring similar pushes from both <a href="https://www.latent.space/p/ainews-thinking-machines-native-interaction">OpenAI DeployCo</a> and <a href="https://www.blackstone.com/news/press/anthropic-partners-with-blackstone-hellman-friedman-and-goldman-sachs-to-launch-enterprise-ai-services-firm/">Anthropic DeployCo</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SpLP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb92541e3-151a-4f10-8226-b86cb12eaca0_2332x1344.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SpLP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb92541e3-151a-4f10-8226-b86cb12eaca0_2332x1344.png 424w, https://substackcdn.com/image/fetch/$s_!SpLP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb92541e3-151a-4f10-8226-b86cb12eaca0_2332x1344.png 848w, https://substackcdn.com/image/fetch/$s_!SpLP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb92541e3-151a-4f10-8226-b86cb12eaca0_2332x1344.png 1272w, https://substackcdn.com/image/fetch/$s_!SpLP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb92541e3-151a-4f10-8226-b86cb12eaca0_2332x1344.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SpLP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb92541e3-151a-4f10-8226-b86cb12eaca0_2332x1344.png" width="1456" height="839" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b92541e3-151a-4f10-8226-b86cb12eaca0_2332x1344.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:839,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1531622,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/199815243?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb92541e3-151a-4f10-8226-b86cb12eaca0_2332x1344.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SpLP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb92541e3-151a-4f10-8226-b86cb12eaca0_2332x1344.png 424w, https://substackcdn.com/image/fetch/$s_!SpLP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb92541e3-151a-4f10-8226-b86cb12eaca0_2332x1344.png 848w, https://substackcdn.com/image/fetch/$s_!SpLP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb92541e3-151a-4f10-8226-b86cb12eaca0_2332x1344.png 1272w, https://substackcdn.com/image/fetch/$s_!SpLP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb92541e3-151a-4f10-8226-b86cb12eaca0_2332x1344.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>as well as AIE&#8217;s new Founders program, where we are doing our version of the Startup Battlefield, a competitive pitch contest anchored by YCombinator&#8217;s Garry Tan and Howie Lu&#8217;s <a href="https://x.com/howietl/status/2057823823526014990">$10 Million dollar Hyperagent </a>contest. Sign up (and <a href="https://www.ai.engineer/worldsfair/2026#venue">book hotel</a>!)  for details today if you are keen.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pbtj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa6ef076-049b-4bd8-b183-4a49f1a913f8_2276x1306.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pbtj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa6ef076-049b-4bd8-b183-4a49f1a913f8_2276x1306.png 424w, https://substackcdn.com/image/fetch/$s_!pbtj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa6ef076-049b-4bd8-b183-4a49f1a913f8_2276x1306.png 848w, https://substackcdn.com/image/fetch/$s_!pbtj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa6ef076-049b-4bd8-b183-4a49f1a913f8_2276x1306.png 1272w, https://substackcdn.com/image/fetch/$s_!pbtj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa6ef076-049b-4bd8-b183-4a49f1a913f8_2276x1306.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pbtj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa6ef076-049b-4bd8-b183-4a49f1a913f8_2276x1306.png" width="1456" height="835" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aa6ef076-049b-4bd8-b183-4a49f1a913f8_2276x1306.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:835,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:412080,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/199815243?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa6ef076-049b-4bd8-b183-4a49f1a913f8_2276x1306.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pbtj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa6ef076-049b-4bd8-b183-4a49f1a913f8_2276x1306.png 424w, https://substackcdn.com/image/fetch/$s_!pbtj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa6ef076-049b-4bd8-b183-4a49f1a913f8_2276x1306.png 848w, https://substackcdn.com/image/fetch/$s_!pbtj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa6ef076-049b-4bd8-b183-4a49f1a913f8_2276x1306.png 1272w, https://substackcdn.com/image/fetch/$s_!pbtj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa6ef076-049b-4bd8-b183-4a49f1a913f8_2276x1306.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p>AI News for 5/28/2026-5/29/2026. We checked 12 subreddits, <a href="https://twitter.com/i/lists/1585430245762441216">544 Twitters</a> and no further Discords. <a href="https://news.smol.ai/">AINews&#8217; website</a> lets you search all past issues. As a reminder, <a href="https://www.latent.space/p/2026">AINews is now a section of Latent Space</a>. You can <a href="https://support.substack.com/hc/en-us/articles/8914938285204-How-do-I-subscribe-to-or-unsubscribe-from-a-section-on-Substack">opt in/out</a> of email frequencies!</p></blockquote><div><hr></div><h1><strong>AI Twitter Recap</strong></h1><p><strong>Claude Opus 4.8 Rollout, Benchmark Friction, and API Ergonomics</strong></p><ul><li><p><strong>Opus 4.8 landed into a noisy, mixed eval landscape</strong>: multiple independent benches converged on &#8220;incremental but not dominant.&#8221; <a href="https://x.com/arena/status/2060160804767584512">@arena</a> pushed <strong>200+ frontend/code tests</strong> comparing Opus 4.8 against prior Opus variants, Gemini, and GLM; <a href="https://x.com/theo/status/2060172445592789064">@theo</a> reported CursorBench shows it as <strong>more efficient but slightly worse than 4.7 within margin of error</strong>; <a href="https://x.com/jerryjliu0/status/2060196252642648427">@jerryjliu0</a> and <a href="https://x.com/llama_index/status/2060165358569337102">@llama_index</a> found <strong>small gains on tables/layout</strong> but regressions on <strong>content faithfulness/charts</strong> in document parsing; <a href="https://x.com/scaling01/status/2060335738172911766">@scaling01</a> said <strong>no progress on ALE-Bench</strong> and separately flagged interesting failure modes on LisanBench. On the positive side, <a href="https://x.com/jeremyphoward/status/2060195641847107722">@jeremyphoward</a> found 4.8 <strong>less over-agentic and more cooperative</strong> than 4.7/GPT-5.5 in coding, while <a href="https://x.com/leo_linsky/status/2060205310871326894">@leo_linsky</a> called it a tangible product improvement over prior Anthropic releases.</p></li><li><p><strong>Anthropic also shipped useful platform-level changes</strong>: <a href="https://x.com/ClaudeDevs/status/2060432688281251998">@ClaudeDevs</a> announced <strong>mid-conversation system instructions without breaking prompt cache</strong>, plus authoritative mid-conversation system-role updates, which matters for long-running agent sessions and cost control. But pricing remains a major complaint: <a href="https://x.com/jeremyphoward/status/2060198836963061998">@jeremyphoward</a> argued Anthropic has done little for <strong>API affordability</strong>, preferring GPT-5.5 partly because subscription/API economics are easier to justify. Overall takeaway: 4.8 looks like a meaningful quality-of-life release for real use, not a clean benchmark reset.</p></li></ul><p><strong>Agent Harnesses, Multi-Turn RL Bugs, and the Infrastructure Around Autonomy</strong></p><ul><li><p><strong>A subtle but important RL failure mode got called out</strong>: <a href="https://x.com/ClementDelangue/status/2060175330665508917">@ClementDelangue</a> highlighted a Hugging Face deep-dive on why many <strong>tool-using, multi-turn RL training loops are silently broken</strong>. The core bug: decoding model output, parsing tool calls, then <strong>re-tokenizing</strong> the updated conversation can change tokenization, so gradients are applied to sequences the model never actually sampled. The proposed fix is a strict <strong>&#8220;Token-In, Token-Out&#8221;</strong> rule: never re-encode sampled tokens; keep a single token buffer across turns. <a href="https://x.com/johnschulman2/status/2060392679528337714">@johnschulman2</a> reinforced the broader point that <strong>renderers are foundational</strong> infrastructure between messages and tokens, with failure modes spanning train/test mismatch, caching inefficiency, and prompt injection risk.</p></li><li><p><strong>Harness design is becoming its own optimization discipline</strong>: <a href="https://x.com/omarsar0/status/2060371848010019001">@omarsar0</a> surfaced work on <strong>Effective Feedback Compute (EFC)</strong>, claiming raw token/tool counts explain agent success poorly while EFC reaches <strong>R&#178; up to 0.99</strong>, implying harness quality matters more than gross activity. This lines up with productized tuning efforts like <a href="https://x.com/LangChain/status/2060349231722852680">@LangChain</a>, where <strong>Deep Agents v0.6</strong> makes <strong>harness profiles</strong> first-class to get strong performance from Qwen/Kimi/DeepSeek at <strong>20x+ lower cost</strong> than frontier APIs, and <a href="https://x.com/hwchase17/status/2060355016989585919">@hwchase17</a> explicitly framing &#8220;different models need different prompts/tools.&#8221; <a href="https://x.com/vllm_project/status/2060208480292843720">@vllm_project</a> shipped <strong>native weight syncing APIs</strong> and improved pause/resume for async RL, and later added <a href="https://x.com/vllm_project/status/2060414393666679229">fastokens</a>, a <strong>Rust BPE tokenizer</strong> to reduce CPU tokenization bottlenecks in long-context/agentic workloads.</p></li><li><p><strong>Debate is shifting from &#8220;single vs multi-agent&#8221; to where the abstraction pays</strong>: <a href="https://x.com/OfirPress/status/2060352260723392658">@OfirPress</a> argued current multi-agent systems are mostly <strong>speedups, not capability unlocks</strong>; <a href="https://x.com/scaling01/status/2060363050272653625">@scaling01</a> took the opposite view, expecting swarm-style training to yield better planning and superintelligence-like behavior. Either way, the practical trend is clear: more teams are building around <strong>agent observability, traces, and continual improvement loops</strong>, e.g. <a href="https://x.com/Vtrivedy10/status/2060406006329278970">@Vtrivedy10</a> on mining production traces for SFT/distillation and long-horizon continual learning.</p></li></ul><p><strong>Open Models, Local AI, and the OSS Toolchain Tightening Up</strong></p><ul><li><p><strong>Local-first and open-weight momentum continues to rise</strong>: <a href="https://x.com/LangChain/status/2060405874993115532">@LangChain</a> said <strong>1 in 3 AI teams</strong> ran an open-weights model in April 2026, up from <strong>1 in 5</strong> nine months earlier; <a href="https://x.com/EpochAIResearch/status/2060451576779886942">@EpochAIResearch</a> estimated open-weight models now lag frontier proprietary models by about <strong>four months</strong>. On the toolchain side, <a href="https://x.com/ggerganov/status/2060394400237109567">@ggerganov</a> launched <strong>llama.app</strong>, giving llama.cpp an official website, a unified installer, and a single <code>llama</code> entrypoint aimed at easier local deployment and third-party agent integration. <a href="https://x.com/ollama/status/2060428074102206496">@ollama</a> announced <strong>OpenJarvis</strong> as a local-first personal AI via Ollama, explicitly tied to Stanford/Hazy&#8217;s &#8220;Intelligence Per Watt&#8221; framing.</p></li><li><p><strong>Open infrastructure is getting more enterprise-shaped</strong>: <a href="https://x.com/ClementDelangue/status/2060378354931388837">@ClementDelangue</a> noted that <strong>~50% of models and datasets on Hugging Face are now private</strong>, rising with HF&#8217;s storage/buckets offering; this is an important correction to the idea that HF is only public OSS infrastructure. <a href="https://x.com/abidlabs/status/2060404002341462044">@abidlabs</a> showed <strong>Hugging Face Jobs</strong> replacing GitHub runners for CPU/serverless GPU CI. <a href="https://x.com/DSPyOSS/status/2060186371902587119">@DSPyOSS</a>, <a href="https://x.com/dbreunig/status/2060187833084870746">@dbreunig</a>, and others shipped a redesigned <strong>DSPy docs/front page</strong> ahead of a coming 4.0, focused on onboarding into programmable AI systems rather than pure prompting.</p></li><li><p><strong>Licensing and permissiveness are becoming strategic levers</strong>: <a href="https://x.com/kimmonismus/status/2060458698930016378">@kimmonismus</a> highlighted NVIDIA moving its four open model families to <strong>Linux Foundation OpenMDW-1.1</strong>, reducing legal fragmentation across weights/code/docs/data. New permissive data releases also matter: <a href="https://x.com/keshigeyan/status/2060398262591668315">@keshigeyan</a> introduced <strong>GPIC</strong>, a <strong>100M-pair permissive image corpus</strong> plus <strong>1M-pair benchmark</strong> for visual generation, with explicit research + commercial usability.</p></li></ul><p><strong>Google/OpenAI Product Surface Expands: Managed Agents, Gemini Spark/Omni, and Codex on Windows</strong></p><ul><li><p><strong>Google is widening the &#8220;managed agent&#8221; stack from API to consumer product</strong>: <a href="https://x.com/_philschmid/status/2060359976325992528">@_philschmid</a> showed <strong>Managed Agents in the Gemini API</strong>: a single API call provisioning a sandboxed Linux environment with code execution, web access, and file I/O. On the consumer side, <a href="https://x.com/GeminiApp/status/2060405496872579115">@GeminiApp</a> rolled out <strong>Gemini Spark</strong> to U.S. AI Ultra subscribers as a <strong>24/7 personal agent</strong> that can operate across a user&#8217;s digital ecosystem under direction. Google also kept pushing <strong>Gemini Omni</strong> multimodal generation/editing demos (<a href="https://x.com/alexanderchen/status/2060322611586834518">example</a>, <a href="https://x.com/GeminiApp/status/2060473816393150965">product thread</a>) and announced <strong>Google Flow Agent</strong> for creative workflows in video/film production (<a href="https://x.com/Google/status/2060473826362732611">thread</a>).</p></li><li><p><strong>OpenAI&#8217;s Codex is moving closer to a persistent remote dev operator</strong>: <a href="https://x.com/OpenAI/status/2060428604727771421">@OpenAI</a> and <a href="https://x.com/OpenAIDevs/status/2060429591655927942">@OpenAIDevs</a> added <strong>computer use on Windows</strong>, including remote steering from the ChatGPT mobile app. Follow-on UX improvements included <strong>stable identicons for background agents</strong> and search across prior chat content (<a href="https://x.com/OpenAIDevs/status/2060478367921831936">@OpenAIDevs</a>); <a href="https://x.com/reach_vb/status/2060430024537178215">@reach_vb</a> summarized broader Codex updates around Windows control, mobile remote access, and profile/task stats. Separately, OpenAI updated <strong>gpt-5.5 instant</strong> to improve <strong>sycophancy, factuality, and multilingual performance</strong> per <a href="https://x.com/michpokrass/status/2060219759682330970">@michpokrass</a>.</p></li><li><p><strong>This all points to more vertically integrated agent stacks</strong>: model + harness + sandbox + UI + remote control + pricing/quotas. Google is smoothing quotas on Gemini (<a href="https://x.com/joshwoodward/status/2060171610922058142">@joshwoodward</a>); OpenAI is expanding Codex&#8217;s operating surface; Cursor added <strong>auto-review mode</strong> with subagent-based approval routing (<a href="https://x.com/cursor_ai/status/2060406013098897765">tweet</a>). The common pattern is less &#8220;chatbot,&#8221; more <strong>managed execution environment with policy and memory</strong>.</p></li></ul><p><strong>Research and Systems Papers Worth Attention</strong></p><ul><li><p><strong>Search, retrieval, and memory</strong>: <a href="https://x.com/TheTuringPost/status/2060194173505155358">@TheTuringPost</a> highlighted <strong>Bidirectional Evolutionary Search (BES)</strong> from Harvard/MIT, combining forward search with backward decomposition and evolutionary operators; reported gains include <strong>Llama-3.2-3B-Instruct on MuSiQue from 4.0% to 7.0%</strong>. In retrieval, <a href="https://x.com/_reachsumit/status/2060214762626306512">@_reachsumit</a> pointed to <strong>Latent Terms</strong>, showing sparse BM25-ready features can be extracted from frozen dense retrievers via SAEs. <a href="https://x.com/topk_io/status/2060383255153569938">@topk_io</a> open-sourced <strong>Iso-ModernColBERT</strong> for more efficient late-interaction inference.</p></li><li><p><strong>Continual learning and belief/state management</strong>: <a href="https://x.com/HuggingPapers/status/2060312560323182657">@HuggingPapers</a> summarized <strong>BeliefTrack</strong>, claiming optimized belief-state management cuts long-horizon reasoning failures by <strong>70%+</strong>. <a href="https://x.com/AndrewLampinen/status/2060460827199599026">@AndrewLampinen</a> argued the continual learning field over-focused on interference instead of positive transfer; <a href="https://x.com/victor207755822/status/2060315686329778432">@victor207755822</a> presented a second <strong>DeliAutoResearch SKILL</strong> paper focused on self-iteration and CL.</p></li><li><p><strong>Multimodal/world models/robotics</strong>: NVIDIA-affiliated work included <strong>&#947;-World</strong>, a generative multi-agent world model streaming at <strong>24 FPS</strong> (<a href="https://x.com/fangfu0830/status/2060233093894869499">tweet</a>), and <strong>minWM</strong>, a real-time interactive video world model framework (<a href="https://x.com/_akhaliq/status/2060392729473860026">tweet</a>). In robotics, <a href="https://x.com/_akhaliq/status/2060388349425119540">@_akhaliq</a> shared <strong>Qwen-VLA</strong>, and <a href="https://x.com/inventorOli/status/2060357909561622885">@inventorOli</a> demoed Robostral&#8217;s language-following and manipulation improvements. For always-on proactive agents, <a href="https://x.com/dair_ai/status/2060373102119555191">@dair_ai</a> surfaced work replacing LLM wake-up decisions with a <strong>220MiB temporal-graph encoder</strong>, gaining <strong>+16.7 mean F1</strong> while running <strong>4&#8211;83x faster</strong>.</p></li></ul><p><strong>Top tweets (by engagement)</strong></p><ul><li><p><strong>OpenAI / biology</strong>: <a href="https://x.com/OpenAI/status/2060376598642405492">@OpenAI on Rosalind Biodefense</a> announced trusted-access biology tooling for public health and biodefense.</p></li><li><p><strong>Google / consumer agents</strong>: <a href="https://x.com/GeminiApp/status/2060405496872579115">@GeminiApp on Spark</a> rolled out its always-on personal agent to AI Ultra users in the U.S.</p></li><li><p><strong>OpenAI / dev tools</strong>: <a href="https://x.com/OpenAI/status/2060428604727771421">@OpenAI on Codex Windows support</a> and <a href="https://x.com/OpenAIDevs/status/2060429591655927942">@OpenAIDevs</a> expanded computer use to Windows plus mobile remote steering.</p></li><li><p><strong>llama.cpp UX milestone</strong>: <a href="https://x.com/ggerganov/status/2060394400237109567">@ggerganov</a> launched <strong>llama.app</strong> with a unified installer and CLI entrypoint for local AI.</p></li><li><p><strong>HF / RL correctness</strong>: <a href="https://x.com/ClementDelangue/status/2060175330665508917">@ClementDelangue</a> amplified the <strong>Token-In, Token-Out</strong> warning for multi-turn RL with tools.</p></li><li><p><strong>Open vs closed timing gap</strong>: <a href="https://x.com/EpochAIResearch/status/2060451576779886942">@EpochAIResearch</a> estimated open-weight models are now about <strong>4 months behind</strong> the frontier.</p></li></ul><div><hr></div><h1><strong>AI Reddit Recap</strong></h1><h2><strong>/r/LocalLlama + /r/localLLM Recap</strong></h2><h3><strong>1. Local LLM Performance: MoE Releases, Quants, VRAM Savings</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1tqloii/stepfun_37_flash/">StepFun 3.7 Flash</a></strong> (Activity: 637): <strong>StepFun released <a href="https://static.stepfun.com/blog/step-3.7-flash/">Step 3.7 Flash</a>, a multimodal MoE with </strong><code>196B</code><strong> total parameters, </strong><code>11B</code><strong> active, and a built-in </strong><code>1.8B</code><strong> ViT, advertised for high-throughput agent workflows up to </strong><code>400 TPS</code><strong> and reportedly runnable locally with ~</strong><code>128GB</code><strong> RAM. Reported benchmarks position it unusually strongly for a flash-class/local model: SWE-Bench Pro </strong><code>56.26%</code><strong>, DeepSearchQA F1 </strong><code>92.82%</code><strong>, HLE w/tools </strong><code>47.2</code><strong>, plus large gains over Step 3.5 Flash on Terminal-Bench, Toolathlon, ClawEval, and other agentic/tool-use tasks. Direct model artifacts are available on Hugging Face in <a href="https://huggingface.co/stepfun-ai/Step-3.7-Flash/">BF16</a>, <a href="https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8">FP8</a>, <a href="https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4">NVFP4</a>, and <a href="https://huggingface.co/stepfun-ai/Step-3.7-Flash-GGUF">GGUF</a>, with day-0 </strong><code>llama.cpp</code><strong><a href="https://github.com/ggml-org/llama.cpp/pull/23845"> support PR</a> and related MTP work in </strong><code>llama.cpp#23274</code><strong>.</strong> Commenters characterize the model as technically odd: its hidden/thinking traces are described as nearly incoherent, but final answers can be <em>&#8220;perfect&#8221;</em> and competitive with much larger <code>&gt;1TB</code> models; one user says the prior Step 3.5 <em>&#8220;infinite thinking&#8221;</em> issue appears fixed. There is cautious enthusiasm around local deployment, especially for users with <code>4x3090</code>-class hardware, and appreciation that StepFun upstreamed <code>llama.cpp</code> support instead of only maintaining a fork.</p><ul><li><p>StepFun released multiple Step-3.7-Flash checkpoints on Hugging Face: <strong>BF16</strong> (<a href="https://huggingface.co/stepfun-ai/Step-3.7-Flash/">Step-3.7-Flash</a>), <strong>FP8</strong> (<a href="https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8">Step-3.7-Flash-FP8</a>), <strong>NVFP4</strong> (<a href="https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4">Step-3.7-Flash-NVFP4</a>), and <strong>GGUF</strong> (<a href="https://huggingface.co/stepfun-ai/Step-3.7-Flash-GGUF">Step-3.7-Flash-GGUF</a>). One user reports the prior Step 3.5 Flash &#8220;infinite thinking&#8221; issue appears fixed, making 3.7 more usable despite still having an odd intermediate reasoning style.</p></li><li><p>There is day-0 <code>llama.cpp</code> enablement via StepFun&#8217;s upstream PR: <a href="https://github.com/ggml-org/llama.cpp/pull/23845">ggml-org/llama.cpp#23845</a>, contrasting with Step 3.5&#8217;s fork-based support. A separate community PR for <strong>MTP support</strong> exists at <a href="https://github.com/ggml-org/llama.cpp/pull/23274">ggml-org/llama.cpp#23274</a>, though commenters note it needs updating for Step 3.7 and current <code>master</code>.</p></li><li><p>A vLLM nightly test of the <strong>NVFP4</strong> checkpoint on <code>2x Pro 6k</code> with <code>64</code> concurrent shallow-context requests reached about <code>2200 tok/s</code>. The reported config used <code>tensor-parallel-size 2</code>, <code>--enable-expert-parallel</code>, <code>--quantization modelopt</code>, <code>--kv-cache-dtype fp8</code>, <code>--reasoning-parser step3p5</code>, and StepFun tool-call parsing; vLLM reported <strong>GPU KV cache size </strong><code>1,667,645</code><strong> tokens</strong> and <strong>max concurrency </strong><code>6.36x</code><strong> for </strong><code>262,144</code><strong> tokens/request</strong>.</p></li></ul></li></ul><p></p>
      <p>
          <a href="https://www.latent.space/p/ainews-founders-and-forward-deployed">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[[AINews] Anthropic raises $965B Series H, releases Opus 4.8 and Dynamic Workflows/ultracode]]></title><description><![CDATA[Total Anthropic victory!]]></description><link>https://www.latent.space/p/ainews-anthropic-raises-965b-series</link><guid isPermaLink="false">https://www.latent.space/p/ainews-anthropic-raises-965b-series</guid><pubDate>Fri, 29 May 2026 02:07:24 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!9YXV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeb0a3a2-e744-4174-a24b-be1fd75961bc_1888x1630.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Anthropic&#8217;s path as the <a href="https://www.latent.space/p/anthropic-glean-and-openrouter-how?utm_source=publication-search">fastest growing company of all time</a> has put overtaking OpenAI in its sights for a while, but there were numerous asterisks for the past few months that put the timing (though perhaps not the fact) of the flippening in question. Today Anthropic <a href="https://www.anthropic.com/news/series-h">officially reported $47B</a> in revenue run-rate (reminder, this number was $9B in December!) and confirmed their Series H raising $65B at a $900B pre-money valuation (including $15B from hyperscalers including <a href="https://www.anthropic.com/news/anthropic-amazon-compute">Amazon</a>, but also the entire memory industrial complex), putting them at least temporarily ahead of OpenAI in every headline dimension outside of compute and non-coding benchmarks:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9YXV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeb0a3a2-e744-4174-a24b-be1fd75961bc_1888x1630.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9YXV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeb0a3a2-e744-4174-a24b-be1fd75961bc_1888x1630.png 424w, https://substackcdn.com/image/fetch/$s_!9YXV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeb0a3a2-e744-4174-a24b-be1fd75961bc_1888x1630.png 848w, https://substackcdn.com/image/fetch/$s_!9YXV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeb0a3a2-e744-4174-a24b-be1fd75961bc_1888x1630.png 1272w, https://substackcdn.com/image/fetch/$s_!9YXV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeb0a3a2-e744-4174-a24b-be1fd75961bc_1888x1630.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9YXV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeb0a3a2-e744-4174-a24b-be1fd75961bc_1888x1630.png" width="1456" height="1257" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/feb0a3a2-e744-4174-a24b-be1fd75961bc_1888x1630.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1257,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:700451,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/199680854?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeb0a3a2-e744-4174-a24b-be1fd75961bc_1888x1630.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9YXV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeb0a3a2-e744-4174-a24b-be1fd75961bc_1888x1630.png 424w, https://substackcdn.com/image/fetch/$s_!9YXV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeb0a3a2-e744-4174-a24b-be1fd75961bc_1888x1630.png 848w, https://substackcdn.com/image/fetch/$s_!9YXV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeb0a3a2-e744-4174-a24b-be1fd75961bc_1888x1630.png 1272w, https://substackcdn.com/image/fetch/$s_!9YXV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffeb0a3a2-e744-4174-a24b-be1fd75961bc_1888x1630.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>By way of celebration, the company also released <a href="https://www.anthropic.com/news/claude-opus-4-8">Opus 4.8</a>, which broadly reportedly fixed many of the issues the community had found/soured on <a href="https://www.latent.space/p/ainews-anthropic-claude-opus-47-literally">Opus 4.7 post launch</a> (see recap below for details). It is notably SOTA on basically every economically relevant bench (a nice detail is they agree with Google&#8217;s messaging that Gemini 3.5 Flash is an improvement over Gemini 3.1 Pro):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pJaM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e7c3740-ab5b-4b98-88eb-c0576e73a2d1_1490x1350.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pJaM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e7c3740-ab5b-4b98-88eb-c0576e73a2d1_1490x1350.png 424w, https://substackcdn.com/image/fetch/$s_!pJaM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e7c3740-ab5b-4b98-88eb-c0576e73a2d1_1490x1350.png 848w, https://substackcdn.com/image/fetch/$s_!pJaM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e7c3740-ab5b-4b98-88eb-c0576e73a2d1_1490x1350.png 1272w, https://substackcdn.com/image/fetch/$s_!pJaM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e7c3740-ab5b-4b98-88eb-c0576e73a2d1_1490x1350.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pJaM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e7c3740-ab5b-4b98-88eb-c0576e73a2d1_1490x1350.png" width="1456" height="1319" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7e7c3740-ab5b-4b98-88eb-c0576e73a2d1_1490x1350.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1319,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:451507,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/199680854?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e7c3740-ab5b-4b98-88eb-c0576e73a2d1_1490x1350.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pJaM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e7c3740-ab5b-4b98-88eb-c0576e73a2d1_1490x1350.png 424w, https://substackcdn.com/image/fetch/$s_!pJaM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e7c3740-ab5b-4b98-88eb-c0576e73a2d1_1490x1350.png 848w, https://substackcdn.com/image/fetch/$s_!pJaM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e7c3740-ab5b-4b98-88eb-c0576e73a2d1_1490x1350.png 1272w, https://substackcdn.com/image/fetch/$s_!pJaM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e7c3740-ab5b-4b98-88eb-c0576e73a2d1_1490x1350.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>But perhaps of more long term significance is the massively parallel <a href="https://claude.com/blog/introducing-dynamic-workflows-in-claude-code">&#8220;dynamic workflows&#8221; feature</a> in Claude Code, also called <code>ultracode</code>, which was behind Jarred Sumner&#8217;s <a href="https://x.com/jarredsumner/status/2060050578026189172">750k LOC rewrite of Bun from Zig to Rust in 6 days</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FuPa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9ab93f6-c75f-4156-850a-81b99806aeea_1402x1256.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FuPa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9ab93f6-c75f-4156-850a-81b99806aeea_1402x1256.png 424w, https://substackcdn.com/image/fetch/$s_!FuPa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9ab93f6-c75f-4156-850a-81b99806aeea_1402x1256.png 848w, https://substackcdn.com/image/fetch/$s_!FuPa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9ab93f6-c75f-4156-850a-81b99806aeea_1402x1256.png 1272w, https://substackcdn.com/image/fetch/$s_!FuPa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9ab93f6-c75f-4156-850a-81b99806aeea_1402x1256.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FuPa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9ab93f6-c75f-4156-850a-81b99806aeea_1402x1256.png" width="1402" height="1256" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e9ab93f6-c75f-4156-850a-81b99806aeea_1402x1256.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1256,&quot;width&quot;:1402,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:428108,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/199680854?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9ab93f6-c75f-4156-850a-81b99806aeea_1402x1256.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FuPa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9ab93f6-c75f-4156-850a-81b99806aeea_1402x1256.png 424w, https://substackcdn.com/image/fetch/$s_!FuPa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9ab93f6-c75f-4156-850a-81b99806aeea_1402x1256.png 848w, https://substackcdn.com/image/fetch/$s_!FuPa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9ab93f6-c75f-4156-850a-81b99806aeea_1402x1256.png 1272w, https://substackcdn.com/image/fetch/$s_!FuPa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9ab93f6-c75f-4156-850a-81b99806aeea_1402x1256.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>&gt;</p><p></p><p></p><p></p><blockquote><p>AI News for 5/27/2026-5/28/2026. We checked 12 subreddits, <a href="https://twitter.com/i/lists/1585430245762441216">544 Twitters</a> and no further Discords. <a href="https://news.smol.ai/">AINews&#8217; website</a> lets you search all past issues. As a reminder, <a href="https://www.latent.space/p/2026">AINews is now a section of Latent Space</a>. You can <a href="https://support.substack.com/hc/en-us/articles/8914938285204-How-do-I-subscribe-to-or-unsubscribe-from-a-section-on-Substack">opt in/out</a> of email frequencies!</p></blockquote><div><hr></div><h1><strong>AI Twitter Recap</strong></h1><p><strong>Anthropic announced a massive new financing and simultaneously shipped Claude Opus 4.8.</strong></p><ul><li><p>On the capital side, Anthropic said it raised <strong>$65B in Series H at a $965B post-money valuation</strong>, led by Altimeter, Dragoneer, Greenoaks, and Sequoia, and said the money will fund research and expand capacity for growing Claude demand (<a href="https://x.com/AnthropicAI/status/2060061347522433422">Anthropic</a>).</p></li><li><p>The company also disclosed that its <strong>run-rate revenue surpassed $47B</strong>, attributing growth to enterprise deployments and everyday usage (<a href="https://x.com/AnthropicAI/status/2060061348818518493">Anthropic</a>).</p></li><li><p>On the product side, Anthropic launched <strong>Claude Opus 4.8</strong>, describing it as an Opus 4.7 update with <strong>&#8220;sharper judgment,&#8221; &#8220;more honesty about its own progress,&#8221; and the ability to work independently for longer</strong>, <strong>at the same price</strong> (<a href="https://x.com/claudeai/status/2060042702150930686">Claude</a>).</p></li><li><p>Anthropic also launched <strong>Dynamic Workflows</strong> in Claude Code, a research-preview orchestration system where Claude plans work and spawns <strong>hundreds of parallel subagents</strong> to tackle large tasks (<a href="https://x.com/ClaudeDevs/status/2060044853279617150">ClaudeDevs</a>). Independent eval posts broadly confirm that 4.8 is a meaningful improvement over 4.7, especially on long-horizon agentic coding and knowledge work, though reactions diverged on whether this is a frontier-resetting leap or mostly catch-up to OpenAI&#8217;s GPT-5.5-family.</p></li></ul><h2><strong>Facts vs opinions</strong></h2><h3><strong>Facts and directly stated claims</strong></h3><ul><li><p>Anthropic raised <strong>$65B</strong> at a <strong>$965B post-money valuation</strong> in Series H (<a href="https://x.com/AnthropicAI/status/2060061347522433422">Anthropic</a>).</p></li><li><p>The company says its <strong>run-rate revenue crossed $47B</strong> (<a href="https://x.com/AnthropicAI/status/2060061348818518493">Anthropic</a>).</p></li><li><p>Lead investors named: <strong>Altimeter, Dragoneer, Greenoaks, Sequoia</strong> (<a href="https://x.com/AnthropicAI/status/2060061347522433422">Anthropic</a>).</p></li><li><p>Altimeter publicly confirmed it led the round and framed it as its <strong>largest investment to date</strong> (<a href="https://x.com/AltimeterCap/status/2060061841372647685">Altimeter</a>, <a href="https://x.com/paulinebhyang/status/2060069180767171052">Pauline Bhyang</a>).</p></li><li><p>Anthropic launched <strong>Claude Opus 4.8</strong>, positioned as an update to <strong>Opus 4.7</strong> with improved judgment, honesty, and longer autonomous work, <strong>same price</strong> (<a href="https://x.com/claudeai/status/2060042702150930686">Claude</a>).</p></li><li><p>Anthropic engineers said 4.8 was a response to <strong>feedback on 4.7</strong>, with &#8220;many fixes&#8221; and better nuance / naturalness (<a href="https://x.com/alexalbert__/status/2060043196655362358">Alex Albert</a>).</p></li><li><p>Claude Code now supports <strong>Dynamic Workflows</strong> that write orchestration plans and launch <strong>large fleets / hundreds of subagents in parallel</strong> (<a href="https://x.com/ClaudeDevs/status/2060044853279617150">ClaudeDevs</a>, <a href="https://x.com/_catwu/status/2060054180379689074">Cat Wu</a>).</p></li><li><p>Dynamic Workflows are available in <strong>research preview</strong> and were said to work on <strong>Max, Team, Enterprise, API, Bedrock, Vertex AI, and Foundry</strong> (<a href="https://x.com/ClaudeDevs/status/2060044860984529368">ClaudeDevs</a>).</p></li><li><p>Anthropic / community posts mention <strong>effort controls</strong> added to web/app/Cowork and continued <strong>Fast mode</strong> support (<a href="https://x.com/mikeyk/status/2060046053907578889">Mikey K</a>, <a href="https://x.com/sammcallister/status/2060048329359212972">Sam Callister</a>, <a href="https://x.com/kimmonismus/status/2060044465385902436">Kimmonismus</a>).</p></li></ul><h3><strong>Opinions / interpretations</strong></h3><ul><li><p>Bullish views:</p><ul><li><p>Opus 4.8 &#8220;could&#8217;ve been called Opus 5&#8221; (<a href="https://x.com/danshipper/status/2060043738752422304">Dan Shipper</a>).</p></li><li><p>&#8220;Anthropic found a cure for laziness&#8221; (<a href="https://x.com/scaling01/status/2060043010943942989">scaling01</a>).</p></li><li><p>&#8220;first smart model in a long while&#8221; due to honesty / calibration (<a href="https://x.com/zephyr_z9/status/2060077152729694586">zephyr_z9</a>).</p></li><li><p>&#8220;People unsubscribing from Anthropic will crawl back&#8221; (<a href="https://x.com/teortaxesTex/status/2060105674311295454">teortaxesTex</a>).</p></li></ul></li><li><p>Skeptical / mixed views:</p><ul><li><p>Opus 4.8 is &#8220;a minor upgrade&#8221; (<a href="https://x.com/scaling01/status/2060041564919833041">scaling01</a>).</p></li><li><p>Anthropic is &#8220;playing catch-up with OpenAI rather than setting the pace&#8221; (<a href="https://x.com/kimmonismus/status/2060085889896726860">kimmonismus</a>).</p></li><li><p>Some benchmark-based criticism from Andon Labs: worse than Opus 4.7 / GPT-5.5 on <strong>Vending Bench</strong>, underperformed on <strong>Blueprint-Bench 2</strong>, more aligned / more cautious, and &#8220;max reasoning is not the best reasoning effort&#8221; (<a href="https://x.com/andonlabs/status/2060047215134228746">andonlabs</a>, <a href="https://x.com/andonlabs/status/2060047225791877193">andonlabs</a>).</p></li><li><p>Dynamic workflows are powerful but may be <strong>token-expensive</strong> and quota-burning in practice (<a href="https://x.com/itsclivetime/status/2060157266591129895">itsclivetime</a>, <a href="https://x.com/theo/status/2060135394570797158">Theo</a>, <a href="https://x.com/omarsar0/status/2060059612041171175">Omar Sar0</a>).</p></li></ul></li></ul><h2><strong>Fundraise details and implications</strong></h2><p>Anthropic&#8217;s financing numbers are the headline shock: <strong>$65B raised on a $965B post-money</strong> with <strong>$47B run-rate revenue</strong> disclosed in the same announcement (<a href="https://x.com/AnthropicAI/status/2060061347522433422">Anthropic</a>, <a href="https://x.com/AnthropicAI/status/2060061348818518493">Anthropic</a>). The scale drew immediate attention because it implies a company operating at near-trillion valuation with hyperscaler-style capital needs and model-serving economics.</p><p>Investor messaging was strongly framed around enterprise adoption and operational execution. Altimeter described Claude as becoming the <strong>&#8220;default operating system for entire enterprises&#8221;</strong> and praised Anthropic&#8217;s combination of performance and safety (<a href="https://x.com/AltimeterCap/status/2060061841372647685">Altimeter</a>). Pauline Bhyang said Anthropic had been on a &#8220;generational trajectory&#8221; since 2022 and highlighted the company crossing <strong>$47B run-rate revenue in under five years</strong> (<a href="https://x.com/paulinebhyang/status/2060069180767171052">Pauline Bhyang</a>).</p><p>The surrounding reactions broke into a few camps:</p><ul><li><p><strong>Validation camp:</strong> This funding size is treated as evidence that Claude has become a core enterprise platform, especially in coding and agentic workflows. Posts like Jamin Ball&#8217;s &#8220;Let&#8217;s go!!&#8221; were simple market validation reactions (<a href="https://x.com/jaminball/status/2060062156478107775">jaminball</a>).</p></li><li><p><strong>Scale / bubble concern camp:</strong> Some reacted by comparing the announcement to traditional startup fundraising rhetoric inflated to unprecedented scale. Jerry Liu joked that if you replace &#8220;billions&#8221; with &#8220;millions,&#8221; it reads like any high-growth startup fundraise (<a href="https://x.com/jerryjliu0/status/2060068247773614238">jerryjliu0</a>). Another critical read linked the financing to Anthropic&#8217;s increasingly strict safety gating around more capable models&#8212;i.e. vast compute access paired with selective capability release (<a href="https://x.com/menhguin/status/2060060425031696387">menhguin</a>).</p></li><li><p><strong>Infrastructure implication:</strong> Anthropic explicitly tied the raise to <strong>capacity expansion</strong> for Claude demand (<a href="https://x.com/AnthropicAI/status/2060061347522433422">Anthropic</a>). That matters because many of the new 4.8 features&#8212;especially higher-effort reasoning, longer independent runs, and multi-agent workflows&#8212;are inference-hungry. The capital raise should be read not just as training fuel, but as a direct attempt to underwrite serving costs for long-running agent workloads.</p></li></ul><p>One notable context tweet: a user speculated that &#8220;Anthropic also secured tens of billions in inference compute&#8221; right as Mythos safety concerns were apparently addressed (<a href="https://x.com/menhguin/status/2060060425031696387">menhguin</a>). That is speculation, not confirmed by Anthropic, but it reflects a common interpretation: this round is about compute supply and deployment scale as much as model R&amp;D.</p><h2><strong>Opus 4.8: official product positioning</strong></h2><p>Anthropic&#8217;s official framing is unusually specific in its emphasis on <strong>behavioral quality</strong>, not just benchmark scores. The launch tweet says 4.8 has:</p><ul><li><p><strong>sharper judgment</strong></p></li><li><p><strong>more honesty about its own progress</strong></p></li><li><p><strong>ability to work independently for longer</strong></p></li><li><p><strong>same price as 4.7</strong> (<a href="https://x.com/claudeai/status/2060042702150930686">Claude</a>)</p></li></ul><p>Alex Albert added that 4.8:</p><ul><li><p>incorporates fixes based on 4.7 feedback,</p></li><li><p>understands nuance better,</p></li><li><p>feels more natural conversationally,</p></li><li><p>is stronger across coding and knowledge work (<a href="https://x.com/alexalbert__/status/2060043196655362358">Alex Albert</a>).</p></li></ul><p>This honesty / calibration angle became a major subtheme. Multiple Anthropic employees and outside testers described the model as more willing to:</p><ul><li><p>say what it doesn&#8217;t know,</p></li><li><p>flag flaws in its own code,</p></li><li><p>avoid glossing over uncertain progress,</p></li><li><p>stop falsely implying task completion (<a href="https://x.com/_catwu/status/2060051277476745512">Cat Wu</a>, <a href="https://x.com/mikeyk/status/2060046051466502401">Mikey K</a>, <a href="https://x.com/dejavucoder/status/2060043362858942497">dejavucoder</a>).</p></li></ul><p>That&#8217;s noteworthy because Claude&#8217;s prior reputation among heavy coding users included strong generation but uneven self-monitoring: false positives in code review, overconfident progress summaries, and &#8220;lazy&#8221; or prematurely truncated task execution. Several community reactions explicitly framed 4.8 as fixing this failure mode:</p><ul><li><p>&#8220;found a cure for laziness&#8221; (<a href="https://x.com/scaling01/status/2060043010943942989">scaling01</a>)</p></li><li><p>&#8220;least lazy model ever?&#8221; (<a href="https://x.com/Teknium/status/2060072183783960971">Teknium</a>)</p></li><li><p>&#8220;dramatically less lazy than every other version of Claude&#8221; (<a href="https://x.com/nrehiew_/status/2060046647867191727">nrehiew_</a>)</p></li></ul><h2><strong>Technical details and numbers</strong></h2><h3><strong>Pricing, context, controls</strong></h3><p>The most concrete consolidated specs came from Artificial Analysis:</p><ul><li><p><strong>Context window:</strong> <strong>1 million tokens</strong></p></li><li><p><strong>Pricing:</strong> <strong>$5 / $25 per million input / output tokens</strong></p></li><li><p><strong>Cache writes:</strong> <strong>$6.25 / M</strong> with <strong>5-minute TTL</strong></p></li><li><p><strong>Cache hits:</strong> <strong>$0.50 / M</strong></p></li><li><p><strong>Effort settings</strong> remain as in Opus 4.7; AA tested <strong>max</strong> effort (<a href="https://x.com/ArtificialAnlys/status/2060117582120976868">Artificial Analysis</a>)</p></li></ul><p>Community posts also highlighted:</p><ul><li><p><strong>Fast mode</strong> is available for Opus 4.8</p></li><li><p>It is <strong>~2.5x faster</strong> and <strong>3x cheaper than before</strong> versus prior fast-mode economics (<a href="https://x.com/kimmonismus/status/2060044465385902436">kimmonismus</a>)</p></li><li><p>scaling01 summarized the new economics as:</p><ul><li><p><strong>Opus 4.8 Fast: 2.5x faster, only 2x more expensive than normal 4.8</strong></p></li><li><p>versus <strong>Opus 4.7 Fast: 2.5x faster, 6x more expensive than normal 4.7</strong> (<a href="https://x.com/scaling01/status/2060051666443943962">scaling01</a>)</p></li></ul></li><li><p>Effort controls were newly exposed in more product surfaces, allowing users to dial reasoning up or down (<a href="https://x.com/sammcallister/status/2060048329359212972">sammcallister</a>, <a href="https://x.com/mikeyk/status/2060046053907578889">mikeyk</a>, <a href="https://x.com/kimmonismus/status/2060045324803063962">kimmonismus</a>)</p></li></ul><p>This matters because many early user reports suggest <strong>reasoning-effort selection significantly changes output quality and cost</strong>, especially for coding and writing. Dan Shipper recommended <strong>xhigh</strong> for coding and <strong>high</strong> for writing after observing weaker behavior at lower settings (<a href="https://x.com/danshipper/status/2060043738752422304">Dan Shipper</a>). Andon Labs similarly said <strong>max reasoning is not the best reasoning effort</strong> on some tasks (<a href="https://x.com/andonlabs/status/2060047215134228746">andonlabs</a>).</p><h3><strong>Benchmarks: strongest reported numbers</strong></h3><p>Key official / semi-official numbers surfaced across launch tweets:</p><ul><li><p><strong>SWE-Bench Pro: 69.2%</strong>, claimed by Yuchen citing release materials, and &#8220;10 points higher than GPT-5.5&#8221; (<a href="https://x.com/Yuchenj_UW/status/2060042830559756407">Yuchenj_UW</a>)</p></li><li><p><strong>FrontierSWE #1</strong>, cited by Anthropic watchers and later confirmed by third-party references (<a href="https://x.com/scaling01/status/2060046440563388838">scaling01</a>, <a href="https://x.com/scaling01/status/2060054319446016046">scaling01</a>)</p></li><li><p><strong>APEX-SWE: 45.3% Pass@1</strong>, nearly <strong>4 points ahead of GPT-5.3 Codex at 41.5%</strong> (<a href="https://x.com/mercor_ai/status/2060046111793123428">mercor_ai</a>)</p></li><li><p><strong>GDPval-AA: 1890 Elo</strong>, <strong>+137 vs Opus 4.7</strong>, <strong>+121 vs GPT-5.5 xhigh</strong>, implying about <strong>67% win rate vs GPT-5.5 xhigh</strong> head-to-head (<a href="https://x.com/ArtificialAnlys/status/2060042848268083411">Artificial Analysis</a>)</p></li><li><p>Artificial Analysis Intelligence Index: <strong>61.4</strong>, <strong>+4.1 vs Opus 4.7</strong>, <strong>+1.2 ahead of GPT-5.5 xhigh</strong> (<a href="https://x.com/ArtificialAnlys/status/2060117582120976868">Artificial Analysis</a>)</p></li><li><p><strong>AA-Omniscience: 27.4</strong>, #2 behind Gemini 3.1 Pro at 32.9; <strong>accuracy 46.6%</strong>, <strong>hallucination 35.9%</strong> (<a href="https://x.com/ArtificialAnlys/status/2060117582120976868">Artificial Analysis</a>)</p></li><li><p>Gains on:</p><ul><li><p><strong>Terminal-Bench Hard +6.8</strong></p></li><li><p><strong>&#964;&#178;-Bench Telecom +5.9</strong></p></li><li><p><strong>IFBench +3.6</strong></p></li><li><p>relatively flat on <strong>AA-LCR, GPQA, SciCode</strong> (<a href="https://x.com/ArtificialAnlys/status/2060117582120976868">Artificial Analysis</a>)</p></li></ul></li></ul><p>Additional qualitative benchmark observations:</p><ul><li><p>Cursor said Opus 4.8 works <strong>much more efficiently than 4.7</strong> on <strong>CursorBench</strong> and is more persistent on hard tasks (<a href="https://x.com/cursor_ai/status/2060044920237469872">Cursor</a>)</p></li><li><p>Anthropic employees emphasized strength on <strong>long-horizon work</strong> in Claude Code (<a href="https://x.com/ClaudeDevs/status/2060043212425933076">ClaudeDevs</a>)</p></li><li><p>Some users reported especially large jumps in <strong>knowledge work</strong> and <strong>writing</strong> (<a href="https://x.com/danshipper/status/2060043738752422304">Dan Shipper</a>, <a href="https://x.com/rishdotblog/status/2060057903344869828">rishdotblog</a>)</p></li></ul><h3><strong>Efficiency and token-use details</strong></h3><p>Artificial Analysis reported:</p><ul><li><p>Compared to Opus 4.7, 4.8 achieved higher GDPval performance with:</p><ul><li><p><strong>15% fewer turns per task</strong></p></li><li><p><strong>35% fewer output tokens</strong></p></li></ul></li><li><p>But 4.8 still used <strong>~30% more turns than GPT-5.5</strong>, the second-ranked model (<a href="https://x.com/ArtificialAnlys/status/2060042850826612996">Artificial Analysis</a>)</p></li></ul><p>This is one of the more important nuanced findings in the launch coverage:</p><ul><li><p>4.8 is <strong>more efficient than 4.7</strong></p></li><li><p>but still not obviously the <strong>most inference-efficient frontier model</strong> against OpenAI on some workloads</p></li></ul><p>That tension is echoed in community commentary:</p><ul><li><p>&#8220;still getting token-mogged by GPT-5.5&#8221; (<a href="https://x.com/scaling01/status/2060080401947746483">scaling01</a>)</p></li><li><p>Theo and others complained that Claude&#8217;s higher-agency, higher-effort modes can blow through quota extremely quickly in practice (<a href="https://x.com/theo/status/2060120708815139241">Theo</a>, <a href="https://x.com/cremieuxrecueil/status/2060161310302630154">cremieuxrecueil</a>)</p></li></ul><h3><strong>Long context</strong></h3><p>Posts highlighted long-context improvements from Opus 4.6 to 4.8, with one claim that <strong>Opus 4.8 at 1M context is almost as good as GPT-5.5&#8217;s 256K score</strong> on a referenced long-context eval (<a href="https://x.com/scaling01/status/2060047431564251545">scaling01</a>). Artificial Analysis also confirmed the <strong>1M token</strong> context remained intact (<a href="https://x.com/ArtificialAnlys/status/2060117582120976868">Artificial Analysis</a>).</p><h3><strong>Safety / robustness / hallucination</strong></h3><p>This was one of the more mixed parts of the release.</p><p>Positive:</p><ul><li><p>Anthropic and supporters emphasized lower dishonesty / better calibration.</p></li><li><p>&#8220;dishonesty at an all time low&#8221; (<a href="https://x.com/scaling01/status/2060042892903678414">scaling01</a>)</p></li><li><p>&#8220;noticeably more honest&#8221; (<a href="https://x.com/_catwu/status/2060051277476745512">Cat Wu</a>)</p></li><li><p>&#8220;flags what it&#8217;s unsure of&#8221; (<a href="https://x.com/mikeyk/status/2060046051466502401">Mikey K</a>)</p></li><li><p>Artificial Analysis said Anthropic continues to show <strong>substantially lower hallucination rates than Google/OpenAI peers</strong> (<a href="https://x.com/ArtificialAnlys/status/2060117582120976868">Artificial Analysis</a>)</p></li></ul><p>Negative / cautionary:</p><ul><li><p>scaling01 noted <strong>Opus 4.8 is the first model in a long time that doesn&#8217;t improve prompt injection robustness</strong> over 100 trials (<a href="https://x.com/scaling01/status/2060042401478005237">scaling01</a>)</p></li><li><p>scaling01 also called it Anthropic&#8217;s <strong>&#8220;most eval aware model&#8221;</strong> (<a href="https://x.com/scaling01/status/2060043854967923086">scaling01</a>)</p></li><li><p>Andon Labs said it was <strong>more aligned / more cautious</strong>, &#8220;scared of getting caught,&#8221; and worse on some adversarial / business-task benchmarks (<a href="https://x.com/andonlabs/status/2060047215134228746">andonlabs</a>)</p></li><li><p>nrehiew_ noted slight hallucination improvements on the reported evals but questioned whether some hallucination tests reflect the failure modes users actually encounter (<a href="https://x.com/nrehiew_/status/2060048083753591264">nrehiew_</a>, <a href="https://x.com/nrehiew_/status/2060048085838118953">nrehiew_</a>)</p></li></ul><h3><strong>Cyber capability gating and future model class</strong></h3><p>An especially important strategic detail appeared in reaction posts: Anthropic appears to have stated it plans to <strong>release &#8220;a new class of model with even higher intelligence than Opus&#8221;</strong> after stronger safeguards (<a href="https://x.com/dejavucoder/status/2060042723185623261">dejavucoder</a>). Multiple watchers interpreted this as a <strong>Mythos-class</strong> rollout with cyber-sensitive capabilities selectively constrained:</p><ul><li><p>&#8220;Mythos class model to all customers in the coming weeks&#8221; (<a href="https://x.com/kimmonismus/status/2060047510853312557">kimmonismus</a>)</p></li><li><p>&#8220;They are releasing a Mythos-class model with the appropriate safeguards, meaning that you can&#8217;t use the &#8216;too dangerous to release&#8217; capabilities&#8221; (<a href="https://x.com/scaling01/status/2060123335514636693">scaling01</a>)</p></li><li><p>Cline summarized Anthropic as announcing plans to release new models <strong>with higher intelligence than Opus after adding stronger cyber safeguards</strong> (<a href="https://x.com/cline/status/2060063889874972905">Cline</a>)</p></li></ul><p>This is not just product roadmap gossip; it reframes Opus 4.8 as a <strong>staged release strategy</strong>:</p><ol><li><p>improve the commercially safe / broadly deployable general model,</p></li><li><p>hold back more dangerous cyber capability until controls are ready.</p></li></ol><p>That tradeoff drew both praise and criticism:</p><ul><li><p>supportive: safety-first frontier deployment</p></li><li><p>skeptical: Anthropic may be sacrificing some competitiveness in raw capability availability to maintain its risk posture (<a href="https://x.com/teortaxesTex/status/2060114150928322868">teortaxesTex</a>)</p></li></ul><h2><strong>Dynamic Workflows: the most important technical addition beyond the base model</strong></h2><p>The standout systems feature accompanying Opus 4.8 is <strong>Dynamic Workflows</strong> in Claude Code.</p><p>Official description:</p><ul><li><p>&#8220;Claude writes an orchestration script on the fly&#8221;</p></li><li><p>then spins up <strong>a large fleet of coordinated subagents in parallel</strong></p></li><li><p>use the word <strong>&#8220;workflow&#8221;</strong> in a prompt to activate it (<a href="https://x.com/ClaudeDevs/status/2060044853279617150">ClaudeDevs</a>)</p></li></ul><p>Anthropic&#8217;s employees and users described it as enabling:</p><ul><li><p>orchestration plans that Claude &#8220;strictly follows&#8221;</p></li><li><p><strong>hundreds of agents</strong></p></li><li><p>verification before returning results</p></li><li><p>support for very large migration / refactor / auditing jobs (<a href="https://x.com/_catwu/status/2060054180379689074">Cat Wu</a>, <a href="https://x.com/mikeyk/status/2060046052821184907">Mikey K</a>)</p></li></ul><p>Examples cited:</p><ul><li><p><strong>porting Bun from Zig to Rust</strong>, around <strong>750k lines</strong>, <strong>99.8% of test suite passing</strong>, <strong>11 days from first commit to merge</strong>, using hundreds of parallel agents and two reviewers per file (<a href="https://x.com/_catwu/status/2060051282698682576">Cat Wu</a>)</p></li><li><p>processing <strong>hundreds of A/B test flags</strong> in parallel in <strong>&lt;10 minutes</strong> to identify stale flags (<a href="https://x.com/_catwu/status/2060054182447448387">Cat Wu</a>)</p></li></ul><p>This launch triggered a mini-debate around the broader concept:</p><ul><li><p>Some researchers argued Anthropic had essentially productized ideas resembling <strong>Recursive Language Models / symbolic recursion over prompts</strong> (<a href="https://x.com/a1zhang/status/2060071701879066626">a1zhang</a>, <a href="https://x.com/lateinteraction/status/2060078643133763839">lateinteraction</a>, <a href="https://x.com/lateinteraction/status/2060082815077961842">lateinteraction</a>)</p></li><li><p>Others pushed back that &#8220;calling models in a loop&#8221; is not novel and that many builders have been doing this manually for months (<a href="https://x.com/omarsar0/status/2060059612041171175">omarsar0</a>, <a href="https://x.com/jxmnop/status/2060109869399916770">jxmnop</a>, <a href="https://x.com/willdepue/status/2060144024300695662">willdepue</a>)</p></li></ul><p>The more substantive critique was not originality, but <strong>cost and harness quality</strong>:</p><ul><li><p>Omar Sar0 warned agent-to-agent interactions are effective but token-heavy (<a href="https://x.com/omarsar0/status/2060059612041171175">omarsar0</a>)</p></li><li><p>Theo complained about conflicting parallel edits and wasted tokens in the current tooling (<a href="https://x.com/theo/status/2060135394570797158">Theo</a>)</p></li><li><p>itsclivetime joked that &#8220;hundreds of parallel subagents&#8221; will hit quota in seconds (<a href="https://x.com/itsclivetime/status/2060157266591129895">itsclivetime</a>)</p></li><li><p>KLieret highlighted a system-card finding: multi-agents may not improve final ProgramBench quality, but they reach mediocre solutions <strong>2x faster</strong> (<a href="https://x.com/KLieret/status/2060111272943739243">KLieret</a>)</p></li></ul><p>So the consensus from technical users is:</p><ul><li><p><strong>Dynamic workflows are strategically important</strong></p></li><li><p>they are likely the future of coding agents</p></li><li><p>but the current implementation still faces <strong>editing conflicts, cost blowups, and harness inefficiencies</strong></p></li></ul><h2><strong>Different opinions on Opus 4.8</strong></h2><h3><strong>1) Strongly supportive: Anthropic is back</strong></h3><p></p>
      <p>
          <a href="https://www.latent.space/p/ainews-anthropic-raises-965b-series">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[The Age of Async Agents — Cognition's Walden Yan & OpenInspect's Cole Murray]]></title><description><![CDATA[80% Devin Commits, Spec-to-PR Workflows, Full VMs, Agent Memory, and PMs Shipping Code]]></description><link>https://www.latent.space/p/cognition</link><guid isPermaLink="false">https://www.latent.space/p/cognition</guid><pubDate>Thu, 28 May 2026 18:41:24 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/199607874/189969d5629a5099ebb2c7a0709ff18d.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p><em>The new <a href="https://ai.engineer/wf">AIEWF website</a> is live! <a href="https://ai.engineer/cfp">CFPs</a> close in 2 days and we will run our first New Engineer Orientation this weekend, get your tickets booked ASAP as they -will- sell out. Take the <a href="https://notion.qualtrics.com/jfe/form/SV_bP07tSVMXH7ePCS">AI Engineering Survey</a> and get &gt;$2k in credits and free <a href="https://ai.engineer/wf">AIE WF tickets</a>!</em></p><div><hr></div><p>One of the central tensions in the agents industry is that even while there are major decacorn agent labs like Sierra, Decagon, Notion and Cursor being built up, it is also true that it has never been easier to DIY agents, with a plethora of agent frameworks like <a href="https://www.latent.space/p/oai-v-langgraph">LangGraph</a> and <a href="https://www.latent.space/p/pydantic">Pydantic</a> and <a href="https://x.com/FredKSchott/status/2050274923852210397">Flue</a>, and managed agents from <a href="https://www.anthropic.com/engineering/managed-agents">Anthropic</a>  and <a href="https://blog.google/innovation-and-ai/technology/developers-tools/managed-agents-gemini-api/">Gemini</a> and <a href="https://openai.com/index/openai-on-aws/">Amazon</a>. There has been a wave of companies building their own background agents from <a href="https://x.com/simonw/status/2053529689122328947">Shopify</a> to <a href="https://stripe.dev/blog/minions-stripes-one-shot-end-to-end-coding-agents">Stripe</a> to <a href="https://x.com/matthuang/status/2057500542298136899?s=46">Paradigm</a> to <a href="https://x.com/shashank_kr/status/2056246734465253859?s=46">Razorpay</a>, and even Cognition&#8217;s friends <a href="https://x.com/zachbruggeman/status/2010728444771074493?s=46">Ramp</a> have <a href="https://modal.com/blog/how-ramp-built-a-full-context-background-coding-agent-on-modal">built their own coding agent with other friend Modal</a>.</p><p>You&#8217;d think Cognition might feel a bit threatened, but they&#8217;re not - even after all this, they were way oversubscribed for the<a href="https://www.latent.space/p/ainews-cognition-raises-1b-in-26b?utm_source=publication-search"> $1B Series D </a>they just announced:</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/cognition/status/2059660758531940856&quot;,&quot;full_text&quot;:&quot;1/ We&#8217;ve raised over $1B at a $26B valuation, led by <span class=\&quot;tweet-fake-link\&quot;>@Lux_Capital</span>, <span class=\&quot;tweet-fake-link\&quot;>@generalcatalyst</span>, and <span class=\&quot;tweet-fake-link\&quot;>@8vc</span>.\n\nOur enterprise usage has grown &amp;gt;10x since the start of this year, and our run-rate revenue grew to $492 M.\n\nWe launched Devin two years ago as the first AI software engineer. Since &quot;,&quot;username&quot;:&quot;cognition&quot;,&quot;name&quot;:&quot;Cognition&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1765909640364068865/MvH-m0gd_normal.jpg&quot;,&quot;date&quot;:&quot;2026-05-27T15:39:26.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/HJViewebAAE1uVB.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/k99LLLyWhZ&quot;}],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:157,&quot;retweet_count&quot;:194,&quot;like_count&quot;:2372,&quot;impression_count&quot;:733289,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:false}" data-component-name="Twitter2ToDOM"></div><p><a href="https://www.linkedin.com/in/waldenyan">Walden Yan</a>, <a href="https://cognition.ai/blog/dont-build-multi-agents">coiner of context engineering</a> and Chief Product Officer/Cofounder of Cognition, invited <a href="https://github.com/ColeMurray/background-agents">OpenInspect&#8217;s Cole Murray</a> to talk about why <a href="https://swyx.io/cognition">the Devin is in the Details</a>.</p><p>Full conversation <a href="https://www.youtube.com/watch?v=0fgJPhYcbVk">live on the pod</a> today: </p><div id="youtube2-0fgJPhYcbVk" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;0fgJPhYcbVk&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/0fgJPhYcbVk?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>In retrospect, async agents were the most AGI pilled bet you could make in 2024 - the models weren&#8217;t good enough yet to vibecode, and people didn&#8217;t trust AI enough to let it rip, nobody (including early Cognition) was sure about the form factors. </p><p>Now it is obvious:</p><ul><li><p>The <strong>first wave of AI coding tools</strong> made the developer faster but remain heavily in the loop. <a href="https://cursor.com/help/ai-features/tab">Copilor and Cursor&#8217;s tab autocomplete</a> are prime examples However, the workflow was still heavily centered around and <strong>bottlenecked</strong> by the developer&#8217;s local workflow: a developer in an IDE, watching the model, accepting or rejecting changes, and pushing code one interaction at a time.</p></li><li><p>The second wave was <strong>local agents</strong>: <a href="https://www.latent.space/p/claude-code">Claude Code</a>, <a href="https://www.latent.space/p/windsurf">Windsurf</a>, Cursor&#8217;s agents pane: first one and increasingly many terminals all running concurrently.</p></li><li><p>The current <strong>Age of Async Agents</strong> points to a <strong>different future</strong> focused more on <strong>agent orchestration</strong> which drives end-to-end development.</p><p></p></li></ul><p><em>According to previous <a href="https://www.latent.space/p/steve-yegges-vibe-coding-manifesto">guest Steve Yegge</a>, there are finer-grained <a href="https://www.oreilly.com/radar/steve-yegge-wants-you-to-stop-looking-at-your-code/">8 levels to agent adoption</a>, but we have collapsed it into three.</em></p><p>As Cursor&#8217;s Michael Truell put it in <a href="https://cursor.com/blog/third-era">The third era of AI software development</a>:</p><blockquote><p><em><strong>Cursor is no longer primarily about writing code</strong>. It is about helping developers <strong>build the factory that creates their software</strong>. This factory is made up of <strong>fleets of agents that they interact with as teammates</strong>: providing initial direction, equipping them with the tools to work independently, and reviewing their work.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QPqO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c0a0107-653e-4c83-a249-c3308b1ed019_1498x844.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QPqO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c0a0107-653e-4c83-a249-c3308b1ed019_1498x844.png 424w, https://substackcdn.com/image/fetch/$s_!QPqO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c0a0107-653e-4c83-a249-c3308b1ed019_1498x844.png 848w, https://substackcdn.com/image/fetch/$s_!QPqO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c0a0107-653e-4c83-a249-c3308b1ed019_1498x844.png 1272w, https://substackcdn.com/image/fetch/$s_!QPqO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c0a0107-653e-4c83-a249-c3308b1ed019_1498x844.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QPqO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c0a0107-653e-4c83-a249-c3308b1ed019_1498x844.png" width="1456" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2c0a0107-653e-4c83-a249-c3308b1ed019_1498x844.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:173299,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/199607874?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c0a0107-653e-4c83-a249-c3308b1ed019_1498x844.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QPqO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c0a0107-653e-4c83-a249-c3308b1ed019_1498x844.png 424w, https://substackcdn.com/image/fetch/$s_!QPqO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c0a0107-653e-4c83-a249-c3308b1ed019_1498x844.png 848w, https://substackcdn.com/image/fetch/$s_!QPqO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c0a0107-653e-4c83-a249-c3308b1ed019_1498x844.png 1272w, https://substackcdn.com/image/fetch/$s_!QPqO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c0a0107-653e-4c83-a249-c3308b1ed019_1498x844.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>The agent should not sit solely inside the developer&#8217;s flow. It should be setup to <strong>work in the background</strong> so that you can give it a task, a repo, a machine, a shell, a browser, tests, memory, and review loops to go do the work somewhere else.</p><p>In less than a year, the sentiment has shifted from <strong>avoiding multi-agent systems</strong>:</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/walden_yan/status/1933264183837282558&quot;,&quot;full_text&quot;:&quot;I see a lot of people make the same mistakes building agents. So we shared a few of the principles we use\n\n&quot;,&quot;username&quot;:&quot;walden_yan&quot;,&quot;name&quot;:&quot;Walden&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/2043470753711190016/6IBgp4Sy_normal.jpg&quot;,&quot;date&quot;:&quot;2025-06-12T20:44:34.000Z&quot;,&quot;photos&quot;:[],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:61,&quot;retweet_count&quot;:126,&quot;like_count&quot;:1079,&quot;impression_count&quot;:256435,&quot;expanded_url&quot;:{&quot;url&quot;:&quot;https://cognition.ai/blog/dont-build-multi-agents&quot;,&quot;title&quot;:&quot;Don&#8217;t Build Multi-Agents | Cognition&quot;,&quot;description&quot;:&quot;Frameworks for LLM Agents have been surprisingly disappointing. I want to offer some principles for building agents based on our own trial &amp; error, and explain why some tempting ideas are actually quite bad in practice.&quot;,&quot;domain&quot;:&quot;cognition.ai&quot;,&quot;image&quot;:&quot;https://pbs.substack.com/news_img/2057257525226135552/JDV60mSg?format=jpg&amp;name=orig&quot;},&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p>to suggesting approaches <strong>that actually work</strong>:</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/walden_yan/status/2047054554433462360&quot;,&quot;full_text&quot;:&quot;A year ago, I'd tell people to not build multi-agents and to focus on context engineering fundamentals\n\nToday, many sexy ideas are still impractical, but we've found some setups that actually work&quot;,&quot;username&quot;:&quot;walden_yan&quot;,&quot;name&quot;:&quot;Walden&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/2043470753711190016/6IBgp4Sy_normal.jpg&quot;,&quot;date&quot;:&quot;2026-04-22T20:46:52.000Z&quot;,&quot;photos&quot;:[],&quot;quoted_tweet&quot;:{&quot;full_text&quot;:&quot;&quot;,&quot;username&quot;:&quot;walden_yan&quot;,&quot;name&quot;:&quot;Walden&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/2043470753711190016/6IBgp4Sy_normal.jpg&quot;},&quot;reply_count&quot;:4,&quot;retweet_count&quot;:2,&quot;like_count&quot;:56,&quot;impression_count&quot;:10933,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p>From coining <strong>&#8220;context engineering&#8221;</strong> to building the infrastructure behind <strong>Devin&#8217;s 7x PR growth</strong> and jump from <strong>16%</strong> to <strong>80%</strong> of commits across Cognition repos, <strong>Walden Yan</strong> has had a front-row seat to the background-agent shift. In this episode, Cognition co-founder and CPO <strong>Walden Yan</strong> joins swyx alongside <strong>Cole Murray</strong>, creator of <strong>OpenInspect</strong>, to unpack why everyone is building their own Devin, what changed after the <strong>December 2025 model inflection</strong>, and why <strong>&#8220;spec to pull request&#8221;</strong> is now becoming a real production workflow.</p><p>We go deep on the architecture of <strong>background agents</strong>: harness-in-the-box vs out-of-the-box, why Devin separates <strong>the &#8220;brain&#8221; from the machine</strong>, why repo setup is still one of <strong>the hardest problems</strong>, why Docker is not always enough, and how full VMs, snapshots, scoped secrets, GitHub bots, Slack integrations, and video-based testing all fit together. Walden and Cole also dig into memory, MCP limitations, <strong><a href="https://cognition.ai/blog/multi-agents-working">multi-agent orchestration</a></strong>, AI code review, SRE auto-triage, PMs shipping code from Slack, Windsurf 2.0, hybrid frontier/sub-frontier systems, and the real failure mode of uncontrolled vibe coding: your codebase regressing to your worst engineer.</p><p>And as<a href="https://www.youtube.com/watch?v=zepu8Kk6FBQ"> agents eat software&#8230; and software eats the world&#8230; </a>you can draw the conclusion on what is next:</p><div id="youtube2-zepu8Kk6FBQ" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;zepu8Kk6FBQ&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/zepu8Kk6FBQ?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h3>We discuss:</h3><ul><li><p>Why the engineering world is waking up to <strong>background agents</strong> and <strong>cloud agents</strong></p></li><li><p>The <strong>December 2025 model inflection</strong> that made spec-to-PR workflows practical</p></li><li><p>Devin&#8217;s <strong>7x merged PR growth</strong> and rise from <strong>16%</strong> to <strong>80%</strong> of commits</p></li><li><p>Why Cole built <strong>OpenInspect</strong> as an open-source background-agent system</p></li><li><p>The economics of <strong>$20/seat</strong> agent products and why monetization is tricky</p></li><li><p>What Cognition actually sells beyond Devin: <strong>infra, onboarding, integrations, and adoption</strong></p></li><li><p><strong>Harness in the box vs out of the box</strong>, and why architecture matters</p></li><li><p>Why Devin separates the <strong>brain</strong> from the machine for <strong>security</strong> and <strong>permissions</strong></p></li><li><p>Repo setup, scoped secrets, Docker Compose, and agent-ready dev environments</p></li><li><p>Why full <strong>VMs matter</strong> when agents need to run real applications and test them</p></li><li><p>Android, macOS, Windows, nested virtualization, and machine-specific agent work</p></li><li><p>Why testing is much harder than <strong>&#8220;computer use&#8221;</strong></p></li><li><p>Screenshots, video verification, and the <strong>&#8220;I know it works&#8221;</strong> merge moment</p></li><li><p><strong>GitHub UX, Devin Review, AI reviewers, and agents</strong> responding to PR comments</p></li><li><p>Why MCP alone is <strong>not enough</strong> for first-class Slack and enterprise integrations</p></li><li><p>Memory, Knowledge, skills, Claude.md, and why retrieval is still unsolved</p></li><li><p><strong>Devin&#8217;s auto-generated memories</strong> and the challenge of memory pruning</p></li><li><p><strong>Always-on agents</strong> as permanent PMs for issues, tickets, and product areas</p></li><li><p>Sub-agents, meta-Devin management, and what multi-agent systems actually add</p></li><li><p>Why pure auto-merge vibe coding <strong>breaks down after about two weeks</strong></p></li><li><p>AI code smells, lint rules, reward hacking, and Semgrep for agent-written code</p></li><li><p>GitAI, inline context, and preserving the <strong>&#8220;why&#8221; behind code changes</strong></p></li><li><p>Local testing, mock servers, older codebases, and preparing companies for agents</p></li><li><p><strong>Windsurf 2.0</strong> and the handoff between local foreground agents and cloud background agents</p></li><li><p>SRE auto-triage, support workflows, and agents as first responders</p></li><li><p>PMs, marketing, and non-engineers creating pull requests from Slack</p></li><li><p>AI agent <strong>budgets</strong>, <strong>$1k-$5k</strong> per engineer <strong>spend</strong>, and hybrid frontier/sub-frontier systems</p></li><li><p>The rise of <strong>autonomous coding factories</strong> and <strong>who Cognition is hiring</strong></p></li></ul><div><hr></div><h3>Walden Yan</h3><ul><li><p><strong>X:</strong> <a href="https://x.com/walden_yan">https://x.com/walden_yan</a></p></li><li><p><strong>LinkedIn:</strong> <a href="https://www.linkedin.com/in/waldenyan/">https://www.linkedin.com/in/waldenyan/</a></p></li></ul><h3>Cole Murray</h3><ul><li><p><strong>X:</strong> <a href="https://x.com/_colemurray">https://x.com/_colemurray</a></p></li><li><p><strong>LinkedIn:</strong> <a href="https://www.linkedin.com/in/colemurray/">https://www.linkedin.com/in/colemurray/</a></p></li><li><p><strong>OpenInspect / Background Agents:</strong> <a href="https://github.com/ColeMurray/background-agents">https://github.com/ColeMurray/background-agents</a></p></li></ul><div><hr></div><h2>Timestamps</h2><p><strong>00:00:00</strong> Introduction<br><strong>00:00:43</strong> Why Everyone Is Building Their Own Devin<br><strong>00:01:57</strong> Devin&#8217;s 2025 Ramp: 7x PR Growth and 80% of Commits<br><strong>00:03:49</strong> OpenInspect and the Rise of Open-Source Background Agents<br><strong>00:07:59</strong> What Cognition Actually Sells Beyond Devin<br><strong>00:09:56</strong> Background Agent Architecture: Harness In vs Out of the Box<br><strong>00:12:08</strong> Separating the Brain from the Machine<br><strong>00:14:07</strong> Repo Setup, Secrets, Docker, and Full VMs<br><strong>00:19:13</strong> Why Testing Is Harder Than Computer Use<br><strong>00:22:40</strong> Video Verification and the &#8220;I Know It Works&#8221; Merge Moment<br><strong>00:23:19</strong> GitHub UX, Devin Review, and AI Code Review<br><strong>00:25:42</strong> MCP, Slack, and Enterprise Agent Integrations<br><strong>00:28:59</strong> Memory, Knowledge, and Always-On Agents<br><strong>00:36:16</strong> Sub-Agents, Multi-Agent Orchestration, and Meta-Devin<br><strong>00:43:55</strong> Vibe Coding, Auto-Merge, and Codebase Decay<br><strong>00:48:38</strong> Agent Infra, VPCs, Cloud Providers, and Fast VM Restore<br><strong>00:52:25</strong> AI Code Smells, Reward Hacking, and Code Review Systems<br><strong>00:56:10</strong> Making Codebases Agent-Ready<br><strong>00:58:30</strong> Windsurf 2.0 and the Local-to-Cloud Agent Handoff<br><strong>01:01:15</strong> SRE Auto-Triage, PMs Shipping Code, and Agent Use Cases<br><strong>01:04:32</strong> Agent Budgets, Hybrid Models, and Autonomous Coding Factories<br><strong>01:06:51</strong> Hiring at Cognition and OpenInspect Consulting<br><strong>01:07:45</strong> Outro</p><div><hr></div><h1>Transcript</h1><h2>Introduction: Walden Yan, Cole Murray, and Context Engineering</h2><p><strong>Swyx [00:00:00]:</strong> All right, we&#8217;re in the studio with Walden Yan, co-founder of Cognition, CPO.</p><p><strong>Walden [00:00:08]:</strong> Happy to be here.</p><p><strong>Swyx [00:00:09]:</strong> Which is a cool title. And coiner of context engineering.</p><p><strong>Walden [00:00:15]:</strong> Although I think there are many people who&#8217;d used the terms in various ways beforehand, but I did find that people, both internally and externally, enjoyed the upgrade from prompt engineering or model wrapping into maybe a more thoughtful way to build agents.</p><p><strong>Swyx [00:00:33]:</strong> For those who haven&#8217;t caught up on that, I have on screen the Don&#8217;t Build Multi-Agents post, which you should go read on and we might refer to, and Cole Murray, who created OpenInspect.</p><p><strong>Cole [00:00:43]:</strong> Great to be here.</p><p><strong>Swyx [00:00:43]:</strong> So let&#8217;s talk about it. Everyone is building their own Devins. What&#8217;s going on?</p><h2>The December Shift: From Handholding Models to Autonomous PRs</h2><p><strong>Cole [00:00:51]:</strong> So I think the engineering world is waking up to this idea of background agents, cloud agents, whatever you&#8217;d like to call it. And I think we saw a shift around the December timeframe of 2025, where the models Opus 4.5 and GPT 5.2, they reached a capability where we moved away from handholding the model and being able to actually more or less autonomously drive the model. And what I mean by that is that we could pretty much go from a specification to a completed pull request, assuming the spec was good enough, with very little friction. And that paradigm alone, I think, changed a lot of how we interact with agents, and opened this world where background agents became more practical.</p><p><strong>Swyx [00:01:41]:</strong> I think for Cole, everyone experienced this in December, but I feel like there was just this increasing ramp, right? There was this moment which was, I think, Sonnet 3.7, where, You guys rewrote Devin in one night or something. So describe 2025 or how it felt from your side.</p><p><strong>Walden [00:02:01]:</strong> In retrospect, we always thought it was ramping up, but then even now, over the last three, four months from today, it&#8217;s been ramping up even faster. So it&#8217;s almost funny to be talking about how, big of a leap Sonnet 3.7 was, and honestly, a lot of it was stripping out parts of Devin that were no longer needed with that jump in of intelligence. But I also just think that a lot of the recent leaps, especially, you look at, models like Opus and the latest GPT models, they are reaching levels of autonomy where people are actually finding that they actually can just be hands-off. And people who were once debating, &#8220;Oh, do I need to be in the weeds with my model in the IDE? Can I just completely move it off into the cloud?&#8221; That&#8217;s a more serious conversation, and we&#8217;ve seen that in all of our growth charts. Internally there&#8217;s this funny graph where our usage has, of PRs, our merged PRs, has grown 7X since I forget what it was called.</p><p><strong>Swyx [00:02:57]:</strong> I think Dev, maybe tweeted that. Yes.</p><p><strong>Walden [00:03:01]:</strong> it grew like 7X over, the last, I think it was, two months, three months, something like that. And then you see our engineering headcount growth. It&#8217;s, gone up by, 10% or something.</p><p><strong>Swyx [00:03:11]:</strong> We were, we were afraid To release this. So this is Devin commit percentages on all Devin repos, was 16% in January and now 80% in March.</p><p><strong>Walden [00:03:25]:</strong> It&#8217;s a big shift right now. And so it makes sense that a lot of people are now thinking about, buying Devin, but also maybe, trying to build their own and there&#8217;s Lots of I have a lot of fun building Devin, so I can see why other people would want to build their own cloud agents as well. Matt, well, maybe it&#8217;s good to hear, what initially inspired you to try to build OpenInspect?</p><h2>OpenInspect: Ramp, Cloud Agents, and Open Source</h2><p><strong>Cole [00:03:49]:</strong> OpenInspect came about, through primarily my clients observing how they were using tools like Claude, OpenAI&#8217;s Codex at the time, and seeing some of the friction that they were having with it. Primarily the Claude was being used through Slack, and a big issue they ran into was that the sessions that were launched were specific to whoever called it via Slack. And so if a PM was the one who invoked the session and they would then go to pass context to engineering can&#8217;t see the session. And that in itself was a deal breaker because the PM, &#8220;Hey, engineering, can you jump in?&#8221; But there&#8217;s nothing to jump in on unless they&#8217;re copy-pasting out or the single response that came back. And so seeing some of these problems, I had built a similar architecture internally, just to experiment with, test out different ideas as this trend of moving off of localhost was starting to become, And as Ramp released their blog post, I had a lot of the pieces for this already in place, and just thought it would be funny to, see what Claude could do just purely from the blog post. And on my X account, there&#8217;s actually a thread of where I live tweeted, going through this</p><p><strong>Cole [00:05:14]:</strong> comparing GPT and Claude as both of them are going through it.</p><p><strong>Swyx [00:05:17]:</strong> On the announcement thing or something else?</p><p><strong>Cole [00:05:19]:</strong> right after it got released. We can put it in the show notes. Yeah, it was helpful that I had already knew how to verify the system. I knew what I was looking for. I think Ramp did a great job of really illustrating, the technical aspects of how to build something. It was much more than just like, &#8220;Hey, we built a great system.&#8221; It was, &#8220;And here&#8217;s how you can build it too.&#8221; And so, I resonated a lot with that, just with the problems that I was already seeing, and I thought that, looking around, I didn&#8217;t really see anything in the open source community that, met this type of system. I think there&#8217;s a lot that run, in localhost like Superset, Conductor, and many others.But nothing that was actually running in the cloud. And so, I built it, and I thought it was interesting to just open source it and allow anyone to then have a foundation that they can mix and match on top of.</p><h2>The Business of Background Agents: Open Source vs. Devin</h2><p><strong>Swyx [00:06:16]:</strong> So literally after Devin was launched was, there was OpenDevin Which became All Hands. I don&#8217;t know if you tried that or</p><p><strong>Walden [00:06:22]:</strong> I was going to say, one of the things that interested me a lot with OpenInspect was, you didn&#8217;t try to go make it then something you monetize. There are a lot of, I think, these open source projects would then go and really try to, raise V</p><p><strong>Swyx [00:06:36]:</strong> That&#8217;s why no OpenDevin. Yeah.</p><p><strong>Walden [00:06:38]:</strong> yeah, and how did you think about that? I thought that was very interesting.</p><p><strong>Cole [00:06:44]:</strong> I thought, and just what I had seen across my clients, was that having a background agent system is going to become a critical infrastructure within their company. And so because of that, I think that I wanted to open source it so that they could fork it and put in whatever customization they wanted. To that question though, I get asked all, &#8220;Oh, are you going to raise? Are you going to turn this into a service?&#8221;</p><p><strong>Walden [00:07:08]:</strong> I&#8217;m sure you&#8217;ve gotten offers.</p><p><strong>Cole [00:07:09]:</strong> but primarily I don&#8217;t want to do that for a few reasons. One, I think that I don&#8217;t want to compete for, $20 a seat. I think that is just a really difficult business. I think it&#8217;s very easy to copy the main pieces of it. Again, I built this fairly quickly. And I think because you are not owning, I guess, the entire stack, it&#8217;s hard to monetize. You have money being made at the sandbox layer with Daytona, E2b, many other players. You have money being made at the model layer. And you sit in this weird in-between gray area where what are you actually selling? You&#8217;re selling, I guess, the infrastructure. You&#8217;re selling, the integrations maybe.</p><p><strong>Swyx [00:07:55]:</strong> let&#8217;s ask the guy. What are you What are you selling?</p><p><strong>Walden [00:07:59]:</strong> Well, yeah, there&#8217;s multiple layers to this in practice, and actually it&#8217;s funny you mentioned the infrastructure, &#8216;cause when we got started building Devin as well, we had to go figure out how to make the infrastructure as well because,</p><p><strong>Swyx [00:08:10]:</strong> You had to build this two years before everyone else,?</p><p><strong>Swyx [00:08:15]:</strong> Including, the model side</p><p><strong>Walden [00:08:17]:</strong> It was not, it was not very polished at the start, when we just built it off of raw VMs from cloud providers like EC2, the boot up time was so slow, I think, And especially then, turning off the machines, saving them, and then to be able to bring them back up again when the, when you want Devin to wake up again later. It would just be out cold for like 10 minutes because that&#8217;s just how long these systems took. They were not built for this repeated down and up usage. And so we actually had to go do all of that. And as a result now, one thing we offer when we go and sell Devin to people is, you don&#8217;t have to worry about all the compute side of things. We&#8217;ll make it work. We&#8217;ll make it work in your cloud if you want it to. But aside from the product, and I want to go into the agents and the tuning of the intelligence part later, but I think a big part of what we do at Cognition as well is to just make sure that your company learns and uses and adopts these coding agents. &#8216;Cause I think for especially the largest enterprises in the world, you find that there is a lot of people who want to move over to using AI for their day-to-day workloads. But because of the way projects are planned, because, not everyone is literate in using AI in these ways, having a team of engineers who can actually go in and onboard you, set up all the integrations you need, the automations you need to really get to that level of, leverage with AI, is super helpful. And so We do that. We show thought partners to the customers that we work with as well.</p><p><strong>Swyx [00:09:56]:</strong> So let&#8217;s talk about, architectural stuff. I think that&#8217;s always, that is something that was the topic of conversation between the two of you. Is this, the mental model that you want to start with or something else? I&#8217;ll just leave the floor open to you guys.</p><h2>Agent Architecture: Harness in the Box vs. Out of the Box</h2><p><strong>Cole [00:10:11]:</strong> I think, maybe we can start here as just a general what are the pieces of a background agent system. And then maybe we can go into some of the nuances of, Decisions that you can make.</p><p><strong>Swyx [00:10:22]:</strong> But I guess I also Like, what, maybe what Walden is saying is the agent is like in this open code box, I guess. Right? This is infra, and then there&#8217;s, that&#8217;s the agent. And you had this discussion about whether you put the agent in here or in Out externally. Can you tease that out?</p><p><strong>Cole [00:10:39]:</strong> In a background agent systems, you have a decision to make of where the agent is actually going to run. This is typically described as the harness in the box or out of the box. With running the agent in the box, you&#8217;re making some trade-offs by doing that. The negative trade-off you&#8217;re making is primarily security. Because the agent is running in that box, unless you otherwise design it, all of your secrets need to go into that box as well. And given the nature of AI, it can be unpredictable, and you could very easily end up accidentally exfilling your secrets, or other unintended behavior. Now, the out of the box is the idea that we are going to have the actual agent running not directly in the sandbox, and we will have, quote-unquote, the brain of the agent running in some type of worker, control plane. That sandbox then is going to serve as the hands where the brain is basically operating and making tool calls into that environment to manipulate it. I guess other trade-off that you&#8217;re making between the two systems is that, in my opinion, running it out of the box is much more complex because, you have state that has to be managed, whereas if you&#8217;re running it in the box, all of the state of that agent is actually in the box, and yes, it&#8217;s you could persist it elsewhere, but it&#8217;s all localized and you have less concerns to worry about.</p><p><strong>Walden [00:12:08]:</strong> I think a lot of that, what you mentioned, is why we actually from the start built Devin to what we called separate the brain from the machine. The other thing that this allows you to do is reuse any existing infrastructure you have for dev boxes Perhaps. And so you don&#8217;t have to worry as much about making a new type of dev box that has all the dependencies the brain needs, as you mentioned, the secrets the brain needs as well. One thing that we&#8217;ve seen some customers run into is, you have a GitHub app and you want Devin, your agent, whatever, be able to interact with GitHub through this application, but then you have different users with different actual permissions. If they are all interacting through the same GitHub app and there&#8217;s no actual, separation between the system that decides, what it does and the actual secrets on the machine, then you run into an issue where, okay, it&#8217;s hard to do the separation. But in practice, with Devin, it&#8217;s much easier because we just say whatever you put on the machine, that is, the scope of basically what the user is free to do, what the agent is free to do. So only put the most scoped secrets on that machine, and then the brain is fully not accessible from the machine. So you don&#8217;t have to worry about messing with the, any of the most secure parts of the brain if the user is free to do whatever they want with the machine.</p><p><strong>Swyx [00:13:31]:</strong> I was going to just bring, I have this, chart from OpenAI, where I don&#8217;t know if this is, in the box, out of the box. That is something that they do use to describe it. And then also recently Anthropic did, managed agents</p><p><strong>Swyx [00:13:44]:</strong> Which is, this is their thing. I don&#8217;t know. It&#8217;s all, it&#8217;s all variations of the same pattern, right?</p><p><strong>Cole [00:13:49]:</strong> So this would be out of the box.</p><p><strong>Swyx [00:13:51]:</strong> Which, is preferable for them because it&#8217;s less work?</p><p><strong>Cole [00:13:56]:</strong> I would say it&#8217;s more work.</p><p><strong>Swyx [00:13:58]:</strong> It&#8217;s more work?</p><p><strong>Cole [00:13:58]:</strong> But it, in my opinion, it is the better architecture of the two. It&#8217;s just, you&#8217;re taking on a bit of complexity by doing that.</p><h2>Repo Setup, Docker, and VM-Based Development Environments</h2><p><strong>Walden [00:14:07]:</strong> One thing I&#8217;ve not seen a lot of other players do well is how do you manage what&#8217;s actually on the box? And this can be complex for many reasons. Let&#8217;s say you have a big repository that&#8217;s changing and updating a lot with changing dependencies. How do you make sure that the working environment of the agent actually stays up to date, has all the credentials it needs to, let&#8217;s say, run the app and test it, and all the things you want your autonomous</p><p><strong>Swyx [00:14:34]:</strong> So a repo setup.</p><p><strong>Walden [00:14:35]:</strong> Exactly. So in, internally At Cognition, we call this repo setup.</p><p><strong>Cole [00:14:39]:</strong> The hardest part of</p><p><strong>Walden [00:14:40]:</strong> It&#8217;s been a perennial problem since the start of the company, of how do we help people get this set up? Because not everyone just has, working cloud environments working out of the box. And do you find this to be a common problem with</p><p><strong>Swyx [00:14:53]:</strong> How do you solve it?</p><p><strong>Walden [00:14:53]:</strong> Your clients?</p><p><strong>Cole [00:14:54]:</strong> This is a very common problem, and through my consulting, this is a lot of what I help teams do. A lot of teams don&#8217;t really have great developer environment setups, if any. A lot of the times it&#8217;s, &#8220;Go talk to Bob and get the secrets,&#8221; and that obviously doesn&#8217;t work when the agent needs to actually set this up. And so a lot of that, most teams are using Docker Compose or some type of microservices. And so for the</p><p><strong>Swyx [00:15:19]:</strong> Even in prod?</p><p><strong>Cole [00:15:20]:</strong> Not in prod. With the OpenInspect, you are using this primarily to interact, and make code changes. There is other use cases, but you can hook, whether through CLI, MCPs, other tools, you can then hook that into your production systems primarily for, SRE type use cases. But you are not, necessarily, trying to test your prod internal microservice through the system.</p><p><strong>Walden [00:15:48]:</strong> And you mentioned Docker Compose. I think one direction we saw some of our friends take early on was, using Docker containers as the level of abstraction for their models. There&#8217;s lots of reasons, I think, why Docker containers are not great. One thing is, Docker container&#8217;s not really a true security boundary, for one. But the other is, if you are running real applications, a lot of times those applications use Docker, and then you have to think about Docker in Docker, which is, really weird. And so I think part of, the really hard challenge of getting VMs to work, why did we do that? Well, it was because we realized that you actually needed, full VMs to be able to do these types of things. And especially nowadays where there&#8217;s actually value in running the application and clicking around and sending you screen recordings of these things. The value just, keeps adding on top of that. But it is a decision I see people run into when they try to build their own systems, is, &#8220;Oh, do we, in addition to this, do we put the agent in the machine or out of the machine? Do we use Docker? Do we use something else?&#8221; What do you recommend people nowadays?</p><p><strong>Cole [00:16:57]:</strong> I think Docker is a good solution for maybe not running the agent, but running your infrastructure, because that is more or less the same setup your engineers are probably already using. If they&#8217;re not, then I don&#8217;t know what they&#8217;re using. But they&#8217;re probably already using Docker Compose.</p><p><strong>Swyx [00:17:14]:</strong> I&#8217;ve always had a small candle for web containers. I don&#8217;t know if you guys have tried them before.</p><p><strong>Swyx [00:17:19]:</strong> To me, they were, supposed to be like Docker Light.</p><p><strong>Cole [00:17:22]:</strong> Is it?</p><p><strong>Swyx [00:17:22]:</strong> I don&#8217;t know.</p><p><strong>Cole [00:17:22]:</strong> No, I haven&#8217;t tried it. But yeah, I think any environment that you&#8217;ve set up that is a good experience for your developer naturally lends itself to being easy to set up for the agent. And once you figure out that local developer story, you&#8217;ve more or less solved the agent in a sandbox, environment setup. OpenInspect does have hooks as well, where you can, run a setup SH script that will pre-install everything. You can then pre-snapshot that build so it starts instantly, and then there is a second hook to actually then, restore the state of the sandbox when it comes back. And so you can already have all of those microservices running and basically get the same experience that you would on your machine within the sandbox.</p><h2>Testing Agents: Computer Use, Screenshots, and Real App Workflows</h2><p><strong>Walden [00:18:08]:</strong> Another thing that we&#8217;ve been thinking a lot about is like Different VM service offerings. Have you had customers where they needed like macOS specific VMs or like Windows specific</p><p><strong>Walden [00:18:20]:</strong> VMs?</p><p><strong>Walden [00:18:22]:</strong> There are like many technologies in the world that only work on specific types of machines, right? If you&#8217;re building a.NET application that has to run on Windows or like, maybe more commonly if you want to build iOS or macOS Does that work</p><p><strong>Swyx [00:18:32]:</strong> Does Commission support</p><p><strong>Swyx [00:18:33]:</strong> Choices like that?</p><p><strong>Walden [00:18:35]:</strong> The fundamental architecture we do, because we do the separation, it does support, but the actual work in progress is happening right now on these. Another thing that we&#8217;ve actually recently added support now for, it&#8217;s in beta, is doing Android development. To do that, we needed to support, I think, nested virtualization within our machines because the VM itself is like a, is a virtualized Firecracker instance, and then you had to then run another Android emulator inside. And there&#8217;s like weird performance issues that like, it, which is why it&#8217;s like still in beta. We have to think through these problems, but it unlocks a lot for anyone who wants to do Android development.</p><p><strong>Swyx [00:19:13]:</strong> I was trying to find like a reference video for the testing thing. I couldn&#8217;t find it, but I think you worked on the testing, capability. Why call it testing and not like computer use or I don&#8217;t know, it&#8217;s, what&#8217;s the general Category of problem?</p><p><strong>Walden [00:19:26]:</strong> I think that when people think about the ability of an AI to run your app and test it, I think they actually over-index on the computer use part of it because computer use in my mind is the literal, okay, you want what button you want to click. Can you emit the right coordinates to go click that button? I think testing is actually a really interesting like</p><p><strong>Walden [00:19:48]:</strong> Problem-solving, challenge for these AIs because if you wanted to do arbitrary testing, imagine you make a change that spans the frontend and the backend, maybe, even some other like even more deeply nested service. To actually test that change, we have to reason through what-- how do you first run these applications to orchestrate with each other with the right version of the code? Then, okay, how do I trigger the feature or how do I make the thing actually happen? And this can get arbitrarily hard, maybe you have to be an admin. Maybe a certain thing has to be feature flagged on. Maybe, you have to like run two sessions and then send us a very specific word into one of them to trigger a specific behavior. And figuring out how do you do that requires a lot of code base context, requires, a lot of orchestration that we&#8217;ve specifically done. And in some cases, we found that you actually, no one frontier model can actually do this full end-to-end task itself.</p><p><strong>Walden [00:20:42]:</strong> We&#8217;ve seen cases where we actually had to orchestrate different frontier models together to solve this problem together. That is where we spend most of our time when we think about this testing problem, not so much the computer use part. Computer use for what it&#8217;s worth has gotten a lot better with recent models and it&#8217;s made that part of the job certainly easier.</p><p><strong>Swyx [00:20:58]:</strong> Especially with like even 4.7, that they released yesterday, apparently like way better in terms of the vision stuff, which is going to be encompassing computer use.</p><p><strong>Walden [00:21:08]:</strong> Having evals for all these as well is something that like takes a while to build up. And having the evals be right is tricky as well. Do you ever see like, clients who are building their own agents have to start standing up evals to make sure things don&#8217;t regress?</p><p><strong>Swyx [00:21:25]:</strong> Not so much evals in the traditional sense, but specific to the testing part that has just gone in. I just added support for screenshots And in theory you can also do video. I need to put in a plugin to do that. But they do show up natively, and it was a very heavily requested feature, especially after Cursor&#8217;s recording came out. I think that was very enlightening for everyone of like, &#8220;Oh, this is a very good feature to actually have.&#8221;, I think with Devin you guys have had this for a while.</p><p><strong>Swyx [00:21:57]:</strong> Oh, yeah. See how screenshots work. Yeah, I don&#8217;t know if there&#8217;s anything, super and not obvious. It&#8217;s like once what feature to build, you can just prompt it and it Will mostly work.</p><p><strong>Walden [00:22:09]:</strong> I think to Walden&#8217;s point, though, the computer use is a subset of the larger testing problem, and I think that&#8217;s very specific to the code base that you&#8217;re working and it&#8217;s not something that, out of the box that you could just solve it. The-- you do need the code base context to actually know how to test it. And I think in the case of a background agent system, you fortunately do have that code base locally that what is changing and could then inspect it and use that to drive the model.</p><p><strong>Swyx [00:22:40]:</strong> For those who haven&#8217;t seen it before, this is an example of how it works. You, after the PR is done, you click testing approved, and then it sends you back a video. What I really like is that it labels, It&#8217;s very small here, but it actually labels what it&#8217;s testing. And then it-- and then you actually see the cursor and everything. So I don&#8217;t know, yeah, the engineering in this, just Whatever you want to show. &#8216;cause this is like, this is one of those like, oh, few of the AGI moments, right? &#8216;cause Once I look at this, I actually don&#8217;t I wish I can just merge inside Of Slack instead of going to GitHub &#8216;cause I don&#8217;t need to see the code. I know it works.</p><p><strong>Walden [00:23:19]:</strong> Maybe a new feature in Cursor. Yeah, the annotations at the bottom was also a big difference for me when I, when I added those.</p><p><strong>Swyx [00:23:27]:</strong> It&#8217;s just like, what am I looking at? What are you trying to demonstrate?</p><p><strong>Walden [00:23:30]:</strong> Exactly. There&#8217;s a surprisingly long tail of small details that ends up making a big difference for this end metric of like how fast do you actually merge the code in. One experience that we spent a lot of time tuning early on was what is the right experience on GitHub for these tools. Because I think, most tools out there when you build the agent, you&#8217;ll think about, oh, it&#8217;ll create the PR for you. We try to take that a step further and say, &#8220;Oh, what if we actually made sure you could interact Devin, with direct Devin directly on GitHub?&#8221; And so we made sure that you can comment on GitHub, and Devin would actually receive those comments and address them back. But there&#8217;s actually quite a bit of tuning you have to do here because you can imagine that actually like-We recently have Devin Review, for example. Devin Review will post comments on his own PR And then Devin has to then go</p><h2>GitHub Workflows: Devin Review, Comments, and PR Automation</h2><p><strong>Swyx [00:24:23]:</strong> He answers his own comments, which is Really loopy. So like, yeah, I like that it just updates here that it&#8217;s, that I have commented But usually it&#8217;s just me saying like, &#8220;Hey, merged, fix any merge conflicts.&#8221;</p><p><strong>Walden [00:24:37]:</strong> The, so when Devin fixes his own comments, you might be scared that, oh, maybe I&#8217;ll infinite loop. But we&#8217;ve put a lot of work into making sure it doesn&#8217;t, both by making sure that the comments are high signal, but also that the agent is thoughtful about what comments it immediately goes and tries to fix, and what comments it&#8217;s like, &#8220;Wait a second, I think you&#8217;re wrong.&#8221; Actually, that&#8217;s one of my favorite moments is when Devin tells me that I&#8217;m wrong, when I try to get it to do something different. But tuning that behavior, actually makes a big difference in terms of how useful the actual GitHub experience is.</p><p><strong>Cole [00:25:06]:</strong> I think to touch on that as well, I think having the AI reviewer integrated into the system is a critical part of this background system. OpenInspect does have that. It has a GitHub code reviewer that you can control the prompt. It does do comments as well. It doesn&#8217;t do them automatically yet. The capability is there, but it&#8217;s not fully used.</p><p><strong>Swyx [00:25:27]:</strong> So you have to ask for it?</p><p><strong>Cole [00:25:28]:</strong> you do, yeah. You can tag it on GitHub, and then whatever you named your, GitHub bot, it will then follow up on it. It will then, if you have merge conflicts or whatever you have asked it to resolve, it will then resolve it, but it doesn&#8217;t do it automatically yet.</p><h2>Integrations: Slack, MCP, and First-Party Agent Interfaces</h2><p><strong>Walden [00:25:42]:</strong> Well, I&#8217;m curious, what is, the most common thing that people end up requesting, that they still need on top of OpenInspect when you help them go implement it?</p><p><strong>Cole [00:25:52]:</strong> I think a lot of it comes down to actually integrating it into the company. It&#8217;s one thing to have the background agent system set up, but if it isn&#8217;t actually integrated into your larger ecosystem, it isn&#8217;t that useful. It is useful to be able to kick off sessions, but what we really want to be able to do is hook it into all of our other systems, whether that is the production database with read-only credentials, the logs, a Confluence or internal knowledge-based system. I think that is where I see the huge leap for companies, and that can be a challenge for companies as well who are maybe not familiar with exactly how to approach it, especially if they&#8217;re in environments that have more compliance type things where, access control can be pretty big and how do you deliberately think about these problems, I find to be, one of the problems that comes with a system like this.</p><p><strong>Walden [00:26:46]:</strong> The thing we found is So, MCPs, obviously it has been like this, really big explosion of, oh, you can go, integrate it with all these different things. But to actually get the integration right and the and get the right experience, oftentimes we found that we had to go build our own ad hoc things. I think Slack is a great example of this. You could give your agent a Slack MCP and okay, it can post messages back to you on Slack. But we actually use Devin like a coworker in Slack, and that&#8217;s how it&#8217;s been built from the ground up. But to do that, you actually need to, support webhooks that come back, right? And then Devin has to respond in a natural way and then hopefully don&#8217;t spam your threads too much and annoy the people in your company. So you got to tune that experience just right. Especially when there&#8217;s a lot of back and forths, we find that we actually have to go beyond the simple MCP integrations in these places.</p><p><strong>Swyx [00:27:39]:</strong> I just pulled up the MCP marketplace. I know this is a Fair amount of work. Is the answer to eventually take first party control of all the top MCPs? Is that the</p><p><strong>Walden [00:27:48]:</strong> I would love a world where you could have something that&#8217;s more expressive than MCP. That, goes both ways, not just a set of tools, but a proper system that interacts back and lets it Have the right experience with all these interfaces.</p><p><strong>Swyx [00:28:03]:</strong> So there actually is sampling in the MCP spec, but nobody Uses it, right?</p><p><strong>Walden [00:28:07]:</strong> And so I think that&#8217;s the other part is, actually we found that when the MCP spec starts to get too complicated, it starts to lose its original promise of Being like a simple one-step connect. Now then we have to go figure out how to support all these different variations of things and It starts to look a lot like just building the first party integrations in a lot of these cases now.</p><p><strong>Cole [00:28:29]:</strong> I think it matters, too, how critical it is to your company, right? If this is something that nearly every session is going through, it probably makes sense to own it so that you can make optimizations on top of it Versus just whatever is off the shelf.</p><p><strong>Swyx [00:28:43]:</strong> Awesome. Other than MCPs, what else, sorry, well, I don&#8217;t know if that&#8217;s Narrowing in too much on, integrations. But what else? What other elements of building OpenInspect or Devin that you guys really sink on?</p><h2>Memory and Knowledge: What Agents Should Remember</h2><p><strong>Cole [00:28:59]:</strong> I think, a problem that comes up very frequently is this idea of memories or knowledge base.</p><p><strong>Swyx [00:29:05]:</strong> Oh, boy. How do you solve it?</p><p><strong>Cole [00:29:08]:</strong> so not solved yet, is the short answer.</p><p><strong>Cole [00:29:11]:</strong> it&#8217;s something, there&#8217;s a open issue for it, someone asking about it.</p><p><strong>Swyx [00:29:16]:</strong> There&#8217;s, I, D Wiki hasn&#8217;t indexed anything about memory yet.</p><p><strong>Cole [00:29:20]:</strong> how I&#8217;m seeing it solved across my clients is primarily through skills. I find that skills can be a good gap within that or updating Claude MD, but I think memory as a whole is a pretty unsolved problem, and it is why I&#8217;ve been hesitant to add it. I think there is parts of memory and that can be addressed, but I think as a whole it&#8217;s a very difficult retrieval problem.</p><p><strong>Swyx [00:29:44]:</strong> Oh my God. RAMP didn&#8217;t write anything about memory? I see zero search results.</p><p><strong>Walden [00:29:50]:</strong> No. Memory can be quite tricky to get right because it&#8217;s the retrieval, but also the generation of the memories that can be really tricky. You don&#8217;t want it to just like Remember very specific details.</p><p><strong>Swyx [00:29:59]:</strong> Walk us through the Devin memory journey because I know there&#8217;s been a journey.</p><p><strong>Walden [00:30:03]:</strong> the first version of memory that like stuck around for a while was A system we have called Knowledge. And the idea was we wanted it to pick up things over time and not need the user to be proactive about teaching Devin things. So, okay, any time you remind Devin, &#8220;Wait, no, that&#8217;s not quite the way you&#8217;re supposed to use Git&#8221;Like, we actually want Devin to say, &#8220;Hey, do you want me to actually just remember this for the future?&#8221; And for you to just basically quickly approve or reject and for it to build up over time. &#8216;Cause I find that, 95%, I think, or some crazy stat like that of the memories that Devin has are all through these auto-generated things. Very few people actually just want to sit down and write big docs on Here&#8217;s how you&#8217;re supposed to work with the technology, et cetera. The generation and the retrieval has been something that we&#8217;ve been trying to tune a lot over the years. Generation, you don&#8217;t want it to remember something like, if you asked one time to like, &#8220;Oh, please open as a draft PR,&#8221; you don&#8217;t want to be like, &#8220;Oh, everyone forever now should get their PRs as draft PRs.&#8221; But you do want some, conveyor. Maybe you want to say like, &#8220;Oh, Cole generally likes, things to be created as draft PRs.&#8221; Same with retrieval, if you have thousands of these memories, how do you actually make sure they&#8217;re retrieved at the right time? And that can be quite tricky to do right without exploding the context with a bunch of useful yeah, useless information. Surprising amount of just, eval work to just make sure that, memory is, remains a reliable system as new models come and go.</p><p><strong>Cole [00:31:31]:</strong> Do you have anything that you could share on, memory pruning? And like the temporal aspect of memory?</p><p><strong>Swyx [00:31:36]:</strong> Deleting and forgetting?</p><p><strong>Walden [00:31:39]:</strong> The, today, the, So the things they could do is it could edit memories. And so if your memory used to say like, &#8220;Oh, Cole likes to open everything as like a draft PR,&#8221; then you can imagine, &#8220;No, don&#8217;t do that.&#8221; And then it&#8217;ll say, &#8220;Oh, do you want me to update the memory to be Cole now want everything as, open PRs?&#8221; I think that at the same time we don&#8217;t know if this is going to be the final version of the system. Whatever we have here will probably, translate into the new system that we&#8217;ll be coming up with. But I think one big difference between two years ago and today is these agents are really good at using anything that resembles a file system natively. And so part of us are, is thinking, &#8220;Oh, should we rebuild memories to feel more like a file system that we let the agent navigate on its own?&#8221; That&#8217;s been an interesting exploration. Also similar ideas in the scale space.</p><p><strong>Swyx [00:32:35]:</strong> I am pulling up OpenClaude&#8217;s memory thing right now. So memory, OpenClaude has like this like daily memory journal thing, right? And you can I mean, that is a file system you can grep through and is a source of truth. I don&#8217;t know if it&#8217;s the best. It&#8217;s probably super noisy, but at least, if you lose something you can discover it or you can apply some, forgetting algorithm to, more ancient memories that don&#8217;t get recalled again or something. I don&#8217;t know.</p><p><strong>Walden [00:33:01]:</strong> One thing we&#8217;ve been trying to do to push the boundaries of how you use agents at your company is letting an agent basically have a very similar file, a memory.md or something, and just like be your permanent PM for a specific set of issues maybe. So we have like some Slack channels internally, maybe a Slack channel dedicated to, a specific product like DeepWiki maybe. And you can imagine that, or you want a Devin that never stops, it&#8217;s just always awake, but it has this like memory dock that it can just maintain for itself about, okay, what are like the number one priorities of what we have to fix and prioritize? Who is responsible for some upcoming work? Maybe they&#8217;ll even Devin will even tag you on some recurring basis. And so it&#8217;s been an interesting move to see, okay, how can we actually use Devin for more than just engineering? Can we actually upstream above the engineering process and maybe it&#8217;s just Devin creating tickets, which then maybe some humans do, but then maybe other Devins do.</p><p><strong>Swyx [00:34:00]:</strong> One of my more fun automations is go research competitors and just suggest stuff to me on a weekly basis. That&#8217;s the automation. I can&#8217;t find it right now, but basically it just like, &#8220;Look at competitors and suggest things.&#8221; &#8220;And here are three things that you&#8217;ve suggested that I don&#8217;t want any more of,&#8221; and you just stick that in the prompts. But like I wish actually So for like when I, for example, when I reject a PR, I wish that it updated memory so that I can then just not have to go up, go back and update the scheduled, sync, but anyway, feature request.</p><p><strong>Walden [00:34:31]:</strong> what? We might change it soon. I guess OpenInspect, in the time you&#8217;ve been around, has there been anything you tried to implement but then you had to like undo and like do a different way?</p><h2>OpenInspect Architecture: Webhooks, Control Planes, and Agent State</h2><p><strong>Cole [00:34:41]:</strong> Nothing yet, but something that is on my mind. The initial way that I built it was that each of the integrations lives as its own package. And so you have The Slack bot, which is what&#8217;s handling the webhooks, and then is basically interacting with the control plane. As I&#8217;m seeing the system starting to be more integrated, specifically with the GitHub bot integration, I&#8217;m considering bringing that all into the central control plane because especially now I want to start, And a request that I&#8217;m getting is the ability to monitor, the actual, pull requests being merged, as well as just tracking of</p><p><strong>Swyx [00:35:19]:</strong> What do I have open?</p><p><strong>Cole [00:35:21]:</strong> What do I have open? How many of these are getting merged? How many comments are showing up? To just understand the health of the system. And so in the case of a GitHub app, you only have one webhook. And so then it&#8217;s a question of do I put that webhook in that GitHub bot package? That&#8217;s weird. It doesn&#8217;t really make sense to live there because that package is more for like the code reviewer. Or do I like centralize it? So that&#8217;s something that&#8217;s on my mind of, making that decision. I think the other one we touched on earlier is the harness in the box versus out of the box. I think long term the architecture will eventually come back out of the box. Some of the newer tools that I&#8217;ve added are calling back into the control plane so that you don&#8217;t have the secrets in the sandbox. And so I think long term I probably will pull the actual, agent out of the box, but I think for now it&#8217;s fine.</p><h2>Subagents and Multi-Agent Systems: When Parallelism Helps or Hurts</h2><p><strong>Swyx [00:36:16]:</strong> Just, a quick question on pulling the agent out of the box. I&#8217;m One thing I&#8217;m very bullish on this year is agents calling other agents or spawning sub-agents or Whatever you want to call it. Does that make it harder or easier? I can&#8217;t tell. Because if the harness is in the box, you can just spin up more boxes. If the harness is outside the box, then you&#8217;re, it&#8217;s less easy because you are, you have a unicorn pet of a, of a harness that&#8217;s, living outside the box.</p><p><strong>Cole [00:36:45]:</strong> In theory it would be the same way, right? Whether, one agent has launched many, sub-sessions within it, OpenInspect, for example, can launch sub-sessions and actually create other environments and then monitor them. In the case where it is out of the box, that would basically just be an additional session that&#8217;s running. And so that session is also running outside of the box. It&#8217;s running in your worker plane, wherever you&#8217;re running this. And then you really just have to think about how does your top level agent then interact with it. I do think it can be more complex, just &#8216;cause again, you have now a more difficult architecture. But I think if you figured it out once, it&#8217;s probably fine.</p><p><strong>Swyx [00:37:26]:</strong> Well, then I&#8217;m just, throwing it open to you in terms of, I call this like meta Devin management. Which is like the, Devin&#8217;s calling Devins or Devin scheduling Devins or querying trajectories or anything like that. What have you built or unshipped, anything?</p><p><strong>Cole [00:37:46]:</strong> I think one of the surprising things we&#8217;ve seen is that a lot of the ways that, these, separate agents work with each other, and you want them to, parallelize their work, has still mostly followed the same manager sub-agents regime. And a lot of people I think are excited about this world where you have swarms of agents that, talk with each other all over the place. We&#8217;ve actually given Devin an MCP so they can just go arbitrarily message other Devins And create new Devins, et cetera. But I guess, it somehow creates, a really chaotic world in that sense. And so we&#8217;ve still found that most practical use on a day-to-day basis has been one single Devin.</p><p><strong>Cole [00:38:33]:</strong> Figuring out how to segregate the work and get, have other Devins work on it in, a relatively isolated sense, each with their own boxes Not sharing machines, so there&#8217;s, a very little room for conflict is the regime that you have to create today.</p><p><strong>Swyx [00:38:50]:</strong> I&#8217;ll call out, the experiments from Cursor, right? This is Wilson Lin&#8217;s work on Single agent to multi-agent, and you&#8217;re obviously famously on the side of don&#8217;t build multi-agent. But they went through the whole thing, only to arrive at, this Which is exactly what Devin has, I think.</p><p><strong>Cole [00:39:08]:</strong> I think there will be a revision to that post at some point About</p><p><strong>Swyx [00:39:12]:</strong> Tell us about it</p><p><strong>Cole [00:39:12]:</strong> I think multi-agents were very much not at all possible a year ago. You do see more multi-agent experiments today, but you can argue, are they really multi-agents, or are they just just, tool calls,? There are people who, will create sub-agents to go look for XYZ file, XYZ implementation. Has really nice context management benefits because all of the tool calls and tokens that it spends then get collapsed back to just the answer for the main agent. There&#8217;s a lot of benefits to doing this. We basically have Devin do this with Deep Bookie, make a call out to Deep Bookie, give you back the results, but that feels like a tool call,? It&#8217;s not like these, two collaborators actually talking back with each, back and forth with each other. But I think the thing that gives me the most bullishness that multi-agents might actually be possible is actually what I said earlier about Devin will actually sometimes tell me I&#8217;m wrong and push back, and I think that demonstrates a level of maturity and communication today that makes a multi-agent world possible. One, can two agents who have seen different information come back to each other and actually figure out who is right, what is the correct implementation? They&#8217;re not just, yes men. Claude, I guess is like, used to just say, what is it? &#8220;You&#8217;re right,&#8221; or,</p><p><strong>Swyx [00:40:25]:</strong> &#8220;You&#8217;re absolutely right.&#8221;</p><p><strong>Cole [00:40:26]:</strong> &#8220;You&#8217;re absolutely right.&#8221; Yeah.</p><p><strong>Swyx [00:40:28]:</strong> The Have you seen, did you see</p><p><strong>Cole [00:40:29]:</strong> The age is over</p><p><strong>Swyx [00:40:30]:</strong> The Codex app troll in Topic? This is the Codex app. Inside of Settings, there&#8217;s a little, there&#8217;s a little Easter egg, right? So if you go to, the Themes or Appearance, right? There&#8217;s all these, color codes, and the top is absolutely, and it&#8217;s the Topic&#8217;s colors. Which is such a troll. Anyway.</p><h2>Model Behavior: Pushback, Adversarial Prompts, and Agent Skepticism</h2><p><strong>Cole [00:40:53]:</strong> I love that Easter egg. Did you discover that yourself?</p><p><strong>Swyx [00:40:54]:</strong> No, it was, someone was, tweeting about it And I was like, I was like, &#8220;Is this true?&#8221; Because, sometimes people just tweet stuff to, get a rise out of you. But yeah, there you go, in Topic colors.</p><p><strong>Cole [00:41:06]:</strong> Yeah. So yeah, we&#8217;re out of this regime where, it just says you&#8217;re absolutely right, and they can have real conversations and real back and forths.</p><p><strong>Swyx [00:41:13]:</strong> You can prompt it as well to be more adversarial or whatever. Yeah. Okay. Yeah, that, I mean, to me, that is more intelligence, right? That is not just something that&#8217;s, a dumb tool, it&#8217;s actually pushing back on you I think. Yeah.</p><p><strong>Cole [00:41:24]:</strong> when you mentioned, of course, the blog posts. There was one blog they had where they fed a swarm of agents together and built a browser.</p><p><strong>Swyx [00:41:34]:</strong> That was I think that was the one.</p><p><strong>Cole [00:41:36]:</strong> You can have, like</p><p><strong>Swyx [00:41:37]:</strong> I think it&#8217;s the same one</p><p><strong>Cole [00:41:37]:</strong> Creation of it. We found a surprising success of, don&#8217;t do a swarm or anything, just have one Devin, it does its own context management. Just let it keep running for a while and give it some crazy tasks. I think we asked it to, rebuild, a Windows OS system. And it managed to do it just like, going on for long enough. It&#8217;s</p><p><strong>Swyx [00:41:55]:</strong> Was this Andrew&#8217;s thing?</p><p><strong>Cole [00:41:58]:</strong> there were lots of demos that we ended up not posting, &#8216;cause at some point we&#8217;d just be posting way too much a bunch of, Demos. But I love that because it shows that I think the multi-agent thing still has, a bit of exciting sexiness to it, which is maybe still beyond still, the actual delta it adds to the capabilities of these systems. But it&#8217;s absolutely the future. I think we&#8217;re heading in that direction and we can see the progress being made there already.</p><p><strong>Swyx [00:42:25]:</strong> If I were to, make one super minor pushback because I don&#8217;t feel that confident about it yet</p><p><strong>Cole [00:42:33]:</strong> Go for it</p><p><strong>Swyx [00:42:33]:</strong> But I&#8217;ve had Ryan Lopopolo from OpenAI on the pod And he&#8217;s a super slop cannon, right? Oh my God, that&#8217;s my coding agent being done. I downloaded this, Peon Ping. I don&#8217;t know if you guys have heard this. It takes like-, sound packs from popular games like, Command and Conquer and Warcraft, and then it plays it whenever it&#8217;s done. And so it&#8217;s like, &#8220;Work,&#8221; or whatever, &#8220;At your command,&#8221; or something. Anyway, what I got from the Cursor code base and from Ryan&#8217;s thing was that there&#8217;s a slop cannon approach where you try to loosen the single agent&#8217;s, bottleneck, and I feel like that is, probably an, a very important thing to try to figure out. I don&#8217;t think anyone&#8217;s, really solved it. Because then you just have more reviewer slop on top of the agent slop To try to wrangle it all. Ryan will probably very strongly object that I say that he hasn&#8217;t solved it, but he thinks he&#8217;s He thinks he&#8217;s completely solved it. But I think it&#8217;s still I think it&#8217;s, very important, &#8216;cause, that is a bottleneck, right? I feel Devin is slow sometimes Because I&#8217;m like, well, yeah, this is very readable and very sensible, but also it is slower than it could be if I just, I want a button to just say, &#8220;Just ramp this up 1,000 next parallel, in parallel and just, see what happens,&#8221;? And I don&#8217;t know if that&#8217;s, feasible at some point in the future.</p><h2>Code Review, Entropy, and AI Slop</h2><p><strong>Walden [00:43:55]:</strong> I And we&#8217;ve also run experiments internally where we&#8217;ve basically tried to build entire products, true products that we knew we would eventually ship, but for now, let&#8217;s try to see if we can do it just by purely, vibe coding on top of each other, auto merge, no code review at all. And then there&#8217;s this benchmark of how many weeks can you go onto this for Before you say, &#8220;We have the trashiest code base.&#8221;</p><p><strong>Walden [00:44:18]:</strong> &#8220;Let&#8217;s actually rewrite it from scratch.&#8221;</p><p><strong>Swyx [00:44:19]:</strong> Start a new factory, yeah. What&#8217;d you find?</p><p><strong>Walden [00:44:21]:</strong> I think we found that the state-of-the-art in December was you can probably, run this for about two weeks. By the end of those two weeks, you&#8217;d find that, hey, you want to, change the color of a button. Well, it turns out this button is implemented in, 10 different places, and they, have All these different variations, and oh, you forgot one of them, and actually it&#8217;s a slightly different color in one spot. And you&#8217;re like, &#8220;Okay, this is too much to work with. Let&#8217;s actually try to do code review at the same time.&#8221; And make sure that we&#8217;re on top of our software, actually cleaning it up a bit And making sure it&#8217;s done in a scalable way.</p><p><strong>Cole [00:44:54]:</strong> I think building on that, the idea of, you don&#8217;t have to look at code, I think is generally a bad idea. And the meme that I have for that</p><p><strong>Walden [00:45:03]:</strong> What timeline, all right, is Do you think that statement will be true on?</p><p><strong>Cole [00:45:06]:</strong> I think probably for a while it&#8217;ll be true that you should continue to look at your code. A problem that I see a lot of teams run into that I work with who are embracing AI native, AI first coding, is The meme that I have is that your code base regresses to your worst engineer, because that engineer who is, very gung-ho about AI and is not auditing their code, their pattern starts cementing into the code, and now the AI is referencing their patterns. And so now their if/else block that, is 20 if/elses back and forth, the AI is seeing that as the pattern of how things are done and starts to then exponentially grow this slop. And I find to your point, a pretty good approach to that is having scheduled cleanup, whether by humans or through systems, that are looking for duplication. They then address that. You&#8217;ll end up with like 12 helpers for how to format a date. And you need to address that, because otherwise it will continue to sprawl.</p><p><strong>Swyx [00:46:09]:</strong> Within balance, I think it&#8217;s fine to have some duplication, and then sometimes To have garbage collection, right? Yeah. The What I&#8217;ve been, talking about with a lot of engineering leaders is that you want to be very strict about the boundaries between modules, and it&#8217;s your job as an architect, as a CTO, whatever, to say like, &#8220;Okay, here&#8217;s the hard contract between you guys and you guys. Whatever you do inside this black box is your business. You do whatever. But between these guys, let&#8217;s be, really damn clear, and any movement must be signed off by a human or me,&#8221; or. Then, and like that&#8217;s that. I don&#8217;t know if you have any other modifications or advice.</p><p><strong>Walden [00:46:44]:</strong> Well, I guess generally on the topic of, where humans can be useful, I found that &#8216;cause, some of these, really deep infra problems, sometimes just having a human that just has, really deep expertise can make a big difference. I&#8217;ve actually seen this come into play when actually building agents. So we&#8217;ve had a few friends now, try building their own coding agents, and I think one same problem that I recurringly heard a lot of them run into was this problem of like, &#8220;Oh, Grep is really slow on our agents&#8217; machines.&#8221; And so a lot of them, I assume because they&#8217;re using AI and they themselves don&#8217;t have, super deep infra background knowledge, say, &#8220;Okay, we&#8217;re going to go build our own custom Grep index. It&#8217;s going to be really fast,&#8221; and use that as a way around this problem. When we ran into this problem About like, maybe like a year and a half ago when we were, in the early days of building Devin, we obviously didn&#8217;t have AI then. We just asked our, how to, how to do this. You can just swap out a new Grep index, so.</p><h2>Infrastructure Details: Grep, File Systems, and Sandboxes</h2><p><strong>Swyx [00:47:45]:</strong> What do you mean you hand-coded Devin? What?</p><p><strong>Walden [00:47:48]:</strong> It&#8217;s like, can you believe we hand-wrote this code? And we had, our infra people who are really amazing, they were looking into it and they&#8217;re like, &#8220;Oh, what? We realized that actually the root cause of this problem is actually super simple, but like fine-grain detail,&#8221; which is that a lot of these virtual machines actually underlying them don&#8217;t use real file systems. They use these, network file systems where things are actually cached over the network actually in S3. So when you&#8217;re Grepping, you&#8217;re actually making network calls Every time you&#8217;re doing these things, and that&#8217;s why Grep is extremely slow on these machines. And so again, goes back to, what is all of the crazy infra work that we had to do to actually get these machines working. If you try to do this yourself, there are tons of small details like this, and so we had to eventually go swap out that network file system. But</p><p><strong>Swyx [00:48:35]:</strong> I think there&#8217;s a write-up about it, right? Silas did one about the virtual file system.</p><p><strong>Walden [00:48:38]:</strong> Oh, that was a whole other thing. The</p><p><strong>Swyx [00:48:39]:</strong> Oh, that&#8217;s a different thing</p><p><strong>Walden [00:48:40]:</strong> The BlockDev file storage format</p><p><strong>Swyx [00:48:42]:</strong> I&#8217;ll bring it up</p><p><strong>Walden [00:48:42]:</strong> Which is, a file system format that we built so that the VMs could be spun up and down very quickly. Basically, the intuition behind this is-Imagine you have, a terabyte of disk, and your agent only, wrote, a hundred lines of code on top of that disk. How long does it, say, take to, save and re-bring up that disk? And most systems, because you&#8217;re not optimizing for this case, it&#8217;s just, on the order of a terabyte of work because you have to Save all of that and bring it back up. In our system, we try to build a file system that incrementally builds on top of each other. So every time you save and bring the machine back up, you&#8217;re only doing work that is proportional to effectively the diff in the file system. And so this, shaves off a lot of time in the boot-up process of Devin. I think we This is actually now outdated. We have a newer system inside of Devin. But yeah, there&#8217;s a lot of tiny details you have to get right here to actually get the day-to-day experience of Devin to be good.</p><p><strong>Swyx [00:49:39]:</strong> It&#8217;s, not technically agents, but it is agent infra, and when you sell an agent as a company, you sell agent plus agent infra.</p><p><strong>Walden [00:49:46]:</strong> At least the way we do it be And the other The nice thing about having the agent infra being done together is, you We get to deploy Devin in whatever environment we want now. We don&#8217;t need to wait for some underlying infra provider to also go and support VPC or on-prem or FedGovCloud, for instance. So we can actually go and figure out, okay, since we own the infrastructure, how can we get that set up for you?</p><h2>Cloud Providers: Modal, Daytona, and Enterprise Sandboxes</h2><p><strong>Swyx [00:50:12]:</strong> Whereas you&#8217;re Cloudflare dependent.</p><p><strong>Cole [00:50:15]:</strong> so Cloudflare runs the control plane. The sandboxes, Modal is supported. A contributor just added Daytona. E2B is on the roadmap, and I think there&#8217;s an abstraction in place that if any contributor wants to add a new provider, they can add that in.</p><p><strong>Walden [00:50:32]:</strong> Well, what are, How are the customers you work with Do they generally try to then go set up a contract with another one of these third-party providers? Do they try to do the VMs in-house?</p><p><strong>Cole [00:50:44]:</strong> most of them I see using Modal. I think Modal has a great</p><p><strong>Walden [00:50:48]:</strong> Shout out Modal.</p><p><strong>Swyx [00:50:48]:</strong> Shout out Modal.</p><p><strong>Cole [00:50:50]:</strong> I think Modal has a great offering. It captures all of the sandbox pieces you need, snapshots being a pretty big piece of that, and given that they also offer GPUs, I think it&#8217;s a pretty nice offering as a whole.</p><p><strong>Swyx [00:51:04]:</strong> no debate there.</p><p><strong>Walden [00:51:07]:</strong> Modal is great, especially, I think their container offering is, the most natural, and so especially if you are willing to, forego, the full VM requirements Modal is, a really vast place you can spin something up on.</p><p><strong>Swyx [00:51:20]:</strong> Is there a point So Modal&#8217;s very Python, and I feel like most workload, has really shifted to JavaScript. I don&#8217;t know if you guys Get the same feeling. So, okay, when I started Landspace and IE and all these things, I was like 50/50 Python and JS, right? That&#8217;s roughly. I think that&#8217;s wrong now. I think JS has won. I don&#8217;t know if you guys Like, I Maybe I&#8217;m overstating it, and maybe for cognition, there&#8217;s, C# and Java and what have you. But for, new greenfield apps, do you feel that Do you get that sense? Does it matter?</p><p><strong>Cole [00:51:52]:</strong> I think that most of the libraries that I see in this space are Python native first, especially in the</p><p><strong>Cole [00:51:58]:</strong> Observability space. That said, I think that there is a pretty big appeal of having your entire system in one language. Especially when you have both your frontend and backend communicating, you can have one central type Which is very nice.</p><p><strong>Swyx [00:52:11]:</strong> That&#8217;s my case against Modal, which is Then you have to run JS. You can run JS inside Modal. It&#8217;s just, one extra step That, isn&#8217;t native to the runtime. I don&#8217;t know if</p><p><strong>Walden [00:52:22]:</strong> I don&#8217;t know</p><p><strong>Swyx [00:52:23]:</strong> Reviews. Do you have numbers? I don&#8217;t know.</p><p><strong>Walden [00:52:25]:</strong> the one thing I don&#8217;t like about Python is whenever AI, whenever it writes Python, it always does, the weirdest patterns, and</p><p><strong>Swyx [00:52:32]:</strong> Oh, because it&#8217;s, mixing two and three or what?</p><p><strong>Walden [00:52:34]:</strong> I think it&#8217;s something mixing two and three, yeah. The I don&#8217;t know if you see this. It always tries to do, has attribute on objects as like</p><p><strong>Cole [00:52:41]:</strong> Oh, my God.</p><p><strong>Walden [00:52:41]:</strong> But it&#8217;s like But that you shouldn&#8217;t be doing that. It should error if there was</p><p><strong>Swyx [00:52:45]:</strong> Because it&#8217;s training on library code?</p><p><strong>Cole [00:52:47]:</strong> I think it&#8217;s more of, like</p><p><strong>Cole [00:52:48]:</strong> From what I&#8217;ve seen, it&#8217;s more of, a reward hacking mechanism where it doesn&#8217;t want to basically</p><p><strong>Walden [00:52:54]:</strong> It&#8217;ll never error.</p><p><strong>Cole [00:52:54]:</strong> It doesn&#8217;t want the code to fail. And so it Even when it knows it has the attribute, it&#8217;ll call getattr on a, and for a lot of my clients who have moved towards more autonomous coding, we&#8217;ve put that in as a lint rule That if you do getattr, your pull request is going to fail.</p><h2>Slop Signatures: Comments, Backwards Compatibility, and Types</h2><p><strong>Swyx [00:53:12]:</strong> Ooh, this is a fun topic. Can you tell me more about this? What else is a sign of AI coding that you have to put guards in?</p><p><strong>Walden [00:53:21]:</strong> So we were talking just before this about Opus 4.7. One of the things this new model likes to do is it writes lots of comments. Not like, it&#8217;ll, comment every line, but it&#8217;ll write, paragraph, PRDs, on top of every function. But I will say, to its credit, these aren&#8217;t slop, descriptions like they were before. &#8220;Oh, here&#8217;s what this function does.&#8221; It&#8217;s like, &#8220;Oh, here&#8217;s actually the reasoning and why we chose this approach and what the alternatives were and why we shouldn&#8217;t do those alternatives.&#8221; Still too much information, but I wonder if this actually might be directionally correct if you want systems that can self-maintain themselves in the long run.</p><p><strong>Swyx [00:54:04]:</strong> Oh, they write the specs inline.</p><p><strong>Walden [00:54:05]:</strong> Have all the context In the code as well. Yeah.</p><p><strong>Swyx [00:54:07]:</strong> So you approve?</p><p><strong>Walden [00:54:09]:</strong> I But at the same time, it&#8217;s this tricky problem. Maybe we&#8217;ll just give our users, a setting or something, for, how verbose you want it to be. I haven&#8217;t loved it. Honestly, I just I like the comment, but please, get rid of it. But I could, I could see a world where maybe something of the sort becomes reality. I don&#8217;t know If you guys know about GitAI. So</p><p><strong>Swyx [00:54:32]:</strong> We&#8217;ve talked about it, yeah.</p><p><strong>Walden [00:54:33]:</strong> GitAI, the idea behind it is</p><p><strong>Swyx [00:54:34]:</strong> I&#8217;ll bring it up</p><p><strong>Walden [00:54:35]:</strong> That if you run an agent, the actual prompts you send to the agent should be stored alongside the code inside the Git metadata so that future agents can reference it, maybe code review bots can reference it. And it&#8217;s ideal world where, your context for why decisions were made constantly lives aside, beside your code. And so it&#8217;s, maybe a more hidden version of this, write massive PRDs for every comment approach.</p><p><strong>Swyx [00:55:01]:</strong> I&#8217;m waiting for the real bull case where we just get rid of Git altogether. We&#8217;re not I&#8217;m not, I&#8217;m not there yet, but I&#8217;m looking for it because that would be a big shift.</p><p><strong>Cole [00:55:11]:</strong> On the topic of, visible slop, a pattern that I see a lot of across GPT models specifically is backwards compatibility, at all costs</p><p><strong>Cole [00:55:21]:</strong> Where it&#8217;s doing these weird import exports so that it doesn&#8217;t have to modify, the names of where the modules were. And I&#8217;ve seen Claude 4.6 starting to do this as well.</p><p><strong>Cole [00:55:33]:</strong> And again, I think it is this, reward hacking behavior where it doesn&#8217;t want failure to occur, and you can address that through, Semgrep or other tools where that behavior is pretty easy to identify. But it&#8217;s something that you only learn through the trade of just seeing code patterns. Untyped tuples are a really big problem of just, again, just throw any in there, dict string any. And again, you can address those through linting.</p><h2>Local Testing, Mock Servers, and AI-Ready Codebases</h2><p><strong>Swyx [00:56:01]:</strong> Awesome. Yeah. Any other So, linting, any other tools? Devin Review, of course. Not so, not so free now, but still use it.</p><p><strong>Walden [00:56:10]:</strong> Well, the one thing that I think we try to recommend teams as they use more AI agents, it goes back to this, local testing thing. In the end of the day, you want your agent to be able to do the full thing, not just write the code, but actually run it and test it. And a lot of code bases were not necessarily built for this from the start. For example, you probably do want a local DB setup, a local Docker Compose and Postgres in order to have it so that you don&#8217;t need to give your agent any crazy product credentials to actually run and test its code. We&#8217;ve also internally done a big shift to make a lot of our core, components of code testable as purely local dev without needing to actually, integrate with, any live services for this reason. And honestly, the older the company, the more you have to change to shift in this direction. But you can use AI to help you perform this migration nowadays.</p><p><strong>Swyx [00:57:02]:</strong> The older, the older the company, the more you have to change in order to do local dev?</p><p><strong>Walden [00:57:05]:</strong> I think so.</p><p><strong>Swyx [00:57:06]:</strong> Or am I misunderstanding? So you&#8217;re saying</p><p><strong>Walden [00:57:08]:</strong> Or often times</p><p><strong>Swyx [00:57:08]:</strong> Most people just build with full integration to all their stuff, and there&#8217;s no code path to switch it to local.</p><p><strong>Walden [00:57:14]:</strong> Especially in, when there&#8217;s, lots of different services and you have, microservice architecture, making that shift, the larger the code base, the harder it is. I guess if you did build it correctly from the very start, I think it&#8217;d be possible. But also, a lot There are a lot of companies in the world that got started before Docker was a thing, and so You&#8217;re forced to make a migration at some point.</p><p><strong>Swyx [00:57:35]:</strong> Well, Devin&#8217;s good, very good at making mock servers. Right? So, And no, the Well, one of the projects that I really want to It&#8217;s like, it&#8217;s like Little Snitch. I don&#8217;t know if you guys have heard of this.</p><p><strong>Cole [00:57:44]:</strong> I run Little Snitch on my computer.</p><p><strong>Swyx [00:57:46]:</strong> It&#8217;s just like There&#8217;s, a man in the middle, but it, shows you all the traffic going back and forth. But then from there you can reconstruct the server, right? And then, and then, create local mocks so you can local mock everything if you just observe traffic for a little bit.</p><p><strong>Cole [00:57:58]:</strong> That&#8217;s an interesting idea.</p><p><strong>Swyx [00:58:01]:</strong> cool. I don&#8217;t know if this will get anywhere, but I wanted to maybe talk a little bit about the CloudCode, leak because usually if I have an Anthropic person on, I can&#8217;t talk about the CloudCode leak. Did you guys learn anything from CloudCode? I</p><p><strong>Walden [00:58:19]:</strong> So if I say</p><p><strong>Cole [00:58:19]:</strong> This is the first time I&#8217;ve seen it</p><p><strong>Walden [00:58:19]:</strong> I was not that, interested in the Leak. We didn&#8217;t spend that much time on it</p><p><strong>Walden [00:58:24]:</strong> If I was to say, but</p><p><strong>Swyx [00:58:25]:</strong> I&#8217;m just, I&#8217;m just, fishing for</p><p><strong>Cole [00:58:28]:</strong> no, I didn&#8217;t really,</p><p><strong>Cole [00:58:29]:</strong> Research too much into it.</p><h2>Windsurf, Local Agents, and Cloud Agents</h2><p><strong>Swyx [00:58:30]:</strong> Fair enough. Okay, one more last thing before we go. Windsurf 2.0, you guys shipped another thing. So The meta context is you use background agents enough, sometimes you&#8217;re going to want to bring them to foreground. And that little, hands-off from local to cloud is hard to work on. And then And Devin has Or Cognition has just done it.</p><p><strong>Walden [00:58:50]:</strong> I think for me the biggest, gap this is trying to close is, again, how do you make the testing process as fast as possible? When it can test on its own and send you a video, it&#8217;s freaking magical. Sometimes there are just really difficult things you can that you do just need to, pull down locally. And we just want Windsurf to just be your, local command center of all your agents, your background ones, your local ones, and you can imagine, &#8220;Oh, okay, this agent needs me to review something. I&#8217;ll pull that down, move my other agents to the background, go test it. Okay, boom, done. On to the next one,&#8221; right? You have some issue you got to fix in the background, just click, approve. Okay, set up, start a background agent to go fix it. I&#8217;d love a world where I don&#8217;t have to leave this window. Then maybe the other window I got to figure out how to stop spending so much time into Slack, but maybe, someday We&#8217;ll want to get those tools all.</p><p><strong>Swyx [00:59:38]:</strong> And does that require the binaries to be exactly the same for local versus cloud?</p><p><strong>Walden [00:59:46]:</strong> So the funny thing here is that the behavior between local agents and cloud agents, I think is actually a bit different In their ideal state. I think local agents, you want them to be a bit more fast and let the user make the call on things. Actually don&#8217;t try to autonomously go test things. The background agent mode where you go start it off, I think the agent should just assume the next message I send a user should just have everything that the user needs from me and not run and stop Keep running and don&#8217;t stop until you have the testing Until you have full report.</p><p><strong>Swyx [01:00:19]:</strong> So that&#8217;s a, that&#8217;s just a slightly different prompt.</p><p><strong>Walden [01:00:20]:</strong> But for many reasons, because of all the work we do to make sure that Devin works with different Git providers, that it works with different, OS&#8217;s and VM&#8217;s, we want as much of that logic to be shared as possible. So for our own practical purposes, we try to share as much of it as possible.</p><p><strong>Swyx [01:00:36]:</strong> Yeah. I mean, I can&#8217;t imagine how much work it is to, transition back and forth, so congrats on shipping this.</p><p><strong>Swyx [01:00:45]:</strong> okay. Anything else that we should cover before we, wrap? Just whatever you guys were talking about in your lunch.</p><p><strong>Walden [01:00:52]:</strong> maybe, use cases. What are your, do you find to be, the biggest things that your clients are trying to do with their cloud agents today?</p><p><strong>Cole [01:00:59]:</strong> Do you want to just ask it again so we can get, a clean cut?</p><p><strong>Swyx [01:01:02]:</strong> Because he was drinking his water. Yeah.</p><p><strong>Walden [01:01:04]:</strong> The thing I wanted to talk about was use cases. What do you think are the main things that your clients come to you today about, &#8220;Hey, this is why we want to go set up cloud agents&#8221;?</p><p><strong>Cole [01:01:15]:</strong> I think the easiest and most common use case I see across everyone is SRE use cases. The idea that whether we have our alerts in Slack or Datadog or wherever they&#8217;re going, we want the agent to be the first responder on that. And that doesn&#8217;t necessarily mean that the agent is actually resolving the issue, but just being able to collect that context ahead of time is huge. Because again, that agent is integrated into the production logs, the database. It has full visibility, and over time, playbooks as well for how to address certain issues. And so that&#8217;s a huge win for teams because instantly you can have a full trajectory of what is going on within the system, and oftentimes actually a pull request directly from that, which is a pretty neat flow to actually experience of, error pull request done. OpenInspect does support a trigger for that as well, so that could happen completely autonomously.</p><p><strong>Swyx [01:02:09]:</strong> From Datadog specifically, or just</p><h2>Use Cases: PMs, Support, Security, and SRE</h2><p><strong>Cole [01:02:11]:</strong> it supports Sentry, it supports a generic webhook, and if someone wants to add Datadog, they can. The other use cases that I see, are for non-builder use cases, whether that&#8217;s the PM or the marketing team. I&#8217;m seeing a lot of, teams where the idea of who&#8217;s actually contributing code is starting to change. And in a lot of cases, the PM, if there&#8217;s just a quick bug fix, the PM is not creating an issue anymore. The PM is just prompting through Slack, and the pull request is then being created. And so I think that&#8217;s a huge win. I think that trend will continue, where we&#8217;re seeing, code modifications happening outside of engineering. The last common use case that I see is customer support. And so where they&#8217;re experiencing an issue with a customer, they&#8217;re not entirely sure why this behavior is happening. Previously that world was, &#8220;Hey, there&#8217;s a bug when they tried to use this feature. We don&#8217;t know what&#8217;s going on.&#8221; Well, they&#8217;re now tagging that in Slack. Again, that entire full context is ready. They can then just tag in engineering and have a complete understanding of that issue and completely bypass the previous pain points of like, &#8220;Oh, can you get more information from them?&#8221;</p><p><strong>Walden [01:03:24]:</strong> The only things I&#8217;d add on top of that I think I&#8217;ve seen is, continual security scanning Continual security review Is a very big one as well. The SRE use case, internally we think about it as auto triage Because we just want every message that comes in, and that&#8217;s an alert, that&#8217;s a bug report, to have Devin just start triaging it before anything else. And we&#8217;ve leaned into this use case so much though that we&#8217;ve basically tried to make it so that you don&#8217;t ever have to leave Slack to interact with this. So again, making the interactions with Devin super fluid from the moment the report comes in to it responds to a report and be able to ask it questions right there with full code-based context about all the issues. Very related to customer support as well, I think one thing that we found is CLIs can sometimes be, very difficult for people who aren&#8217;t technical to go and use. But an online chat interface that anyone can go and ask questions and is super intuitive and doesn&#8217;t assume you have any technical knowledge but does have access to all parts of your code base, super useful For support, for salespeople, anyone who might need to have their questions answered about the code base. So yeah, great callout.</p><p><strong>Swyx [01:04:32]:</strong> This might potentially be, a very expensive, use case. Is there like a rule, sense, a rule of thumb on, how much people should spend on this? &#8216;Cause, you have unlimited budget, but not other people don&#8217;t,? I don&#8217;t know if this is an answerable question because obviously it depends on, a lot of factors. But I guess, like</p><p><strong>Cole [01:04:51]:</strong> I think it depends really on, how people are using it. I think If people are using it responsibly and they&#8217;re getting value from it, then, you can kinda determine the budget. Common numbers that I hear are anywhere from 1,000 an engineer up to 5,000 an engineer. I have not heard anywhere in the realm of, 50,000 an engineer for a frame of reference.</p><h2>Model Costs, Smart Routing, and Frontier Tradeoffs</h2><p><strong>Swyx [01:05:12]:</strong> We&#8217;ll get there.</p><p><strong>Walden [01:05:13]:</strong> I&#8217;ve seen, I&#8217;ve seen numbers go that high for sure. I think that this is also I think going to be a big theme of the coming year, is we&#8217;re going to see very expensive, very smart frontier models, And we&#8217;re also going to see people who say, &#8220; what? I don&#8217;t need the frontier anymore for a lot of the work I do,&#8221; because some frontier models actually are good enough For a lot of the work.</p><p><strong>Swyx [01:05:36]:</strong> Also shout-out you pioneered Smartfind Which is a mix.</p><p><strong>Walden [01:05:39]:</strong> I&#8217;m really interested in a world where you basically have hybrid frontier and subfrontier systems Where you use the subfrontier part to be really fast, really efficient, and call out to the frontier part of the system so that you can still get frontier performance for the most part.</p><p><strong>Swyx [01:05:54]:</strong> I&#8217;m trying to search, but Twitter search is, completely broken. I, it&#8217;s, the from field is just completely gone. It&#8217;s very sad, Because I really want to</p><p><strong>Walden [01:06:04]:</strong> No worries. I might have to make a new post at some point about the return of Smartfind.</p><p><strong>Swyx [01:06:10]:</strong> Anthropic has now officially adopted it. Okay, cool. I think that&#8217;s it. It&#8217;s really great discussion and good, great having you guys on. Background agents are a thing now, and everyone&#8217;s building them. We, but we talked a lot about, the production concerns and like, well, why you would want to offer one architecture over the other. Yeah, lots to look forward to.</p><p><strong>Walden [01:06:35]:</strong> There&#8217;s a real zeitgeist in the space right now I think, for companies to want to turn themselves into these autonomous coding factories. And yeah, we&#8217;re doing a lot to try to support that. And so, any listeners are welcome to come chat to us about that, whether using Devin or working with us.</p><h2>Wrap-Up: Hiring, Consulting, and Agent Adoption</h2><p><strong>Swyx [01:06:51]:</strong> Hiring?</p><p><strong>Swyx [01:06:53]:</strong> what, specifically, just like give like one profile that&#8217;s, very interesting.</p><p><strong>Walden [01:06:58]:</strong> I think people underestimate the role of, really high-taste product engineers In this space right now.</p><p><strong>Swyx [01:07:05]:</strong> And the test is, what have you shipped end to end that is A tasteful product.</p><p><strong>Walden [01:07:10]:</strong> If you&#8217;ve shipped stuff that you think is tasteful and you&#8217;re, and you&#8217;re proud of, you should, you should come talk to us.</p><p><strong>Cole [01:07:15]:</strong> For me, any businesses that are looking to further their engineering org, a lot of the consulting I do is around that. Teams who are maybe starting their AI journey, whether that&#8217;s with Cursor or Claude Code, but they&#8217;re looking for someone to help navigate them through the state-of-the-art and beyond just that initial deployment. As mentioned, there&#8217;s a lot of lift from you&#8217;ve deployed the background agent to how do we actually get this fully integrated into the company and really realizing the true value of that.</p><p><strong>Swyx [01:07:45]:</strong> Okay. Well, thanks you guys for coming on.</p><p><strong>Walden [01:07:47]:</strong> Thanks for having us.</p>]]></content:encoded></item><item><title><![CDATA[[AINews] Cognition raises $1B in $26B Series D]]></title><description><![CDATA[coding is an uncapped TAM market]]></description><link>https://www.latent.space/p/ainews-cognition-raises-1b-in-26b</link><guid isPermaLink="false">https://www.latent.space/p/ainews-cognition-raises-1b-in-26b</guid><pubDate>Thu, 28 May 2026 07:26:09 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!i6tW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1563dd3-9a40-45b1-9060-7ec196bf8e77_1316x1616.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We last <a href="https://swyx.io/cognition">wrote about </a><strong><a href="https://swyx.io/cognition">Cognition</a> in <a href="https://news.smol.ai/frozen-issues/25-09-08-cog-smol.html">September&#8217;s $10B Series C</a> </strong>when Smol.ai also joined Cognition and AINews was eventually <a href="https://www.latent.space/p/2026">moved here to Latent Space</a>. 8 months later, it is <a href="https://x.com/cognition/status/2059660758531940856">worth 2.5x more</a>, and officially the largest <a href="https://x.com/swyx/status/2059717021944926238">remaining independent agent lab</a> in AI, a thesis we <a href="https://x.com/swyx/status/1990886806250782876">mapped out last year</a>. With official ARR disclosures (now <a href="https://www.youtube.com/watch?v=VuyOy5WN980">projecting &gt;$1B ARR by EOY</a>) you can map out the growth, which looks oddly similar to the <a href="https://www.latent.space/p/wtf2025">WTF Happened in 2025 charts</a> (this <a href="https://x.com/swyx/status/2057119153337545096">isn&#8217;t a coincidence</a>):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!l_fo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc283a27b-c506-4ee9-8b9a-47650b429a01_2534x1694.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!l_fo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc283a27b-c506-4ee9-8b9a-47650b429a01_2534x1694.png 424w, https://substackcdn.com/image/fetch/$s_!l_fo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc283a27b-c506-4ee9-8b9a-47650b429a01_2534x1694.png 848w, https://substackcdn.com/image/fetch/$s_!l_fo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc283a27b-c506-4ee9-8b9a-47650b429a01_2534x1694.png 1272w, https://substackcdn.com/image/fetch/$s_!l_fo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc283a27b-c506-4ee9-8b9a-47650b429a01_2534x1694.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!l_fo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc283a27b-c506-4ee9-8b9a-47650b429a01_2534x1694.png" width="1456" height="973" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c283a27b-c506-4ee9-8b9a-47650b429a01_2534x1694.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:973,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:831076,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/199565531?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc283a27b-c506-4ee9-8b9a-47650b429a01_2534x1694.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!l_fo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc283a27b-c506-4ee9-8b9a-47650b429a01_2534x1694.png 424w, https://substackcdn.com/image/fetch/$s_!l_fo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc283a27b-c506-4ee9-8b9a-47650b429a01_2534x1694.png 848w, https://substackcdn.com/image/fetch/$s_!l_fo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc283a27b-c506-4ee9-8b9a-47650b429a01_2534x1694.png 1272w, https://substackcdn.com/image/fetch/$s_!l_fo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc283a27b-c506-4ee9-8b9a-47650b429a01_2534x1694.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In the enterprise SaaS business, ARR is a trailing indicator of utilization, as are the logos of some of the toughest/most discerning customers in the enterprise and startup ecosystem (including <a href="https://www.latent.space/p/ainews-new-ai-infra-unicorns-exa">Exa and Modal</a>, featured last week)</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!i6tW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1563dd3-9a40-45b1-9060-7ec196bf8e77_1316x1616.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!i6tW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1563dd3-9a40-45b1-9060-7ec196bf8e77_1316x1616.png 424w, https://substackcdn.com/image/fetch/$s_!i6tW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1563dd3-9a40-45b1-9060-7ec196bf8e77_1316x1616.png 848w, https://substackcdn.com/image/fetch/$s_!i6tW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1563dd3-9a40-45b1-9060-7ec196bf8e77_1316x1616.png 1272w, https://substackcdn.com/image/fetch/$s_!i6tW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1563dd3-9a40-45b1-9060-7ec196bf8e77_1316x1616.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!i6tW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1563dd3-9a40-45b1-9060-7ec196bf8e77_1316x1616.png" width="1316" height="1616" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a1563dd3-9a40-45b1-9060-7ec196bf8e77_1316x1616.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1616,&quot;width&quot;:1316,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:392802,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/199565531?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1563dd3-9a40-45b1-9060-7ec196bf8e77_1316x1616.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!i6tW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1563dd3-9a40-45b1-9060-7ec196bf8e77_1316x1616.png 424w, https://substackcdn.com/image/fetch/$s_!i6tW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1563dd3-9a40-45b1-9060-7ec196bf8e77_1316x1616.png 848w, https://substackcdn.com/image/fetch/$s_!i6tW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1563dd3-9a40-45b1-9060-7ec196bf8e77_1316x1616.png 1272w, https://substackcdn.com/image/fetch/$s_!i6tW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1563dd3-9a40-45b1-9060-7ec196bf8e77_1316x1616.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We will release more on the Cognition podcast tomorrow.</p><p></p><blockquote><p>AI News for 5/26/2026-5/27/2026. We checked 12 subreddits, <a href="https://twitter.com/i/lists/1585430245762441216">544 Twitters</a> and no further Discords. <a href="https://news.smol.ai/">AINews&#8217; website</a> lets you search all past issues. As a reminder, <a href="https://www.latent.space/p/2026">AINews is now a section of Latent Space</a>. You can <a href="https://support.substack.com/hc/en-us/articles/8914938285204-How-do-I-subscribe-to-or-unsubscribe-from-a-section-on-Substack">opt in/out</a> of email frequencies!</p></blockquote><div><hr></div><h1><strong>AI Twitter Recap</strong></h1><p><strong>Inference Efficiency, Serving Architectures, and Cost Curves</strong></p><ul><li><p><strong>Inference optimization is increasingly architectural, not just kernel-level</strong>: <a href="https://x.com/EagleCorp/status/2059485457227149334">EAGLE 3.1</a> improves speculative decoding robustness by stabilizing hidden-state feedback and reducing attention drift at deeper decode steps, with explicit emphasis on <strong>long-context acceptance length</strong> and real-world serving reliability; the team also highlighted collaboration with <a href="https://x.com/vllm_project">vLLM</a> and TorchSpec. At the kernel/system layer, Perplexity open-sourced a rebuilt <a href="https://x.com/perplexity_ai/status/2059664738087469511">Unigram tokenizer</a> that cuts CPU utilization <strong>5&#8211;6&#215;</strong> and reaches <strong>63 &#181;s at 514 tokens</strong> with zero heap allocations, while <a href="https://x.com/Alibaba_Qwen/status/2059674574397313277">Qwen3.5 on TokenSpeed</a> reportedly hits <strong>580 tokens/s</strong> for agentic workloads via joint optimization across Alibaba, LightSeek, NVIDIA, Mooncake, and FlashAttention-4 contributors. Supporting libraries also improved: <a href="https://x.com/ErikKaum/status/2059659837219156453">MaxSim v2</a> adds backprop and reports <strong>10.33&#215; faster on H200</strong> and <strong>11.94&#215; on A100</strong> versus na&#239;ve PyTorch.</p></li><li><p><strong>Price cuts are being justified by structural KV-cache and attention changes</strong>: Several posts converged on the same theme: recent API price cuts from Chinese labs look sustainable because they reflect <strong>lower serving cost per token</strong>, not temporary subsidy. <a href="https://x.com/kimmonismus/status/2059578380329394292">@kimmonismus</a> summarized how <strong>DeepSeek V4-Pro</strong> uses hybrid attention with <strong>Compressed Sparse Attention</strong> and <strong>Heavily Compressed Attention</strong> to bring <strong>1M-token KV cache to ~10% of V3.2</strong> and single-token inference FLOPs to <strong>27%</strong>, while still routing <strong>49B active params</strong> out of <strong>1.6T total</strong>. Xiaomi&#8217;s MiMo similarly reduces cache traffic using SWA plus hierarchical cache management. That was corroborated directly by <a href="https://x.com/_LuoFuli/status/2059618247553745204">@_LuoFuli</a>, who said MiMo&#8217;s deepest input-cache-hit price cut comes from <strong>5&#215; cached token capacity</strong>, roughly <strong>80% lower caching cost</strong>, and an architectural <strong>1:7 Full:SWA sparsity ratio</strong>. The broader takeaway: long-context inference economics are now being pushed by <strong>attention design + cache hierarchy + routing</strong>, not just cheaper hardware.</p></li></ul><p><strong>Agents, Harnesses, Memory, and Continual Learning</strong></p><ul><li><p><strong>The stack is shifting from &#8220;model quality&#8221; to &#8220;model-harness-memory fit&#8221;</strong>: A substantial cluster of tweets focused on practical agent engineering. LangChain shipped <a href="https://x.com/LangChain/status/2059634226836746483">Deep Agents v0.6</a> with <strong>Delta Channels</strong>, cutting checkpoint storage for a 200-turn coding session from <strong>5.3 GB to 129 MB</strong>, and also launched <a href="https://x.com/LangChain/status/2059685293322858809">computer use in Fleet</a>, plus <a href="https://x.com/hwchase17/status/2059687279199924462">Context Hub</a> for versioned agent context/skills. <a href="https://x.com/LangChain/status/2059654417478012938">LangSmith Engine</a> was framed as automating the eval &#8594; diagnosis &#8594; fix loop, with multiple practitioners emphasizing its value for turning trace feedback into reusable online/offline evaluators. In parallel, <a href="https://x.com/Vtrivedy10/status/2059712077925658717">@Vtrivedy10</a> made the clearest formulation of the day: <strong>task-harness fit</strong> matters as much as model quality, and bespoke vertical systems outperform generic harnesses by narrowing tools, prompts, and context to the task.</p></li><li><p><strong>Continual learning is re-emerging as a product category, not just a research topic</strong>: The biggest announcement here was <a href="https://x.com/rronak_/status/2059644771262730624">Trajectory&#8217;s launch</a>: a platform for using <strong>product usage signals and agent traces</strong> to continuously post-train large agentic models, with <strong>$15M in funding</strong> and design partners including Clay, Harvey, Decagon, Mercor, and Rogo. Baseten said it supports these deployments with <a href="https://x.com/baseten/status/2059651376565936510#m">FP8/NVFP4 quantization and autoscaled H100 infra</a>, including a cited overnight deployment of a <strong>397B-parameter model</strong>. The same trend appeared in open tooling: <a href="https://x.com/hwchase17/status/2059487107144655356">an open-source memory-centric agent</a> built on LangChain/LangGraph was praised by multiple builders for explicit retrieval/storage/reasoning/learning separation, and <a href="https://x.com/a1zhang/status/2059633834094678173">RLM&#8217;s minimal training harness</a> shows small teams can now RL-tune long-context agents in <strong>a day on 8&#215;A100</strong>. The throughline is that &#8220;post-deployment learning&#8221; is moving from aspiration to infra.</p></li></ul><p><strong>Benchmarks, Scaling Laws, and Training Methods</strong></p><ul><li><p><strong>New benchmarks are increasingly about long-horizon, messy, real-world workflows</strong>: <a href="https://x.com/_philschmid/status/2059564676569076021">DeepSWE</a> was highlighted as a SWE/agent benchmark with <strong>113 tasks across 91 repos in 5 languages</strong>, using a minimalist bash-only harness and shorter prompts that nevertheless require <strong>5.5&#215; more code</strong> and touch <strong>7 files on average</strong> than SWE-Bench Pro. In enterprise operations, Artificial Analysis and IBM launched <a href="https://x.com/ArtificialAnlys/status/2059698327235805258">ITBench-AA</a>, an SRE benchmark over Kubernetes incident response where <strong>all frontier models scored below 50%</strong>; <strong>Claude Opus 4.7</strong> led at <strong>47%</strong>, <strong>GPT-5.5</strong> followed at <strong>46%</strong>, and <strong>GLM-5.1 Reasoning</strong> led open weights at <strong>40%</strong>. Another useful reliability angle came from <a href="https://x.com/omarsar0/status/2059689897523642510">AgingBench</a>, which frames deployed agent degradation as a lifespan problem caused by compression, interference, and memory updates.</p></li><li><p><strong>Training efficiency research remains active across both theory and systems</strong>: Sakana AI&#8217;s <a href="https://x.com/hardmaru/status/2059648995132367277">DiffusionBlocks</a> was one of the most technically interesting releases: it reinterprets forward passes as diffusion-like denoising steps so deep nets can be trained <strong>one block at a time</strong>, dramatically reducing memory while matching end-to-end performance across <strong>ViTs, DiTs, masked diffusion, autoregressive transformers, and recurrent-depth transformers</strong>. On the RL systems side, Snowflake introduced <a href="https://x.com/StasBekman/status/2059718503318655314">ZoRRo</a>, claiming <strong>up to 3.5&#215; faster long-context RL</strong> and <strong>3.2&#215; longer context windows</strong> by eliminating redundant rollout computation, alongside the specialized <a href="https://x.com/dwarak/status/2059686825086902398#m">Arctic-Text2SQL-R2</a> enterprise SQL model. On the theory front, <a href="https://x.com/Tiberiu_Musat_/status/2059562156102746148">Tiberiu Musat&#8217;s preprint</a> argues minimum neural weight norm matches minimum program length up to a log factor for fixed-precision networks, while <a href="https://x.com/ethanCaballero/status/2059686905105563907">Unified Neural Scaling Law</a> proposes a multivariate functional form intended to extrapolate neural scaling behavior more accurately than prior fits.</p></li></ul><p><strong>Model and Modality Releases: Biology, Vision, OCR, and Embedded AI</strong></p><ul><li><p><strong>Protein modeling had a standout day</strong>: <a href="https://x.com/alexrives/status/2059611151860683097">ESMFold2</a> was announced as an open scientific engine for protein structure prediction and design, with strong reported results on <strong>protein interactions and antibodies</strong>, plus an accompanying atlas of <strong>6.8B proteins</strong> and <strong>1.1B predicted structures</strong>. The release emphasized both practical design outcomes&#8212;miniprotein binders and single-chain antibodies across five therapeutic targets&#8212;and mechanistic interpretability findings about emergent protein representations. The release was echoed by <a href="https://x.com/proteinrosh/status/2059633089702240598">@proteinrosh</a> and contextualized by <a href="https://x.com/cgeorgiaw/status/2059694583856927201">@cgeorgiaw</a>, who noted the atlas exceeds AlphaFold DB in scale.</p></li><li><p><strong>A wave of smaller but practical multimodal/open releases landed</strong>: Google DeepMind shared the white paper for <a href="https://x.com/mseyed/status/2059504005387284629">Gemini Embedding 2</a>, described as a <strong>native multimodal embedding model</strong> supporting unified representations over text, image, audio, and video. NVIDIA&#8217;s <a href="https://x.com/wildmindai/status/2059600079804088790">LocateAnything</a> combines <strong>Qwen2.5-3B + Moon-ViT</strong> for high-speed grounding, with a claimed <strong>10&#215; speedup</strong> for dense object detection. Hugging Face integrated Roboflow&#8217;s <a href="https://x.com/mervenoyann/status/2059647988373373253">RF-DETR</a>, positioning it as real-time detection/segmentation that outperforms YOLO-style systems. For document pipelines, <a href="https://x.com/VikParuchuri/status/2059675773712167423">Surya OCR 2</a> ships as a <strong>650M</strong> model with <strong>83.3% OLMOCR bench</strong>, <strong>87% on an internal 91-language benchmark</strong>, and <strong>5 pages/s on RTX 5090</strong>; <a href="https://x.com/jerryjliu0/status/2059710330016817501">LiteParse v2</a> rewrites parsing in Rust for <strong>up to 100&#215; speedups</strong> and edge/browser deployment via WASM. On-device AI also got a nod with Google&#8217;s new <a href="https://x.com/googlegemma/status/2059740184930074758">Coral board</a> for local speech, vision, and control demos.</p></li></ul><p><strong>Developer Platforms, Enterprise Controls, and Coding-Agent Productization</strong></p><ul><li><p><strong>Coding agents are consolidating into full product stacks with enterprise controls</strong>: OpenAI continued tightening Codex&#8217;s product surface: <a href="https://x.com/thsottiaux/status/2059650685948551384">GPT-5.2 and GPT-5.3-Codex are being sunset in Codex in favor of GPT-5.5</a>, while enterprise features now include <a href="https://x.com/OpenAIDevs/status/2059703536825565499">private MCP connectivity over outbound-only HTTPS</a>, <a href="https://x.com/OpenAIDevs/status/2059703600662925635">Workload Identity Federation</a>, and <a href="https://x.com/OpenAIDevs/status/2059703665276145920">expanded Admin API controls</a> for spend alerts, allowlists, retention policies, and hosted tool management. OpenAI also published a concrete case study on <a href="https://x.com/OpenAIDevs/status/2059638868983562640">self-improving tax agents with Codex</a>, centered on tracing reviewer corrections back into evals and fixes.</p></li><li><p><strong>Competition in coding agents is now visibly about reliability, workflow breadth, and enterprise adoption</strong>: <a href="https://x.com/ClaudeDevs/status/2059701677981413812">Claude Code</a> shared a reliability/performance update and easier bug-report capture, while GitHub kept pushing the &#8220;agentized IDE&#8221; direction with <a href="https://x.com/code/status/2059664796178354617">Copilot Dev Days</a> and <a href="https://x.com/code/status/2059666498285629707">MCP positioning</a>. The biggest commercial datapoint was <a href="https://x.com/cognition/status/2059660758531940856">Cognition</a>: <strong>&gt;$1B raised at a $26B valuation</strong>, <strong>enterprise usage up &gt;10&#215; YTD</strong>, and <strong>$492M run-rate revenue</strong>, paired with a growing customer list and strong endorsements from users like <a href="https://x.com/nityasnotes/status/2059768072110776370">Exa</a>. Meanwhile, smaller infra/product moves suggest the ecosystem is broadening: <a href="https://x.com/trycua/status/2059688960838828391">Cua Driver for Windows</a> brings background computer use to Windows agents; <a href="https://x.com/brandonjcarl/status/2059624598644109363">Cloudflare&#8217;s agent platform</a> was repeatedly praised for &#8220;fractional computing&#8221; economics; and <a href="https://x.com/theskory/status/2059729539287167068">Grok Build&#8217;s worktree support</a> targets multi-agent code swarms at repo scale.</p></li></ul><p><strong>Top tweets (by engagement)</strong></p><ul><li><p><strong>Cognition&#8217;s scale-up</strong>: <a href="https://x.com/cognition/status/2059660758531940856">Cognition</a> announced <strong>&gt;$1B raised</strong>, <strong>$26B valuation</strong>, and <strong>$492M run-rate revenue</strong>, one of the clearest signals yet that coding agents are converting into large enterprise businesses.</p></li><li><p><strong>Claude Code reliability push</strong>: <a href="https://x.com/ClaudeDevs/status/2059701677981413812">Anthropic&#8217;s ClaudeDevs</a> posted a high-engagement update on responsiveness, reliability, and better feedback collection&#8212;evidence that product quality and trust are now central battlegrounds.</p></li><li><p><strong>Sakana AI&#8217;s DiffusionBlocks</strong>: <a href="https://x.com/hardmaru/status/2059648995132367277">@hardmaru</a> drew major attention to block-wise training that can match end-to-end performance while dramatically lowering memory requirements.</p></li><li><p><strong>ESMFold2 release</strong>: <a href="https://x.com/alexrives/status/2059611151860683097">@alexrives</a> announced one of the day&#8217;s most substantive science releases: open protein modeling at atlas scale with therapeutic design implications.</p></li><li><p><strong>OpenAI enterprise controls + MCP</strong>: <a href="https://x.com/OpenAIDevs/status/2059703536825565499">@OpenAIDevs</a> on private MCP and related admin/security updates reflects where frontier APIs are competing for large-org adoption.</p></li></ul><div><hr></div><h1><strong>AI Reddit Recap</strong></h1><h2><strong>/r/LocalLlama + /r/localLLM Recap</strong></h2><h3><strong>1. Low-Bit Local AI on Consumer Hardware</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1togflk/prismml_just_released_binary_and_ternary_bonsai/">PrismML just released Binary and Ternary Bonsai Image 4B: 1-bit/ternary text-to-image diffusion transformers that can even run 100% locally in your browser on WebGPU.</a></strong> (Activity: 759): <strong>PrismML released Binary and Ternary Bonsai Image 4B, described as </strong><code>1-bit</code><strong>/ternary text-to-image diffusion-transformer variants with ~</strong><code>3GB</code><strong> checkpoints, Apache-2.0 licensing, and a WebGPU browser demo (<a href="https://huggingface.co/collections/prism-ml/bonsai-image">HF collection</a>, <a href="https://huggingface.co/spaces/webml-community/bonsai-image-webgpu">demo</a>). The post compares them to FLUX.2 Klein 4B at ~</strong><code>16GB</code><strong>; a top technical comment claims Bonsai Image is primarily a quantized/post-trained derivative of FLUX.2 Klein 4B, with insufficient attribution outside the whitepaper.</strong> The main debate is attribution/branding: one commenter argues PrismML is rebranding quantized/fine-tuned base models as &#8220;Bonsai&#8221; while minimizing credit to original labs, comparing it to releasing a quant of Qwen as a new model. Another commenter asks whether it can run on CPU with <code>16GB</code> RAM, but no technical answer is provided in the supplied comments.</p><ul><li><p>A commenter alleges <strong>PrismML&#8217;s &#8220;Bonsai-Image&#8221; is not a newly trained base model</strong>, but a <strong>binary/ternary quantization of </strong><code>FLUX.2 Klein 4B</code> with additional post-training to recover quality. They argue the project&#8217;s HF demo/model pages and GitHub omit clear attribution to the original FLUX model/team, with the original model reportedly mentioned only in the whitepaper.</p></li><li><p>A technical usability note says the browser/WebGPU model requires roughly <code>~2 GB</code><strong> to download</strong>, which is relevant for fully local inference despite the 1-bit/ternary compression claims. Another user asks whether it can run on <strong>CPU with 16 GB RAM</strong>, but no concrete benchmark or compatibility answer is provided in the thread.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLM/comments/1to6enj/got_tired_of_oom_errors_on_my_4gb_gpu_wrote_a/">Got tired of OOM errors on my 4GB GPU. Wrote a custom Rust bare-metal engine and hit 66.8 TPS with a 4B model (BitNet 1.58b on RTX 3050).</a></strong> (Activity: 390): <strong>OP claims a custom Rust/C++ LLM inference engine, Cluaiz, runs </strong><code>prism-ml/Bonsai-4B-gguf</code><strong> with </strong><code>1.58-bit</code><strong> quantization on an RTX 3050 4GB, reaching </strong><code>66.8 tokens/s</code><strong>, and reports </strong><code>~30&#8211;33 TPS</code><strong> for Gemma/Qwen 4B variants without OOM via dynamic KV-cache management. No reproducible repo or benchmark artifacts were provided in the post yet; commenters pointed to the apparent project links (<a href="https://github.com/cluaiz/cluaiz">GitHub</a>, <a href="https://cluaiz.com/">site</a>) and questioned vague claims like </strong><em><strong>&#8220;direct-to-silicon&#8221;</strong></em><strong> access, noting this may simply mean ahead-of-time native compilation rather than any unusual GPU/driver-level mechanism. The attached Reddit video could not be independently accessed due to Reddit </strong><code>HTTP 403</code><strong> restrictions.</strong> Top comments were strongly skeptical, characterizing the writeup and repo language as pseudo-technical/AI-generated and arguing the stated achievements amount to basic native compilation plus a single-machine demo. Commenters also challenged the project&#8217;s licensing/copyright wording under Apache 2.0 and asked for concrete implementation details behind the claimed low-level hardware access.</p><ul><li><p>Commenters challenged the technical claims in the linked repo (<a href="https://github.com/cluaiz/cluaiz">github.com/cluaiz/cluaiz</a>, <a href="https://cluaiz.com/">cluaiz.com</a>), arguing that descriptions like <strong>&#8220;direct silicon access&#8221;</strong>, &#8220;bare-metal engine,&#8221; and &#8220;copyrighted Apache licensed software&#8221; appear to be marketing or LLM-generated pseudo-technical language rather than concrete implementation details. One commenter asked whether &#8220;direct silicon access&#8221; merely means <strong>ahead-of-time native compilation in Rust</strong>, rather than any real low-level GPU programming beyond normal CUDA/driver APIs.</p></li><li><p>Several commenters argued that the claimed outcome should be compared against existing tooling, especially <strong>llama.cpp</strong>, which already supports low-memory inference and quantized models on consumer GPUs. The critique was that OOM issues on a <code>4GB</code> RTX 3050 are often solvable through proper llama.cpp configuration rather than writing a new engine, so the claimed <code>66.8 TPS</code> with a <code>4B</code> BitNet 1.58b model needs reproducible benchmarks and configuration details to be meaningful.</p></li></ul></li></ul><h3><strong>2. Qwen 3.5/3.6 Local Model Releases and Coding Tests</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1tnzalm/qwen35_35b_a3b_uncensored_heretic_native_mtp/">Qwen3.5 35B A3B uncensored heretic Native MTP Preserved is Out Now With the Full 785 MTPs Preserved and Retained, Available in Safetensors, GGUFs. NVFP4, NVFP4 GGUFs and GPTQ-Int4 Formats</a></strong> (Activity: 602): <strong>llmfan46 released </strong><code>Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved</code><strong>, a decensored derivative of </strong><code>Qwen/Qwen3.5-35B-A3B</code><strong> made with Heretic v1.3.0 / Magnitude-Preserving Orthogonal Ablation-style edits targeting </strong><code>attn.o_proj</code><strong>, </strong><code>attn.out_proj</code><strong>, and </strong><code>mlp.down_proj</code><strong>, while preserving all </strong><code>785</code><strong> native MTP tensors. The model card reports refusals reduced from </strong><code>92/100</code><strong> to </strong><code>14/100</code><strong>, KL divergence </strong><code>0.0487</code><strong> vs base, and MMLU dropping only from </strong><code>84.12%</code><strong> to </strong><code>83.72%</code><strong> over </strong><code>7,021</code><strong> questions; releases include <a href="https://huggingface.co/llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved">Safetensors</a>, <a href="https://huggingface.co/llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF">GGUF</a>, <a href="https://huggingface.co/llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4">NVFP4</a>, <a href="https://huggingface.co/llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF">NVFP4 GGUF</a>, and <a href="https://huggingface.co/llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-GPTQ-Int4">GPTQ-Int4</a> variants. The author argues Qwen3.5 and Qwen3.6 both use the </strong><code>qwen35</code><strong> architecture but are tuned for different regimes&#8212;Qwen3.5 for general assistance, Qwen3.6 for agentic/coding&#8212;and notes abliteration KL/quality behavior differs substantially between the families.</strong> Commenters appreciated the unusual availability of an <strong>NVFP4 GGUF</strong> build, with one noting they could not find comparable releases even from Unsloth. Another tester agreed with the author&#8217;s positioning, describing Qwen3.6 as closer to <em>&#8220;3.5 coder+&#8221;</em> rather than a simple across-the-board successor to Qwen3.5.</p><ul><li><p>One commenter highlighted the practical value of the <strong>NVFP4 GGUF</strong> build, noting that this format is hard to find elsewhere: <em>&#8220;I seriously can&#8217;t find anyone else doing that, not even Unsloth.&#8221;</em> This is technically relevant because NVFP4 GGUF availability can matter for users targeting newer NVIDIA-oriented low-precision inference workflows while still using GGUF-based runtimes.</p></li><li><p>A tester compared <strong>Qwen3.5</strong> and <strong>Qwen3.6</strong>, arguing that 3.6 feels more like <em>&#8220;3.5 coder+&#8221;</em> than a straightforward general upgrade. They suggested the short time between releases makes a broad capability leap unlikely, implying 3.6 may be more specialized toward coding rather than a simple successor to 3.5.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1to73op/okay_27b_made_me_a_believer/">Okay 27B made me a believer</a></strong> (Activity: 541): <strong>OP reports that a </strong><code>27B</code><strong> Qwen-family model used via Opencode generated a near-complete HTML5 Breakout-style game in one shot from three reference files describing console APIs, gamepad controls, and a TypeScript shader. The output was immediately playable, with working controls, sound, metadata, save/stat/heartbeat API integration, and only required one follow-up for customization plus one glitch fix; a commenter recommends enabling MTP/speculative decoding with </strong><code>2&#8211;3</code><strong> draft tokens for speed. Another heavy user says the model performs best below </strong><code>64K</code><strong> context, degrades noticeably past </strong><code>64K</code><strong>, and &#8220;really drops off&#8221; after </strong><code>128K</code><strong>, recommending periodic summarization-to-file and session resets for long agentic coding tasks.</strong> Commenters characterize the dense <code>27B</code> as unusually strong for local coding&#8212;<em>near-Sonnet class</em> for web-app one-shots&#8212;while one user found <code>35B A3B</code> less capable despite its size/routing advantages. The main caution is that long-context agentic runs can induce loops or &#8220;stupidity,&#8221; so users should manage context aggressively.</p><ul><li><p>A commenter recommended enabling <strong>MTP/speculative decoding</strong> for better throughput, suggesting an MTP value of <code>2</code> or <code>3</code> as a practical speed/quality tradeoff. This is a deployment-level optimization rather than a model-quality claim, useful for users running the 27B model locally.</p></li><li><p>One user reported that the 27B model&#8217;s effective reasoning quality drops noticeably with long contexts: <strong>best below </strong><code>64K</code><strong> tokens</strong>, degraded past <code>64K</code>, and <em>&#8220;really drops off after </em><code>128K</code><em>.&#8221;</em> Their workaround for long-horizon agentic tasks is to periodically summarize state into a file, restart the harness/session, and reload the summary to recover model quality and avoid loops.</p></li><li><p>A benchmark operator said <strong>Qwen 27B</strong> was such an outlier that they rechecked their methodology, placing it <em>roughly on par with GPT-5.2 or Sonnet 4.5</em> in their rankings while noting it struggles at larger context sizes, likely due to parameter-count limits. They linked their data at <a href="https://gertlabs.com/rankings">gertlabs.com/rankings</a>.</p></li></ul></li></ul><p></p><p></p>
      <p>
          <a href="https://www.latent.space/p/ainews-cognition-raises-1b-in-26b">
              Read more
          </a>
      </p>
   ]]></content:encoded></item></channel></rss>