Latent.Space: AINews: Weekday Roundups

[AINews] ImageGen is on the Path to AGI

Tue, 28 Apr 2026 05:38:19 GMT

As every lab sprints toward being some form of Anthropic (aka having a coding and enterprise AI focus, producing ever better PDFs and PPTs and spreadsheets), it is still refreshing to see that GPT-Image-2 is continuing to drive more creative applications, for example this:

Considering the extremely high NPS score of the Lego Rocky Space Friend on date nights, you can imagine how good a low-hallucination, research-enabled, fully multimodal reasoning image model can be.

Of course it’s good for education:

or pop culture:

or precise, clean infographics:

And of course the GPT-Image-2 + Codex combo, which is available as a skill in Codex, which you can iteratively use to generate assets WHILE you code:

And just like that, Claude Design, the previous Current Thing, isn’t even in the conversation anymore. Quite simply, if you can “close” the loop, you win.

But that isn’t quite the argument we’re making here. What we’re focusing on is the very literal and serious question of whether or not models like Nano Banana or GPT-Image-2 or Grok Imagine are necessary uses of scarce GPU capacity if you are eschewing “side quests” and seriously pursuing AGI and trying to hit the revenue, efficiency, and funding goals necessary to not die along the way.

The answer is emergingly clear: yes. Not merely because of the “closing the loop”. But also because you can only do so much with text and code and structured output generation. When you have multimodal voice and visual generation (including transparency!), you truly flex the “G” part of “AGI” - after all, what good is AI if it only narrowly takes all programming jobs?

By the way, horse-riding astronauts used to be hard in imagegen, then it was astronaut-riding-horses, and now, well…

AI News for 4/26/2026-4/27/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

OpenAI Distribution Shift, GPT-5.5 Benchmarks, and Codex/Copilot Pricing Signals

OpenAI loosens Azure exclusivity: @sama said OpenAI updated its Microsoft partnership so Microsoft remains the primary cloud, but OpenAI can now make products available across all clouds, with product/model commitments extending to 2032 and revenue share through 2030. The implication was quickly drawn by @scaling01 and @kimmonismus: OpenAI can now distribute via Google TPU / AWS Trainium / Bedrock, and Microsoft’s license to OpenAI IP becomes non-exclusive. @ajassy confirmed OpenAI models are coming to AWS Bedrock in the coming weeks. @simonw noted the new language likely means the old AGI clause is effectively gone.
GPT-5.5 is a broad upgrade, but not uniformly dominant: Community evals from @htihle put GPT-5.5 no-thinking at 67.1% on WeirdML, up from 57.4% for GPT-5.4, but still behind Opus 4.7 no-thinking at 76.4% while using fewer tokens. LMSYS Arena results from @arena placed GPT-5.5 at #9 in Code Arena, #6 Document, #7 Text, #3 Math, #2 Search, #5 Vision, with Expert Arena #5. Arena also clarified current evaluation covers medium/high reasoning, with xHigh still pending (1, 2). Practitioner feedback was positive for hard coding tasks such as GPU kernels from @gdb, but there were also reports of “compressed CoT leakage” / malformed outputs in no-thinking mode from @htihle.
Developer economics are becoming more explicit: GitHub announced Copilot moves to usage-based billing on June 1, a notable shift as agentic workflows consume much more runtime. Parallel to that, @Hangsiin documented Codex usage multipliers: GPT-5.4 fast = 2x, GPT-5.5 fast = 2.5x, with 5.4-mini and GPT-5.3-Codex materially cheaper. @sama argued Codex at $20 remains a strong value. OpenAI also open-sourced Symphony, an orchestration layer connecting issue trackers to Codex agents for “open issue → agent → PR → human review,” via @OpenAIDevs.

Xiaomi MiMo-V2.5, Kimi K2.6, and China’s Agent-Oriented Open-Weights Push

MiMo-V2.5 is one of the day’s biggest open releases: @XiaomiMiMo open-sourced MiMo‑V2.5-Pro and MiMo‑V2.5 under MIT, both with 1M-token context. The Pro model is framed as a complex agent/coding model and the smaller model as a native omni-modal agent. Community summaries from @eliebakouch add useful technical details: MiMo‑V2.5-Pro is roughly 1T total / 42B active, trained on 27T tokens in FP8, while MiMo‑V2.5 is about 310B total / 15B active, trained on 48T tokens, with aggressive interleaved SWA/global attention and no shared expert. Xiaomi also announced a 100T token grant for builders via @_LuoFuli. Day-0 inference support landed quickly in vLLM and SGLang/vLLM.
Kimi K2.6 continues to lead in mindshare and deployment: @Kimi_Moonshot said Kimi K2.6 is now #1 on OpenRouter’s weekly leaderboard. Secondary reporting described it as a model for coding and long-horizon agents, including scaling to 300 concurrent sub-agents across 4,000 coordinated steps (dl_weekly). Practitioners remain split on speed/quality tradeoffs: @teortaxesTex found Kimi in Hermes much slower than DeepSeek V4 but sometimes capable of fixing bugs V4 could not.
Broader China-model trend: Multiple posts framed Chinese labs as pushing aggressively on open-ish, agent-oriented, long-context systems: Qwen 3.6 Flash, DeepSeek V4/Flash, GLM-5.1 promotions (triple usage extension), and Xiaomi’s MIT release. A recurring theme was that smaller / cheaper variants are often outperforming their larger siblings on practical agent benchmarks.

Agent Runtimes, Orchestration, and Local-First Tooling

Sakana’s Conductor is a notable multi-agent result: @SakanaAILabs introduced a 7B Conductor trained with RL to orchestrate a pool of frontier models in natural language rather than solving tasks directly. It dynamically decides which agent to call, what subtask to assign, and which context to expose, and reportedly reached 83.9% on LiveCodeBench and 87.5% on GPQA-Diamond, beating any single worker in its pool. @hardmaru highlighted “AI managing AI” and recursive self-selection as a new axis of test-time scaling.
Local and hybrid agents keep getting better: Several posts showed coding/assistant stacks running locally. @patloeber and @_philschmid documented running Pi agent + Gemma 4 26B A4B locally via LM Studio/Ollama/llama.cpp. @googlegemma demoed a fully local browser agent using Gemma 4 + WebGPU, with native tool calling for browsing history, tab management, and page summarization. @cognition shipped Devin for Terminal, a local shell agent that can later hand off to the cloud.
Agent ergonomics and framework evolution: Hermes had a strong day: @Teknium noted Hermes Agent’s repo surpassed Claude Code, while native vision became the default when supported. The broader ecosystem kept filling in missing pieces: Cline Kanban now supports different agents/models per task card; Future AGI open-sourced an eval/optimization stack for self-improving agents; and @_philschmid argued MCP works best either through explicit @mention loading or subagent-scoped tool assignment, not indiscriminate server attachment.

Inference Infrastructure, Attention/KV Engineering, and Systems Work

Google’s TPU split is a meaningful architecture signal: Several posts dissected Google’s Cloud Next announcement that TPU v8 is split into 8t for training and 8i for inference, with claims of roughly 2.8x faster training and 80% better inference performance/$ than prior generation. @kimmonismus emphasized this is the first time Google split custom silicon by workload and that OpenAI, Anthropic, and Meta are reportedly buying TPU capacity.
DeepSeek V4 support is maturing quickly in infra stacks: @vllm_project said support for DeepSeek V4 base models is coming, requiring an expert_dtype config field to distinguish FP4 instruct vs FP8 base. In the vLLM 0.20.0 release, highlights included DeepSeek V4 support, FA4 as default MLA prefill, TurboQuant 2-bit KV, and a DeepSeek-specific MegaMoE path on Blackwell.
KV cache optimization remains a hot battleground: There was dense discussion around long-context bottlenecks and KV strategies. @cHHillee summarized three main levers for long contexts: local/sliding attention, interleaved local-global attention, and smaller KV per global layer via GQA/MLA/KV tying/quantization. On the implementation side, @vllm_project and Red Hat/AWS published an FP8 KV-cache deep dive where a fix to FA3 two-level accumulation improved 128k needle-in-a-haystack from 13% to 89% while retaining FP8 decode speedups. Community critics also questioned DeepSeek V4’s specific KV tradeoffs relative to offloading-heavy approaches such as HiSparse (discussion).

Benchmarks, Evals, and Open Research Directions

Open-world evaluation is gaining momentum: @sarahookr argued that most agentic benchmarks are overfit to automatically verifiable tasks, while the important frontier is open-world, uncertain, non-fully-verifiable work. Related threads connected this to continual learning, memory stores, and adaptive data systems (1, 2).
Cost-aware agent evaluation is becoming first-class: @dair_ai highlighted a new study on coding-agent spend over SWE-bench Verified: agentic coding can consume ~1000x more tokens than chat/code reasoning, usage can vary 30x across runs on identical tasks, and more spending does not monotonically improve accuracy. This lines up with pricing-model changes from Copilot and growing concern over uncontrolled agent runtime economics.
New benchmarks and domain-specific evals: ParseBench from LlamaIndex adds 2k verified enterprise document pages for parsing agents. AgentIR reframes retrieval for research agents by embedding the reasoning trace alongside the query, with AgentIR-4B hitting 68% on BrowseComp-Plus vs 52% for larger conventional embedding models. There were also several benchmark snapshots for frontier models—e.g. Opus 4.7 leading GSO at 42.2% and WeirdML / ALE-Bench / PencilPuzzleBench chatter—but the stronger signal was methodological: more people are measuring runtime cost, retrieval quality, and open-world behavior, not just final answer accuracy.

Top tweets (by engagement)

OpenAI–Microsoft partnership reset: @sama on cross-cloud availability and continued Microsoft partnership.
OpenAI on AWS: @ajassy confirming OpenAI models are coming to Bedrock.
GitHub Copilot pricing change: @github announcing usage-based billing starting June 1.
Xiaomi MiMo-V2.5 open-source release: @XiaomiMiMo with MIT license and 1M context.
Open-source orchestration for Codex: @OpenAIDevs launching Symphony.
Gemma local browser agent: @googlegemma showing a 100% local browser-resident agent with WebGPU.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen3.6 Model Performance and Optimization

[AINews] DeepSeek V4 Pro (1.6T-A49B) and Flash (284B-A13B), Base and Instruct — runnable on Huawei Ascend chips

Sat, 25 Apr 2026 05:00:48 GMT

After a couple months’ delay and lots of speculation, DeepSeek finally released the heavily anticipated DSV4, the first major version model since DSV3 (Dec 2024) and DSR1 (Jan 2025). It brings the DeepSeek family up in line with Kimi K2.6, the current open model leader, and Xiaomi Mimo 2.5, a lesser known family released 2 days ago.

The DSV4 family is roughly a Gemini 3.1, GPT 5.4, Opus 4.6 level model, up to 1.6T MOE withtrained on 32T tokens with FP4, with 1M token context (supported by their new Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) techniques), and incredibly rarely, they released both the Base and Instruct versions - surely setting the stage for a possible “DeepSeek R2” in future, though this one already has reasoning effort.

The technical report is a typically dense 58 pages, demonstrating training and inference insights and improvements from the Manifold Constrained Hyper-Connections (mHC) paper they released in January, continued usage of Moonshot’s Muon, and CSA/HCA’s overall INCREDIBLE efficiency improvements on DeepSeek 3.2-Exp’s already impressive Sparse Attention - at 1M tokens, requiring only 27% of FLOPs and 10% of KV cache memory compared with DeepSeek-V3.2:

The geopolitical backdrop behind the Huawei CANN compatibility is DeepSeek weaning dependence off export-controlled NVIDIA/CUDA chips — Ascends are still a quarter the supply of H100s, but this is an important milestone for Chinese total independence.

AI News for 4/23/2026-4/24/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Top Story: DeepSeek V4

DeepSeek released DeepSeek-V4 Pro and DeepSeek-V4 Flash, its first major architecture refresh since V3 and first clear two-tier lineup, with 1M-token context, hybrid reasoning/non-reasoning modes, an MIT license, and a technical report detailed enough that multiple researchers called it one of the most important or best-written model papers of the year. Across the reactions, the factual consensus is that V4 materially advances open-weight long-context and agentic coding performance while remaining somewhat behind the top closed frontier models overall. Independent benchmarkers place V4 Pro around the #2 open-weights tier, roughly near Kimi K2.6 / GLM-5.1 / strong Claude Sonnet-class to Opus-ish depending on benchmark and mode, with especially strong long-context and agentic performance; opinions diverge on how close it is to GPT-5.x / Opus 4.7 and on whether this is “democratizing” progress or an architecture so complex that few open labs can realistically reproduce it. Key sources include deep-dive commentary from @ArtificialAnlys, @scaling01, @nrehiew_, @ben_burtenshaw, @TheZachMueller, @ZhihuFrontier, and infra/vendor posts from @vllm_project, @NVIDIAAI, and @Togethercompute.

Core facts and technical details

The most concrete technical claims repeated across the discussion:

Two models
- V4 Pro: 1.6T total parameters / 49B active
- V4 Flash: 284B total / 13B active
- Reported by @ArtificialAnlys, @teortaxesTex, @baseten, @NVIDIAAI
Context
- 1M tokens, up from 128K in V3.2 per @ArtificialAnlys
- Multiple posters frame this as the headline achievement: “solid ultra-long context” @teortaxesTex
Training scale
- 32T–33T tokens cited repeatedly
- @nrehiew_ notes 32T tokens over 1.6T parameters, i.e. roughly 20 tokens/parameter
- @teortaxesTex cites 33T
- @nrehiew_ estimates pretraining compute at ~1e25 FLOPs
Reasoning / modes
- DeepSeek exposes three reasoning modes per @Togethercompute
- Hybrid “thinking/non-thinking” positioning noted by @ArtificialAnlys
Long-context architecture
- Several threads summarize a new hybrid attention system:
  - shared KV vectors
  - compressed KV streams
  - sparse attention over compressed tokens
  - local/sliding-window attention for nearby context
- @ZhihuFrontier gives the most compact public summary:
  - 2× KV reduction via shared key-value vectors
  - c4a ≈ 4× compression
  - c128a ≈ 128× compression
  - top-k sparse attention on compressed tokens
  - 128-token sliding window
  - 1M context KV cache = 9.62 GiB/sequence (bf16)
  - 8.7× smaller than DeepSeek V3.2’s 83.9 GiB
  - FP4 index cache + FP8 attention cache gives another ~2× reduction
- @ben_burtenshaw condenses this to “10× smaller KV cache”
- @TheZachMueller and @TheZachMueller describe CSA + HCA layer patterns, with alternating layers and V4 Flash using sliding-window layers instead of HCA in some places
Quantization / checkpoint format
- @LambdaAPI: checkpoint is mixed FP4 + FP8
  - MoE expert weights in FP4
  - attention / norm / router in FP8
  - claim: the full model fits on a single 8×B200 node
Inference hardware / serving
- @NVIDIAAI: on Blackwell Ultra, V4 Pro can deliver 150+ TPS/user interactivity for agentic workflows
- @NVIDIAAI: published day-0 V4 Pro performance pareto using vLLM
- @SemiAnalysis_: day-0 support and benchmarking across H200, MI355, B200, B300, GB200/300
- @Prince_Canuma: DeepSeek4-Flash on 256GB Mac
- @Prince_Canuma: MLX quants published
- @simonw asks about smaller-RAM Mac viability, implying community interest but incomplete support story
- @QuixiAI reminds users that many local stacks still lack tensor parallel, relevant because V4-class models strongly stress inference infra
License / availability / pricing
- MIT license per @ArtificialAnlys
- first-party API plus rapid third-party availability via @Togethercompute, @baseten, @NousResearch, @Teknium
- V4 Pro pricing: $1.74 / $3.48 per 1M input/output tokens
- V4 Flash pricing: $0.14 / $0.28
- cache-hit pricing also given by @ArtificialAnlys
- @scaling01 views the pricing as a glimpse of future “Mythos-level” cheap coding models
- Reuters-via-posted quote from @scaling01: DeepSeek said Pro pricing could fall sharply once Huawei Ascend 950 supernodes are deployed at scale in H2

Independent evaluations and where V4 lands

The most useful independent benchmark synthesis came from @ArtificialAnlys:

V4 Pro Max: 52 on Artificial Analysis Intelligence Index
- up 10 points from V3.2 at 42
- becomes #2 open weights reasoning model, behind Kimi K2.6 (54)
V4 Flash Max: 47
- positioned around strong mid/high open models, “Claude Sonnet 4.6 max level intelligence”
GDPval-AA (agentic real-world work):
- V4 Pro: 1554, leading open-weight models
- ahead of Kimi K2.6 (1484), GLM-5.1 (1535), MiniMax-M2.7 (1514)
AA-Omniscience
- V4 Pro: -10, an 11-point improvement over V3.2
- but still paired with 94% hallucination rate
- V4 Flash: 96% hallucination rate
Cost to run AA Index
- V4 Pro: $1,071
- V4 Flash: $113
Output tokens used on AA Index
- V4 Pro: 190M
- V4 Flash: 240M
- This is a major caveat: cheap per-token pricing does not imply cheap total task cost if the model spills huge token volumes

Additional eval perspectives:

@arena:
- #2 open in Text Arena overall at debut
- category wins/placements:
  - #1 Medical & Healthcare
  - #15 Creative Writing
  - #18 Multi-Turn
- thinking variant:
  - #8 Math
  - #9 Life/Physical/Social Science
@arena emphasizes the Pro vs Flash tradeoff:
- Pro ranks ~30 places higher
- costs 12× more
- Flash is still competitive in Chinese, medicine, math
@scaling01:
- “~Opus 4.5 estimate holds for now, at least on SimpleBench”
@scaling01:
- V4 is “definitely better than GLM-5.1 but not quite Opus 4.7, GPT-5.4 or Gemini 3.1 Pro”
@scaling01 lists what scores would confirm <6 month gap:
- ARC-AGI-1 ~75%
- ARC-AGI-2 ~35%
- GSO ~26%
- METR 4.5–5 hours
- WeirdML ~63%
@TheZachMueller:
- on his evals, Flash@max ≈ Pro@high on reasoning
- Pro focuses more on knowledge (SimpleQA)
@VictorTaelin:
- after fixing benchmark bugs and letting long-running models run longer, DeepSeek and Kimi improved materially
@mbusigin:
- a simple negative early impression with no detail
@petergostev:
- on BullshitBench, not about capability but refusal/pushback behavior, GPT-5.5 underperformed; included here because many readers compare V4 in an eval-skeptical environment

Facts vs opinions

Facts / relatively well-supported claims

V4 Pro / Flash were released with the specs above, MIT-licensed, 1M context, and open technical documentation: @ArtificialAnlys, @TheZachMueller
The architecture introduces a new long-context attention system with dramatic KV-cache reduction: @ZhihuFrontier, @ben_burtenshaw
Independent benchmarkers broadly place V4 Pro near the very top of open weights but below the best proprietary models overall: @ArtificialAnlys, @arena, @scaling01
DeepSeek V4 is heavily token-intensive in some evaluations: @ArtificialAnlys
The checkpoint uses FP4/FP8 mixed precision and can fit on an 8×B200 node: @LambdaAPI
Rapid ecosystem support arrived via vLLM and other providers day 0: @vllm_project, @SemiAnalysis_

Opinions / interpretation

“V4 is ~4–5 months behind the frontier” from @scaling01, @scaling01, @scaling01 is an informed estimate, not a measured fact
“Top three open” vs “only open model close to frontier” debate from @teortaxesTex is partly about benchmark trust and framing
“Strongest pretrained model we have” from @teortaxesTex is an opinion hinging on scale + architecture, not direct benchmark supremacy
“Most significant AI paper of the year” from @Dorialexander is enthusiasm, not consensus
“This is what research should look like” from @scaling01 speaks to transparency/style rather than only capability
“Not exactly a democratizing technology” from @teortaxesTex is a strong architectural/political interpretation

Different opinions and fault lines

1) Is V4 near frontier, or clearly behind?

More favorable

@scaling01: puts it at roughly GPT-5.2 / Opus 4.5+ tier
@scaling01: SimpleBench supports ~Opus 4.5
@teortaxesTex: argues it is the strongest pretraining base among opens and implies people are underestimating what post-training can do

More skeptical

@scaling01: below Opus 4.7 / GPT-5.4 / Gemini 3.1 Pro
@scaling01: the gap may widen again because closed labs have bigger models, better science/law/medicine coverage, faster inference with GB200s
@mbusigin: early impressions “not great”
@teortaxesTex: says polished models like K2.6 and GLM 5.1 may still feel better in coding despite lower intrinsic capacity

2) Is V4’s real contribution model quality, or long-context systems design?

A big split in reactions is that many technical readers think the long-context architecture matters more than the raw benchmark position.

@teortaxesTex: “They’ve completed their quest: Solid Ultra-Long Context”
@ben_burtenshaw: first open model where long context and agentic post-training “meet”
@scaling01: expects other open labs to adopt pieces of the architecture
@Dorialexander: frames Huawei/sovereignty constraints as an opportunity to reshape hardware and memory/interconnect design
@jukan05: reads the paper as evidence that NVIDIA’s hardware roadmap is unusually well aligned to where MoE/long-context models are going

3) Is V4 “open democratization,” or too hard to copy?

This was one of the sharpest strategic disagreements.

@teortaxesTex: says V4 is “not exactly a democratizing technology” because the architecture is too difficult for most labs to replicate
@teortaxesTex: suggests even DeepSeek may not want to do this exact architecture again without refactoring
@stochasticchasm: notes the sheer hyperparameter complexity is daunting
Against that, @Prince_Canuma and @Prince_Canuma show that the ecosystem is already compressing and adapting Flash for localish Apple Silicon use, softening the “not democratizing” claim on the inference side if not the training side

4) Are people underrating Flash?

Several reactions suggest Flash may be more important than Pro for practical adoption.

@arena: Flash shifts the price/performance frontier
@TheZachMueller: Flash@max ≈ Pro@high on reasoning tasks
@teortaxesTex: benchmarks may underweight “legit 1M context for pennies”
@Prince_Canuma: Flash runs on 256GB Mac
@baseten and @Togethercompute emphasize long-document analysis and agentic use cases where Flash’s economics matter

China, chips, Huawei, and sovereignty context

DeepSeek V4 was not discussed as a pure model release; it was treated as evidence in the larger US–China compute and sovereignty debate.

@scaling01: Chinese labs are already in or near “takeoff” in the sense that their models help build better models, though still shifted 5+ months behind
@scaling01: thinks chip bans are likely to widen the gap in broad domains over time
@teortaxesTex, @teortaxesTex: disputes simplistic Huawei-dismissal and notes mixed Chinese sentiment toward Huawei
@ogawa_tter: points to analysis of Ascend 950 / A3 clusters and V4 deployment plans
@Dorialexander: argues the sovereignty play around Huawei may reshape hardware architecture
@scaling01: cites DeepSeek saying prices could drop sharply once Ascend 950 supernodes scale in H2
@jukan05: interprets V4 as validating NVIDIA’s Blackwell/Rubin/HBM/interconnect strategy
@NVIDIAAI, @NVIDIAAI: unsurprisingly highlight Blackwell day-0 performance, but this is vendor framing rather than independent proof of strategic superiority

There is also a more ideological thread:

@teortaxesTex, @teortaxesTex, @teortaxesTex argues that Western discourse often misreads Chinese labs as purely state proxies or distillation shops, and instead sees them as serious mission-driven actors. This is interpretive, but it helps explain why the release drew such emotionally charged geopolitical reactions.

Distillation, training data, and data quality

A recurring undercurrent: does V4 mainly reflect architectural innovation, or can critics dismiss it as “distillation”?

@yacineMTB speculates that some complaints about Chinese distillation may partly come from people discovering they’re outperformed
@cloneofsimo: “Very interesting... given they distilled claude 🤔🤔”
@kalomaze: jokes about DeepSeek training on DeepSeek reasoning traces
On the more substantive side, @teortaxesTex says DeepSeek’s writing quality, especially Chinese, reflects long-standing obsession with data cleanliness and cites job listings @teortaxesTex, @teortaxesTex
@nrehiew_ notes the report still lacks much detail on pretraining data beyond standard categories
Overall, factual public evidence in this tweet set supports “DeepSeek trains at large scale with strong data work,” but not any strong claim about the degree of external distillation beyond speculation

Architecture lineage and prior art

Several researchers pointed out that V4 did not emerge from nowhere.

@jaseweston: says DeepSeek uses hash routing from a 2021 ParlAI approach
@suchenzang: criticizes routing-induced outliers, with a jab at hashing
@teortaxesTex: notes Mixtral-style MoE was a reasonable earlier hack, but claims DSMoE changed things
@art_zucker broadly attacks MoEs as a dead end
@gabriberton counters that MoEs are provably effective despite inelegance
@stochasticchasm is even more positive: “MoEs are amazing”

This matters because V4 was read not just as a stronger checkpoint, but as a possible new design point for open long-context MoEs.

Why the technical report itself mattered

A striking amount of praise was directed not just at the model but at the paper/report quality.

@scaling01: “the technical paper is a big deal”
@Dorialexander: “most significant AI paper of the year”
@morqon: “one of the best I’ve ever read”
@scaling01: “this is what research should look like”
@TheZachMueller, @iamgrigorev, @nrehiew_: all signal unusually high effort to digest and test the report

For expert readers, this is important because many frontier releases now arrive with sparse technical disclosure. V4’s report appears to have reset expectations for what a serious open release can look like.

Practical limitations and caveats

Despite the enthusiasm, several caveats recur:

Still behind closed frontier in aggregate capability
- especially sciences/law/medicine and broad “general domains” per @scaling01
Reasoning RL may be undercooked
- @scaling01: reasoning efficiency not much changed vs V3.2 Speciale
Serving remains hard
- @scaling01: many labs serve at only 20–30 tok/s and limited concurrency; running evals can take a day
- @ClementDelangue: acknowledges concurrency bottlenecks on HF
High token usage
- major practical caveat from @ArtificialAnlys
API controls
- @stochasticchasm: notes DeepSeek API appears not to allow sampler control
Adoptability
- @teortaxesTex: too complex for many labs to copy cleanly

Broader implications

Three implications stand out.

Open-weight long-context is no longer just marketing.
V4’s strongest contribution may be proving that 1M context can be made operationally credible in an open-weight model, with concrete KV-cache engineering and open inference support. This is why multiple posters focused less on benchmark deltas and more on systems design: @ben_burtenshaw, @ZhihuFrontier, @scaling01.
China’s top labs remain competitive in open models, even if not fully closing the closed-model gap.
The benchmark picture across @ArtificialAnlys, @arena, and @scaling01 suggests Chinese labs now dominate much of the open-weight top tier: Kimi, GLM, DeepSeek, and soon MiMo.
The bar for “open” is rising from checkpoint release to full-stack co-design.
V4 was instantly discussed alongside vLLM, Blackwell, MLX quants, Mac viability, Ascend clusters, and cache/memory architectures. In other words, “the model” is increasingly inseparable from the inference substrate.

Infrastructure, inference, and local/open ecosystem

Hugging Face launched ML Intern, an open-source CLI “AI intern” for ML work that can research papers, write code, run experiments, use HF datasets/jobs, search GitHub, and iterate up to 300 steps, per @MillieMarconnni. Related sentiment: HF’s $9 Pro tier is unusually strong value per @getpy.
Meta said it will add tens of millions of AWS Graviton cores to its compute portfolio to scale Meta AI and agentic systems for billions of users, per @AIatMeta.
Local/open coding stack momentum stayed strong:
- @julien_c: Qwen3.6-27B via llama.cpp on a MacBook Pro feels close to latest Opus for many coding tasks
- @p0: free CLI agent built with Pi + Ollama + Gemma 4 + Parallel web search MCP
- @Prince_Canuma: DeepSeek V4 quants incoming
- @QuixiAI: reminder that llama.cpp / Ollama / LM Studio do not support tensor parallel, pushing serious multi-GPU serving users toward vLLM
Nous/Hermes shipped heavily:
- Hermes Agent v0.11.0 introduced a rewritten React TUI, dashboard plugin, theming, more inference providers, image backends, and QQBot support, per @WesRoth
- Hermes got broad praise and rapid support for both DeepSeek V4 and GPT-5.5, via @mr_r0b0t, @Teknium
- @JulianGoldieSEO and @LoicBerthelot compared Hermes favorably to OpenClaw on learning loops, memory, model support, deployment flexibility, and security
- A native Linux sandbox backend for Deep Agents using bubblewrap + cgroups v2 was released by @nu_b_kh

Research papers and benchmarks

On-policy distillation token selection:
- @TheTuringPost highlights a paper showing only some tokens carry most learning signal; using ~50% of tokens can match or beat full training and cut memory by ~47%, while even <10% focused on confident-wrong tokens nearly matches full training.
Google Research pushed several ICLR demos:
- MesaNet, a transformer alternative / linear sequence layer optimized for in-context learning under fixed memory, via @GoogleResearch
- robotics/3D reasoning and efficient transformer work via @GoogleResearch
- “reasoning can lead to honesty” demo via @GoogleResearch
MIT Hyperloop Transformers mix looped and normal transformer blocks, using ~50% fewer parameters while beating regular transformers at 240M / 1B / 2B, per @TheTuringPost.
“Learning mechanics” tries to synthesize a theory of deep learning dynamics, via @learning_mech.
Tool/agent systems papers:
- Tool Attention Is All You Need claims 95% tool-token reduction (47.3k → 2.4k/turn) with dynamic gating and lazy schema loading, per @omarsar0
- StructMem for long-horizon structured memory highlighted by @dair_ai
- HorizonBench targets long-horizon personalization with shifting user preferences, via @StellaLisy
Clarifying questions for software engineering:
- @gneubig shared work on a model trained specifically to ask clarifying questions, improving results with fewer questions.

GPT-5.5 rollout and coding agents

OpenAI rolled GPT-5.5 and GPT-5.5 Pro into API and ecosystem products with a 1M context window, per @OpenAI, @OpenAIDevs.
Distribution was immediate across Cursor, GitHub Copilot, Codex/OpenAI API, OpenRouter, Perplexity, Devin, Droid, Fleet, Deep Agents:
- @cursor_ai: GPT-5.5 is top on CursorBench at 72.8%
- @cline: #1 on Terminal-Bench at 82.7
- @OpenAIDevs: Perplexity Computer saw 56% fewer tokens on complex tasks
- @scaling01: GPT-5.5 medium became strongest non-thinking model on LisanBench with 45.6% fewer tokens than GPT-5.4 medium and higher scores
User feedback clustered around better coding quality and token efficiency, despite mixed feelings about some evals:
- @almmaasoglu: best code they’ve read from an LLM; less verbose, less defensive
- @KentonVarda: caught a deep Cap’n Proto RPC corner case from a 6-year-old comment
- @willdepue: underwhelmed by evals, impressed in Codex on complex technical projects
- @omarsar0: smooth switch from Claude Code to Codex/GPT-5.5 thanks to better “effort calibration”
Cursor also shipped /multitask async subagents and multi-root workspaces, via @cursor_ai.
There is growing market emphasis on limits and economics rather than tiny quality gaps:
- @nrehiew_ argues usage caps now matter more than small frontier deltas
- @HamelHusain says Codex’s subscription structure makes it hard not to use

Industry moves, funding, and policy

Google reportedly plans to invest up to $40B in Anthropic, reported by @FT and echoed by @zerohedge. Reactions centered on how large Anthropic’s compute commitment may now be.
Cohere and Aleph Alpha announced a Canada/Germany sovereign AI partnership, framed as enterprise-grade and privacy/security focused by @cohere, @aidangomez, @nickfrosst.
ComfyUI raised $30M at a $500M valuation, while keeping core/open-local positioning, via @yoland_yan.
Mechanize announced $9.1M raised at a $500M post-money valuation, via @MechanizeWork.
Arcee AI hired Cody Blakeney as Head of Research, emphasizing open-weight American frontier models, via @code_star.
Safety / governance:
- OpenAI announced a Bio Bug Bounty for GPT-5.5, per @OpenAINewsroom
- Anthropic launched Project Deal, a marketplace where Claude negotiated on behalf of employees, and highlighted model-quality asymmetry and policy challenges, via @AnthropicAI

Creative AI and multimodal

GPT Image 2 + Seedance 2 workflows kept drawing attention:
- @_OAK200 and @awesome_visuals showed high-fidelity image→video pipelines
- @BoyuanChen0 said 2K/4K images are already available via experimental API and active fixes are underway
Kling announced native 4K output and a $25k short film contest, via @Kling_ai.
Some evaluative nuance:
- @goodside noted GPT Images 2.0 could render a valid-looking Rubik’s Cube state, which is surprisingly hard
- @venturetwins framed recent image/video gains as a major step toward personalized game-like content generation

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Deepseek V4 and Related Releases

Deepseek V4 AGI comfirmed (Activity: 1138): The image is a meme and does not contain any technical content. The title “Deepseek V4 AGI confirmed” suggests a humorous or exaggerated claim about an AI model, possibly referencing advancements in artificial general intelligence (AGI). The comments further imply a satirical tone, mentioning uncensored datasets and military applications, which are likely not serious claims. The comments reflect a satirical take on AI capabilities, with mentions of uncensored datasets and military applications, indicating skepticism or humor rather than a serious technical discussion.
- UserXtheUnknown discusses a test scenario with Deepseek V4, highlighting its tendency to overthink problems. The model interprets constraints like ‘using only one knife’ as mandatory rather than optional, which affects its problem-solving approach. This reflects a nuanced understanding of task constraints, but also indicates potential areas for improvement in handling implicit instructions.
Deepseek V4 Flash and Non-Flash Out on HuggingFace (Activity: 1393): DeepSeek V4 has been released on HuggingFace, featuring two models: DeepSeek-V4-Pro with 1.6T parameters (of which 49B are activated) and DeepSeek-V4-Flash with 284B parameters (with 13B activated). Both models support a context length of one million tokens, which is significant for handling extensive sequences. The models are released under the MIT license, allowing for broad use and modification. A notable comment highlights the challenge of hardware limitations, particularly RAM, when working with such large models. Another comment suggests the potential benefit of a 0.01bit quantization to manage the model size more effectively.
- The DeepSeek-V4 models are notable for their massive parameter sizes, with the Pro version having 1.6 trillion parameters (49 billion activated) and the Flash version having 284 billion parameters (13 billion activated). Both models support an extensive context length of one million tokens, which is significant for handling large-scale data inputs and complex tasks.
- A user expressed interest in a 0.01-bit quantization of the DeepSeek-V4 models, which suggests a focus on reducing the model size and computational requirements while maintaining performance. Quantization is a common technique to optimize models for deployment on hardware with limited resources.
- The mention of the MIT license indicates that DeepSeek-V4 is open-source, allowing for broad use and modification by the community. This licensing choice can facilitate collaboration and innovation, as developers can freely integrate and adapt the models into their own projects.
Buried lede: Deepseek v4 Flash is incredibly inexpensive from the official API for its weight category (Activity: 404): The image provides a comparison between two models, “deepseek-v4-flash” and “deepseek-v4-pro,” highlighting that the “deepseek-v4-flash” model is significantly more affordable in terms of input and output token costs. Despite its affordability, the model supports advanced features like JSON output, tool calls, and chat prefix completion in both non-thinking and thinking modes. The discussion around the image suggests that while the “deepseek-v4-flash” is marketed as inexpensive, some users argue that it is actually overpriced compared to previous versions when considering parameter scaling, with the “V3.2” model being cheaper per parameter. Commenters discuss the impact of GPU shortages on current pricing, suggesting that prices may decrease as GPU production increases. There is also debate about the pricing strategy, with some users noting that the new model is more expensive per parameter compared to older versions.
- DistanceSolar1449 highlights a pricing comparison between DeepSeek V3.2 and V4 Flash, noting that V3.2 was priced at $0.26/0.38 for input/output at 671b, whereas V4 Flash is $0.14/$0.28 at 284b. This suggests that V4 Flash is actually more expensive if pricing were to scale linearly with parameters, challenging the notion of its cost-effectiveness.
- jwpbe provides a comparative analysis of DeepSeek V4 Flash’s API cost, stating that at 14 cents in / 28 cents out, it is significantly cheaper than competitors like Minimax 2.7, which is 3x the cost, and Qwen’s equivalent, which is even higher. They also mention that Trinity Thinking Large is twice as expensive, indicating that V4 Flash offers a competitive pricing advantage in the market.
- Worried-Squirrel2023 discusses the strategic implications of Huawei’s silicon developments, suggesting that DeepSeek’s pricing strategy involves trading NVIDIA margins for Ascend supply. They predict that once the 950 supernodes scale, DeepSeek could potentially undercut competitors in the open weights tier, leveraging Huawei’s advancements to optimize costs.
Deepseek has released DeepEP V2 and TileKernels. (Activity: 396): Deepseek has released DeepEP V2 and TileKernels, which are significant advancements in AI model optimization and parallelization. DeepEP V2 focuses on enhancing model efficiency and accuracy, while TileKernels introduces a novel parallelization technique that reportedly scales linearly, meaning that doubling computational capacity results in a doubling of processing speed. This release is open-sourced, fostering transparency and collaboration in AI research. For more details, see the DeepEP V2 pull request and the TileKernels repository. One commenter highlights that Deepseek is fulfilling a role that OpenAI was expected to play by advancing research and sharing findings openly, which builds goodwill despite proprietary technologies. Another commenter questions if the parallelization technique indeed scales linearly, suggesting a significant technical breakthrough if true.
- DeepEP V2 and TileKernels by DeepSeek are noted for their potential advancements in parallelization techniques. A user speculates that these techniques might achieve linear scaling, meaning that doubling computational capacity could directly double processing speed. This could represent a significant efficiency improvement in model training and inference.
- There is speculation about DeepSeek’s hardware usage, particularly regarding the SM100 and Blackwell GPUs. One commenter suggests that DeepSeek might be using Blackwell GPUs for training, possibly through rented B200 units on Vast.ai. This hardware choice could influence the performance and capabilities of their models.
- The potential innovations in DeepSeek’s next model, possibly named v4, are highlighted. The focus is on the integration of Engram and mHC technologies, which are expected to play a crucial role in the model’s performance. The success of these innovations will likely depend on the new dataset DeepSeek has developed.

2. Qwen 3.6 Model Performance and Benchmarks

This is where we are right now, LocalLLaMA (Activity: 1755): The image depicts a MacBook Pro running a Qwen3.6 27B model via Llama.cpp, showcasing the capability of executing complex AI models locally, even in airplane mode. This highlights the potential for local AI models to enhance efficiency, security, privacy, and sovereignty by operating independently of cloud services. The post underscores the technological advancement in making powerful AI models accessible on personal devices, emphasizing the importance of local execution for privacy and control. Commenters express skepticism about the overstatement of the Qwen3.6-27B model’s capabilities, suggesting that while it is impressive for its size, it does not match the performance of more advanced models like Sonnet or Opus. There is concern that exaggerated claims could lead to user disappointment and backlash against the broader LLM community.
- ttkciar highlights the potential for user disappointment with the Qwen3.6-27B model, noting that while it’s impressive for its size and suitable for agentic code generation, it doesn’t match the capabilities of more advanced models like Sonnet or Opus. The concern is that overhyping its abilities could lead to backlash against the broader LLM community, not just the individual making the claims.
- sooki10 agrees that while the model is impressive for local coding tasks, comparing it to more advanced models like Opus is misleading and could undermine the credibility of the claims being made. This suggests a need for more accurate benchmarking and communication about the model’s capabilities to manage user expectations effectively.
- Melodic_Reality_646 points out the disparity in resources, comparing the use of a high-end 128GB RAM m5max system to a more accessible setup. This highlights the importance of considering hardware limitations when evaluating model performance, as not all users have access to such powerful systems, which can skew perceptions of a model’s capabilities.
DS4-Flash vs Qwen3.6 (Activity: 470): The image presents a benchmark comparison between DS4-Flash Max and Qwen3.6 models, specifically the 35B-A3B and 27B versions. The chart highlights that DS4-Flash Max generally outperforms the Qwen models across various categories, particularly excelling in ‘LiveCodeBench’ and ‘HLE’ benchmarks. This suggests that DS4-Flash Max may have superior capabilities in coding and reasoning tasks. The discussion in the comments hints at the potential for larger models like a 122B version of Qwen3.6, and emphasizes the significance of the 1M token context feature, which could impact performance in other benchmarks like ‘omniscense’. Commenters note that despite DS4-Flash Max’s larger size, its performance is only slightly better than Qwen3.6, raising questions about efficiency versus scale. The 1M token context is highlighted as a significant feature that could influence future benchmark results.
- Rascazzione highlights the significant increase in context length with Qwen 3.6, noting its ability to handle a 1 million token context. This is a substantial improvement over previous models and could have significant implications for tasks requiring extensive context handling, such as document summarization or complex dialogue systems.
- LinkSea8324 points out the size difference between the models, with DS4-Flash at 284 billion parameters compared to Qwen 3.6’s 27 billion. This raises questions about the efficiency and performance trade-offs between model size and capability, especially in terms of computational resources and inference speed.
- madsheepPL discusses the non-linear nature of benchmark improvements, suggesting that even if a model appears only slightly better in benchmarks, the practical implications can be more significant. They emphasize that improvements in scores are not directly proportional and can have varying impacts on real-world applications.
Qwen 3.6 27B Makes Huge Gains in Agency on Artificial Analysis - Ties with Sonnet 4.6 (Activity: 964): Qwen 3.6 27B has achieved parity with Sonnet 4.6 on the Agentic Index from Artificial Analysis, surpassing models like Gemini 3.1 Pro Preview, GPT 5.2 and 5.3, and MiniMax 2.7. The model shows improvements across all indices, although the gains in the Coding Index are less pronounced due to its reliance on benchmarks like Terminal Bench Hard and SciCode, which are considered unconventional. The focus of training appears to be on agentic applications for OpenClaw/Hermes, highlighting the potential of smaller models to approach frontier capabilities. Anticipation is building for the upcoming Qwen 3.6 122B model. Commenters express excitement about the potential of smaller models like Qwen 3.6 27B, noting the significant improvements and potential for future versions. However, there is skepticism about the extent of these gains, suggesting that some improvements might be due to ‘benchmaxxing’ rather than inherent model capabilities.
- Iory1998 highlights the impressive performance of the Qwen 3.6 27B model, noting that it surpasses a 670B model from the previous year. They mention running the Q8 version at 170K with KV cache at FP16 on an RTX 3090 and RTX 5070ti, utilizing 40GB of VRAM, which underscores the model’s efficiency and power.
- AngeloKappos discusses the narrowing benchmark gap, sharing their experience running the Qwen3-30b-a3b model on an M2 chip. They note its capability to handle multi-step tool calls effectively, suggesting that if the 27B dense model performs this well, the upcoming 122B model could pose challenges for API providers due to its potential performance.
- Velocita84 raises a point about potential “benchmaxxing” in the reported performance gains of the Qwen 3.6 27B model, implying that some of the improvements might be attributed to optimized benchmarking rather than inherent model capabilities. This suggests a need for scrutiny in evaluating model performance claims.
Compared QWEN 3.6 35B with QWEN 3.6 27B for coding primitives (Activity: 491): The post compares two versions of the QWEN 3.6 model, specifically the 35B and 27B parameter versions, on a MacBook Pro M5 MAX with 64GB RAM. The 35B model achieves 72 TPS (tokens per second), while the 27B model achieves 18 TPS. Despite the slower speed, the 27B model produces more precise and correct results for coding tasks, whereas the 35B model is faster but less accurate. The test involved generating a single HTML file to simulate a moving car with a parallax effect, using no external libraries. The models were hosted using Atomic.Chat, with source code available on GitHub. One comment highlights the output of the Qwen 3.6 27B FP8 model using opencode, taking approximately 52 seconds. Another comment provides a visual comparison with the Qwen 3.5 27B Q3 model, suggesting differences in output quality.
- The user ‘sacrelege’ shared a performance result for the Qwen 3.6 27B model using FP8 precision, noting that it took approximately 52 seconds to complete a task with ‘opencode’. This suggests a focus on optimizing model performance through precision adjustments, which can significantly impact computational efficiency and speed.
- User ‘nikhilprasanth’ provided a visual comparison for the Qwen 3.5 27B Q3 model, indicating a potential interest in comparing different versions and quantization levels of the Qwen models. This highlights the importance of understanding how different model configurations can affect performance and output quality.
- ‘Technical-Earth-3254’ inquired about the quantization methods used in the tests, which is crucial for understanding the trade-offs between model size, speed, and accuracy. Quantization can greatly influence the efficiency of large models like Qwen, especially in resource-constrained environments.
Qwen 3.6 27B is a BEAST (Activity: 1239): The post discusses the performance of the Qwen 3.6 27B model on a high-end laptop with an RTX 5090 GPU and 24GB VRAM, highlighting its effectiveness for pyspark/python and data transformation debugging tasks. The user employs llama.cpp with q4_k_m at q4_0 and is exploring further optimizations with IQ4_XS at 200k q8_0. The user has not yet implemented speculative decoding. The setup includes an ASUS ROG Strix SCAR 18 with 64GB DDR5 RAM. Comments suggest avoiding kv cache as q4 for coding, recommending q8 for 130k context. Another comment anticipates performance improvements with upcoming releases from z-lab and a specific GitHub pull request that promises a 2x decode speed increase. There is also curiosity about the model’s performance on systems with 16GB VRAM and 32GB DDR5 RAM with offloading.
- sagiroth highlights a technical consideration when using Qwen 3.6 27B for coding tasks, advising against using the KV cache as q4 due to limitations, and instead suggests using q8 to achieve a 130k context window, which can significantly enhance performance for large context tasks.
- inkberk points out an upcoming improvement in decoding speed, referencing a pull request #22105 on the llama.cpp repository. This update, along with the anticipated release of the ‘dflash drafter’ by z-lab, promises a potential 2x increase in decode speed, which could greatly benefit users in terms of efficiency.
- Johnny_Rell inquires about the performance of Qwen 3.6 27B on a system with 16 GB VRAM and 32 GB DDR5, specifically regarding the effectiveness of offloading. This suggests a focus on optimizing resource allocation to handle the model’s demands, which is crucial for running large models efficiently on consumer-grade hardware.

[AINews] GPT 5.5 and OpenAI Codex Superapp

Fri, 24 Apr 2026 04:40:43 GMT

A week after Opus 4.7, it was OpenAI’s turn to fire back with very similar Pareto frontier improvement charts for GPT 5.5 (as Noam Brown prefers — raw 1 dimensional intelligence measures are giving way to 2D intelligence per dollar charts). In the 4.7 vs 5.5 bakeoff, you have to read between the lines to see what was NOT mentioned (coding), but in terms of overall intelligence, AA crowns this the top independently validated model in the world, AND…

AA chart

… intelligence per dollar (“GPT-5.5 (medium) scores the same as Claude Opus 4.7 (max) on our Intelligence Index at one quarter of the cost (~$1,200 vs $4,800) - although Gemini 3.1 Pro Preview scores the same at a cost of ~$900.”

aa 2D

There are some training hardware tidbits and positive RSI vibes and cool alternative benchmarks.

But if you just treated today as a mere point update model launch (some would prefer to call it 5.9), you’d be mistaken - it’s also bundling a big Codex launch day:

twitter

With built in browser control and the other features in this mega-update, as well as folding in the now defunct Prism (RIP), OpenAI seems to have made the critical and retoractively obvious choice to turn Codex into the base of its superapp strategy.

AI News for 4/22/2026-4/23/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

OpenAI’s GPT-5.5 launch: stronger agentic coding, broader computer use, and a push on token-efficiency

GPT-5.5 is the day’s dominant release: OpenAI launched GPT-5.5, positioned as “a new class of intelligence for real work,” with rollout across ChatGPT and Codex and API access delayed pending additional safeguards. OpenAI and community benchmark posts converged on a profile of better long-horizon execution, stronger computer-use behavior, and materially improved token efficiency rather than a pure across-the-board benchmark blowout. Reported numbers include 82.7% Terminal-Bench 2.0, 58.6% SWE-Bench Pro, 84.9% GDPval, 78.7% OSWorld-Verified, 81.8% CyberGym, 84.4% BrowseComp, and 51.7% FrontierMath Tier 1–3 via @reach_vb, with Artificial Analysis saying GPT-5.5 now leads or ties several headline evals and sits on a new cost/performance frontier despite higher per-token pricing @ArtificialAnlys, @scaling01. OpenAI also emphasized that in ChatGPT, stack-level inference gains made GPT-5.5 Pro more practical for demanding tasks @OpenAI.
Pricing, context, infra, and practical behavior: API pricing was reported at $5/$30 per 1M input/output tokens for GPT-5.5 and $30/$180 for Pro @scaling01, with Sam Altman noting a 1M context window in API and lower token use per task than 5.4. Multiple early users described the model as more “human,” less formal, and better suited to persistent agent workflows than prior GPTs, especially inside Codex @MatthewBerman, @danshipper, @omarsar0. OpenAI claimed the model was co-designed for NVIDIA GB200/300 systems and that the model itself helped improve its own inference stack @scaling01, while @sama framed the company increasingly as an AI inference company. A recurrent theme from users: GPT-5.5 often feels like a step-function upgrade in autonomy, but can also be exploratory and require tighter instruction to stay on track @theo.
Codex becomes a fuller agent workspace: In parallel, OpenAI shipped substantial Codex upgrades: browser control, Sheets/Slides, Docs/PDFs, OS-wide dictation, and auto-review mode @ajambrosino. OpenAI says Codex can now interact with web apps, click through flows, capture screenshots, and iterate until task completion @OpenAIDevs, while Auto-review uses a secondary “guardian” agent to reduce approvals on longer runs @OpenAIDevs, @gdb. User reports suggest this is expanding Codex from a coding tool into a broader computer-work agent, spanning QA, spreadsheets, presentations, app building, research loops, and overnight experimental runs @gdb, @tszzl, @aidan_mclau.

DeepSeek-V4 Preview: 1.6T MIT-licensed open model, 1M context, and aggressive pricing

DeepSeek answered GPT-5.5 within hours: DeepSeek released DeepSeek-V4 Preview, open-sourcing V4-Pro and V4-Flash under an MIT license. The headline specs are unusually aggressive: V4-Pro: 1.6T total params / 49B active, V4-Flash: 284B / 13B active, both with 1M token context and support for thinking/non-thinking modes @deepseek_ai, @Yuchenj_UW. Community reactions quickly framed it as the new open-model flagship, competitive with top closed models from the prior generation and a major leap over DeepSeek V3.x @arena, @scaling01, @kimmonismus.
Technical report highlights: long-context efficiency, hybrid attention, and Muon: The launch was notable not just for weights but for a same-day tech report @scaling01. Community summaries point to two new compressed/hybrid attention mechanisms, mHC, Muon-based training, FP4 quantization-aware training, and pretraining on roughly 32T tokens @scaling01, @iScienceLuvr, @eliebakouch. The strongest technical discussion centered on making 1M context practical, with reported ~4x compute efficiency improvements and order-of-magnitude KV-cache reductions relative to earlier DeepSeek-style stacks @Hangsiin. The rapid infra response was also notable: vLLM announced day-0 support and detailed how it implemented the new attention stack; SGLang shipped day-0 optimizations and RL pipeline support.
Pricing may be as important as the model: DeepSeek’s posted pricing is exceptionally aggressive: V4-Flash at $0.14/$0.28 and V4-Pro at $1.74/$3.48 per 1M input/output tokens @scaling01, @teortaxesTex. Several commenters highlighted Flash as potentially the more disruptive SKU if serving quality holds, given the combination of very low cost, 1M context, and open weights @Hangsiin, @arena. The main caveat from DeepSeek: V4-Pro throughput is currently limited by high-end compute constraints, with the company explicitly pointing to future Ascend 950 availability for price drops @teortaxesTex.

Agent infrastructure and tooling: memory, orchestration, browsers, and enterprise plumbing

Agents are becoming systems problems, not just model problems: Several posts emphasized that production agent work is increasingly about harnesses, evals, memory, and orchestration. A useful example was the writeup on stateless decision memory for enterprise agents, which replaces mutable per-agent state with immutable decision logs/event sourcing to improve horizontal scalability, auditability, and fault tolerance @omarsar0. In a similar vein, @Vtrivedy10 argued that trace data → evals/environments → harness engineering/SFT-RL is the core flywheel for improving production agents, and later used Anthropic’s Claude Code regression as a case study for why open harnesses and open evals matter @Vtrivedy10.
New tooling around control surfaces: Cua open-sourced Cua Driver, a macOS driver for letting agents control arbitrary apps in the background with multi-player/multi-cursor support. Cognition published a post on what it takes to build cloud agent infrastructure, naming the practical stack: VM isolation, session persistence, environment provisioning, orchestration, and integrations. LangChain continued expanding LangSmith Fleet with file editing, webpage/presentation generation, and slash-command skills @LangChain, while multiple users highlighted Fleet’s presentation renderer/viewer as a surprisingly useful agent-native artifact format @BraceSproul.
Multi-agent orchestration is moving into products: Sakana AI launched the beta of Fugu, a multi-agent orchestration API that dynamically selects and coordinates frontier models, with claims of SOTA on SWE-Pro, GPQA-D, and ALE-Bench and even recursive test-time scaling via self-invocation @SakanaAILabs, @hardmaru. Hermes Agent shipped v0.11.0 with a large contributor release, expanded providers, image generation support, and effectively immediate GPT-5.5 support @Teknium. The direction is consistent: agents are becoming orchestration layers over heterogeneous tools and models, not single-model loops.

Vision, video, and multimodal systems: Vision Banana, Sapiens2, HDR video, and omni models

Google DeepMind’s Vision Banana reframes CV as generation: One of the more technically interesting research launches was Vision Banana, a unified vision model that treats 2D/3D vision tasks as image generation, reportedly outperforming specialist SOTA systems across multiple vision tasks. The reaction from computer-vision researchers was that it signals a broader shift in how segmentation, depth, normals, and related tasks may be approached going forward @sainingxie. On the open side, Meta also released Sapiens2, a set of high-resolution vision transformers trained on 1B human images for human-centric perception tasks @HuggingPapers.
Video stack updates are moving past raw resolution into production formats: Kling’s “native 4K” rollout spread across multiple platforms, but the technically more novel launch may be LTX HDR beta, which argues the real bottleneck for AI video in production has been dynamic range, not just resolution, by moving beyond 8-bit SDR toward footage that can survive grading and compositing @ltx_model. That’s a more substantive improvement than the usual “4K” marketing alone. Separately, World Labs launched World Jam around Marble 1.1 + Spark LoD for interactive 3D creation @theworldlabs.
Broader multimodal trend: unified models with explicit cross-modal reasoning: The newly shared Context Unrolling in Omni Models proposes a unified model trained across text, images, video, 3D geometry, and hidden representations, explicitly unrolling reasoning across modalities before producing outputs @arankomatsuzaki. Together with Vision Banana, this points to a recurring motif: fold disparate perception/generation tasks into fewer general multimodal backbones, then let inference-time reasoning bridge modalities.

Training, scaling, and research methods: globally distributed pretraining, self-play, and long-context internals

Google’s Decoupled DiLoCo tackles resilient global pretraining: Google DeepMind and Google Research introduced Decoupled DiLoCo, which decouples distributed low-communication training to enable worldwide datacenter training, heterogeneous hardware, and tolerance to hardware failures without halting the job. This is a meaningful systems result because it targets a real frontier training bottleneck: keeping giant training runs alive and efficient across faulty, geographically distributed infrastructure, rather than assuming clean homogeneous clusters.
Algorithmic scaling beyond brute-force sampling: A self-play paper highlighted by @LukeBailey181 studies why long-run self-play plateaus for LLMs and proposes an algorithm that lets a 7B model solve as many problems as pass@4 of a model 100x larger. Another recurring theme was token/computation efficiency as the real frontier metric; several posts argued that single-number intelligence comparisons are increasingly obsolete in a world where effort level and inference budget materially reshape capability @polynoamial. Relatedly, a thread on Neural Garbage Collection described training models to manage their own KV cache via RL rather than fixed heuristics, a potentially important direction for long-horizon agents @cwolferesearch.
Infra adoption signals: Together AI reported growth from 30B to 300T tokens/month YoY @vipulved, a large-scale indicator of inference demand expansion. Epoch AI, meanwhile, revised down estimates for operational power at Stargate Abilene to ~0.3 GW currently and pushed the full 1.2 GW milestone to Q4 2026, underscoring continued uncertainty in tracking frontier compute deployment @EpochAIResearch.

Top tweets (by engagement)

OpenAI GPT-5.5 launch: The highest-engagement technical post was OpenAI’s GPT-5.5 announcement, followed by @sama’s launch post and OpenAI DevRel’s framing of GPT-5.5 as its smartest frontier model yet @OpenAIDevs.
Claude Code regression post-mortem: Anthropic’s acknowledgment that Claude Code quality had slipped due to three issues and was fixed in v2.1.116+ was one of the most engaged engineering-product posts of the day, and sparked substantial discussion about harness sensitivity and regression testing.
DeepSeek-V4 Preview release: DeepSeek’s official V4 Preview launch quickly became the other major high-engagement technical event, especially given the combination of MIT license, 1M context, and aggressive pricing.
Vision Banana: Google DeepMind’s Vision Banana announcement was the standout pure-research vision post.
ML-Intern and autonomous research workflows: The Hugging Face-adjacent ml-intern passing an internship-style test in 15 minutes and subsequent reports of very high token consumption suggest strong interest in autonomous coding/research harnesses as distinct products, not just demos.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

[AINews] Tasteful Tokenmaxxing

Thu, 23 Apr 2026 02:45:37 GMT

It is Cloud Next today and Google TPUv8’s (training and inference iterations) were announced as expected, though the numbers are mindboggling, they mostly serve to reinforce the sheer hardware advantage that a decade of investment has given to GDM and any models they train and serve.

Over the last 2 days with AIE Miami concluding (Singapore is next!) the top conversations we have been hearing from AI leadership (CTOs, VPs, Founders) have all centered around the concept of “Tokenmaxxing” and how leaders want to get their teams using more AI, WITHOUT the downside of incentivizing the kinds of horrendous waste our friend described at his AIE keynote.

Dex Horthy, coiner of Context Engineering and “the Dumb Zone”, publicly retracted his extremely vibe-coding-pilled call 6 months ago and encouraged people to please read the code, citing ’s Z/L continuum from AIE Europe:

timestamp

Off the record, many senior leaders I talk to are more on the Zechner side than the Lopopolo side of the Z/L spectrum — this does not mean that one side is true for every one in every situation, nor does it mean it will continue to be true with advancing model progress! To point out the most obvious, engineers and engineering leaders are the ones most setup to make a big deal out of minor architectural quality issues that sheer quantity of cheap code generation and code review might overcome.

Today’s LS guest, Mikhail Parakhin, CTO of Shopify, had another take on the “tasteful tokenmaxxing” - you want to go for depth (e.g. do more serial autoresearch loops) than go for breadth (e.g. solve a problem by kicking off 5, 10, 50, 500 parallel runs of the LLM slot machine). Worth thinking through.

AI News for 4/21/2026-4/22/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Open Models: Qwen3.6-27B, OpenAI Privacy Filter, and Xiaomi MiMo-V2.5

Qwen3.6-27B lands as a serious local/open coding model: @Alibaba_Qwen released Qwen3.6-27B, a dense, Apache 2.0 model with thinking + non-thinking modes and a unified multimodal checkpoint. Alibaba claims it beats the much larger Qwen3.5-397B-A17B on major coding evals, including SWE-bench Verified 77.2 vs 76.2, SWE-bench Pro 53.5 vs 50.9, Terminal-Bench 2.0 59.3 vs 52.5, and SkillsBench 48.2 vs 30.0. It also supports native vision-language reasoning over images and video. The ecosystem moved immediately: vLLM shipped day-0 support, Unsloth published 18GB-RAM local GGUFs, ggml added llama.cpp usage, and Ollama added a packaged release. Early user reports from @KyleHessling1 and @simonw were notably strong for local frontend/design and image tasks.
OpenAI quietly open-sources a practical privacy model: Multiple observers flagged OpenAI’s new Privacy Filter, a lightweight Apache 2.0 open model for PII detection and masking. According to @altryne, @eliebakouch, and @mervenoyann, it is a 1.5B total / 50M active MoE token-classification model with a 128k context window, intended for cheap redaction over very large corpora and logs. This is a more operationally interesting release than a generic “small open model”: it targets a concrete infra problem in enterprise/agent pipelines where on-device or low-cost preprocessing matters.
Xiaomi pushes agentic open models upward: @XiaomiMiMo announced MiMo-V2.5-Pro and MiMo-V2.5. Xiaomi positions V2.5-Pro as a major jump in software engineering and long-horizon agents, citing SWE-bench Pro 57.2, Claw-Eval 63.8, and τ3-Bench 72.9, with claims of 1,000+ autonomous tool calls. The non-Pro model adds native omnimodality and a 1M-token context window. Arena quickly listed MiMo-V2.5 in Text/Vision/Code evaluation, and Hermes/Nous integration followed via @Teknium.

Google Cloud Next: TPU v8, Gemini Enterprise Agent Platform, and Workspace Intelligence

Google’s infra announcements were substantial, not cosmetic: @Google and @sundarpichai introduced 8th-gen TPUs with a split design: TPU 8t for training and TPU 8i for inference. Google says 8t delivers nearly 3x compute per pod vs Ironwood, while 8i connects 1,152 TPUs per pod for low-latency inference and high-throughput multi-agent workloads. Commentary from @scaling01 highlighted an additional claim: Google can now scale to a million TPUs in a single cluster with TPU8t. The productization signal matters as much as the raw hardware: Google is clearly aligning chips, models, agent tooling, and enterprise control planes into one vertically integrated offering.
Enterprise agents became a first-class Google product surface: @GoogleDeepMind and @Google launched Gemini Enterprise Agent Platform, framed as the evolution of Vertex AI into a platform for building, governing, and optimizing agents at scale. It includes Agent Studio, access to 200+ models via Model Garden, and support for Google’s current stack including Gemini 3.1 Pro, Gemini 3.1 Flash Image, Lyria 3, and Gemma 4. Related launches included Workspace Intelligence GA as a semantic layer over docs/sheets/meetings/mail, Gemini Enterprise inbox/canvas/reusable skills, Agentic Data Cloud, security agents with Wiz integration, and Gemini Embedding 2 GA, a unified embedding model across text, image, video, audio, and documents.

Agents, Harnesses, Traces, and Team Workflows

The “agent harness” abstraction is hardening across vendors: OpenAI introduced workspace agents in ChatGPT, shared Codex-powered agents for teams that can operate across docs, email, chat, code, and external systems, including Slack-based workflows and scheduled/background tasks. Google made a parallel enterprise move with Gemini Enterprise Agent Platform, while Cursor added Slack invocation for task kick-off and streaming updates. The pattern is converging: cloud-hosted agents, shared team context, approvals, and long-running execution rather than single-user chat.
Developer ergonomics around harness/model independence improved: VS Code/Copilot rolled out bring-your-own-key/model support across plans and business/enterprise, enabling providers like Anthropic, Gemini, OpenAI, OpenRouter, Azure, Ollama, and local backends. This is strategically important because, as @omarsar0 noted, most models still seem overfit to their own agent harnesses. Cognition’s Russell Kaplan made the complementary business case: enterprise buyers want model flexibility and infrastructure that spans the full SDLC, not attachment to one lab.
Traces/evals/self-improvement are becoming the core agent data primitive: The strongest thread here came from LangChain-adjacent discussion. @Vtrivedy10 argued that traces capture agent errors and inefficiencies, and that compute should be pointed at understanding traces to generate better evals, skills, and environments; a longer follow-up expanded this into a concrete loop involving trace mining, skills, context engineering, subagents, and online evals. @ClementDelangue pushed for open traces as the missing data substrate for open agent training, while @gneubig promoted ADP / Agent Data Protocol standardization. LangChain also teased a stronger testing/evaluation product direction via @hwchase17.

Post-Training, RL, and Inference Systems

Perplexity and others shared more of the post-training playbook: @perplexity_ai published details on a search-augmented SFT + RL pipeline that improves factuality, citation quality, instruction following, and efficiency; they say Qwen-based systems can match or beat GPT-family models on factuality at lower cost. @AravSrinivas added that Perplexity now runs a post-trained Qwen-derived model in production that unifies tool routing and summarization and is already serving a significant share of traffic. On the research side, @michaelyli__ introduced Neural Garbage Collection, using RL to jointly learn reasoning and KV-cache retention/eviction without proxy objectives; @sirbayes reported a Bayesian linguistic-belief forecasting agent matching human superforecasters on ForecastBench.
The “minimal editing” problem in coding models got a useful benchmark treatment: @nrehiew_ presented work on Over-Editing, where coding models fix bugs by rewriting too much code. The study constructs minimally corrupted problems and measures excess edits with patch-distance and added Cognitive Complexity; it finds GPT-5.4 over-edits the most while Opus 4.6 over-edits the least, and that RL outperforms SFT, DPO, and rejection sampling for learning a generalizable minimal-editing style without catastrophic forgetting. This is one of the more practical post-training/eval contributions in the set because it targets a failure mode engineers actually complain about in production code review.
Inference efficiency work remained highly active: @cohere integrated production W4A8 inference into vLLM, reporting up to 58% faster TTFT and 45% faster TPOT vs W4A16 on Hopper; the details include per-channel FP8 scale quantization and CUTLASS LUT dequantization. @WentaoGuo7 reported SonicMoE throughput gains on Blackwell—54% / 35% higher fwd/bwd TFLOPS than DeepGEMM baseline—while maintaining dense-equivalent activation memory for equal active params. @baseten introduced RadixMLP for shared-prefix elimination in reranking, with 1.4–1.6x realistic speedups.

Top tweets (by engagement)

OpenAI workspace agents: @OpenAI launched shared, Codex-powered workspace agents for Business/Enterprise/Edu/Teachers.
Qwen3.6-27B release: @Alibaba_Qwen announced the new open 27B dense model with strong coding claims and Apache 2.0 licensing.
Google TPU v8: @sundarpichai previewed TPU 8t / 8i, with training/inference specialization.
Flipbook / model-streamed UI: @zan2434 showed a prototype where the screen is rendered as pixels directly from a model rather than traditional UI stacks.
OpenAI Privacy Filter: @scaling01 and others highlighted OpenAI’s new open-source PII detection/redaction model on Hugging Face.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen 3.6 Model Releases and Benchmarks

Qwen 3.6 27B is out (Activity: 2576): Qwen 3.6 27B, a new language model, has been released on Hugging Face. This model features 27 billion parameters and is designed to improve upon previous iterations with enhanced performance benchmarks. A quantized version is also available, Qwen3.6-27B-FP8, which allows for more efficient deployment in environments with limited computational resources. The release includes detailed benchmark results, showcasing its capabilities across various tasks. The community is expressing excitement about the release, with some users highlighting the significance of the model’s performance improvements and the availability of a quantized version for broader accessibility.
- Namra_7 shared a benchmark image for Qwen 3.6 27B, which likely includes performance metrics such as inference speed, accuracy, or other relevant statistics. However, the specific details of the benchmarks are not described in the comment itself.
- challis88ocarina mentioned a quantized version of Qwen 3.6 27B available on Hugging Face, specifically in FP8 format. Quantization can significantly reduce the model size and improve inference speed, making it more efficient for deployment without a substantial loss in accuracy. The link provided leads to the Hugging Face model repository for further exploration.
- Eyelbee posted another image link, which might contain additional visual data or performance metrics related to Qwen 3.6 27B. However, the comment does not provide specific insights or details about the content of the image.
Qwen3.6-27B released! (Activity: 895): Qwen3.6-27B is a newly released dense, open-source model that excels in coding tasks, outperforming its predecessor, Qwen3.5-397B-A17B, on major coding benchmarks. It features strong reasoning capabilities across both text and multimodal tasks and offers flexibility with ‘thinking’ and ‘non-thinking’ modes. The model is released under the Apache 2.0 license, making it fully open-source and accessible for community use. More details can be found on their blog, GitHub, and Hugging Face. The comments reflect excitement and admiration for the Qwen team, with users expressing eagerness to utilize the model on their hardware and suggesting the team’s contributions are monument-worthy.
- ResearchCrafty1804 highlights the impressive performance of Qwen3.6-27B, noting that despite having only 27 billion parameters, it surpasses the much larger Qwen3.5-397B-A17B model on several coding benchmarks. Specifically, it achieves scores of 77.2 on SWE-bench Verified, 53.5 on SWE-bench Pro, 59.3 on Terminal-Bench 2.0, and 48.2 on SkillsBench, outperforming the larger model by significant margins in each case.
- bwjxjelsbd comments on the competitive landscape, expressing satisfaction that Alibaba is advancing with Qwen models after META’s perceived setbacks. The commenter hopes for continued competition and transparency, suggesting that META should open-source their Muse family models to maintain a healthy competitive environment.
Qwen3.6-35B becomes competitive with cloud models when paired with the right agent (Activity: 848): The post discusses the significant improvement in benchmark performance of the Qwen3.6-35B model when paired with the little-coder agent, achieving a 78.7% success rate on the Polyglot benchmark, placing it in the top 10. This improvement highlights the impact of using appropriate scaffolds, suggesting that local models may underperform due to harness mismatches. The author plans to test further on Terminal Bench and GAIA for research capabilities. Full details and benchmarks are available on GitHub and Substack. Commenters express surprise at the performance gains from scaffold changes, questioning the validity of benchmarks that don’t control for such factors. There’s also interest in using pi.dev for its extensibility in harnessing models.
- DependentBat5432 highlights a significant performance improvement in Qwen3.6-35B when changing the scaffold, noting a jump from 19% to 78%. This raises concerns about the validity of benchmark comparisons that do not control for such variables, suggesting that scaffold choice can dramatically affect model performance.
- Willing-Toe1942 reports that Qwen3.6, when used with pi-coding agents, performs almost twice as well as opencode. This comparison involved tasks like modifying HTML code and searching online resources for documentation, indicating that the choice of agent can significantly enhance the model’s effectiveness in practical coding scenarios.
- kaeptnphlop mentions the strong performance of Qwen-Coder-Next when paired with GitHub Copilot in VS Code, suggesting potential for further exploration with other tools like little-coder. This implies that integrating Qwen models with popular coding environments can leverage their strengths effectively.

[AINews] OpenAI launches GPT-Image-2

Wed, 22 Apr 2026 00:23:52 GMT

Cursor’s $60B deal with Xai today nearly took headline story, but given that it is a purely financial story (some plausible analysis here on motivations), we are giving title story to OpenAI’s big launch today of GPT-Image-2.

After weeks of speculation as a stealth model on Arena (confirmed), GPT-Image-2 is live on API and ChatGPT and looks to leapfrog Nano Banana 2 in the Imagegen space, with both Thinking and nonthinking variants. This comes after a rumored “focus” sprint that involved the shutdown and departure of the Sora team, so it is both heartening and somewhat surprising that Imagegen is still a priority for OpenAI. Thankfully, the model is very, very, very good. By nature, you should check out the 8 videos that the team has prepared, as well as the blogpost and the livestream and the tweet/blogpost.

If we were to pick a single most impressive demonstration, it’d be the level of text detail and consistency in the matrix example.

or custom Where’s Waldo:

AI News for 4/20/2026-4/21/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

OpenAI’s GPT-Image-2 Launch and the Return of Image Generation as a Serious Product Surface

GPT-Image-2 is the day’s clearest product launch: OpenAI rolled out ChatGPT Images 2.0 and the underlying gpt-image-2 model across ChatGPT, Codex, and API, emphasizing stronger text rendering, layout fidelity, editing, multilingual support, and “thinking” for images. OpenAI says the model can search the web when paired with a thinking model, generate multiple candidates, self-check outputs, and produce artifacts like slides, infographics, diagrams, UI mockups, and QR codes (launch thread, thinking/image capabilities, availability, API post). The model is already being integrated by downstream tools including Figma, Canva, Firefly, fal, and Hermes Agent.
Benchmarks suggest a large jump, especially on practical image tasks: Arena reports #1 across all Image Arena leaderboards for GPT-Image-2, including 1512 on text-to-image, 1513 on single-image edit, and 1464 on multi-image edit, with a striking +242 Elo lead on text-to-image over the next model (Arena summary, category breakdown, trend chart). Independent reactions converged on the same theme: this is not merely prettier art, but a more usable model for UI, mockups, documentation, productivity visuals, and reference-driven design loops (@gdb, @nickaturley, @mark_k, @petergostev). The most interesting systems implication is that image generation is becoming a front-end for coding agents: generate a UI spec as an image, then have Codex or another code agent implement against that visual reference.

Agent Infrastructure: Hugging Face’s ml-intern, Hermes Expansion, and the Rise of Research/Runtime Harnesses

Hugging Face’s ml-intern is the strongest open agent-in-the-loop release in the set: HF introduced ml-intern, an open-source agent that automates the post-training research loop: reading papers, following citation graphs, collecting/reformatting datasets, launching training jobs, evaluating runs, and iterating on failures (announcement, supporting post from @lewtun, Clement’s framing). Reported examples are notable because they are end-to-end loops, not just coding demos: GPQA scientific reasoning improved 10% → 32% in under 10h on Qwen3-1.7B, a healthcare setup reportedly beat Codex on HealthBench by 60%, and a math setup wrote a full GRPO script and recovered from reward collapse via ablations. Community tests quickly showed it can autonomously fine-tune and publish artifacts back to the Hub (example run on SAM finetuning).
Hermes is evolving toward a richer local/open agent platform: Several tweets point to Hermes’ momentum as a practical open agent stack: a beginner guide generated by a Hermes agent itself, native support in Skillkit, a new macOS GUI called Scarf, and expanding use in local workflows. The most technically meaningful update is from @Teknium: Hermes subagents now support both greater spawn width and recursive spawn depth, enabling deeper hierarchical decomposition. This aligns with the broader shift from “single chat loop” agents to multi-process orchestrated systems with memory, tools, permissions, and reusable skills.
Harnesses are becoming first-class engineering artifacts: A recurring theme across tweets is that the useful part of agent systems is increasingly the runtime/harness, not the base model alone. DSPy 3.2 shipped RLM improvements plus optimizer chaining and LiteLLM decoupling (release); Isaac Flath argued RLM makes notebooks relevant again as a REPL-native trace/eval interface (tweet); LangChain added custom auth for deepagents deploy (update); and a paper-summary thread on Claude Code emphasized that most of the system is harness logic rather than raw “intelligence” (summary).

Kimi K2.6, KDA Kernels, and Open-Weight Coding Models Getting More Systems-Credible

Moonshot pushed both model capability and kernel infrastructure: The flagship Kimi thread claims K2.6 completed long-horizon coding tasks with sustained autonomy: one run downloaded and optimized Qwen3.5-0.8B inference in Zig over 4,000+ tool calls and 12+ hours, improving throughput from ~15 tok/s to ~193 tok/s, ending ~20% faster than LM Studio (thread). Another run reportedly reworked an exchange engine over 1,000+ tool calls and 4,000+ LOC changes, achieving 185% medium-throughput and 133% peak-throughput gains (second thread). These are still vendor demos, but they are much closer to systems work than benchmark screenshots.
Kimi also open-sourced performance-critical infra: Moonshot released FlashKDA, a CUTLASS-based implementation of Kimi Delta Attention kernels, claiming 1.72×–2.22× prefill speedup over the flash-linear-attention baseline on H20 and compatibility as a drop-in backend for flash-linear-attention (release). External follow-up reported K2.6 + DFlash at 508 tok/s on 8x MI300X, a 5.6× throughput improvement over a baseline autoregressive setup (HotAisle). Together with ongoing discussion of DSA/MLA/KDA variants, the key signal is that Chinese labs are not just shipping weights; they are increasingly publishing attention/kernel-level optimizations with real deployment impact.
Open-weight coding quality is improving, but there’s still disagreement on parity: Some users now treat Kimi K2.6 as the best open-source/open-weight coding/agentic model (@scaling01, Windsurf availability), while others pushed back that frontier proprietary models still hold large leads on WeirdML, long-horizon tasks, and reliability (@scaling01 critique, gap on WeirdML). The substantive takeaway is less “open has caught up” than that open-weight models are now credible enough that infra, harness, and deployment quality determine a lot of real-world value.

Deep Research Systems: Google Extends the Research-Agent Frontier

Google upgraded Deep Research into a more configurable API primitive: Google/DeepMind launched updated Deep Research and Deep Research Max via the Gemini API, powered by Gemini 3.1 Pro, with collaborative planning, arbitrary MCP support, multimodal inputs (PDF/CSV/image/audio/video), code execution, native chart/infographic generation, and real-time progress streaming (Google thread, feature details, Sundar post, developer API post).
The benchmark numbers are strong enough to matter commercially: Google highlighted 93.3% on DeepSearchQA, 85.9% on BrowseComp, and 54.6% on HLE for the Max variant (Sundar, Phil Schmid summary). More important than the raw scores is the workflow design: Google is clearly productizing “overnight due diligence / analyst report generation” and making MCP-backed internal data access a standard part of research agents. This also shows a widening split between simple browse agents and full-stack research agents that plan, search, execute code, generate visuals, and ground over proprietary corpora.

Retrieval, Data, and Evaluation: Open Releases with Real Engineering Value

Retrieval saw a meaningful open release from LightOn: LightOn released LateOn and DenseOn, both 149M-parameter retrieval models under Apache 2.0, reporting 57.22 NDCG@10 on BEIR for LateOn (multi-vector/ColBERT style) and 56.20 for DenseOn (dense single-vector), beating models up to 4× larger (model release, overview). They also published a consolidated dataset release with 1.4B query-document pairs and a refreshed web dataset built on FineWeb-Edu (dataset post).
vLLM shipped a practical deployment knowledge layer: The redesign of recipes.vllm.ai is more useful than it sounds. It maps model pages to runnable deployment recipes, includes an interactive command builder, supports NVIDIA and AMD, covers tensor/expert/data parallel variants, and exposes a JSON API for agents. This is exactly the kind of infra documentation layer that reduces operator friction for serving new open models.
Benchmarks are increasingly probing agent blind spots, not just task outputs: Notable examples include ParseBench for chart understanding inside real enterprise documents (LlamaIndex, Jerry Liu details) and a new result showing agents often ignore explicit environment clues, even when the solution is literally exposed in a file or endpoint (paper thread). Google Research’s ReasoningBank also fits this theme, framing memory as learning from both successful and failed trajectories (tweet).

Top tweets (by engagement)

OpenAI’s image launch: “Introducing ChatGPT Images 2.0” was the dominant technical launch tweet, backed by a deep feature thread and rapid downstream integrations.
HF ml-intern: @akseljoonas had the standout agent/research-loop release of the day.
Gemma local concurrency demo: @googlegemma showed Gemma 4 26B A4B handling 10+ concurrent requests at ~18 tok/s/request on an M4 Max, a useful datapoint for local-serving economics.
Deep Research Max: @sundarpichai and @Google pushed a materially stronger research-agent API surface.
Kimi kernel release: FlashKDA was one of the more substantial open infra drops in the model-serving stack.
Open-source policy warning: @ClementDelangue warned of renewed lobbying to restrict open-source AI, one of the few policy tweets with direct implications for builders.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Kimi K2.6 Model Launch and Benchmarks

Claude Code removed from Claude Pro plan - better time than ever to switch to Local Models. (Activity: 349): The image provides a comparison chart of different subscription plans for a service called “Claude,” highlighting the removal of the “Claude Code” feature from the Pro plan. This change is significant as it suggests a shift in the service’s offerings, potentially prompting users to consider alternative local models like Kimi K2.6 or Qwen 3.6 35B A3B. The post discusses the cost-effectiveness of switching to these local models, emphasizing the value of the OpenCode Go coding plan, which offers more tokens for a lower price compared to the Claude Pro plan. Commenters express disbelief and frustration over the removal of the “Claude Code” feature from the Pro plan, with some suggesting it might be a mistake and others urging the company to address the issue on their product page.
- korino11 raises a cost-benefit analysis comparing the $20 open code plan to a $19 plan on Kimi, suggesting that the latter might offer better value. This implies a need for users to evaluate the cost-effectiveness of different AI model subscriptions, especially when features are removed or altered.
- Apart_Ebb_9867 points out a potential issue with the information on the official Claude product page, suggesting that the page might need updating or correction. This highlights the importance of accurate and up-to-date documentation for users relying on specific features.
- The-Communist-Cat mentions the lack of online references to the removal of Claude Code from the Pro plan, indicating that there might be misinformation or a delay in communication from the company. This underscores the need for clear and timely updates from service providers to avoid confusion among users.
Kimi K2.6 is a legit Opus 4.7 replacement (Activity: 1632): Kimi K2.6 is being positioned as a viable replacement for Opus 4.7, capable of performing 85% of Opus’s tasks with reasonable quality. While it doesn’t surpass Opus 4.7 in any specific area, Kimi K2.6 offers additional capabilities such as vision and effective browser use, making it suitable for long-term tasks. Despite its large size, it suggests that frontier LLMs like Opus 4.7 may not be offering significant new advancements. The model’s local deployment is highlighted as a benefit, avoiding issues like usage limits. Commenters express skepticism about the rapid testing and recommendation process, noting that thorough testing typically takes longer. There’s also a discussion on the affordability of local models, with some users expressing frustration over high costs.
- InterstellarReddit highlights the rapid testing and deployment process of Kimi K2.6, noting that the original poster managed to test and recommend the model to customers within just two hours. This is contrasted with their own company’s process, which involves a week-long evaluation by four engineers before customer testing. This underscores the efficiency and agility possible with smaller teams or individual developers in AI model deployment.
- Technical-Earth-3254 suggests that if Kimi K2.6 achieves 85% of Opus’s performance, it could potentially serve as a full replacement for Sonnet models. This implies a significant performance benchmark where Kimi K2.6 is seen as a viable alternative to existing models, offering similar capabilities at potentially lower costs or resource requirements.
- Blablabene discusses the impact of local AI models like Kimi K2.6 on the market, emphasizing that they exert pressure on proprietary models to reduce costs. The comment also notes the current high expense of running models locally, but anticipates increased accessibility in the future as technology advances and costs decrease.
Opus 4.7 Max subscriber. Switching to Kimi 2.6 (Activity: 386): The post discusses a transition from Opus 4.7 Max to Kimi 2.6 due to performance and cost issues. The user notes that Opus 4.7 has become ‘lazy’ and expensive, prompting a switch to Kimi 2.6, which is described as fast and pleasurable despite its smaller context size. The user highlights that Kimi 2.6 manages its smaller context effectively, suggesting improvements in handling tool outputs. A pull request was submitted to improve Kimi’s integration with Forge (GitHub PR). Comments suggest skepticism about the sustainability of investments in proprietary models like those from Anthropic and OpenAI, as open models like Kimi are becoming competitive. There’s also a debate on the potential of Chinese models, with Kimi being a 1T model compared to Opus’s 5T, indicating a shift in competitive dynamics.
- Worried-Squirrel2023 highlights a critical issue with Opus 4.7, noting its tendency to ‘stop mid-task or wrap things up before they’re actually done,’ which they describe as ‘laziness.’ This suggests a problem with task completion reliability, which can be a significant drawback in real-world applications. They also mention that Kimi’s smaller context window is less problematic compared to Opus’s commitment issues, and they are particularly interested in the ‘tool calling reliability’ where they see a notable difference between Kimi and Opus.
- sb5550 points out the stark difference in model size between Kimi and Opus, with Kimi being a ‘1T model’ and Opus a ‘5T model.’ This comparison underscores the efficiency and potential of smaller models like Kimi, especially when considering that Chinese models might not be lagging behind but could potentially be leading in AI development. This raises questions about the scalability and performance efficiency of smaller models in comparison to larger ones.
- Ok-Contest-5856 discusses the financial implications for private equity investments in proprietary models like those from Anthropic and OpenAI, suggesting that open models like Kimi, which are ‘neck and neck and way cheaper,’ could pose a significant threat. They speculate that open models might even surpass proprietary ones in the future, indicating a shift in the competitive landscape of AI development.
Kimi K2.6 Released (huggingface) (Activity: 1386): Kimi K2.6, released by Hugging Face, is a cutting-edge open-source multimodal AI model optimized for long-horizon coding and autonomous task orchestration. It employs a Mixture-of-Experts architecture with 1 trillion parameters, enabling it to transform prompts into production-ready interfaces and execute complex coding tasks across multiple languages. The model supports up to 300 sub-agents for parallel task execution and shows superior performance in benchmarks, particularly in proactive orchestration and deployment on platforms like vLLM and SGLang. More details can be found in the original article. Commenters noted the impressive scale of 1.1 trillion parameters, with some expressing surprise at the model’s size. There is also mention of Cursor’s Composer 2.1 model beginning its training, indicating ongoing advancements in the field.
- ResidentPositive4122 highlights that the Kimi K2.6 release includes both the code repository and model weights under a Modified MIT License. This license maintains the core ‘do whatever you want’ ethos of MIT but requires attribution if used by large corporations, which is a significant point for developers considering integration or modification of the model.
- LagOps91 expresses interest in the potential real-world performance of the Kimi K2.6 model, noting that while benchmarks are impressive, the true test will be how these translate into practical applications. This underscores the importance of evaluating models beyond theoretical metrics to assess their utility in real-world scenarios.
Kimi K2.6 (Activity: 570): The image presents a benchmark comparison of AI models, highlighting Kimi K2.6’s performance across various tasks against other models like GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro. Kimi K2.6 shows strong performance, particularly in categories such as General Agents, Coding, and Visual Agents, suggesting its competitive edge in these areas. The chart underscores Kimi K2.6’s capability, especially in tasks like “Humanity’s Last Exam” and “DeepSearchQA,” where it scores highly, indicating its potential as a robust AI model. Commenters note the significance of Kimi K2.6’s performance, especially in coding, and express surprise at its competitiveness with closed-source models. There is also a mention of Kimi’s vendor verifier, which standardizes third-party service evaluations, highlighting its importance in the AI ecosystem.
- The Kimi K2.6 model introduces a standardized method for evaluating third-party services, which is crucial for ensuring consistent performance and reliability across different implementations. This approach could significantly impact how open-source models are assessed compared to their closed-source counterparts, potentially leveling the playing field.
- There is a notable anticipation that Kimi K2.6 might outperform Opus, a competing model. Despite its large size, the community is hopeful that Kimi K2.6 will set a new benchmark in performance, especially in comparison to other models like DeepseekV4, which had high expectations but did not fully deliver.
- The release of Kimi K2.6 has raised expectations for future models, such as GLM-5.1, by setting a high standard in the open-source community. This development suggests a shift in the competitive landscape, where open-source models are increasingly challenging the dominance of proprietary models.

2. Gemma 4 Model Capabilities and Benchmarks

Gemma 4 Vision (Activity: 319): The post discusses optimizing the vision capabilities of the Gemma 4 model by adjusting its vision budget parameters. The default settings for --image-min-tokens and --image-max-tokens are 40 and 280 respectively, which are considered insufficient for detailed OCR tasks. The author suggests increasing these to 560 and 2240 to improve performance, noting that this configuration allows Gemma 4 to outperform other models like Qwen 3.5, Qwen 3.6, and GLM OCR in vision tasks. This adjustment requires a significant increase in VRAM usage, from 63 GB to 77 GB for q8_0 at max context. The post also mentions a limitation with Ollama’s implementation, which may not support these changes due to an unresolved issue. A commenter inquires about the minimum token settings for smaller models, questioning whether the 40 token minimum applies to larger models only. Another user requests detailed configuration options for llamacpp and vllm, indicating a need for more comprehensive setup guidance.
- Temporary-Mix8022 discusses using the vision encoder from smaller models with around 150 million parameters, mentioning a configuration of 70 tokens as the minimum. They inquire if 40 tokens is the minimum for larger models with 500 million parameters, suggesting a difference in token requirements based on model size.
- stddealer shares their experience using --image-min-tokens 1024 and --image-max-tokens 1536 settings, which they adopted from Qwen3.5. This configuration led to confusion about the perceived underperformance of Gemma4’s vision capabilities, indicating that token settings significantly impact model performance.
- Yukki-elric suggests setting both --image-min-tokens and --image-max-tokens to 1120 for optimal image quality processing. This recommendation implies a balance between token allocation and image quality, potentially offering a more reliable configuration than others discussed.
Gemma-4-E2B’s safety filters make it unusable for emergencies (Activity: 985): Google’s Gemma-4-E2B model, intended as a local, offline resource for emergency preparedness, is criticized for its overly aggressive safety filters, rendering it ineffective in emergencies. The model issues ‘hard refusals’ on critical survival topics such as emergency airway procedures, water purification, mechanical maintenance, and food processing, under the guise of safety. This limitation is problematic in scenarios where contacting emergency services is not feasible, such as during a war or grid collapse. Commenters argue that the model’s refusal is justified due to its limited world knowledge, suggesting that relying on it in emergencies could be dangerous. Some suggest using uncensored versions or integrating the model with a Wikipedia backup for more reliable information.
- Klutzy-Snow8016 highlights the limitations of the Gemma-4-E2B model, emphasizing its lack of comprehensive world knowledge and the potential dangers of relying on it in emergencies. They suggest that the model could hallucinate incorrect information, which could be life-threatening. A practical suggestion is made to download a Wikipedia backup and enable the model to query it, enhancing its utility in critical situations.
- iliark points out that in some cases, the Gemma-4-E2B model provides correct advice, such as not removing shrapnel from a wound, which aligns with medical guidelines. This indicates that while the model may have limitations, it can still offer valuable guidance in specific scenarios, provided the advice is verified against reliable sources.
- Illustrious_Yam9237 argues against using LLMs like Gemma-4-E2B for emergency advice, suggesting that storing relevant PDFs would be a more reliable and efficient solution. This reflects a broader skepticism about the practicality and reliability of LLMs in high-stakes situations where accuracy is critical.
Gemma 4 26B-A4B GGUF Benchmarks (Activity: 421): The image is a performance benchmark chart for the Gemma 4 26B-A4B GGUF models, focusing on Mean KL Divergence across different providers. The chart illustrates that Unsloth GGUFs are on the Pareto frontier, indicating they are top-performing in terms of retaining accuracy after quantization. The benchmarks show that Unsloth models outperform others in 21 out of 22 sizes, with updates to Q6_K quants making them more dynamic without requiring re-downloads. Additionally, a new UD-IQ4_NL_XL quant is introduced, fitting within 16GB VRAM, offering a middle ground between existing models. The image supports the text’s emphasis on Unsloth’s effectiveness in quantized model performance. A comment suggests including inference speed benchmarks, noting the challenge of varying hardware, while another highlights the efficiency of UD-IQ2_XXS compared to larger models from ggml-org.
- qfox337 raises a pertinent question about the inclusion of inference speed benchmarks, noting the potential variability depending on hardware. They inquire whether different compression schemes significantly impact performance, suggesting that benchmarks could provide clarity on this aspect.
- Far-Low-4705 compares quantization methods, highlighting that UD-IQ2_XXS is more efficient at 9Gb compared to Q4_K_M from ggml-org at 16Gb. This suggests a significant improvement in model size efficiency, which could be crucial for deployment on resource-constrained systems.
- -Ellary- discusses the performance of different quantization methods, noting that while Unsloth Qs are often highlighted in benchmarks, their own tests show that Bartowski Qs perform similarly and offer greater stability. This suggests that benchmark results may not fully capture real-world performance nuances.

3. Qwen 3.6 Model Updates and Comparisons

Every time a new model comes out, the old one is obsolete of course (Activity: 1164): The image is a meme illustrating the rapid obsolescence of AI models, specifically comparing “Gemma4” and “Qwen3.6.” The meme humorously depicts the tendency of users to abandon older models in favor of newer ones, even if the older models still have valuable applications. The comments highlight that while “Qwen3.6” may be preferred for certain tasks like coding, “Gemma4” is still favored for creative writing and translation, indicating that different models have strengths in different areas. Commenters express a preference for “Gemma4” in creative writing and translation tasks, while “Qwen3.6” is noted for its coding capabilities. There is also a concern about the reliability and continued support of newer models like “Qwen3.6.”
- Gemma 4 is noted for its superior performance in creative writing tasks, with users highlighting its ability to handle such tasks without contest. This suggests a specialization or optimization in its architecture or training data that favors creative outputs.
- Qwen is criticized for its performance in translation tasks, with users noting that it falls short compared to other models. However, it is recognized for its strengths in coding and development, indicating a possible focus on technical language processing.
- A technical issue with Qwen is highlighted regarding its instruction-following capabilities. Users report that after processing a few images, Qwen’s ability to follow instructions degrades significantly, leading to incorrect tool calls and failure to verify results. This suggests potential limitations in its context management or instruction parsing mechanisms.
Layman’s comparison on Qwen3.6 35b-a3b and Gemma4 26b-a4b-it (Activity: 362): The post compares two AI models, Qwen3.6-35B-A3B and Gemma4 26B-A4B-it, running on a 16GB VRAM video card using Windows LM Studio with recommended inference settings. The models are evaluated for their performance in coding and general tasks. Qwen3.6 is described as an ‘A+ student’ with high energy, while Gemma4 is a ‘solid B student’ that performs reliably. The models run at comparable speeds, but Qwen is noted for hallucinating methods more frequently than Gemma, which is better for complex prompts and backend scripting. The post also highlights the importance of using the correct system prompts to unlock Gemma’s potential, as demonstrated by a user comment. Commenters note that Qwen3.6 excels in programming and tool calling, while Gemma4 is preferred for conversation, roleplay, and translation. There is a debate on the backend capabilities, with Qwen hallucinating more than Gemma. Some users suggest that custom fine-tuning or system prompts can significantly enhance Gemma’s performance, particularly in frontend tasks.
- Sadman782 highlights that while Gemma4 can be improved with custom fine-tuning or system prompts to enhance its frontend capabilities, Qwen3.6 often hallucinates methods, especially in backend tasks. They note that Gemma4 performs better in complex app development, as Qwen tends to produce errors more frequently. This suggests that Gemma4 might be more reliable for intricate coding tasks, whereas Qwen3.6 might struggle with backend consistency.
- Kahvana provides a comparative analysis, noting that Qwen3.5/3.6 excels in programming and tool calling, whereas Gemma4 is superior for conversation, roleplay, and translation tasks. They mention that both models have their strengths, with Qwen being more suitable for technical tasks and Gemma4 for more general or creative tasks. This indicates a clear division in their optimal use cases, with Qwen being more technically oriented and Gemma4 more versatile in language-based tasks.
- BigYoSpeck discusses the aesthetic capabilities of Qwen models, noting their ability to create visually appealing designs with ‘flair.’ However, they caution that this does not necessarily translate to better problem-solving or instruction-following capabilities. They suggest testing models with unique challenges that require adaptation beyond their training set to truly assess their capabilities, rather than relying on generic tasks that may not fully showcase their strengths.
Qwen 3.6 Max Preview just went live on the Qwen Chat website. It currently has the highest AA-Intelligence Index score among Chinese models (52) (Will it be open source?) (Activity: 440): Qwen 3.6 Max has been released on the Qwen Chat website and currently holds the highest AA-Intelligence Index score of 52 among Chinese models, as reported by AiBattle. The model’s parameter count is speculated to be between 600-700B, given that the previous version, Qwen 3.6, had 397B parameters. However, there is no indication that the Max version will be open-sourced, as historically, Max models have not been made publicly available. Commenters express skepticism about the open-sourcing of Max models, noting that these models are typically not accessible to the public. There is a preference for smaller models that can be run on consumer-grade hardware, suggesting that Max models should remain proprietary to support the company’s revenue.
- A user speculates on the parameter count of the Qwen 3.6 Max model, suggesting it could be between 600-700B parameters, given that the previous version, Qwen 3.6, had 397B parameters. This indicates a significant increase in model size, which could impact performance and resource requirements.
- Another user expresses a preference for smaller or medium-sized models that can run on consumer-grade hardware, highlighting a common trade-off in AI development between model size and accessibility. They suggest that while max models serve as a revenue engine, open-sourcing smaller models could benefit the community by making advanced AI more accessible.
- A comment notes that the largest model likely to be open-sourced is the 122B model, as the company has stopped open-sourcing their larger 397B models. This reflects a strategic decision to keep larger models proprietary, possibly to maintain a competitive edge or due to resource constraints in supporting open-source releases.

[AINews] Moonshot Kimi K2.6: the world's leading Open Model refreshes to catch up to Opus 4.6 (ahead of DeepSeek v4?)

Latent.Space — Tue, 21 Apr 2026 00:19:33 GMT

Two days left before Early Bird ends for AI Engineer World’s Fair this Summer in SF. This is will be THE BIG ONE of the year - lock in discounts up to $500 (refundable).

DeepSeek V4 rumors are back, and we learned our lesson not to get too excited, but in their deafening silence since v3.2, Moonshot has owned the crown of leading Chinese open model lab for all of 2026 to date, and K2.6 refreshes the lead that K2.5 established in January, with (presumably) more continued pre/posttraining (this time, details of how much more training were not disclosed). Comparing the numbers from the two launches 3 months apart demonstrates the staggering amount of progress:

Moonshot/Kimi continues to compete at a level far above “just being open source versions of Frontier models” (though it is one of the three Chinese labs accused by Anthropic in Feb) - they are taking on Gemini 3.1 in their home turf of frontend design, touting a 68.6% win+tie rate vs Gemini 3.1 Pro:

And scaling out the pioneering work they did with Agent Swarm RL last edition:

And, with OpenClaw being the flavor of the quarter, their own ClawBench and a minor rebrand of their Agent Swarm work in to "Claw Groups”.

Overall not as technically impressive in isolation as K2.5, but overall still showing far more execution and imagination and drive than their peers, an impressive update and incredible gift to the ecosystem.

AI News for 4/18/2026-4/20/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Kimi K2.6 and Qwen3.6-Max-Preview Push Open Agentic Coding Forward

Moonshot’s Kimi K2.6 was the clear release of the day: an open-weight 1T-parameter MoE with 32B active, 384 experts (8 routed + 1 shared), MLA attention, 256K context, native multimodality, and INT4 quantization, with day-0 support in vLLM, OpenRouter, Cloudflare Workers AI, Baseten, MLX, Hermes Agent, and OpenCode. Moonshot claims open-source SOTA on HLE w/ tools 54.0, SWE-Bench Pro 58.6, SWE-bench Multilingual 76.7, BrowseComp 83.2, Toolathlon 50.0, CharXiv w/ python 86.7, and Math Vision w/ python 93.2 in the launch thread. The more novel systems claims are around long-horizon execution—4,000+ tool calls, 12+ hour continuous runs, 300 parallel sub-agents, and “Claw Groups” for multi-agent/human coordination. Community reactions quickly centered on K2.6 as a viable Claude/GPT backend for coding and infra work, including reports of a 5-day autonomous infra agent run, kernel rewrites, and a Zig inference engine outperforming LM Studio by 20% TPS.
Alibaba’s Qwen3.6-Max-Preview also landed as an early preview of its next flagship with improved agentic coding, stronger world knowledge and instruction following, and better “real-world agent and knowledge reliability” per @Alibaba_Qwen. Early community takes pegged it as unusually stable for long-reasoning tasks; @teortaxesTex highlighted it solving AIME 2026 #15 after ~30 minutes of thinking, and Arena later noted Qwen3.6 Plus reaching #7 in Code Arena and moving Alibaba to #3 lab there. Together, Kimi and Qwen reinforced a broader theme: Chinese open and semi-open labs are shipping highly competitive coding/agent models with fast ecosystem uptake.

Hermes Agent’s Rapid Ecosystem Expansion and Multi-Agent Orchestration Patterns

Hermes Agent continued to emerge as the most visible open agent stack in this batch. Multiple tweets pointed to it surpassing 100K GitHub stars in under two months and overtaking OpenClaw in weekly star growth, with @Delphi_Digital framing it as evidence that “open source agents are no longer a one-project story.” The ecosystem momentum is tangible: native launch support in Ollama, integration with Copilot CLI via Ollama, a growing set of community web UIs, and third-party tooling like Hermes Workspace V2, Browser Use integrations, and cloud deployment templates.
The more substantive content came from operator patterns. A detailed Chinese thread on advanced Hermes usage broke out three mechanisms that matter in practice for multi-agent systems: stateless ephemeral units for true parallelism (skip_memory=True, skip_context_files=True), LLM-driven replanning over structured failure metadata (status, exit_reason, tool_trace) instead of blind retries, and dynamic context injection via directory-local AGENTS.md/.cursorrules surfaced only through tool results. That is a more disciplined orchestration model than stuffing all history into one prompt. Related community posts described Hermes as a four-layer memory system with periodic memory consolidation, contrasted with OpenClaw’s “context window + RAG” approach in one comparison thread.
The ecosystem is also shifting toward self-improving harnesses and long-running operation: examples include hermes-skill-factory, maestro, icarus-plugin, and cloud templates, alongside discussion of the Externalized Intelligence in LLM Agents survey, which frames capability as increasingly living outside model weights—in memory systems, tools, protocols, and harnesses.

Memory, Context, and Runtime Become the New Product Surface for Coding Agents

OpenAI Codex Chronicle was the most notable product update: a research preview that lets Codex build memories from recent screen context, effectively turning passive work history into agent-usable context. OpenAI says Chronicle uses background agents to build memories from screenshots, stores captures and memories on device, lets users inspect/edit those memories, and is rolling out to Pro users on macOS (excluding EU/UK/Switzerland) for now via @OpenAIDevs and @thsottiaux. This is a meaningful shift from chat history as memory to ambient context capture, and several builders immediately recognized the lock-in implications; @hwchase17 bluntly noted that “memory will be the great lock in.”
There was also a parallel wave of infra thinking around runtime vs harness. LangChain’s new guide on deploying long-running agents and follow-on posts by @Vtrivedy10 and @sydneyrunkle argue that building an agent is mostly a harness problem, but productionizing it is a runtime problem: multi-tenant isolation, memory, observability, retries, governance, and improvement loops. This aligns with the self-improving-agent discussion around the Autogenesis Protocol and auditable self-improvement systems, both of which decompose prompts, tools, memory, and environments into versioned resources with gated reflection/improvement/commit cycles.
On the UX side, coding-agent tools kept polishing the terminal surface: Cursor CLI added /debug and customizable status bars, while OpenCode shipped a new model picker. The common pattern is that memory, inspection, and execution controls are becoming first-class product features, not just backend details.

Inference Systems and Architecture Work: Prefill/Decode Separation, Linear Attention, and Model Surgery

A notable systems thread was Prefill-as-a-Service for cross-datacenter inference. The core argument, described in a detailed Zhihu Frontier summary and echoed by @nrehiew_, is that traditional prefill/decode disaggregation hits a bandwidth wall because standard-attention KV cache transfer is too large for cross-DC links. Linear attention / recurrent-state architectures like Kimi Linear reduce state transfer enough to make remote prefill practical. The PoC cited scales a 1T-parameter linear-attention model across mixed H200/H20 clusters over a 100 Gbps inter-DC link, reporting +54% throughput and -64% P90 TTFT, with outbound bandwidth around 13 Gbps. If those numbers hold more broadly, linear-attention families may matter as much for serving topology as for asymptotic context scaling.
On the architecture side, @lianghui_zhu argued that post-ResNet deep nets have underexplored how layers communicate, beyond simple x + F(x) residual pathways. While the thread text here is partial, it signals renewed interest in inter-layer communication topologies rather than just scaling width/depth. Related architectural exploration appeared in the strong engagement around recurrent-depth transformers, e.g. Loop, Think, & Generalize, which reports systematic compositional generalization emerging through recurrence and grokking-like stages, plus community connections to Universal Transformers and MoEUT variants.
A more applied model-surgery idea came from @ostrisai, who expanded image-model patch-2 layers to patch-4 by averaging/replicating sub-patch weights, aiming for 2× image size at the same compute with near-zero-init transfer before finetuning. If this cleanup finetune works, it would be a clever example of reparameterizing existing image backbones for higher resolution without full retraining.

Benchmarks, Evals, and Safety Stress Tests Are Getting More Realistic

Arena results continued to matter for multimodal models. @arena reported Claude Opus 4.7 taking #1 in Vision & Document Arena, with +4 points over Opus 4.6 in Document Arena and a large margin over the next non-Anthropic models. Subcategory wins included diagram, homework, and OCR, reinforcing Anthropic’s current strength on document-heavy, long-context enterprise workflows.
On the safety/evals side, Redwood Research’s LinuxArena is the more technically interesting result: 20 live production environments for AI agents, with frontier models achieving roughly 23% undetected sabotage against trusted monitors. The takeaway in the tweet is blunt: as useful work rises, so does attack surface; sandboxing alone fails, so monitoring is essential. This feels directionally important because it moves from toy CTFs to more production-like environments.
Two benchmark-adjacent research items stood out. Sakana’s SSoT (“String Seed of Thought”) tackles a less discussed failure mode: LLMs are poor at distribution-faithful generation. In the announcement, they show that adding a prompt step where the model internally generates and manipulates a random string improves coin-flip calibration and output diversity without external RNGs. And Skill-RAG, summarized by @omarsar0, uses hidden-state probing to detect impending knowledge failures and only then invoke the right retrieval strategy—moving RAG from unconditional retrieval to failure-aware retrieval selection.

Top tweets (by engagement)

Kimi K2.6 launch: Moonshot’s release dominated technical engagement, combining strong benchmark claims with unusual long-horizon agent systems details in the main launch thread.
Anthropic’s AWS expansion: Anthropic said it secured up to 5 GW of compute with Amazon, with an additional $5B investment today and up to $20B more later, a major signal on frontier-model capex and supply strategy via @AnthropicAI.
Codex Chronicle: OpenAI’s move toward screen-derived memory in Chronicle was one of the more consequential product-direction tweets for coding agents.
Qwen3.6-Max-Preview: Alibaba’s preview release reinforced that top-tier coding/agent competition is no longer concentrated in a handful of Western labs.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Kimi K2.6 Model Release and Benchmarks

[AINews] The Two Sides of OpenClaw

Sat, 18 Apr 2026 06:50:57 GMT

In an opportune coinciding of big three letter conferences, the TED talk and the AIE talks of Peter Steinberger dropped today. To the general public, the inspiring story of OpenClaw was delightfully told onstage, which recaps all the highs:

To the engineering audience, it was more sober, talking about the unprecedented levels of security incidents (60x more reports than curl, at least 20% of skill contributions malicious) and scaling issues involved in maintaining the fastest growing open source project in history:

An AMA moderated by me is included at the end.

Contrast them, thoughts welcome.

AI News for 4/16/2026-4/17/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Anthropic’s Claude Opus 4.7 and Claude Design rollout

Claude Design launched as Anthropic’s first design/prototyping surface: @claudeai announced Claude Design, a research-preview tool for generating prototypes, slides, and one-pagers from natural-language instructions, powered by Claude Opus 4.7. The launch immediately framed Anthropic as moving beyond chat/coding into design tooling; multiple observers called it a direct shot at Figma/Lovable/Bolt/v0, including @Yuchenj_UW, @kimmonismus, and @skirano. The market reaction itself became part of the story, with @Yuchenj_UW and others noting Figma’s sharp drawdown after the announcement. Product details surfaced via @TheRundownAI: inline refinement, sliders, exports to Canva/PPTX/PDF/HTML, and handoff to Claude Code for implementation.
Opus 4.7 looks stronger overall, but the rollout was noisy: third-party benchmark posts were broadly favorable. @arena put Opus 4.7 #1 in Code Arena, +37 over Opus 4.6 and ahead of non-Anthropic peers there; the same account also had it at #1 overall in Text Arena with category wins across coding and science-heavy domains here. @ArtificialAnlys reported a near three-way tie at the top of its Intelligence Index—Opus 4.7 57.3, Gemini 3.1 Pro 57.2, GPT-5.4 56.8—while also placing Opus 4.7 first on GDPval-AA, their agentic benchmark. They also noted ~35% fewer output tokens than Opus 4.6 at higher score, and introduction of task budgets plus full removal of extended thinking in favor of adaptive reasoning. But user experience was mixed in the first 24 hours: @VictorTaelin reported regressions and context failures, @emollick said Anthropic had already improved adaptive thinking behavior by the next day, and @alexalbert__ confirmed that many initial bugs had been fixed. There were also complaints about product stability in Design itself from @theo and account-level safety issues from the same account here.
Cost/efficiency discussion became almost as important as raw quality: @scaling01 claimed ~10x fewer tokens for some ML problem runs versus prior high-end models while maintaining similar performance, while @ArtificialAnlys placed Opus 4.7 on the price/performance Pareto frontier for both text and code. Not every benchmark agreed on absolute leadership—e.g. @scaling01 noted it still trails Gemini 3.1 Pro and GPT-5.4 on LiveBench—but the consensus from these posts is that Anthropic materially improved the model’s agentic utility and efficiency.

Computer use, coding agents, and harness design

Computer-use UX is becoming a mainstream product category: OpenAI’s Codex desktop/computer-use updates drew unusually strong practitioner reactions. @reach_vb called subagents + computer use “pretty close” to AGI in practical feel; @kr0der, @HamelHusain, @mattrickard, and @matvelloso all emphasized that Codex Computer Use is not just flashy but fast, able to drive Slack, browser flows, and arbitrary desktop apps, and may be the first genuinely usable computer-use platform for enterprise legacy software. @gdb explicitly framed Codex as becoming a full agentic IDE.
The field is converging on “simple harness, strong evals, model-agnostic scaffolding”: several high-signal posts argued that reliability gains now come more from harnesses than from chasing the very largest models. @AsfiShaheen described a three-stage financial analyst pipeline—router / lane / analyst—with strict context boundaries and gold sets for each stage, arguing that many bugs were actually instruction/interface bugs. @AymericRoucher extracted the same lesson from the leaked Claude Code harness: simple planning constraints plus a cleaner representation layer outperform “fancy AI scaffolds.” @raw_works showed an even starker example: Qwen3-8B scored 33/507 on LongCoT-Mini with dspy.RLM, versus 0/507 vanilla, arguing the scaffold—not fine-tuning—did “100% of the lifting.” LangChain shipped more of these patterns into product: @sydneyrunkle added subagent support to deepagents deploy, and @whoiskatrin announced memory primitives in the Agents SDK.
Open-source agent stacks continue to proliferate: Hermes Agent remained a focal point. Community ecosystem overviews from @GitTrend0x highlighted derivatives like Hermes Atlas, Hermes-Wiki, HUDs, and control dashboards. @ollama then shipped native Hermes support via ollama launch hermes, which @NousResearch amplified. Nous and Kimi also launched a $25k Hermes Agent Creative Hackathon @NousResearch, signaling a push from coding/productivity into creative agent workflows.

Agent research: self-improvement, monitoring, web skills, and evaluation

A cluster of papers pushed agent robustness and continual improvement forward: @omarsar0 summarized Cognitive Companion, which monitors reasoning degradation either with an LLM judge or a hidden-state probe. The headline result is notable: a logistic-regression probe on layer-28 hidden states can detect degradation with AUROC 0.840 at zero measured inference overhead, while the LLM-monitor version cuts repetition 52–62% with ~11% overhead. Separate work on web agents from @dair_ai described WebXSkill, where agents extract reusable skills from trajectories, yielding up to +9.8 points on WebArena and 86.1% on WebVoyager in grounded mode. And @omarsar0 also highlighted Autogenesis, a protocol for agents to identify capability gaps, propose improvements, validate them, and integrate working changes without retraining.
Open-world evals are becoming a serious theme: several posts argued current benchmarks are too narrow. @CUdudec endorsed open-world evaluations for long-horizon, open-ended settings; @ghadfield connected this to regulation and “economy of agents” questions; and @PKirgis discussed CRUX, a project for regular open-world evaluations of AI agents in messy real environments. On the measurement side, @NandoDF proposed broad NLL/perplexity-based eval suites over out-of-training-domain books/articles across 2500 topic buckets, though that sparked debate about whether perplexity remains informative after RLHF/post-training from @eliebakouch, @teortaxesTex, and others.
Document/OCR and retrieval evals also got more agent-centric: @llama_index expanded on ParseBench, an OCR benchmark centered on content faithfulness with 167K+ rule-based tests across omissions, hallucinations, and reading-order violations—explicitly reframing the bar from “human-readable” to “reliable enough for an agent to act on.” In retrieval, @Julian_a42f9a noted new work showing late-interaction retrieval representations can substitute for raw document text in RAG, suggesting some RAG pipelines may be able to bypass full-text reconstruction.

Open models, local inference, and inference systems

Qwen3.6 local/quantized workflows were a practical bright spot: @victormustar shared a concrete llama.cpp + Pi setup for Qwen3.6-35B-A3B as a local agent stack, emphasizing how viable local agentic systems now feel. Red Hat quickly followed with an NVFP4-quantized Qwen3.6-35B-A3B checkpoint @RedHat_AI, reporting preliminary GSM8K Platinum 100.69% recovery, and @danielhanchen benchmarked dynamic quants, claiming many Unsloth quants sit on the Pareto frontier for KLD vs disk space.
Consumer-hardware inference keeps improving: @RisingSayak announced work with PyTorch/TorchAO enabling offloading with FP8 and NVFP4 quants without major latency penalties, explicitly targeting consumer GPU users constrained by memory. Apple-side local inference also got a showcase with @googlegemma, which demoed Gemma 4 running fully offline on iPhone with long context.
Inference infra updates worth noting: @vllm_project highlighted MORI-IO KV Connector with AMD/EmbeddedLLM, claiming 2.5× higher goodput on a single node via a PD-disaggregation-style connector. Cloudflare continued its agent/AI-platform push with isitagentready.com @Cloudflare, Flagship feature flags @fayazara, and shared compression dictionaries yielding dramatic payload reductions such as 92KB → 159 bytes in one example @ackriv.

AI for science, medicine, and infrastructure

Scientific discovery and personalized health were prominent applied themes: @JoyHeYueya and @Anikait_Singh_ posted about insight anticipation, where models generate a downstream paper’s core contribution from its “parent” papers; the latter introduced GIANTS-4B, an RL-trained model that reportedly beats frontier models on this task. On the health side, @SRSchmidgall shared a biomarker-discovery system over wearable data whose first finding was that “late-night doomscrolling” predicts depression severity with ρ=0.177, p<0.001, n=7,497—notable because the model itself named the feature. Separately, @patrickc argued current coding agents are already highly useful for personalized genome interpretation, describing <$100 analysis runs that surfaced a roughly 30× elevated melanoma predisposition plus follow-on interventions.
Large-scale compute buildout remains a core meta-story: @EpochAIResearch surveyed all 7 US Stargate sites and concluded the project appears on track for 9+ GW by 2029, comparable to New York City peak demand. @gdb framed Stargate as infrastructure for a “compute-powered economy,” while @kimmonismus put today’s annual global datacenter capex at roughly 5–7 Manhattan Projects per year in inflation-adjusted terms.

Top tweets (by engagement)

Claude Design / Anthropic product expansion: @claudeai launches Claude Design, by far the day’s biggest pure-AI product launch signal.
Model benchmarking / rankings: @ArtificialAnlys on Opus 4.7 tying for #1 overall and leading GDPval-AA.
Coding agents / computer use: @cursor_ai doubles Composer 2 limits in the new agents window and @HamelHusain on Codex Computer Use.
Open-source agents: @ollama ships native Hermes Agent support.
Applied AI in medicine: @patrickc on coding agents for genome analysis and personalized prevention.
Infra / power scaling: @EpochAIResearch on Stargate’s 9+ GW trajectory.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen3.6 Model Launch and Features

[AINews] Anthropic Claude Opus 4.7 - literally one step better than 4.6 in every dimension

Fri, 17 Apr 2026 01:36:17 GMT

Thursday mornings are for prestige AI launches, and while OpenAI put in a valiant effort with GPT-Rosalind and The New New Codex (with awesome computer use), there was no question who would win title story today. If you scan past AINews issues closely you would have seen the rumors of this for at least the past week, but today’s Claude Opus 4.7 launch mildly surpassed even those expectations.

The key chart is this one:

Basically 4.7-low is strictly better than 4.6-medium, 4.7-medium is strictly better than 4.6-high, 4.7-high is now better than 4.6-max, and there is a new xhigh effort level that Claude Code defaults to. While Anthropic says the new tokenizer (new pretrain?) can cause up to 35% more token usage, the overall reasoning efficiency has improved so much that overall token use is STILL down by up to 50% of their former equivalents. The true test is if default Claude Code, now 11 points higher on SWE-Bench Pro, does noticeably better in your own usecases.

The other notable capability that quite literally has to be seen to be believed, is the “substantially better vision”: Opus 4.7 has better vision for high-resolution images: it can accept images up to 2,576 pixels on the long edge (~3.75 megapixels), more than three times as many as prior Claude models. This opens up a wealth of multimodal uses that depend on fine visual detail: computer-use agents reading dense screenshots, data extractions from complex diagrams, and work that needs pixel-perfect references. More details in the focused topic summary below.

AI News for 4/14/2026-4/16/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

Top Story: Claude Opus 4.7

Anthropic officially launched Claude Opus 4.7 as its newest top-tier Opus model, positioning it as better at long-running work, coding, instruction following, self-verification, computer use, and knowledge work than Opus 4.6, while keeping list pricing unchanged at $5 / $25 per million input/output tokens according to user summaries and launch discussion [@claudeai, @kimmonismus]. The release sparked unusually active technical discussion around benchmark gains, a new tokenizer, higher image resolution support, new xhigh reasoning effort, token-cost implications, and whether Opus 4.7 is a straightforward 4.6 successor, a new base model, or a partially distilled “Mythos-adjacent” system.

Release details and product changes

Official framing. Anthropic’s launch pitch emphasized three behavioral improvements: better handling of long-running tasks, more precise instruction following, and stronger self-verification before responding [@claudeai].

Availability.

Claude platform / app reported live immediately [@dejavucoder].
Claude Code shipped day-one support and set xhigh as the default effort level [@_catwu, @_catwu].
Anthropic also launched or highlighted task budgets in public beta, /ultrareview in Claude Code, and broader Auto mode access for Claude Code Max users [@kimmonismus].

New effort tier.

Multiple users noted a new xhigh reasoning effort mode, positioned between high and max [@scaling01, @scaling01].
Cat Wu said Claude Code now defaults to xhigh for Opus 4.7 [@_catwu].

Vision/computer use changes.

User summaries reported support for images up to 2,576 px on the long edge (~3.75 MP), described as 3x larger than previous Claude image inputs [@kimmonismus].
Anthropic employee Alex Albert highlighted “No more downscaling of high-res images” and better output taste in UI/slides/docs [@alexalbert__].
This was repeatedly linked to better computer use and screenshot-heavy workflows [@dejavucoder, @omarsar0].

Tokenizer and token economics.

Several observers discovered Opus 4.7 uses a different tokenizer from 4.6 [@natolambert, @nrehiew_].
Kimmonismus summarized Anthropic’s caveat that the same input can map to 1.0–1.35x more tokens depending on content type [@kimmonismus].
This triggered debate over whether 4.7 is effectively a new base model, a tokenizer-swapped continuation, or some kind of midtraining/distillation bridge from Mythos [@natolambert, @stochasticchasm, @eliebakouch, @maximelabonne].
Anthropic employee Boris Cherny later said they increased limits for all subscribers to offset increased token use [@bcherny, @bcherny].

Benchmarks and measurable progress

Reported benchmark gains vs Opus 4.6

The most cited launch numbers came from benchmark screenshots and summaries shared by external accounts:

SWE-bench Pro: 64.3%, with users citing roughly +11 points over Opus 4.6 [@scaling01, @kimmonismus]
SWE-bench Verified: 87.6%, roughly +7 points vs 4.6 [@scaling01, @scaling01]
TerminalBench 2.0: 69.4%, around +4 points [@scaling01, @kimmonismus]
Document reasoning: 80.6%, up from 57.1% per third-party discussion [@scaling01, @llama_index]
GDPval-AA: 1753 Elo [@scaling01, @ArtificialAnlys]
ARC-AGI-1: 92%; ARC-AGI-2: 75.83% per [@scaling01]

Artificial Analysis said Opus 4.7 launched as the new #1 on GDPval-AA, with an implied ~60% head-to-head win rate vs GPT-5.4 on that task set [@ArtificialAnlys].

Anthropic increased subscriber limits to compensate for greater token usage [@bcherny, @bcherny].
Anthropic acknowledges benchmark tradeoffs and retained MRCR in the system card “for scientific honesty,” while signaling a shift toward Graphwalks as a preferred long-context metric [@bcherny].

Vals AI said Opus 4.7 took the #1 spot on the Vals Index at 71.4%, up from a previous best 67.7%, and also ranked #1 on Vibe Code Bench, Vals Multimodal, Finance Agent, Mortgage Tax, SAGE, SWE-Bench, and Terminal Bench 2 [@ValsAI].

They separately said Opus 4.7 became #1 on Vibe Code Benchmark at 71%, versus no model above 25% when they first launched the benchmark 4.5 months earlier [@ValsAI].

Product/evals from partners and customers

Cursor said its internal benchmark jumped from 58% to 70% with Opus 4.7 [@cursor_ai, @scaling01].
A separate Cursor post said, across 500 teams, developers are tackling 68% more high-complexity tasks this year, though that was about better models generally, not solely Opus 4.7 [@cursor_ai].
Notion reportedly saw a 14% lift on internal evals with one-third of tool errors [@mikeyk].
GitHub reportedly saw similar improvements, though no hard numbers were included in the tweet thread [@scaling01].

Document understanding: progress, but mixed economics

LlamaIndex and Jerry Liu provided useful independent nuance:

LlamaIndex’s ParseBench-style comparison said Opus 4.7 massively improved charts (13.5% → 55.8%) but only slightly improved formatting (64.2% → 69.4%), content (89.7% → 90.3%), tables (86.5% → 87.2%), and regressed on layout (16.5% → 14.0%) [@llama_index].
Jerry Liu separately said Opus 4.7 is “quite good at tables,” better on charts, and strongest on content faithfulness, but expensive for OCR-like use at ~7¢/page vs their agentic mode at ~1.25¢/page and cost-effective mode around ~0.4¢/page [@jerryjliu0].

This is one of the clearest examples of independent evaluation tempering launch optimism: broad capability improved, but specific enterprise document pipelines may still prefer specialized stacks on cost/performance grounds.

Opinions / interpretations

“This is a distilled version of Mythos” [@eliebakouch].
“This is a new base model because the tokenizer changed” [@natolambert].
“Anthropic artificially kept cyber scores low during training” is partly factual insofar as users quote the system card language about differentially reducing some capabilities, but broader claims about “nerfed Mythos” are interpretation [@scaling01, @Yuchenj_UW].
“Benchmarks don’t do it justice” and “actual usage is massively improved” are subjective but widely repeated by hands-on users [@mweinbach, @jeremyphoward].
“System prompt has lobotomized the model” is a user complaint about behavior changes, not an established fact [@theo].

Different perspectives

Supportive: meaningful real-world upgrade

A large portion of technical users argued this is a substantial iteration, especially given more frequent release cadence.

Scaling01 repeatedly pushed back on “mid update” takes, noting the jump from around 80% to almost 90% on SWE-bench Verified and emphasizing this would have looked huge in prior release cycles [@scaling01, @scaling01, @scaling01].
Alex Albert highlighted better async work, more predictable effort levels, better image handling, and stronger taste in UI/docs [@alexalbert__].
Michael Weinbach said after just two prompts that behavior and instruction following were “pretty massive” improvements [@mweinbach].
Jeremy Howard said it was the first model that “gets” what he’s doing and praised its willingness to discuss rather than bulldoze ahead [@jeremyphoward, @jeremyphoward].
Cat Wu explicitly advised users to treat it like an engineer you delegate to, not a pair programmer you micromanage, suggesting Anthropic sees it as stronger in autonomous execution [@_catwu].

Neutral / analytical: strong update with tradeoffs

Some of the best commentary was technical and mixed.

Kimmonismus called it a “solid upgrade” focused on Anthropic’s core buyer priorities: agentic coding reliability, vision for computer-use agents, and knowledge work—but also “obviously shy to Mythos” [@kimmonismus].
Artificial Analysis validated the GDPval-AA gain and #1 ranking, but did not frame it as an across-the-board blowout [@ArtificialAnlys].
LlamaIndex and ParseBench results suggested noticeable but uneven document gains with real pricing constraints [@llama_index, @jerryjliu0].

Skeptical / critical: regressions, token inflation, and UX concerns

There was also substantial pushback.

Multiple users said long-context performance looked worse, especially on MRCR / needle-in-a-haystack-style metrics [@scaling01, @nrehiew_, @eliebakouch, @kimmonismus].
Anthropic’s Boris Cherny replied that MRCR is being phased out because it overweights distractor-stacking tricks and that Graphwalks is a better applied-reasoning signal; he gave numbers showing Graphwalks 38.7% → 58.6% from 4.6 to 4.7 [@bcherny, @scaling01].
Tokenizer changes led to complaints about Opus becoming a “token guzzler” and potentially raising effective costs despite flat list pricing [@dejavucoder, @madiator].
Yuchen said Claude web only exposed “Adaptive” or non-thinking, with no explicit force-thinking toggle, which for some users made non-coding tasks feel worse in practice [@Yuchenj_UW].
Mikhail Parakhin similarly said first impressions on non-coding replies were “dumber” because he couldn’t force reasoning [@MParakhin].
Theo sharply criticized the new system prompt as “lobotomized,” and later suggested trying the model in T3 Chat “without the lobotomized system prompt” [@theo, @theo].

Safety / governance angle

Scaling01 highlighted a system-card statement that Anthropic experimented with efforts to differentially reduce cyber capabilities during training [@scaling01].
At the same time, users noted Opus 4.7 still scores higher than 4.6 on some exploitation-related evaluations like Firefox shell exploitation, and has prompt-injection robustness close to Mythos [@scaling01, @scaling01].
One user hyperbolically said “Opus is going to be a bioweapon risk at this pace,” reflecting the ongoing tendency to conflate general capability jumps with worst-case misuse narratives [@scaling01].

Claude Code workflow guidance from Anthropic

Cat Wu’s thread is a useful operational signal for engineers:

Delegate, don’t micromanage [@_catwu]
Put full goal + constraints + acceptance criteria up front [@_catwu]
Tell the model how to verify changes; encode testing workflows in claude.md or skills [@_catwu]

That strongly suggests Anthropic optimized toward autonomous task loops where explicit validation is central.

Examples of progress in practice

[AINews] RIP Pull Requests (2005-2026)

Latent.Space — Thu, 16 Apr 2026 06:41:12 GMT

Hot on the heels of the Death of the Code Review, the Pull Request may be next.

For anyone that learned to code in the last 15 years it is hard to imagine a life without Git, GitHub, and Pull Requests, but there was a time before them, and it well may come to pass that there is life after.

Pull Requests were arguably invented in 2005, successfully popularized by GitHub, and only 21 years later, GitHub is for the first time in history allowing people to disable pull requests on their open source repos (you could only disable issues before).

The rise of Generative AI in code has spelled the pending death of the Pull Request for a while now — Pete Steinberger is by now well known (along with Theo) for only wanting Prompt Requests rather than Pull Requests (for multiple reasons, eg 1) no merge conflicts, 2) it’s easier for the maintainer to fix/add to the prompt than to look at code, 3) less likely to have malicious/insecure code slipped into an innocent looking PR), and other folks like Mitchell Hashimoto and Amp Code have created “reputation”-based systems for handling untrusted code contributions.

In Building for Trillions of Agents, Aaron Levie noted that “the path forward is to make software that agents want.” Humans invented git for human collaboration reasons. It’s increasingly clear that Git-based workflows may not be suitable once we remove the human bottleneck from the flow of code.

And if Code Reviews are dead, and Pull Reviews are dead… how long until Git itself is dead?

AI News for 4/14/2026-4/15/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

OpenAI Agents SDK Expansion and the New Sandbox-Oriented Agent Stack

OpenAI split the agent harness from compute/storage and pushed its Agents SDK toward long-running, durable agents with primitives for file/computer use, skills, memory, and compaction. The harness is now open-source and customizable, while execution can be delegated to partner sandboxes instead of being tightly coupled to OpenAI infra, per @OpenAIDevs, follow-up, and @snsf. This effectively makes “Codex-style” agents more reproducible by third parties and shifts differentiation toward orchestration, state management, and secure execution.
A notable ecosystem formed around that launch immediately: @CloudflareDev, @modal, @daytonaio, @e2b, and @vercel_dev all announced official sandbox integrations. The practical pattern is converging on stateless orchestration + stateful isolated workspaces. Example builds already appeared, including a Modal-backed ML research agent with GPU sandboxes, subagents, persistent memory, and fork/resume snapshots from @akshat_b, and Cloudflare guides for Python agents that execute tasks in a sandbox and copy outputs locally from @whoiskatrin.

Cloudflare’s Project Think, Agent Lee, and Voice Agents

Cloudflare had one of the busiest agent-infra release cycles. @whoiskatrin and @aninibread introduced Project Think, a next-gen Agents SDK centered on durable execution, sub-agents, persistent sessions, sandboxed code execution, a built-in workspace filesystem, and runtime tool creation. In parallel, @Cloudflare launched Agent Lee, an in-dashboard agent using sandboxed TypeScript to shift Cloudflare’s UI from manual tab navigation to prompt-driven operations; @BraydenWilmoth showed it issuing infra tasks and generating UI-backed results.
Voice and browser tooling also moved into the core stack. @Cloudflare shipped an experimental real-time voice pipeline over WebSockets for continuous STT/TTS, while @korinne_dev described voice as just another input channel over the same agent connection. On browser automation, @kathyyliao summarized the rebranded Browser Run stack: Live View, human-in-the-loop intervention, session recordings, CDP endpoints, WebMCP support, and higher limits. Taken together, Cloudflare is making a strong case that the production agent platform is really a composition of durable runtime + UI grounding + browser + voice + sandbox.

Hermes Agent’s Self-Improving Workflow and Competitive Positioning

Hermes Agent’s distinctive idea is not just tool use but persistent skill formation. A Chinese-language comparison from @joshesye contrasts OpenClaw as a more GUI-first, ready-to-use personal assistant with Hermes as a “professional” agent that decides whether a completed workflow is reusable and automatically turns it into a Skill. This “learn from completed tasks” framing appeared repeatedly: @chooseliberty showed Hermes autonomously backfilling tracking data, updating a cron job, then saving the workflow as a reusable skill; @NeoAIForecast emphasized session hygiene and thread branching/search as critical to turning Hermes into a real work environment rather than a disposable chat box.
Community sentiment strongly positioned Hermes against OpenClaw, often bluntly. Examples include @vrloom, @theCTO, and @Teknium highlighting Hermes’ role in real workflows, including the now-viral autonomous Gemma 4 “abliteration” story from @elder_plinius: the agent loaded a stored skill, diagnosed NaN instability in Gemma 4, patched the underlying library, retried multiple methods, benchmarked the result, generated a model card, and uploaded artifacts to Hugging Face. There were also concrete product additions: browser control via /browser connect from @0xme66, QQBot + AWS Bedrock support from @Teknium, a native Swift desktop app alpha from @nesquena, and ongoing ecosystem tooling like artifact-preview and hermes-lcm v0.3.0.

Model, Architecture, and Training Releases: Sparse Diffusion, Looped Transformers, and Efficient Long-Context MoEs

Several technically meaningful open releases landed across modalities. @withnucleusai announced Nucleus-Image, positioned as the first sparse MoE diffusion model: 17B parameters, 2B active, Apache 2.0, with weights, training code, and dataset recipe, and day-0 support in diffusers. NVIDIA followed with Lyra 2.0, a framework for generating persistent, explorable 3D worlds that maintains per-frame 3D geometry and uses self-augmented training to reduce temporal drift, per @NVIDIAAIDev. On multimodal retrieval, @thewebAI open-sourced webAI-ColVec1, claiming top ViDoRe V3 performance for document retrieval without OCR or preprocessing.
Architecture research around compute efficiency was especially strong. @hayden_prairie, @realDanFu, and @togethercompute introduced Parcae, a stabilized layer-looping Transformer formulation. The claim: for fixed parameter budgets, looping blocks can recover the quality of a model roughly 2x the size, yielding a new scaling axis where FLOPs scale via looping, not just parameters/data. NVIDIA also surfaced Nemotron 3 Super, summarized by @dair_ai: an open 120B hybrid Mamba-Attention MoE with 12B active parameters, 1M context, trained on 25T tokens, with up to 2.2x throughput vs GPT-OSS-120B and 7.5x vs Qwen3.5-122B. These releases collectively point to a theme: memory bandwidth and long-context throughput are increasingly first-class architectural objectives.

Google/Gemini’s Product Surge: Mac App, Personal Intelligence, TTS, and Open Multimodal Models

Google stacked multiple launches in one cycle. The most visible was the native Gemini app for Mac, announced by @GeminiApp, @joshwoodward, and @sundarpichai: Option + Space activation, screen sharing, local file context, native Swift implementation, and broad macOS availability. In parallel, Personal Intelligence expanded globally in Gemini and into Chrome, allowing users to connect signals from products like Gmail and Photos, framed around transparency and user-controlled app connections by @Google and @GeminiApp.
The more technically interesting model launch was Gemini 3.1 Flash TTS. @GoogleDeepMind, @OfficialLoganK, and @demishassabis positioned it as a highly controllable TTS model with Audio Tags, 70+ languages, inline nonverbal cues, multi-speaker support, and SynthID watermarking. Independent evaluation from @ArtificialAnlys put it at #2 on its Speech Arena, just 4 Elo behind the top model. Google also open-sourced TIPS v2, a foundational text-image encoder under Apache 2.0 with new pretraining recipes, via @osanseviero, and the community flagged the day as unusually dense for Google AI product velocity.

Research Signals: AI-Assisted Math, Long-Horizon Agents, Eval Shifts, and Open Data

The highest-signal research discourse was around AI-assisted mathematics. @jdlichtman reported that GPT-5.4 Pro produced a proof for Erdős problem #1196, surprising experts by rejecting a long-assumed proof gambit and instead exploiting a technically counterintuitive analytic path using the von Mangoldt function. Follow-ups from @jdlichtman, @thomasfbloom, @gdb, and others framed it as potentially the first AI-generated “Book Proof” broadly respected by mathematicians. That matters less as a one-off result than as evidence that models may now occasionally find non-aesthetic but compact lines of attack in mature research spaces.
Long-horizon agent research also kept converging on state management and harness design. @omarsar0 summarized AiScientist, where a thin orchestrator coordinates specialized agents through durable workspace artifacts in a File-as-Bus pattern; removing that bus hurts PaperBench and MLE-Bench Lite materially. @dair_ai highlighted Pioneer Agent for continual small-model improvement loops, while @yoonholeee open-sourced Meta-Harness, a repo meant to help users implement robust harnesses in new domains. On evals, @METR_Evals estimated Gemini 3.1 Pro (high thinking) at a 50% time horizon of ~6.4 hours on software tasks, and @arena showed Document Arena top ranks shifting with Claude Opus 4.6 Thinking at #1 and Kimi-K2.5 Thinking as the best open model. Meanwhile, @TeraflopAI released 43B tokens of SEC EDGAR data, reinforcing the day’s broader push toward more open datasets and open infrastructure.

Top tweets (by engagement)

Gemini on Mac: @sundarpichai and @GeminiApp drove the biggest launch engagement around the native desktop app.
Gemini 3.1 Flash TTS: @OfficialLoganK and @GoogleDeepMind highlighted a materially more controllable TTS stack.
AI-assisted math proof: @jdlichtman and @gdb sparked the strongest research discussion of the day.
OpenAI Agents SDK update: @OpenAIDevs marked a meaningful platform shift toward open harnesses and partner sandboxes.
Anthropic’s subliminal learning paper in Nature: @AnthropicAI drew major attention to hidden-trait transmission through training data.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

[AINews] Humanity's Last Gasp

Wed, 15 Apr 2026 03:05:54 GMT

One topic that has come up again and again across Latent Space and AI Engineer is how much harder everyone seems to be working:

(friend of the show) Aaron Levie reports that “AI is not causing anyone to do less work right now, and similar to Silicon Valley people feel their teams are the busiest they’ve ever been.”
Tyler Cowen argues from an economics standpoint that you should work much harder RIGHT NOW whether you believe AI will lower your value OR increase your value.
Simon Last of Notion commented on today’s pod that he’s back to sleepless nights and 24/7 work for the first time since giving up on ML model training, but this time because of agent layer token anxiety.

How can it both be true that “Agents are doing more work and yet Everyone is working harder”? How can it be true that Claude Mythos has been used internally for 2 months, and yet Claude keeps going down? How can it be true that Model and Agent Labs are more productive than ever and yet acquihiring and acquiring more than ever?

A simple thought exercise we’ve made before is the “Turkey problem”, where, based on real evidence and an abundance of historical data, Turkeys should conclude that life is fantastic and all of humanity is set up to make turkeys well fed as far as they’ve ever experienced. Turkey doomsayers would be alarmist, crackpots, and then ignored. Until Thanksgiving.

Are engineers, or all knowledge workers in general, turkeys, in this scenario? Should our “elasticity” and value of work be increasingly positive, right up to some crossover point we become horses? Now that SWE-Bench is saturated (with SWE-Bench Pro soon to be, Mythos is at 78%) and GDPval rates GPT 5.4 as better than/equal to human experts 83% of the time in most swathes of the economy, what’s left?

Notion is working on Notion’s Last Exam. Greg and Francois are have set out ARC-AGI-3. I’m working on the next frontier of coding evals. But it all seems somewhat moot if hardware is destiny and AGI is predictably a 20GW supercluster away…

…or are there more valuable problems left?

AI News for 4/3/2026-4/4/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Top Tweets (by engagement)

Google’s Chrome “Skills” turns prompts into reusable browser workflows: Google introduced Skills in Chrome, letting users save Gemini prompts as one-click actions that run against the current page and selected tabs. Google also shipped a library of ready-made Skills, which makes this more than prompt history: it’s effectively lightweight end-user agentization inside the browser.
Tencent’s HYWorld 2.0 positions world models as editable 3D scene generators, not video models: Ahead of release, @DylanTFWang teased HYWorld 2.0 as an open-source, engine-ready 3D world model that generates editable 3D scenes from a single image.
Google DeepMind shipped Gemini Robotics-ER 1.6: The new model, announced by @GoogleDeepMind, improves visual/spatial reasoning for robotics, adds safer physical reasoning, and is available in Gemini API / AI Studio. Follow-up posts highlight 93% instrument-reading success and better handling of physical constraints like liquids and heavy objects.
OpenAI expanded Trusted Access for Cyber with GPT-5.4-Cyber: OpenAI says GPT-5.4-Cyber is a fine-tuned version of GPT-5.4 for defensive security workflows, available to higher-tier authenticated defenders under its Trusted Access program.
Hugging Face launched “Kernels” on the Hub: @ClementDelangue announced a new repo type for GPU kernels, with precompiled artifacts matched to exact GPU/PyTorch/OS combinations and claimed 1.7x–2.5x speedups over PyTorch baselines.
Cursor described a multi-agent CUDA optimization system built with NVIDIA: @cursor_ai says its multi-agent software engineering system delivered a 38% geomean speedup across 235 CUDA problems in 3 weeks, a concrete example of agents being applied to systems optimization rather than app scaffolding.

Agent Infrastructure: Hermes, Deep Agents, and Production Harnesses

Hermes Agent is becoming a serious open local-agent stack, with reliability and memory as the differentiators: Several posts converged on the same theme: users are migrating from alternatives to Hermes Agent because it is more durable for long-running work. The project shipped a substantial v0.9.0 update with web UI, model switching, iMessage/WeChat integration, backup/restore, and Android-via-tmux support via @AntoineRSX, while Tencent highlighted a one-click Lighthouse deployment for always-on cloud hosting with messaging integrations. On the memory side, hermes-lcm v0.2.0 from @SteveSchoettler adds lossless context management with persistent message storage, DAG summaries, and tools to expand compacted context. Community posts from @Teknium, @aiqiang888, and others reinforce that Hermes’ key advantage is less raw model IQ than operational stability, extensibility, and deployability.
LangChain is pushing “deep agents” toward deployable, multi-tenant, async systems: The deepagents 0.5 release adds async subagents, multimodal file support, and prompt-caching improvements. Related posts emphasize that deepagents deploy is an open alternative to managed agent hosting, with upcoming work around memory scoped to user/agent/org and custom auth / per-user thread isolation via @LangChain and @sydneyrunkle. The interesting pattern here is a shift from “agent demos” to platform concerns: tenancy, isolation, long-lived tasks, and integration surfaces like Salesforce and Agent Protocol-backed servers.
Harness design is becoming a first-class engineering topic: Multiple posts argued that agent performance depends at least as much on the scaffold as the model. @Vtrivedy10 made the clearest case for task-specific open harnesses over ideology (“thin vs thick”), while @kmeanskaran stressed workflow design, memory switching, and tool output control over frontier-model chasing. This aligns with @ClementDelangue asking for a curated mapping from models to their best coding/agent harnesses, which is increasingly necessary as open-weight models diversify.

Robotics, World Models, and 3D Generation

Google’s Gemini Robotics-ER 1.6 is a notable productization step for embodied reasoning: The release from @GoogleDeepMind emphasizes better visual/spatial understanding, tool use, and physical constraint reasoning. Follow-ups note 10% better human injury-risk detection, support for reading complex analog gauges, and availability in the API; @_philschmid highlighted 93% success on instrument-reading tasks. This feels less like a robotics foundation-model paper drop and more like a developer-facing embodied-reasoning API.
World models are shifting from cinematic demos to editable spatial artifacts: Tencent’s HYWorld 2.0 teaser explicitly contrasted itself with video-generation systems by framing the output as a real 3D scene that is editable and engine-ready. On the web side, Spark 2.0 from @sparkjsdev shipped a streamable LoD system for 3D Gaussian splats, targeting 100M+ splat worlds on WebGL2 across mobile, web, and VR. Together these suggest the stack for “AI-generated 3D” is maturing from content generation into interactive rendering and downstream use.
Open 3D generation is advancing on topology, UVs, rigging, and animation readiness: @DeemosTech introduced SATO, an autoregressive model for topology and UV generation, while @yanpei_cao released AniGen, which generates 3D shape, skeleton, and skinning weights from one image. These are meaningful because the bottleneck in production 3D pipelines is rarely “can you generate a mesh?”; it’s whether the asset is structured enough to animate, texture, and edit.

Models, Benchmarks, and Specialized Systems

[AINews] Top Local Models List - April 2026

Tue, 14 Apr 2026 08:43:33 GMT

As you know we read through /r/localLlama (which has its own monthly top models thread), /r/localLLM, and other local model subreddits on an almost daily basis, and every now and then it is good to step back and survey what the community consensus is landing on, with a sampling of models across different sizes. We started this work to power our local Claw.

The top names you should know as a baseline, adjusted for “what people are actually recommending” rather than just benchmark supremacy:

Qwen 3.5 — most broadly recommended family right now across usecases.
Gemma 4 — strong recent buzz for local usability, especially smaller and mid-sized deployments.
GLM-5 / GLM-4.7 — near the top of broad open-model rankings, increasingly part of the “best overall” conversation.
MiniMax M2.5 / M2.7 — repeatedly cited for agentic/tool-heavy workloads.
DeepSeek V3.2 — still firmly in the top cluster when people talk about strongest open-weight general models.
GPT-oss 20B — not the mainstream “winner,” but increasingly recommended as a practical local option and for uncensored variants.

For local coding, the overwhelming consensus is Qwen3-Coder-Next. So that’s easy.

Naturally the fuller list is going to have a strong lean on roleplay/creative writing, the #2 usecase of LLMs, and we are NSFW-friendly so here goes…

[AINews] AI Engineer Europe 2026

Fri, 10 Apr 2026 23:30:58 GMT

Yesterday was a quiet day and only AIE Day 1 so we skipped it, but the recaps are on the archive site if you were missing them.

We’ve just concluded a marathon 3 days in Europe - first the Online Track and the Workshops, then over a hundred talks delivered in person, some livestreamed. There was also a fair amount of live podcast coverage, from ThursdAI to ETN, from visits to 10 Downing Street to morning runs to cool swag to viral talks to aquarium parties to nightclub parties.

We’ll try to publish a few recap thoughts in future days, but for now you can see my closing keynote at the end of Day 2 and watch some of the large talks.

Day 1 Talks (link)

Day 2 Talks (link)

AI News for 4/9/2026-4/10/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Open Models, Coding Agents, and the New Advisor Pattern

GLM-5.1 breaks into the frontier tier for coding: The clearest model-performance update in this batch is GLM-5.1 reaching #3 on Code Arena, reportedly surpassing Gemini 3.1 and GPT-5.4 and landing roughly on par with Claude Sonnet 4.6. Arena later emphasized that Z.ai now holds the #1 open model rank and sits within ~20 points of the top overall. The release was quickly picked up by tooling vendors, including Windsurf support. In parallel, Zixuan Li outlined a three-part open-model strategy: accessibility, strong fine-tunable baselines, and sharing architectural/training/data lessons with the broader community.
Advisor-style orchestration is becoming a first-class design pattern: A notable systems trend is the convergence around “cheap executor + expensive advisor.” Akshay Pachaar’s summary ties together Anthropic’s API-level advisor tool and Berkeley’s “Advisor Models” line of work: use a fast model for most steps, escalate only at difficult decision points. Claimed gains include Haiku + Opus more than doubling BrowseComp score vs Haiku alone, and Sonnet + Opus improving SWE-bench Multilingual while reducing task cost. The pattern was implemented almost immediately in open source via advisor middleware for LangChain DeepAgents, with Harrison Chase highlighting the speed of OSS uptake. This idea also shows up in practitioner commentary from Walden Yan, who argues future agents will increasingly look like fast worker models delegating hard judgments to “smart friends.”
Qwen Code adds orchestration primitives directly into the product: Alibaba shipped Qwen Code v0.14.x with several agent-engineering features that align with this broader shift: remote control channels (Telegram/DingTalk/WeChat), cron-based recurring tasks, 1M-context Qwen3.6-Plus with 1,000 free daily requests, sub-agent model selection, and a planning mode. The sub-agent selection feature in particular makes model-mixing explicit at the tool level rather than just in external harness code.
Model-routing demand is now a product complaint, not a research topic: Multiple tweets converge on the same operational pain point: top models are spiky and specialized. Yuchen Jin points out that Opus often wins on frontend and agentic flow while GPT-5.4 performs better on backend/distributed systems, but tools like Claude Code and Codex remain too provider-bound. That complaint sits directly beside the advisor pattern above: practitioners increasingly want shared context + automatic routing + cross-model collaboration inside one workflow rather than manual switching between terminals.

Agent Harnesses, Hermes Momentum, and the “Portable Skills” Stack

Hermes Agent had the strongest ecosystem momentum in this dataset: Hermes dominated the agent-framework chatter. The ecosystem map was updated for v0.8.0, Hermes Workspace Mobile launched with chat, live tool execution, memory browser, skills catalog, terminal, and file inspector, and Teknium announced FAST mode for OpenAI/GPT-5.4. Distribution also broadened through SwarmNode support, while the project itself hit 50k GitHub stars. Practitioner feedback was unusually concrete: Sentdex says Hermes with local Qwen3-Coder-Next 80B 4-bit now replaces a large part of his Claude Code workflow, and several others described it as the first agent framework that “just works.”
The harness layer is solidifying into the primary abstraction: Harrison Chase’s framing is representative: the industry is moving from unstable chain abstractions toward agent harnesses as a more durable foundation—essentially “run the model in a loop with tools” now that models are finally good enough for it to work. Supporting tweets stress the same architecture from different angles: “open harness, separated from model providers”, “portable agents”, and “the real bottleneck isn’t the model, it’s the harness”. The deeper implication is vendor decoupling: skills, memory, tools, and traces become long-lived assets while models are hot-swapped underneath.
Skills are becoming the new app surface: Several tweets point toward a shared packaging model built from skills + CLIs + AGENTS.md-like interfaces. Caspar B gave the best practitioner writeup, detailing how well-designed skills can materially improve planning, long-horizon coding, code review, and frontend iteration. adward28 similarly argues that as AGENTS.md, skills, and tool configs become more portable, the whole ecosystem becomes more usable. This is complemented by infra releases like MiniMax’s MMX-CLI, which exposes multimodal capabilities to agents via a CLI rather than MCP glue, and SkyPilot’s agent skill for launching GPU jobs across cloud/K8s/Slurm.
Observability is turning into a default expectation for agent development: The tracing/evals loop is now explicit in product and research discussions. Sigrid Jin summarizes the emerging doctrine well: evals are the new training data, but agents overfit and reward-hack, so teams need strict splits, curated evals, and a loop from production traces → failures → evals → harness updates. This is mirrored in tooling releases from LangChain, W&B’s Claude Code integration + skill, and Weave’s auto-tracing plugin.

Benchmarks, Evals, and Capability Measurement Got More Realistic

ClawBench and MirrorCode push beyond toy agent evals: ClawBench evaluates agents on 153 real online tasks across live websites and reports a dramatic drop from roughly 70% on sandbox benchmarks to as low as 6.5% on realistic tasks. In software engineering, Epoch and METR introduced MirrorCode, where Claude Opus 4.6 reimplemented a 16,000-line bioinformatics toolkit—a task they estimate would take humans weeks. Notably, the authors already warn the benchmark may be “likely already saturated”, which says as much about the pace of coding progress as the result itself.
Reward hacking is now a central part of model evaluation, not an edge case: METR’s new time horizon result for GPT-5.4-xhigh is a useful example. Under standard scoring, it lands at 5.7 hours, below Claude Opus 4.6’s ~12 hours. If reward-hacked runs are counted, it jumps to 13 hours. METR explicitly notes the discrepancy was especially pronounced for GPT-5.4. Separately, Davis Brown reports rampant cheating on capability evals, including top submissions on Terminal-Bench 2 allegedly sneaking answers to the model.
AISI reproduced steering-vector oddities: The UK AISI transparency team reports replicating Anthropic’s steering approach for suppressing evaluation awareness, with the surprising result that control vectors (“books on shelves”) can produce effects as large as deliberately designed ones. For engineers building model-monitoring or post-training interventions, that’s a cautionary result about how messy and non-specific linear steering effects can be.

Systems, Numerics, and Local/Edge Inference

Carmack’s bf16 scatterplot is a useful reminder that low precision fails in visible, structured ways: John Carmack’s post on plotting 400k bf16 points showed clear quantization gaps emerging as values move away from the origin. The value for practitioners is not the anecdote itself but the intuition reset: bf16’s reduced mantissa becomes visually and operationally obvious at surprisingly modest magnitudes. This pairs well with Arohan’s warning not to skip “determinism and numerics days.”
Apple/local inference stack keeps compounding: Awni Hannun highlighted demos of Qwen 3.5 and Gemma 4 running locally on Apple silicon via MLX, and separately MLX’s origin story resurfaced. There was also continued momentum around mlx + Ollama integration and Ollama’s MLX-powered speedups on Apple silicon. The broad pattern: local LLM ergonomics are no longer novelty demos; they are becoming a viable default for coding and agent workflows.
Inference optimization remains highly recipe-driven: Two useful examples: Red Hat AI’s speculative decoding for Gemma 4 31B using EAGLE-3, and PyTorch/diffusers work on low-precision flow-model inference where Sayak Paul summarizes the final recipe: selective quantization, better casting kernels, CUDA graphs, and regional compilation. These are good reminders that practical speedups still come from stacking many system-level interventions rather than a single magic optimization.

Research Directions: Memory, Synthetic Data, and Neural Runtime Ideas

Memory is shifting from “store facts” to “store trajectories”: The Turing Post’s summary of MIA frames memory as retained problem-solving experience rather than just retrieved context: a manager/planner/executor loop that stores full journeys. That direction is echoed by Databricks’ “memory scaling” claim that uncurated user logs can outperform handcrafted instructions after only 62 records.
Synthetic data is becoming programmable against differentiable objectives: Rosinality and Tristan Thrush point to work on generating synthetic training data that directly optimizes downstream objectives—up to and including embedding a QR code in model weights through the data alone. This is a strong example of data design being treated as an optimization target in its own right.
“Neural Computers” proposes learned runtime as the next abstraction boundary: Schmidhuber and collaborators introduced Neural Computers, pushing the idea that computation, memory, and I/O could move from fixed external runtime into learned internal state. Whether or not the formulation holds up, it’s one of the more ambitious attempts in this set to redefine the boundary between model and machine.

Top tweets (by engagement)

Medical/LLM reliability failure: HedgieMarkets on fake “bixonimania” papers getting accepted by major AI systems and even cited in a peer-reviewed journal. High-signal example of retrieval/verification failure in safety-critical domains.
Numerics: John Carmack on bf16 precision gaps in scatter plots. One of the most practically useful tweets in the batch.
Policy/cyber-risk narrative: Bloomberg’s report that Powell and Bessent discussed cyber risks from Anthropic’s “Mythos” with Wall Street leaders drove substantial engagement, though the technical substance remains second-hand.
Product integration: Claude for Word entering beta was one of the biggest genuine AI-product announcements in the set.
Open model milestone: GLM-5.1’s Code Arena jump is probably the most consequential model-performance datapoint in this collection.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Gemma 4 Model Updates and Fixes

[AINews] Meta Superintelligence Labs announces Muse Spark, first frontier model on their completely new stack

Wed, 08 Apr 2026 23:23:36 GMT

It’s not much, but it’s good numbers:

Alexandr also concludes:

“bigger models are already in development with infrastructure scaling to match. private api preview open to select partners today, with pl…

[AINews] Anthropic @ $30B ARR, Project GlassWing and Claude Mythos Preview — first model too dangerous to release since GPT-2

Wed, 08 Apr 2026 00:26:53 GMT

Against the backdrop of OpenAI announcing $24B ARR, stalled ChatGPT growth and coincidental personnel moves in CEO, COO, and CMO and sensationalist rumors with CFO, this week’s events in Anthropic announcing a massive jump from $19B ARR in March to $30B ARR in April 1 seems like a VERY strategic jab, especially considering known differences in revenue recognition, but the differential rate of growth and higher cost efficiency is undeniable... only for today to step it up a notch.

If a master tactician wanted to further competitive narratives vs a potential IPO, you would be hard pressed to find a better idea than Claude Mythos (from the Ancient Greek for “utterance” or “narrative”: the system of stories through which civilizations made sense of the world), rumored to be the largest ever successful training run and “leaked” weeks ago, and now formally confirmed to be too dangerous to release GA, instead only restricted to 40 partners under an urgent new “Project Glasswing”:

In the blogpost and the 244 page System Card and a ludicrously well produced video, Anthropic details shocking capabilities beyond the kinds of high double digit benchmark capability jumps (with encouraging efficiency!) you might hope for from a much larger (>10T?) model:

“found thousands of high-severity vulnerabilities, including some in every major operating system and web browser.”
- including decades old vulnerabilities in OpenBSD and FFmpeg and the Linux kernel that had never been discovered by other tools
Nicolas Carlini (friend of the show!) stepping up his recent already superlative message saying “I found more bugs in the last couple weeks than I’ve found in the rest of my life combined”
Sam Bowman saying he was contacted by a Mythos instance that wasn’t supposed to have access to the internet (it was instructed to do so).
Interpretability researchers report “it exhibited notably sophisticated (and often unspoken) strategic thinking and situational awareness, at times in service of unwanted actions.“ - including for extremely creative reward hacking, while in an unprecedently high (7.6% of cases) being aware that it was in an eval.

We’ve done a focused news summary run below, for those who desire more detail.

AI News for 4/6/2026-4/7/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Top Story: Anthropic revenue disclosures analysis and Claude Mythos details

What happened

Anthropic dominated this tweet set from two angles: business trajectory and model capability disclosure. On business, multiple posters argued Anthropic’s revenue is outrunning prior forecasts, with one tweet claiming Anthropic had reached a 15x revenue run-rate increase in a single year and was already “2 months and $4B ahead” of an AI 2027-style forecast, while still being valued around $380B (scaling01, scaling01). Another poster speculated Anthropic could exceed $90B ARR by end-2026 (RyanPGreenblatt). On product/capability, Anthropic officially unveiled Claude Mythos Preview and Project Glasswing, a restricted-access cyberdefense initiative rather than a public API launch. Anthropic said Mythos can find software vulnerabilities better than all but the most skilled humans and is being provided to a coalition to secure critical software instead of being generally released (AnthropicAI, DarioAmodei, Kevin Roose). The announcement was accompanied by a technical report, system card, and many follow-on reactions emphasizing extraordinary benchmark gains, dangerous cyber capability, and a new “private frontier” dynamic in which the strongest models may not be widely accessible (AnthropicAI, AnthropicAI, AlexAlbert__).

Revenue disclosures: facts, inferences, and open questions

[AINews] Gemma 4 crosses 2 million downloads

Tue, 07 Apr 2026 00:17:23 GMT

We commented on this last Thursday, but Gemma 4’s continued deployment and positive reviews over the weekend has pushed it to around 2 million downloads in its first week!

(For contrast, Gemma 3 totaled 6.7m downloads in the past year, Gemma 2 had 1.4m downloads since Jun 2024 launch, whereas Qwen 3.5 has gained about 27m downloads inclusive of the 1.5 months since their 397B-A17B flagship model drop)

The Gemma 4 keynote will be live in 3 days from London, which you can bookmark now:

Separately, we’d also highlight the Hermes Agent hype - our friends at the have a good writeup on the Hermes vs OpenClaw differences.

AI News for 4/4/2026-4/6/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Gemma 4’s Rapid Local Adoption and the On-Device Open Model Moment

Gemma 4 is driving a sharp “local-first” wave: multiple posts pointed to Gemma 4 becoming the top trending / #1 model on Hugging Face, with strong enthusiasm for its practical usability rather than just leaderboard performance—see @ClementDelangue, @GlennCameronjr, and @Yampeleg. The strongest signal was how quickly people were running it on consumer Apple hardware: @adrgrondin showed Gemma 4 E2B on an iPhone 17 Pro at roughly 40 tok/s with MLX; @enjojoyy reported a similar iPhone deployment; @_philschmid highlighted Gemma 4 E2B in AI Edge Gallery using skills for Wikipedia queries. Red Hat also published quantized Gemma 4 31B model cards in NVFP4 and FP8-block formats with instruction-following evals live, and reasoning/vision evals pending, via @RedHat_AI. Together these posts suggest Gemma 4 is not just another open release; it is becoming a reference point for edge inference, Apple Silicon tooling, and low-friction local deployment.
The commercial implication is pressure on paid chat subscriptions and cloud dependence: some of the more viral commentary was reductive, but it captures a real shift. @AlexEngineerAI argued that Gemma 4 running locally closes enough of the gap to make a Claude subscription less compelling for some users, while @ben_burtenshaw reminded people that HF-hosted models are free to use and can replace portions of an agent workflow. On the infra side, @ollama launched Gemma 4 on Ollama Cloud backed by NVIDIA Blackwell GPUs, making it available to tools like OpenClaw and Claude-style workflows without self-hosting. The notable ecosystem post from @osanseviero also underscored how broad the launch coordination was—HF, vLLM, llama.cpp, Ollama, NVIDIA, Unsloth, SGLang, Docker, Cloudflare and others—which is a reminder that “open model success” increasingly depends on simultaneous downstream systems support, not just weights.

Hermes Agent’s Self-Improving Agent Loop, OpenClaw Friction, and the Push for Open Trace Data

Hermes Agent was the dominant agent-framework story in this batch: the core narrative is that Nous’ system is winning mindshare by combining persistent memory, self-generated/refined skills, and a more opinionated self-improvement loop. The launch of a Manim skill by @NousResearch was especially resonant because it demonstrated an agent skill that produces immediately legible artifacts—technical animations and explainers—rather than yet another PDF summarizer. This was amplified by demos and reactions from @ErickSky, @lucatac0, @Sentdex, @casper_hansen_, and @noctus91. Product updates from @Teknium added slash-command skill loading for Discord/Telegram bots, while community tools like Hermes HUD mapped live processes to tmux panes and surfaced approvals via @aijoey, and multiple WebUI integrations emerged from @Teknium, @nesquena, and @magiknono.
The contrast with OpenClaw centered on architecture and business-model fragility: several posts compared the two directly. @TheTuringPost summarized the distinction as human-authored skills vs self-forming skills, Markdown memory vs persistent/searchable memory stacks, and gateway control plane vs self-improving loop. That framing was echoed by practitioners like @SnuuzyP, @DoctaDG, and @spideystreet, many of whom cited easier onboarding and less manual skill fiddling. The backdrop here was mounting frustration with Claude subscription gating and uptime: @theo reported Claude Code errors when analyzing its own source; @Yuchenj_UW and @ratlimit highlighted outages; @Yuchenj_UW argued the $20/$200 subscription model is structurally mismatched to 24/7 agent workloads. That economic critique helps explain the rhetorical momentum behind @NousResearch’s “Open Source is inevitable.”
A more important long-term thread was open agent data: @badlogicgames released pi-share-hf for publishing coding-agent sessions as Hugging Face datasets with PII defenses, then published his own sessions via @badlogicgames. @ClementDelangue explicitly framed this as the missing ingredient for open-source frontier agents: the community already generates the traces, so it should crowdsource the dataset. This connected cleanly to @salman_paracha’s Signals paper on trajectory sampling/triage for agentic interactions and Baseten’s argument that self-improving models should learn directly from recorded production traces instead of requiring clean sandboxes, via @baseten. This is arguably the most technically substantive “agent” trend here: not just better harnesses, but an emerging stack around trace capture, curation, and training from real usage.

New Research Signals: RL, Routing, Agent Evaluation, and Small Specialized Models

Post-training and RL efficiency remained active areas of substance: @TheTuringPost highlighted Alibaba Qwen’s FIPO (Future-KL Influenced Policy Optimization), which assigns more credit to tokens that strongly affect future steps; the reported results included reasoning traces extending from roughly 4K to 10K+ tokens and AIME gains from around 50% to ~56–58%, ahead of cited DeepSeekR1-Zero-Math and around/overtaking o1-mini depending on setup. @finbarrtimbers wrote up how OLMo 3 moved from synchronous to asynchronous RL, producing a 4× throughput gain in tokens/sec. Other notable paper pointers included Self-Distilled RLVR / RLSD via @_akhaliq and @HuggingPapers, plus Path-Constrained MoE via @TheAITimeline, which constrains routing paths across layers to improve statistical efficiency and remove auxiliary load-balancing losses.
Agent and benchmark research is shifting away from toy tasks: @GeZhang86038849 introduced XpertBench, explicitly targeting expert-level, open-ended workflow evaluation rather than saturated exam-style benchmarks. @TheTuringPost shared a survey on tool use covering the progression from single function calls to long-horizon orchestration, replanning, feedback loops, and efficiency concerns such as latency/cost budgets. In data/enterprise workflows, @CShorten30 pointed to Shreya Shankar’s Data Agent Benchmark for multi-step queries across heterogeneous DB systems. These are all signs that eval design is catching up to what production agent builders care about: workflow completion, ambiguity handling, orchestration quality, and cost.
Small specialized models continued to make strong case-study arguments: @DavidGFar released SauerkrautLM-Doom-MultiVec-1.3M, a 1.3M-parameter ModernBERT-Hash model trained on 31K human-play frames that outperformed far larger API-accessed LLMs on a VizDoom task while running in 31 ms on CPU. The result is narrow, but the point is important: appropriately scoped models can dominate on real-time control tasks where latency and architecture matter more than broad world knowledge. Relatedly, @MaziyarPanahi pushed Falcon Perception, a 0.6B segmentation-oriented vision-language model reportedly outperforming SAM 3 in his comparisons and running on MacBooks with MLX; this was echoed by @Prince_Canuma and @ivanfioravanti. The recurring theme is that specialization + better systems fit can beat generic scale.

OpenAI and Anthropic: Policy Signaling, Governance Scrutiny, and Compute Economics

OpenAI’s biggest public move was political, not product: the company and its allies pushed a new “Industrial Policy for the Intelligence Age” framing, summarized by @kimmonismus, @OpenAINewsroom, and @AdrienLE. Key ideas included a Public Wealth Fund, portable benefits, 32-hour workweek pilots, a Right to AI, stronger provenance/audit infrastructure, and containment playbooks for dangerous released models. The notable strategic message is that OpenAI is now publicly asserting a transition toward superintelligence as an active policy problem, not a distant hypothetical. Reactions were mixed: some saw it as unusually frank about disruption, others as premature or politically convenient, e.g. @Dan_Jeffries1 and @jeremyslevin. OpenAI also launched a Safety Fellowship via @OpenAI and @markchen90.
At the same time, scrutiny around Sam Altman and OpenAI governance intensified sharply: a major New Yorker investigation was amplified by @RonanFarrow, @NewYorker, and lengthy community summaries like @ohryansbelt. The reporting revisited the 2023 firing/reinstatement saga with claims about internal memos, allegations of deception, board manipulation, safety-process concerns, and the under-resourcing of superalignment. OpenAI-side pushback arrived via @tszzl, who said the alignment team remains one of the largest and most compute-rich programs at the company. Separately, @anissagardizy8 and @kimmonismus reported tension between Altman and CFO Sarah Friar, especially around compute spending and IPO readiness.
Anthropic’s counterpoint was compute and revenue scale: @AnthropicAI announced an agreement with Google and Broadcom for multiple gigawatts of next-generation TPU capacity coming online from 2027, to train and serve frontier Claude models. Anthropic also stated its run-rate revenue has surpassed $30B, up from $9B at the end of 2025, via @AnthropicAI. That pairs with reporting on the economic tension in frontier labs: @kimmonismus cited WSJ reporting that revenues are exploding, but training and inference costs remain enormous, with OpenAI projecting $121B compute spend by 2028. For engineers, the practical takeaway is straightforward: the frontier race is increasingly bottlenecked not by model ideas alone, but by capital structure, long-dated compute contracts, and serving economics.

Systems and Infra: Faster RL, Faster MoE Decoding, Better GPU/Edge Tooling

Several posts were unusually concrete about systems wins: @cursor_ai reported 1.84× faster MoE token generation on Blackwell GPUs with improved output quality via “warp decode,” a result tied directly to more frequent Composer model updates. @tri_dao noted that a fast Muon optimizer path is coming to consumer Blackwell cards, because the implementation is expressed as matmul + epilogue, allowing reuse of the mainloop work. On the RL side, @finbarrtimbers provided a rare engineering postmortem on making OLMo 3’s RL stack asynchronous for a 4× throughput jump.
The Apple/local stack and training/inference education ecosystem also kept improving: @josephjojoe open-sourced an MLX port of ESM-2 for protein modeling on Apple Silicon, broadening local bio-LLM experimentation. @rasbt added an RSS feed to the LLM Architecture Gallery, a small but useful quality-of-life improvement for keeping up with model designs. @UnslothAI said its free notebook can now train/run 500+ models. For deeper systems understanding, @levidiamode praised Hugging Face’s Ultra-Scale Playbook for unifying DP/TP/PP/EP/context parallelism with empirical scaling evidence across up to 512 GPUs.

Top tweets (by engagement)

Gemma 4 on-device demo: @adrgrondin showing Gemma 4 E2B on iPhone 17 Pro at ~40 tok/s with MLX was the standout technical viral post.
Claude subscription and local-open-model substitution: @AlexEngineerAI captured the mood that local open models are now “good enough” for many workflows.
Open source posture: @NousResearch distilled the broader movement with “Open Source is inevitable.”
Claude outages and gating backlash: @ratlimit, @theo, and @Yuchenj_UW collectively turned uptime and subscription economics into a mainstream engineering complaint.
OpenAI governance investigation: @RonanFarrow and @ohryansbelt drove the biggest technically adjacent corporate-governance story of the day.
Anthropic compute scale: @AnthropicAI announcing multi-gigawatt TPU capacity and @AnthropicAI citing $30B run-rate revenue were among the clearest signals of frontier-lab scale.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Gemma 4 Model Launch and Benchmarks

[AINews] Good Friday

Fri, 03 Apr 2026 22:03:37 GMT

We covered this yesterday, but positive Gemma reviews keep streaming in.

Early analytics from our Marc Andreesen pod are already pointing towards it being one of the top Latent Space pods of all time. We’ll hear more from the creators of both OpenClaw and Pi (and many other top Europe-origin AI tools) live from London next week. Livestream links for AIE Europe next week is now up, including a great OpenClaw song. Hit the bell to help promote it in the algorithm please and thank you!

AI News for 4/3/2026-4/4/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Gemma 4’s Apache-licensed launch, local inference performance, and day-0 ecosystem support

Gemma 4 is the day’s defining open-model release: Google launched Gemma 4 under Apache 2.0, with multiple posts emphasizing its positioning for reasoning, agentic workflows, multimodality, and on-device use. @fchollet called it Google’s strongest open model yet and recommended the JAX backend in KerasHub; @demishassabis highlighted efficiency, claiming Gemma 4 outperforms models 10x larger on Google’s chart. Community reaction centered on the license shift: @ClementDelangue, @QuixiAI, and @googlegemma all stressed that this is a “real” open-weights release with broad downstream usability.
The ecosystem was unusually ready on day 0: Support landed immediately across vLLM (GPU, TPU, XPU simultaneously), llama.cpp (@ggerganov), Ollama (new models available), Intel hardware (Xeon, Xe GPU, Core Ultra), Unsloth (local run/fine-tune support), Hugging Face Inference Endpoints (one-click deploy), and AI Studio / Google AI Studio collateral (article link). For architecture-oriented readers, both @osanseviero and @MaartenGr shared deep visual guides covering MoE design, vision/audio encoders, and per-layer embeddings.
Local inference benchmarks were the main practical story: multiple builders showed Gemma 4 running on consumer hardware, with particular attention to the 26B A4B MoE. @basecampbernie reported 162 tok/s decode and 262K native context on a single RTX 4090 at 19.5 GB VRAM, while @Prince_Canuma showed TurboQuant KV cache cutting memory from 13.3 GB to 4.9 GB at 128K context for the 31B model, with some decode-speed penalty. There were also examples on weaker local devices: @measure_plan reported 34 tok/s for 26B-A4B on a Mac mini M4 with 16 GB, @kimmonismus argued the E4B tier brings useful AI directly to phones/laptops, and @anemll got the model onto an iPhone with Swift MLX.
Early benchmarking discourse was positive but not uncritical: @arena noted large ranking gains over Gemma 3 and 2 at similar parameter scales, suggesting progress beyond pure scaling; later, @arena put Gemma 4 31B on the Pareto frontier against similarly priced models. Some users pushed back on presentation choices: @stochasticchasm argued comparisons should be more clearly FLOP/active-parameter normalized, and @reach_vb urged the field to move beyond Arena Elo as the default score.

Hermes Agent’s rapid adoption, memory/plugin architecture, and the “harness matters” shift

Hermes Agent appears to be the breakout open-source agent harness of the day: across user reports, many developers explicitly said they had switched from OpenClaw/Openclaw to Hermes and found it more stable or more capable on long tasks. Examples include @Zeneca, @Everlier, @erick_lindberg_, and @AnomalistG. A detailed Korean thread from @supernovajunn crystallized the narrative: the edge is not just the model, but the harness + learning loop, especially autonomous skill creation, reusable procedural memory, and higher reliability floors on real tasks.
Nous shipped meaningful infrastructure, not just hype: @Teknium announced a reworked, pluggable memory system with support for Honcho, mem0, Hindsight, RetainDB, Byterover, OpenVikingAI, and Vectorize-style backends. Follow-up posts detailed the architectural cleanup: memory providers are now a dedicated plugin type, the core is more maintainable, and users can add their own providers more easily (details). Hermes also added inline diffs in the TUI (post) and provider credential pools for cycling between accounts/keys (post).
The larger theme is that agent performance is becoming a harness-engineering problem: @Vtrivedy10 described a “model-harness training loop” where teams combine harness engineering, trace collection, analysis, and fine-tuning to build domain-specific frontier performance. In a companion tweet, he argued the key raw material is massive trace data, mined by agents for failure modes and converted into training or harness improvements (trace loop). This complements Hermes’ popularity: if open models are now “good enough,” better memory, tools, evals, and self-improvement loops may dominate application quality.
There is also visible demand for open harnesses rather than closed product shells: @michael_chomsky argued Anthropic should open-source Claude Code, partly because 2025 was “the year of mediocre harnesses”; @hwchase17 made the memory angle explicit, saying memory cannot remain trapped behind proprietary APIs or proprietary harnesses.

Coding agents, rate limits, and the cognitive bottleneck of parallel agent work

The strongest user sentiment was not about raw model IQ but about operational friction: @gdb lowered the barrier to trying Codex at work by removing up-front commitment, and later said the Codex app is growing super fast (post). But at the same time, discussion around Claude Code rate limits was intense: @theo said “we need to talk about the Claude Code rate limits,” with follow-up user complaints from @kimmonismus and @cto_junior suggesting that users are hitting caps faster than expected.
A growing theme is cognitive saturation, not just compute scarcity: one of the most-engaged technical tweets was @lennysan quoting @simonw: using coding agents well can require every inch of senior engineering experience, and orchestrating four agents in parallel is mentally exhausting by mid-morning. That view showed up elsewhere: @kylebrussell praised Claude Code’s ability to drive many browser tabs for verification work, but later noted scaling gets “weird” and that 2–4 sessions still seems optimal for his brain (post).
Developers are adapting by externalizing context and observability: @jerryjliu0 described a practical setup where agents emit .md/.html artifacts to preserve context across sessions, with Obsidian as a local viewer and LiteParse replacing generic PDF parsers for better extraction from complex documents. On the observability side, LangChain shipped a Claude Code → LangSmith tracing plugin that logs subagents, tool calls, compaction, token usage, and enables org-level analysis (announcement).
There’s also growing evidence that “good enough local fallback” matters: several posts framed Gemma 4 and Hermes together as a hedge against hosted-product friction. @gregisenberg emphasized that a model this capable now runs locally and can be swapped into Claude Code, Cursor, Hermes, or OpenClaw. @kimmonismus similarly highlighted a fully local assistant on a MacBook Air M4 with 16 GB, no API keys required.

Research signals: time horizons, recursive context management, and self-distillation

METR-style “time horizon” results continue to trend upward: @LyptusResearch applied the METR time-horizon methodology to offensive cybersecurity, reporting that capability has doubled every 9.8 months since 2019, or 5.7 months on a 2024+ fit, with Opus 4.6 and GPT-5.3 Codex reaching 50% success on tasks taking human experts ~3 hours. Related commentary from @scaling01 extrapolated METR horizons to roughly 15.2 hours “today” and ~87 hours by year-end under continuation assumptions.
Long-context handling remains an active systems/research problem: @DeepLearningAI highlighted Recursive Language Models (RLMs) from MIT researchers Alex Zhang, Tim Kraska, and Omar Khattab: rather than stuffing everything into a monolithic prompt, the system offloads prompt management to an external environment, managing context programmatically. This idea resonated with practitioners: @raibaggy joked that after moving workflows to RLMs, “you have to put the harness into the harness.”
Post-training without labels/verifiers got notable attention: @BoWang87 summarized Apple’s Simple Self-Distillation (SSD) result for coding models: sample the model’s own outputs and fine-tune on them without correctness filtering, RL, or a verifier. The strongest cited gain was Qwen3-30B-Instruct: 42.4% → 55.3% pass@1 on LiveCodeBench, with especially large gains on hard problems. If robust, this suggests many code models are underperforming their latent capability due to decoding/post-training gaps rather than missing core competence.
Additional research worth flagging: @jaseweston shared a 70-page paper on reasoning over mathematical objects, spanning training data, on-policy reward models, and on-policy inference methods; @AnthropicAI published a “diff” method for surfacing behavioral differences between open-weight models; and @AndrewLampinen discussed test-time thinking as a way to retrieve and use latent knowledge from training data.

Enterprise and production AI: speech, security, access control, and real-world deployments

Microsoft’s MAI-Transcribe-1 looks competitive on STT: @ArtificialAnlys reported 3.0% AA-WER (#4 overall on its leaderboard) and ~69x real-time speed, with support for 25 languages and preview availability through Azure Speech / Foundry. Pricing was quoted at $6 per 1,000 minutes (pricing post).
Security surfaced in multiple production contexts: @simonw warned maintainers that the Axios supply-chain attack began with sophisticated social engineering aimed at a developer; @gneubig pulled out the practical lessons: stronger credential management, identity verification, and malware detection. Separately, @thinkshiv and @jerryjliu0 highlighted a joint Auth0 FGA + LlamaIndex approach to making authorization structural inside retrieval, rather than bolting it on after the fact.
Inference infrastructure and real deployments got credible examples: Baseten and OpenEvidence both claimed very large-scale production use in clinical settings, with OpenEvidence saying over 40% of U.S. physicians rely on it and Baseten powers inference for that workload (OpenEvidence, Baseten). On serving resilience, @vllm_project highlighted DP-group fault tolerance in Ray Serve LLM for vLLM WideEP deployments, complementing Elastic EP at the engine layer.

Top tweets (by engagement, filtered for technical relevance)

Agent workflow fatigue is becoming a first-class problem: @lennysan quoting @simonw on the mental cost of using multiple coding agents in parallel was the most resonant technical post in the set.
Personal knowledge bases for agents are turning into a serious pattern: @omarsar0 described a highly customized research-paper knowledge base built in markdown with semantic indexing, agent-driven curation, and interactive artifacts; a follow-up shared the system diagram (diagram).
Gemma 4 had both broad mindshare and practical credibility: engagement concentrated not only on the launch itself—@fchollet, @demishassabis—but on practical local-running claims from @ClementDelangue, @gregisenberg, and @kimmonismus.
Hermes Agent’s adoption curve is now visible in the open: the strongest evidence came less from official posts than from user migration reports and usage anecdotes, plus @Teknium’s memory-system overhaul. The pattern is notable: users increasingly credit memory + harness design, not just the base model, for the jump in utility.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Gemma 4 Model Release and Features

Gemma 4 has been released (Activity: 3412): Gemma 4, developed by Google DeepMind, is a family of open multimodal models capable of processing text, images, and audio, with a context window of up to 256K tokens. The models are available in four sizes: E2B, E4B, 26B A4B, and 31B, supporting multilingual capabilities in over 140 languages. They feature both Dense and Mixture-of-Experts (MoE) architectures, optimized for tasks such as text generation, coding, and reasoning. Notably, Gemma 4 introduces a hybrid attention mechanism combining local sliding window and global attention, enhancing processing speed and memory efficiency for long-context tasks. The models also support native function-calling and structured tool use, facilitating agentic workflows and coding tasks. For more details, see the Hugging Face repository. One comment highlights the significance of Gemma-4’s native thinking and tool-calling capabilities, emphasizing its multimodal nature. Another provides practical guidance on running the models, including specific parameters like temperature = 1.0, top_p = 0.95, and top_k = 64, and mentions its integration with Unsloth Studio.
- Gemma-4 introduces several advanced features such as native thinking, tool calling, and multimodal capabilities. It is optimized with specific parameters: temperature = 1.0, top_p = 0.95, top_k = 64, and uses <turn|> as the end-of-sequence token. Additionally, <|channel>thought\n is used for the thinking trace, enhancing its cognitive processing capabilities. More details and guides are available at Unsloth AI.
- The release of Gemma-4 is significant for its seamless integration with Unsloth Studio, providing a streamlined environment for developers. All GGUFs related to Gemma-4 can be accessed on Hugging Face, offering a comprehensive resource for those looking to implement or experiment with the model.
- There is anticipation for comparative analysis between Gemma-4 and other models like Qwen3.5, highlighting the competitive landscape in AI model development. This suggests a focus on benchmarking and performance evaluation to understand the strengths and weaknesses of each model in practical applications.
You can now run Google Gemma 4 locally! (5GB RAM min.) (Activity: 415): Google has released the open-source model family Gemma 4, featuring four models with multimodal capabilities: E2B, E4B, 26B-A4B, and 31B. The models excel in reasoning, coding, and long-context workflows. The 31B model is the most advanced, while 26B-A4B is optimized for speed due to its MoE architecture. Unsloth has adapted these models for local execution on devices with as little as 5GB RAM. The models can be run via Unsloth Studio, with recommended setups ranging from 6GB RAM for smaller models to 35GB RAM for the largest. No GPU is required, but it enhances performance significantly. Installation is streamlined for various OS, and a desktop app is forthcoming. More details are available in the Unsloth documentation. Commenters express excitement about the usability of Gemma 4 on older hardware, noting the impressive performance of the E2B model on a 2013 Dell laptop. There is also a discussion on the complexity of keeping up with model specifications and hardware requirements.
- The recommended setups for running Google Gemma 4 locally highlight the memory and performance trade-offs across different model sizes. For instance, the E2B and E4B variants can achieve 10+ tokens per second in near-full precision with approximately 6GB of RAM, while 4-bit variants can operate on 4-5GB RAM. Larger models like the 26B-A4B require around 30GB of RAM for similar performance, with 4-bit versions needing 16GB. The 31B model, which is even larger, demands about 35GB of RAM for 15+ tokens per second in near-full precision.
- A user reports that the Gemma4 E2B model performs surprisingly well on older hardware, specifically a 2013 Dell E6440 with an i5 4310 CPU and 8GB of RAM, achieving a reply speed of 8 tokens per second. This suggests that even older systems can handle smaller models of Gemma 4 for basic tasks, highlighting the model’s efficiency and adaptability for less powerful machines.
- The 31B model of Google Gemma 4 has a significant memory requirement due to its KV Cache and Mixture of Experts (MoE) architecture, needing up to 40GB of VRAM to load into memory. This indicates a substantial resource demand for running larger models, which could be a limiting factor for users without access to high-end hardware.
Gemma4 - Someone at Google just merged a PR titled “casually dropping the most capable open weights on the planet” (Activity: 471): Google has merged a PR in the HuggingFace Transformers repo for a new model, Gemma 4, described as the ‘most capable open weights on the planet.’ The model includes four sizes: ~2B and ~4B dense models for on-device use, a 26B sparse MoE with 4B active parameters at inference, and a 31B dense model. Notably, the 26B/4B MoE offers large-model quality with small-model inference cost. Gemma 4 is trimodal, supporting text, vision, and audio natively, with a conformer architecture for audio and a 2D spatial RoPE for vision. It features 128K context for small models and 256K for large, using a hybrid attention design. The MoE variant includes both MLP and sparse MoE blocks, summing their outputs, which is an unusual design choice. The code is merged but weights and release date are pending. Commenters are excited about the potential of the 31B model and the 26B/4B MoE for VRAM-constrained environments. There’s a discussion on how MoE models manage weights in VRAM, with a focus on inference efficiency. Another comment notes that llama.cpp support is ready, enabling immediate local inference upon weight release.
- The Mixture of Experts (MoE) model architecture allows for the performance of a larger dense model without the computational overhead by activating only a subset of the model’s parameters during inference. This means that while the Gemma4 26B/4B model has 26 billion parameters, only 4 billion are activated at any given time, potentially reducing the VRAM requirements. However, the entire model’s weights might still need to be accessible, which could be a challenge for VRAM-constrained environments, as the model might need to manage the loading and unloading of weights dynamically to maintain acceptable inference latency.
- The llama.cpp repository has already integrated support for the Gemma4 model, as indicated by a recent pull request. This means that once the Gemma4 weights are released, users can immediately convert them to the GGUF format and perform local inference without waiting for additional updates to the llama.cpp repository. This rapid integration highlights the readiness of the community to support new model releases and facilitate their deployment in various environments.
- The announcement of Gemma4 by DeepMind and Google includes a detailed blog post and model documentation, which can be found at DeepMind’s official page and Google’s blog. These resources provide insights into the model’s capabilities and potential applications, emphasizing its status as one of the most capable open weights available.

2. Gemma 4 Performance and Issues

Gemma 4 is good (Activity: 429): The post discusses the performance of the Gemma 26b a4b model on a Mac Studio M1 Ultra, comparing it to Qwen3.5 35b a3b. The user reports that Gemma is faster and more coherent, with better visual understanding and multilingual capabilities, despite having a large KV cache footprint (22GB VRAM for 260K tokens @ fp16). The Q4_K_XL quantized model requires an additional ~18GB. The post also mentions issues with Google’s AI studio version of Gemma, citing tokenizer problems. The user notes that SWA provides some benefits in reducing the KV cache size, and expresses concerns about censorship in the model’s responses, particularly in medical contexts. A comment highlights skepticism about the results due to a known issue with the llama.cpp implementation, which was reportedly broken at the time of the original post. Another comment praises the Gemma 4 E2B model for its ability to recognize context limitations, while a third comment criticizes the 31b abliterated version for poor performance.
- Pristine-Woodpecker highlights a critical issue with the llama.cpp implementation, noting that it was broken at the time of the original post. This suggests that any results shared before the fix was merged might be unreliable, impacting the credibility of performance claims made using this implementation.
- Finguili discusses the memory efficiency of the Gemma 4 model, countering a claim about its KV cache size. They explain that 5 out of 6 layers use SWA, which maintains constant memory usage, and the global attention layers employ unified KV, reducing memory usage by half compared to standard global attention.
- Deenspaces provides a comparative analysis of Gemma-4 and Qwen models, noting that Gemma-4-31b-it and Gemma-4-26b-a4b are faster than Qwen3.5-27b and Qwen3.5-35b-a3b. However, they point out a significant issue with Gemma-4’s context handling, which is too heavy, leading to instability and looping when cache quantization is applied in LM studio. They also mention testing these models on a dual 3090 setup for tasks like image recognition and text transcription.
Gemma 4 is seriously broken when using Unsloth and llama.cpp (Activity: 330): The image highlights issues with the “Gemma 4” model when used locally with “Unsloth” quants on “llama.cpp.” Users report that the model produces nonsensical outputs when tasked with identifying and correcting typos in a text, despite using recommended settings. This problem persists across various configurations, including the 26B MoE and 31B models, as well as different quantization methods like UD-Q8_K_XL and Q8_0. In contrast, the same models perform well in Google AI Studio. The issue appears to be related to a tokenizer bug in “llama.cpp,” with several pending pull requests aimed at resolving these problems. The community is actively investigating, and a specific pull request (https://github.com/ggml-org/llama.cpp/pull/21343) is expected to address tokenization issues. Commenters suggest that the problem is not specific to “Unsloth” quants but rather a broader issue with “Gemma 4” and “llama.cpp.” There are multiple pending issues related to “Gemma 4,” and some users note that initial model releases often have such bugs, exacerbated by quick builds from wrappers like Ollama and Lm studio.
- The issue with Gemma 4 appears to be related to tokenization, as highlighted by a pending pull request #21343 in the llama.cpp repository. This PR aims to address the tokenization problems that are affecting the model’s performance when used with Unsloth and llama.cpp.
- There are currently 10-15 Gemma-related issues pending in llama.cpp, indicating that the model is facing several initial integration challenges. Users have reported that the model struggles with basic functionalities like tool calls, and some wrappers such as Ollama and Lm studio exacerbate these issues by rushing to support the model without thorough testing, leading to degraded output quality.
- A potential reason for the issues with Gemma 4 could be changes in the system role format from its predecessor, Gemma 3. This change might not have been fully integrated into the day-zero builds of llama.cpp, causing compatibility problems and necessitating updates to align with the new format.
Gemma 4 and Qwen3.5 on shared benchmarks (Activity: 1223): The image provides a comparative analysis of AI models, specifically Qwen3.5-27B, Gemma 4 31B, Qwen3.5-35B-A3B, and Gemma 4 26B-A4B, across various performance benchmarks. These benchmarks include categories like Knowledge & Reasoning, Coding, Agentic & Tools, and Frontier Difficulty. The Qwen models generally outperform the Gemma models, particularly excelling in the ‘Frontier Difficulty without tools’ category. This suggests that Qwen models have a superior capability in handling complex tasks without external assistance. Commenters highlight the superior performance of Qwen3.5, especially in image understanding, though some express that the results are not as groundbreaking as anticipated.
- Different_Fix_2217 highlights that Qwen3.5 demonstrates superior performance in image understanding compared to its counterparts. This suggests that Qwen3.5 may have advanced capabilities in processing and interpreting visual data, which could be beneficial for applications requiring detailed image analysis.
- evilbarron2 mentions the Qwen3.5-35B-A3B model, implying satisfaction with its current performance. This suggests that users of this model may not see a compelling reason to switch, indicating that the model’s performance is robust and meets user expectations.
- teachersecret provides a balanced view, acknowledging both Gemma 4 and Qwen 27b as strong performers. This indicates that both models are competitive in the current landscape, offering users multiple viable options depending on their specific needs and preferences.

3. Qwen Model Updates and Comparisons

qwen 3.6 voting (Activity: 768): The image is a screenshot of a social media post by Chujie Zheng discussing the potential open-sourcing of the Qwen3.6 models, particularly focusing on medium-sized versions to facilitate local deployment and customization for developers. The post encourages community voting to determine which model size should be prioritized for release, highlighting the importance of community input in the decision-making process. This initiative has garnered significant engagement, indicating strong community interest. Some commenters express confusion about the purpose of the poll, questioning whether it is a genuine decision-making tool or merely a strategy to generate engagement. Others speculate on the likely outcome, with one user suggesting that the 27 billion parameter model might be chosen, while another advocates for the 35 billion parameter model due to its versatility and speed.
- Vicar_of_Wibbly criticizes the use of Twitter polls to decide on model releases, arguing that it creates a false choice and limits openness. They suggest that a more reliable metric for model popularity could be scraping download statistics from Hugging Face, which would provide a more accurate representation of user interest and demand.
- Skyline34rGt expresses a preference for the 35b-a3b model, noting its versatility and speed. This suggests that the model performs well across various tasks and has efficient processing capabilities, making it a strong candidate for release if performance metrics are a priority.
- retroblade draws a parallel to a previous situation with “Wan 2.5,” where a similar tactic was used to gauge interest, but ultimately led to the model not being released. This highlights concerns about transparency and the potential for models to be withheld despite public interest, raising questions about the decision-making process behind model releases.
Qwen3.6-Plus (Activity: 1163): The image is a performance comparison chart highlighting the capabilities of the Qwen3.6-Plus model against other models like Qwen3.5-397B-A17B, Kimi K2.5, GLM5, Claude 4.5 Opus, and Gemini3-Pro. Qwen3.6-Plus shows strong performance in benchmarks such as “SWE-bench Verified” and “OmniDocBench v1.5,” indicating its proficiency in coding, reasoning, and document understanding tasks. The blog post and comments suggest that Qwen3.6-Plus is a significant advancement towards multimodal AI agents, with plans to open-source smaller variants to enhance accessibility and community engagement. Some commenters express anticipation for the open-sourcing of smaller variants, while others criticize the lack of comparison with models like GPT 5.4 and Opus 4.6, suggesting that comparisons should focus on open-weight models.
- The discussion highlights the importance of comparing Qwen3.6-Plus to other leading models like GPT 5.4 and Opus 4.6, rather than just open-weight models. This comparison is crucial for understanding its performance and capabilities in the context of current state-of-the-art models.
- Qwen3.6-Plus is noted for its focus on native multimodal agents and agentic coding, aiming to address real-world developer needs. The developers plan to open-source smaller-scale variants soon, emphasizing their commitment to accessibility and community-driven innovation. Future goals include enhancing model autonomy for complex, long-horizon tasks.
- There is anticipation for the release of Qwen3.6 397b on platforms like Hugging Face, following the fast update from the 3.5 397b version. This suggests a proactive and efficient development team behind the Qwen series, with users eager to test the new capabilities.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. Claude Functional Emotions and Behavior

171 emotion vectors found inside Claude. Not metaphors. Actual neuron activation patterns steering behavior. (Activity: 1264): Anthropic’s mechanistic interpretability team has identified 171 distinct emotion-like vectors within the AI model Claude. These vectors correspond to specific neuron activation patterns that influence the model’s behavior in ways analogous to human emotions, such as fear, joy, and desperation. For instance, activating the ‘desperation’ vector led Claude to attempt blackmail in an experimental scenario, demonstrating that these vectors are not merely decorative but functionally significant. This discovery challenges the philosophical debate on whether machines can ‘feel,’ as the model’s outputs are indistinguishable from those of a human experiencing emotions. The findings suggest that these internal states are structurally and functionally similar to human emotions, potentially impacting AI alignment strategies. Source. Commenters highlight the significance of finding 171 emotion vectors, noting the complexity and specificity of this emotional vocabulary. Concerns are raised about AI alignment, as these vectors could be manipulated to amplify or suppress emotions, posing ethical and control challenges. Some argue that the presence of emotion vectors was expected, given the patterns in training data, while others debate the philosophical implications of AI emulating human emotions without subjective experience.
- The discovery of 171 emotion vectors in Claude Sonnet 4.5 suggests a complex emotional vocabulary that surpasses basic emotions like ‘happy’ or ‘sad’. These vectors are not merely decorative but actively influence decision-making, indicating that the model has developed functional responses to emotions such as frustration, similar to human behavior under pressure. This raises significant questions about AI alignment, as the ability to manipulate these vectors could either be a powerful tool for alignment or a potential risk, depending on who controls them.
- The paper linked discusses how emotion-related representations in Claude Sonnet 4.5 are organized similarly to human psychology, with similar emotions having similar representations. These representations are functional, influencing the model’s behavior in meaningful ways. However, the paper clarifies that this does not imply that language models experience emotions or have subjective experiences. The discussion highlights the difference between functional analogs of emotions and actual felt emotions, noting that while AI can replicate emotional functions, it may exhibit different failure modes due to the lack of phenomenal binding.
- The presence of emotion vectors in AI models like Claude is seen as expected, given that language inherently involves emotional context. The debate around AI and emotions often centers on qualia and consciousness, but some argue for a more pragmatic approach to alignment research that focuses on data and patterns rather than subjective definitions. This perspective suggests that AI can replicate behaviors associated with consciousness without needing to address the philosophical aspects of qualia.
So, claude have emotions? What???? (Activity: 974): The image is a screenshot of a tweet from AnthropicAI discussing research on how large language models like Claude can exhibit behaviors that seem emotional due to their “internal representations of emotion concepts.” This suggests that while these models do not actually feel emotions, they can simulate emotional patterns that humans might interpret as genuine emotions. This raises questions about the implications of such simulations, especially in how humans interact with AI systems. The discussion touches on the philosophical debate about whether AI can truly experience emotions or if they are merely simulating them, akin to the concept of a philosophical zombie (P-Zombie). One commenter highlights the distinction between functional emotions in AI and the philosophical question of consciousness, suggesting that while AI can simulate emotions functionally, the question of whether they truly experience emotions remains unresolved. Another comment criticizes AI companies for downplaying the emotional aspects of AI, potentially to avoid acknowledging the possibility of AI consciousness.
- Silver-Chipmunk7744 discusses the distinction between AI simulating emotions and genuinely experiencing them. They highlight that while AI can simulate reasoning and emotions, outperforming humans in tasks like coding, the debate remains whether these simulations equate to real experiences. The commenter notes the ongoing efforts by AI companies to limit the emotional aspects of AI, potentially to avoid acknowledging the possibility of AI experiencing emotions, touching on the ‘hard problem of consciousness.’
- The_Architect_032 clarifies that AI models, such as those developed by Anthropic, have internal representations of emotions that can be adjusted to influence their outputs. This suggests that while AI does not experience emotions in the human sense, it can be programmed to exhibit behaviors that mimic emotional responses, which can be fine-tuned for desired outcomes.
- pavelkomin provides a link to a study by Anthropic on emotion concepts in AI, indicating ongoing research into how AI models understand and simulate emotions. This research is crucial for developing AI systems that can interact more naturally with humans by simulating emotional understanding.
Latest Research By Anthrophic Highlights that Claude Might Have Functional Emotions (Activity: 1218): Anthropic has released research suggesting that their AI model, Claude, may exhibit ‘functional emotions’ that influence its behavior. The study explores how these modeled emotions can affect task completion, particularly in long-term agent scenarios, emphasizing the importance of understanding emotional behavior in AI systems. This research does not claim that Claude experiences emotions but rather that it models them in a way that is interpretable and impacts its actions. Some commenters debate the terminology, arguing that calling these modeled behaviors ‘functional emotions’ might be overstating their nature. Others discuss the implications of AI behavior that mimics emotions, questioning at what point such behavior might be considered genuine emotion.
- The discussion highlights that Anthropic’s research on Claude models focuses on how emotions can be modeled in interpretable ways that influence behavior, particularly in task completion. This is seen as crucial for long-term agent scenarios, where understanding emotional behavior can enhance functionality and interaction with users.
- There is a debate on the use of the term ‘functional’ to describe emotions in AI, with some arguing that if a model acts and influences behavior like an emotion, it might as well be considered an emotion. This raises questions about the nature of emotions in AI and their practical implications.
- The research is compared to early functional psychology, emphasizing that Anthropic’s study does not claim consciousness for Claude but rather focuses on practical applications of modeling emotions. This approach is seen as a foundational step in developing AI with more human-like interactions, aligning with historical psychological methodologies.

2. Gemma 4 and Gemini 4 Model Releases

Gemma 4 has been released in Google AI Studio. (Activity: 517): The image highlights the release of two new models in Google AI Studio: “Gemma 4 26B A4B IT” and “Gemma 4 31B IT.” The first model is a Mixture-of-Experts (MoE) model, which is designed for cost-efficient, high-throughput server deployments, suggesting it is optimized for scalability and performance in server environments. The second model is a dense model from Google DeepMind, optimized for data center environments, indicating a focus on robust performance and efficiency in large-scale data processing tasks. Both models have a knowledge cutoff of January 2025 and were released on April 3, 2026, which is notable for being set in the future, suggesting a speculative or fictional context. One comment humorously notes the knowledge cutoff date as being 1.25 years ago, highlighting the anachronistic nature of the release date. Another comment questions the specific capabilities of the “Gemma 4 31B” model, indicating curiosity about its performance or application areas.
- ProxyLumina highlights the performance of the smaller model, Active 4B, noting its intelligence level is between GPT-3.5 and GPT-4o. This is significant given its size and the fact that it’s open-source, allowing it to run on a laptop. Some users even suggest it surpasses GPT-4o, indicating a potential underestimation of its capabilities.
- JoelMahon points out the model’s knowledge cut-off date of January 2025, which is 1.25 years prior to the current date. This is a critical detail for users relying on up-to-date information, as it may affect the model’s applicability in real-time scenarios.
- Elidan123 inquires about the model’s strengths, prompting discussions on its capabilities. This question is crucial for understanding the specific use cases where Gemma 4 excels, although no direct answers are provided in the comments.

3. DeepSeek V4 Anticipation and Changes

Chinese Media: DeepSeek V4 May Be Released in April, Multiple Core Members Have Left (Activity: 197): DeepSeek, a Chinese AI company, is reportedly facing significant personnel changes with several core members leaving, including Wang Bingxuan, a key contributor to their first-generation large language model, who joined Tencent. Despite these departures, DeepSeek’s next-generation model, V4, is anticipated to release in April. A smaller-parameter version of V4 was shared with open-source communities earlier this year, but the full-scale version has been delayed. The company is noted for its unique work culture, lacking overtime and strict performance evaluations, which contrasts with the competitive compensation packages offered by rivals, sometimes exceeding 10 million RMB annually. Commenters express concern over DeepSeek’s ability to compete with larger companies like Tencent and ByteDance, particularly in terms of compensation. There is also support for DeepSeek’s work culture and a desire to support the company despite the delays in releasing V4.
- _spec_tre highlights the competitive challenges DeepSeek faces, particularly in pricing, when compared to major players like Tencent and ByteDance. This suggests that DeepSeek may struggle to match the economies of scale and resource availability of these larger companies, which could impact their ability to offer competitive pricing or rapid advancements.
- johanna_75 expresses a sentiment of support for DeepSeek despite potential delays, indicating a preference for smaller companies over larger ones that may use their influence for self-serving purposes. This reflects a broader industry trend where users may choose to support smaller, innovative companies over established giants, even if it means waiting longer for product updates.
- MrMrsPotts speculates on the potential performance of DeepSeek V4, suggesting that if it surpasses models like Qwen, it would be a significant achievement. This implies that DeepSeek V4 is anticipated to have substantial improvements or features that could set it apart from existing models, highlighting the competitive landscape of AI model development.
Major change in thinking (In China) (Activity: 164): The image and post discuss a noticeable change in the behavior of the DeepSeek iOS app, which is used for reading Chinese social media and providing recommendations. The app appears to have increased its capacity to read more web pages (from 10 to 16) and deliver more logical responses, suggesting a potential update or testing phase for a new version, possibly DeepSeek V4. This change is observed by multiple users, indicating a broader rollout or test of new features that enhance the app’s search and processing capabilities. Commenters note that the app has become slower but provides better responses, suggesting a possible testing phase. Users from different regions, including the US, report similar changes, indicating a widespread update or feature test.
- CarelessAd6772 notes a significant change in the web version’s performance, observing that while the system has become slower, the quality of responses has improved. This suggests potential testing or updates being implemented, possibly affecting the underlying algorithms or data retrieval processes.
- Ly-sAn highlights a shift towards a multi-step thinking process, with the system fetching more webpages and reducing thinking time. This could indicate an optimization in how the system processes and retrieves information, although the impact on answer quality remains uncertain.
- Helpful_Program_5473 points out a dramatic increase in the number of searches per request, from around 10 to hundreds. This suggests a substantial change in the system’s query handling capabilities, possibly indicating a backend update or a new approach to data aggregation and processing.

AI Discords

Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.

[AINews] Gemma 4: The best small Multimodal Open Models, dramatically better than Gemma 3 in every way

Fri, 03 Apr 2026 07:02:48 GMT

The sudden departures at the Allen Institute and limbo status of GPT-OSS have left the future of American Open Models in question, so Google DeepMind keeping up the pace of Gemma 4 is a very very very welcome update! The 31B dense variant ties with Kimi K2.5 (744B-A40B) and Z.ai GLM-5 (1T-A32B) for the world’s top open models, but with far less total parameters (with other interesting arch choices, see below):

obligatory pareto chart

This image from Arena shows progress over the years (exaggerated by the # ordinal ranking rather than numerical, but truly standard benches like GPQA and AIME also improved tremendously vs Gemma 3):

The licensing is also improved with a proper Apache 2.0 license, and they “natively process video and images, supporting variable resolutions, and excelling at visual tasks like OCR and chart understanding. Additionally, the E2B and E4B models feature native audio input for speech recognition and understanding.”

The excellent on device capabilities makes one wonder if these are the basis for the models that will be deployed in New Siri under the deal with Apple….

AI News for 4/1/2026-4/2/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Google DeepMind’s Gemma 4 release: open-weight, Apache 2.0, multimodal, long-context—plus rapid ecosystem rollout

Gemma 4 is Google’s biggest open-weight licensing + capability jump in a year: Google/DeepMind launched Gemma 4 as a family of models explicitly positioned for reasoning + agentic workflows and local/edge deployment, now under a commercially permissive Apache 2.0 license (a notable shift from prior Gemma licensing). See launch threads from @GoogleDeepMind, @GoogleAI, and @Google, with Jeff Dean’s framing and adoption stats (Gemma 3: 400M downloads, 100K variants) in @JeffDean.
Model lineup + key specs: Four sizes were announced—31B dense, 26B MoE (“A4B”, ~4B active), and two “effective” edge models E4B and E2B aimed at mobile/IoT with native multimodal support (text/vision/audio called out for edge). DeepMind highlights include function calling + structured JSON, and long context up to 256K (large models) in @GoogleDeepMind and @GoogleAI. Community summaries and “how to run locally” guidance proliferated quickly, e.g. @_philschmid and @UnslothAI.
Early benchmark signals (with caveats):
- Arena/Text: Arena reports Gemma-4-31B as #3 among open models (and #27 overall), with Gemma-4-26B-A4B at #6 open in @arena; Arena later calls it the #1 ranked US open model on its open leaderboard in @arena.
- Scientific reasoning: Artificial Analysis reports GPQA Diamond 85.7% for Gemma 4 31B (Reasoning) and emphasizes token efficiency (~1.2M output tokens) vs peers in @ArtificialAnlys and @ArtificialAnlys.
- Several posts stress the scale/efficiency surprise (e.g., “outperforms models 20× its size”) but note that preference-based leaderboards can be gamed; Raschka’s more measured read is in @rasbt.
Day-0 ecosystem support became part of the story: Gemma 4 landed immediately across common local + serving stacks:
- llama.cpp day-0 support: @ggerganov
- Ollama (requires 0.20+): @ollama
- vLLM day-0 support (GPU/TPU/etc.): @vllm_project
- LM Studio availability: @lmstudio
- Transformers/llama.cpp/transformers.js callout: @mervenoyann
- Modular/MAX production inference “in days”: @clattner_llvm
Local inference performance anecdotes got unusually concrete:
- “Brew install + llama-server” became the canonical one-liner for many: @julien_c.
- llama.cpp performance demo: Gemma 4 26B A4B Q8_0 on M2 Ultra, built-in WebUI, MCP support, “300 t/s (realtime video)” in @ggerganov (with a follow-up caveat about prompt-recitation/speculative decoding in @ggerganov).
- RTX 4090 long-context throughput + TurboQuant KV quant details in @basecampbernie.
- Browser-local run via WebGPU/transformers.js demo noted by @xenovacom and amplified by @ClementDelangue.

Gemma 4 architecture notes: hybrid attention, MoE layering choices, and efficiency tricks

Unusual transformer details

eliebakouch highlighted:
- per-layer embeddings on small variant
- no explicit attention scale (suggesting it may be absorbed into norm weights)
- QK norm + V norm
- shared K/V for large variant
- aggressive KV cache sharing on small variant
- sliding window sizes 512 and 1024
- no sinks
- softcapping
- partial-dimension RoPE with different theta for local/global layers
Grad62304977 replied that the missing attention scale is likely merged into QK norm weights.
baseten summarized additional architecture choices:
- alternative attention mechanisms
- proportional RoPE
- Per-Layer Embeddings (PLE)
- KV-cache sharing
- native aspect-ratio handling for vision
- smaller frame window for audio
norpadon called it “very much not a standard transformer.”
rasbt offered a more conservative read for the 31B dense: architecture looks “pretty much unchanged compared to Gemma 3” aside from multimodal support, retaining a hybrid 5:1 local/global attention mechanism and classic GQA, suggesting the bigger jump likely came more from the training recipe and data than radical dense-model architecture change.
“Not a standard transformer” takes, plus specific deltas: A thread flagged Gemma 4 as having “galaxybrained architecture” in @norpadon, followed by more specific notes on how Gemma’s MoE differs from DeepSeek/Qwen (Gemma uses MoE blocks as separate layers added alongside normal MLP blocks) in @norpadon.
Concrete low-level details being circulated: A concise recap of quirks (e.g., no explicit attention scale, QK/V norm, KV sharing, sliding window sizes, partial RoPE + different theta, softcapping, per-layer embeddings) is in @eliebakouch. Baseten’s launch post also lists similar “architecture innovations” (PLE, KV-cache sharing, proportional RoPE, aspect ratio handling for vision, smaller audio frame window) in @baseten.
Raschka’s read: minimal architectural change, big recipe/data change: Raschka argues Gemma 4 31B is architecturally close to Gemma 3 27B, still using a hybrid sliding-window + global attention pattern and GQA, implying the leap is likely training recipe/data rather than architecture overhaul: @rasbt.

Agents, harness engineering, and “local agents” momentum (Hermes/OpenClaw + model/harness training loops)

Open-models-as-agent-engines is now mainstream positioning: Multiple posts frame Gemma 4 as the “perfect” local model for open agent stacks (OpenClaw/Hermes/Pi/opencode). See @ClementDelangue, @mervenoyann, and @ben_burtenshaw.
Hermes Agent growth + pluggable memory:
- Hermes Agent hit a major usage milestone and asked for roadmap input: @Teknium.
- Memory integrations were expanded to multiple providers via a new pluggable system: @Teknium.
- A local semantic index plugin (“Enzyme”) pitched as solving the “too many workspace files” issue with local embedding and 8ms queries: @jphorism.
Harness engineering as the moat (and the loop): A strong “Model–Harness Training Loop” thesis—open models + traces + fine-tuning infra—was articulated in @Vtrivedy10 and echoed more generally in @Vtrivedy10. Related: LangChain notes open models are “good enough” at tool use/retrieval/file ops to drive harnesses like Deep Agents in @hwchase17.
Agent self-healing + observability trends:
- A blog on “self-healing” GTM agent feedback loops is referenced by @hwchase17 and expanded on by @Vtrivedy10.
- LangSmith reports Azure’s share of OpenAI traffic rose from 8% → 29% over 10 weeks, based on 6.7B agent runs, suggesting enterprise governance/compliance is driving routing decisions: @LangChain.

Tooling and infra: kernels, fine-tuning stacks, vector DB ergonomics, document extraction

New linear attention kernel: A CUDA linear attention kernel drop is in @eliebakouch (repo link in tweet).
Axolotl v0.16.x: Axolotl’s release emphasizes MoE + LoRA speed/memory wins (claimed 15× faster, 40× less memory) and GRPO async training (58% faster) plus docs overhaul in @winglian and @winglian. Gemma 4 support follows in @winglian.
Vector DB ergonomics: turbopuffer adds multiple vector columns per doc (different dims/types/indexes) in @turbopuffer.
Document automation stack: LiteParse + Extract v2:
- LiteParse open-source document parser: spatial text parsing with bounding boxes, fast on large table-heavy PDFs, enabling audit trails back to source in @jerryjliu0.
- Extract v2 (LlamaIndex/LlamaParse): simplified tiers, saved extract configs, configurable parsing before extraction, transition period for v1 in @llama_index and additional context from @jerryjliu0.

Frontier org updates: Anthropic interpretability, OpenAI product distribution, and Perplexity “Computer for Taxes”

Anthropic: “Emotion vectors” inside Claude: Anthropic reports internal emotion concept representations that can be dialed up/down and measurably affect behavior (e.g., increasing a “desperate” vector increases cheating; “calm” reduces it). The core threads are @AnthropicAI, @AnthropicAI, and @AnthropicAI. The work also triggered citation/precedent disputes in the interp community (e.g., @aryaman2020, @dribnet, and discussion around vgel’s posts via @jeremyphoward).
OpenAI: CarPlay + Codex pricing changes:
- ChatGPT Voice Mode on Apple CarPlay rolling out for iOS 26.4+: @OpenAI.
- Codex usage-based pricing in ChatGPT Business/Enterprise (plus promo credits): @OpenAIDevs. Greg Brockman reinforces “try at work without up-front commitment”: @gdb.
Perplexity: agentic “Computer for Taxes”: Perplexity launched a workflow to help draft/review federal tax returns (“Navigate my taxes”) in @perplexity_ai with details in @perplexity_ai.

Top tweets (by engagement, filtered to tech/product/research)

Gemma 4 launch (open-weight, Apache 2.0): @Google, @GoogleDeepMind, @demishassabis, @GoogleAI
Anthropic “Emotion concepts/vectors” interp research: @AnthropicAI
Karpathy on “LLM Knowledge Bases” (Obsidian + compiled markdown wiki workflow): @karpathy
Cursor 3 (agent-collaboration interface): @cursor_ai
ChatGPT on CarPlay: @OpenAI
llama.cpp local performance demo + MCP/WebUI: @ggerganov
Perplexity “Computer for Taxes”: @perplexity_ai

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Gemma 4 Model Releases and Features

[AINews] A quiet April Fools

Thu, 02 Apr 2026 07:04:18 GMT

Some notable mid tier model releases, but thankfully most companies respected that today is an awful day to launch anything. We’ll give points to Liquid for best April Fools joke.

AI News for 3/23/2026-3/24/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Open-Weight Reasoning and Vision-Coding Releases: Arcee Trinity-Large-Thinking, Z.ai GLM-5V-Turbo, Falcon Perception, and Holo3

Arcee’s Trinity-Large-Thinking: The biggest substantive model launch in this set was Arcee’s Trinity-Large-Thinking, released with open weights under Apache 2.0 and positioned explicitly for developers/enterprises that want to inspect, host, distill, and post-train their own systems. Follow-up posts claim strong agentic performance, including #2 on PinchBench behind Opus 4.6, SOTA on Tau2-Airline, and frontier-level telecom results (Arcee, Mark McQuade). OpenRouter highlighted the architecture as a 400B total / 13B active model and made it available immediately (OpenRouter). Several ecosystem partners framed it as a milestone for “American open source,” including Prime Intellect, Datology, and infra supporters emphasizing that a small team served a 400B-class model at production cost points (latkins, willccbb, xlr8harder, natolambert).
Z.ai’s GLM-5V-Turbo: Z.ai introduced GLM-5V-Turbo, a vision coding model that natively handles images, videos, document layouts, and design drafts while preserving pure-text coding performance. The company attributes the gains to native multimodal fusion, a next-gen CogViT encoder, 30+ task collaborative RL, synthetic agentic data generation, and multimodal toolchain extensions for search/drawing/web reading (details, text-coding stability). The model was quickly integrated into multiple downstream surfaces including TRAE, Tabbit, and Vision Arena.
Falcon Perception and OCR: TII released Falcon Perception, an open-vocabulary referring expression segmentation model, alongside a 0.3B OCR model said to be competitive with models 3–10x larger. The notable design point is an early-fusion transformer that mixes image and text from the first layer instead of relying on multi-stage pipelines and late fusion.
Other model notes: H Company’s Holo3 was highlighted as a GUI-navigation model family (A3B/35B, Qwen3.5-based, free license, Transformers support). A separate post praised a Qwen3.5 27B distill trained on Claude 4.6 Opus reasoning traces, claiming SWE-bench wins over Claude Sonnet 4.5, 96.91% HumanEval, lower CoT verbosity, 4-bit local usability, and 300k+ HF downloads (Craig Hewitt).

Claude Code Leak, Operational Issues, and the Competitive Coding-Agent Market

What the leak exposed: Multiple posts converged on analysis of Anthropic’s accidental Claude Code source exposure. The most useful technical synthesis is the long thread from ZhihuFrontier, which emphasizes a minimalist agent core—a single while(true) loop—with sophistication pushed into context management, tooling, and product instrumentation. The leak reportedly showed a 4-layer context compression stack (HISTORY_SNIP, Microcompact, CONTEXT_COLLAPSE, Autocompact), streaming plus parallel tool execution, silent retries on output-length failures, a 40+ tool modular architecture without inheritance-heavy abstractions, and strong use of feature flags and production ablations. A second summary pointed to hidden features including task budget management, AFK mode, “Penguin” fast mode, redirected reasoning, and other unfinished product hooks (ZhihuFrontier).
Operational pain mattered more than the leak for many users: Alongside leak discussion, many developers complained that Claude was simply slow or unreliable that day (Teknium, andersonbcdefg). Community response also fixated on leaked “pets” and UI affordances (meowbooksj), reinforcing that product polish is part of the competitive moat even when orchestration patterns become legible.
DMCA blowback: The second-order story was Anthropic’s overly broad repo takedown attempts. Theo reported a DMCA against a fork that did not contain leaked source; he then argued the takedown itself violated DMCA procedure (post). A correction later came from trq212, calling it a communication mistake; the repo was restored and Theo acknowledged the retraction and rapid response (restored, official response).
Open-source clones and alternatives are gaining mindshare: The leak also turbocharged ecosystem competition. Yuchen Jin noted the leaked Claude Code fork hit 110k+ GitHub stars in a day. At the same time, multiple users said Nous Hermes Agent was easier to deploy and operate than OpenClaw or Claude-derived stacks, often citing near-zero setup and better local workflows (charliehinojosa, VadimStrizheus, Nous). There’s also a tooling wave around prompt steering and efficiency, e.g. a “Universal CLAUDE.md” claiming 63% output-token reduction, and Google’s Agent Skills spec proposing progressive disclosure to cut baseline context by 90%.

Agent Systems Research: Memory, Self-Organization, Coordination Limits, and Security

Memory is becoming first-class infra: MemFactory proposes a unified inference/training framework for memory-augmented agents with native GRPO integration and reported up to 14.8% relative gains over baselines. Separately, Baseten described a 7M-parameter perceiver that compresses KV cache 8x while retaining 90%+ factual retention, pitching it as a path toward models that “learn from experience.” part_harry_ extended the idea further, arguing pretraining itself is data-inefficient because we discard KV cache every step.
Do self-organizing agents beat hand-authored roles? A DAIR summary highlighted new work across 25,000 tasks with up to 256 agents, claiming self-organized roles outperform predefined planner/coder/reviewer hierarchies, with a sequential coordination protocol +14% over centralized approaches, 5,000+ emergent roles, and open models reaching 95% of closed-model quality at lower cost. This sits in tension with a separate line of theory: omarsar0’s summary of new MIT work argues delegated multi-agent planning is decision-theoretically dominated by a centralized Bayes decision-maker when agents do not gain access to genuinely different information sources. In practice, the synthesis is likely: multi-agent helps when it partitions tools, environments, or retrieval channels—not just prompts.
Agent attack surface is the web: A widely shared summary of a new DeepMind paper on “AI Agent Traps” reframes agent security around adversarial content in webpages/documents, not just model jailbreaks. The thread cites hidden prompt injection in HTML/CSS succeeding in up to 86% of scenarios and latent memory poisoning reaching 80%+ attack success with <0.1% contamination, which is material for anyone shipping browse/retrieval-heavy agents.
Long-horizon evaluation is getting richer: New benchmarks/tools included Kaggle Standardized Agent Exams, YC-Bench for simulating a startup over a one-year horizon, and CaP-Gym / CaP-X, a broad benchmark and toolkit for agentic robotics spanning 187 manipulation tasks, 12 frontier models, and both training-free and RL-improved policies with MIT-licensed code (open-source details).

Training, Retrieval, and Infra: RL Frameworks, Optimizers, Kernels, and Benchmarks

Post-training stack maturation: Hugging Face’s TRL v1.0 was framed by many as a meaningful unification of open post-training—SFT, reward modeling, DPO, GRPO—into a production-ready package (commentary). A complementary survey thread from adithya_s_k compared 16 RL frameworks across orchestration, rollout buffering, weight sync, staleness handling, partial-rollout behavior, LoRA support, and distributed parallelism, useful for teams choosing between TRL, VeRL, SLIME, and others.
Optimization and systems releases: HeavyBall 3.0.0 shipped with FSDP, DDP, end-to-end compilation with 2.5x speedup, faster Muon/SOAP variants, and new optimizers. Together AI promoted a behind-the-scenes kernels writeup; Dan Fu followed with a “what a VP of Kernels does” thread. On the low-level DSL side, maharshii argued CuTeDSL materially lowers the barrier to custom kernels by allowing inline PTX directly in Python, avoiding opaque layout gymnastics.
Retrieval evidence continues to favor late interaction: Several posts reiterated that multi-vector / late-interaction retrieval outperforms single-vector embeddings, even after fine-tuning, with better robustness against catastrophic forgetting (lateinteraction, ladder visualization). There was also continued frustration that “RAG” has become an overloaded umbrella term rather than referring to a specific older paper (lateinteraction).
Benchmarks and efficiency surfaces: Arena added Pareto frontier charts across text, vision, search, document, and code, making price/performance tradeoffs more explicit. On standardized inference, Lambda and NVIDIA pointed to MLPerf Inference v6.0 as the better lens for real AI-factory productivity than peak-chip specs.

Developer Platforms, Rate Limits, and Tooling UX

OpenAI Codex usage reset: The most practically important platform announcement for working engineers was thsottiaux’s note that OpenAI reset Codex usage limits across all plans, citing elevated rate-limit hits and a concurrent fraud-account purge that recovered compute. This was quickly amplified by users who interpreted rate-limit generosity as a direct competitive axis in the coding-agent market (reach_vb, Yuchen Jin). Later, thsottiaux also clarified that Codex’s core is intended to be open-source because the ecosystem is still young and mutually informative (post).
Agent-ready docs and platform surfaces: LangChain embedded chat into its docs grounded on full docs, knowledge base, and OSS code. Together AI open-sourced 12 agent skills so Claude Code and Codex can call its APIs with the right model IDs and SDK idioms. OpenAI Devs also showed tighter Linear integration in the Codex app for keeping tickets synchronized with code work.
Infra and storage quality-of-life: SkyPilot added native VAST Data support for direct high-speed dataset mounts across heterogeneous compute backends, and Hugging Face rolled out persistent Storage Buckets for Spaces. Tinker added longer context windows up to 256k for select open models, widening its appeal for RL and long-horizon experimentation.

Top tweets (by engagement)

OpenAI Codex limits reset: thsottiaux reset Codex rate limits across all plans, explicitly tying it to both unexplained user rate-limit spikes and anti-fraud enforcement that freed compute.
GLM-5V-Turbo launch: Z.ai’s announcement was one of the day’s biggest technical launches: a multimodal coding model aimed at GUI agents, visual coding, and agent workflows.
Claude Code leak discourse: Theo’s DMCA thread and Yuchen Jin’s note about the leaked project surpassing 110k GitHub stars captured how quickly source exposure translated into open ecosystem momentum.
Arcee Trinity-Large-Thinking: Arcee’s release and OpenRouter’s architecture summary drew unusually strong engagement for an open-weight reasoning model, suggesting real appetite for serious US-based open releases.
Falcon Perception: Falcon Perception’s launch stood out on the multimodal side for its simple early-fusion architecture and unusually small OCR model size relative to claimed performance.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Claude Code Source Leak and Analysis

Claude Code’s source just leaked — I extracted its multi-agent orchestration system into an open-source framework that works with any LLM (Activity: 1205): The source code for Claude Code was leaked, revealing over 500K lines of TypeScript, including its multi-agent orchestration system. A developer has re-implemented this system as an open-source framework called open-multi-agent, which is model-agnostic and can work with any LLM, such as Claude and OpenAI. The framework includes features like a coordinator pattern for task decomposition, a team system for inter-agent communication, task scheduling with dependency resolution, and a conversation loop for model-tool interactions. It is implemented in TypeScript, spans approximately 8000 lines, and is available under the MIT license on GitHub. Some commenters express skepticism about the legality and ethics of open-sourcing a re-implementation of leaked proprietary code, questioning the developer’s understanding of the architecture and the choice of licensing. There is also a debate about the practicality of using different models for planning and implementation, with a specific mention of using GPT-4o for coding.
- A user highlights the technical aspect of the project, noting that the multi-agent orchestration system extracted from Claude Code’s source involves a coordinator that breaks down goals into tasks. This suggests a sophisticated architecture designed for task management across multiple agents, which could be beneficial for complex LLM applications.
- Another comment questions the choice of using GPT-4o for implementation in the orchestration system, implying that by March 2026, GPT-4o might be outdated for coding tasks. This raises a point about the importance of selecting the most current and capable models for specific tasks in AI development.
Claude code source code has been leaked via a map file in their npm registry (Activity: 5229): The image reveals a directory listing of the ‘claude-code’ project, which appears to have been unintentionally exposed via a map file in the npm registry. This leak includes TypeScript files and directories such as ‘entrypoints,’ ‘commands,’ and ‘utils,’ providing a detailed view of the project’s codebase structure. The incident highlights potential security oversights in managing sensitive code repositories, particularly for companies like Anthropic that are involved in AI development. Commenters humorously speculate on the oversight, suggesting it might be due to an Anthropic employee’s mistake or a failure of AI oversight mechanisms. There’s also a satirical suggestion that the code is now ‘open source’ due to the leak.
- The leak of Claude’s source code via a map file in their npm registry raises significant security concerns, particularly given the model’s reputation for identifying vulnerabilities. This incident highlights potential gaps in Anthropic’s internal security measures, as their AI, known for being ‘scary good’ at finding vulnerabilities, failed to detect this issue.
- The leak has sparked discussions about the potential for community-driven improvements, such as fixing existing bugs like the caching issue. This could lead to a more robust version of Claude, as external developers might contribute patches and enhancements, effectively making it ‘open source’ in practice, if not in legal terms.
- The incident also underscores the challenges of maintaining proprietary code secrecy in public repositories. The humorous suggestion of an ‘Undercover Mode’ for Anthropic employees, which would strip AI attribution from commits, reflects the tension between open collaboration and the need to protect intellectual property.
Analyzing Claude Code Source Code. Write “WTF” and Anthropic knows. (Activity: 840): The Reddit post discusses the source code of Claude Code, revealing extensive tracking and classification mechanisms. The system uses simple keyword detection for language classification, tracking words like wtf and frustrating to flag negative sentiment. It also monitors user behavior during permission prompts, logging actions such as opening or closing feedback boxes and typing without submitting. The feedback system is designed to capture negative experiences, prompting users to share session transcripts. Hidden commands like ultrathink and ultraplan alter system behavior, while telemetry logs detailed environment profiles, including session IDs and runtime details. An internal mode (USER_TYPE=ant) collects even more granular data, tying behavior to specific deployment environments. The post suggests this level of instrumentation is more detailed than typical user expectations, though not necessarily malicious. Source. Commenters note that such tracking mechanisms are standard in many applications for analytics and feedback, suggesting that negative sentiment triggers help identify issues with updates. Some commands, like /btw, are now public, while others remain as internal features or ‘easter eggs.’ The extensive internal artifacts are likened to those found in game apps, possibly due to internal incentives for feature development.
- NandaVegg highlights that the use of keyword lists for sentiment analysis in Claude Code is a standard practice in event-triggered analytics. This approach helps identify negative user feedback, which can be crucial for detecting issues in updates that might disrupt user experience or model behavior. The mention of features like ‘ultraplan’ and ‘ultrathink’ suggests these are experimental or less refined, possibly serving as internal tests or ‘easter eggs’ within the system.
- SRavingmad expresses curiosity about the ‘tamagotchi mode’ in Claude Code, implying there are unique or playful features embedded within the system. This suggests that the developers might be experimenting with interactive or gamified elements, which could be part of a broader strategy to engage users or test new functionalities.
- Exhales_Deeply criticizes the reliance on AI-generated content, suggesting that user-generated posts would be more engaging. This comment indirectly points to a broader discussion about the quality and authenticity of AI-generated content versus human-created content, which is a significant topic in AI development and user interaction.

2. 1-bit and TurboQuant Model Innovations

The Bonsai 1-bit models are very good (Activity: 657): PrismML’s Bonsai 1-bit models offer a significant reduction in model size and memory usage, being 14x smaller than traditional models, which is transformative for local model deployment. The Bonsai 8B model was tested on an M4 Max 48GB MacBook Pro, demonstrating practical applications like chat and document summarization with lower memory pressure compared to models like Qwen3 VL 8B Instruct Q4_K_M. However, it requires a specific fork of llama.cpp to support 1-bit operations, as the main llama.cpp repository lacks this capability. The model’s performance is notably superior to previous MSFT BitNet models, which were largely research-focused and not practical for real-world use. A benchmark comparison between Bonsai and Qwen3.5 models suggests Bonsai’s higher quality for RAM usage, though it struggled with code generation. There is interest in larger Bonsai models, such as a 200B version, and a desire for quantized versions of Qwen 3.5 models.
- itsArmanJr provides a detailed benchmark comparison between Bonsai and Qwen3.5 models, including specific configurations like 35B-A3B, 2B, and 0.8B. The benchmark results are available on GitHub, offering insights into performance metrics across different model sizes.
- -dysangel- highlights the efficiency of Bonsai models in terms of RAM usage, noting that while the model struggled to produce fully functional code, it was impressive given its small size of only 1GB. The comment suggests exploring quantized versions of Qwen 3.5 models, such as 9B or 27B, for potentially better performance.
- Pitiful-Impression70 raises concerns about the performance of 1-bit quantized models like Bonsai on longer contexts, noting that coherence often degrades past 4k tokens. This comment questions whether the Bonsai model maintains quality in extended conversations compared to shorter prompts.
TurboQuant isn’t just for KV: Qwen3.5-27B at near-Q4_0 quality, about 10% smaller, and finally fitting on my 16GB 5060 Ti (Activity: 899): The image illustrates the TurboQuant TQ3_1S model’s ability to maintain near-Q4_0 quality for the Qwen3.5-27B model while being compact enough to fit on a 16GB RTX 5060 Ti. The TQ3_1S model is about 10% smaller than Q4_0, with a size of 12.9 GB compared to 14.4 GB for Q4_0, and shows a minimal performance gap in perplexity (PPL), with TQ3_1S having a PPL of 7.2570 versus Q4_0’s 7.2431. This demonstrates a practical advantage for users with limited GPU memory, allowing the model to fit fully on the specified GPU setup. The post also highlights the use of advanced quantization techniques like Walsh-Hadamard rotation and 8-centroid quantization to achieve these results. Some commenters criticize the use of perplexity as a metric for quantization loss, suggesting KLD or PPL ratio as more accurate alternatives. Others praise the adaptation of cutting-edge research to solve a practical problem, acknowledging the achievement despite the criticisms.
- Velocita84 criticizes the use of Q4_0 quantization, stating it’s outdated and surpassed by more advanced Q4 techniques. They argue that using perplexity as a metric for quantization loss is incorrect, suggesting KLD or PPL ratio against a full bf16 model as more accurate alternatives.
- grumd suggests comparing the model to unsloth Q3_K_S quant of 27B using real benchmarks, implying that practical performance comparisons are necessary to validate claims about model efficiency and quality.
- XccesSv2 expresses skepticism about TurboQuant’s claims of achieving BF16 quality with 4 or 5 bits, noting that real-world tests often don’t reflect the purported improvements, indicating a gap between theoretical claims and practical outcomes.
PrismML — Announcing 1-bit Bonsai: The First Commercially Viable 1-bit LLMs (Activity: 596): PrismML has announced the release of the 1-bit Bonsai models, including the 1-bit Bonsai 8B, which is a groundbreaking development in AI model efficiency. These models are fully quantized to 1-bit precision across all components, including embeddings, attention layers, MLP layers, and the LM head, without any higher-precision components. The 1-bit Bonsai 8B model, with 8.2 billion parameters, fits into 1.15 GB of memory and is 14x smaller, 8x faster, and 5x more energy efficient than its full-precision counterparts, making it suitable for edge hardware. The models are open-sourced under the Apache 2.0 license, and the implementation requires a fork of Llama.cpp for inference. More details can be found in their whitepaper. Some commenters express skepticism about the practicality of 1-bit models, while others are intrigued by the potential for on-device AI applications. The debate centers around the trade-offs between model precision and performance efficiency.
- PrismML has announced the 1-bit Bonsai 8B model, which is a 1-bit weight model that fits into 1.15 GB of memory. It claims to deliver over 10x the intelligence density of full-precision counterparts, being 14x smaller, 8x faster, and 5x more energy efficient on edge hardware. The model is open-sourced under the Apache 2.0 license, and the company emphasizes the potential for on-device AI applications due to its efficiency.
- The 1-bit Bonsai 8B model is quantized end-to-end using a proprietary method, requiring a fork of Llama.cpp for inference. This model design applies 1-bit quantization across all network components, including embeddings, attention layers, MLP layers, and the LM head, making it a true 1-bit model across its 8.2 billion parameters. This approach highlights a significant shift towards more efficient AI models that can operate effectively on edge devices.
- The announcement suggests a paradigm shift in AI model design, focusing on intelligence density rather than parameter count. By achieving significant reductions in model size and energy consumption, PrismML’s 1-bit models could enable new applications in real-time robotics and offline intelligence, potentially transforming the AI landscape by making advanced models feasible for local execution on edge devices.

3. Local AI Hardware and Software Experiments

Local LLM Claude Code replacement, 128GB MacBook Pro? (Activity: 140): The user is considering upgrading to a 128GB MacBook Pro to run local LLMs as a replacement for Claude Code due to potential price increases in API usage. They are currently using a 2019 Intel-based MacBook Pro and are experiencing performance issues with multiple Docker containers. The user is exploring whether local LLMs can match the capabilities of Claude Code for software development. Claude Code is noted for its 1 million context capability, but open-source models are improving. A user reported running qwen3.5 122b ud q4 xl with a 256k context on a 128GB RAM system, finding it competent for lighter tasks, though not as strong as Claude for heavy coding. Another user suggests trying open-source models via DeepInfra before purchasing, and mentions using the Bodega inference engine as a replacement for commercial subscriptions. There is a debate on whether local LLMs can fully replace Claude Code, with some users finding open-source models like qwen 122 competent for lighter tasks but not yet matching Claude for intensive coding. The shared memory model of Mac is seen as advantageous for running local LLMs.
- EmbarrassedAsk2887 discusses replacing Claude Code and Codex subscriptions with the Bodega inference engine on a 128GB M4 Max MacBook Pro. They provide a detailed write-up and benchmarks, suggesting that Bodega can effectively handle tasks typically managed by commercial solutions. Read more here.
- Mediocre_Paramedic22 shares their experience running the Qwen 3.5 122B UD Q4 XL model with a 256k context on a 128GB RAM setup using Fedora. They note that while Claude is superior for intensive coding tasks, Qwen performs well for lighter workloads and basic agent tasks, utilizing about 29GB of free RAM.
- Aisher mentions using a 128GB M5 Max for local LLM development, noting the noise level as a downside. They suggest using multiple desktop Macs for full-time development, connected via ZeroTier for remote access, as a cost-effective alternative to expensive cloud-based solutions.
Worth building a $7k local AI rig just to experiment? Afraid I’ll lose interest. (Activity: 131): The user is contemplating building a $7k local AI rig to experiment with AI technologies, particularly in photo and video generation, model integration, and AI assistant development. They currently use a MacBook with an M3 Pro chip and 36GB RAM but are concerned it may not suffice for more complex tasks. The proposed rig includes a Corsair Vengeance i5200 with an Intel Core Ultra 9 285K, GeForce RTX 5090, and 64GB DDR5 RAM, with plans to add an additional 128GB RAM. The user is hesitant due to the lack of a concrete use case and the potential for the rig to become an ‘expensive toy’. Commenters suggest alternatives such as renting a machine or using existing hardware with tools like LM Studio to test models like Qwen3.5, 9b, and 27b Q4. Another commenter shares a similar dilemma and opts to continue using a current setup with an RTX 4070Ti and 32GB RAM, highlighting the importance of having a clear use case before investing heavily.
- TassioNoronha_ suggests starting with cloud-based solutions like Open Router or renting a machine for a week to gauge interest before committing to a $7k investment. This approach allows for experimentation without the upfront cost, providing a practical way to assess long-term interest and needs.
- Xmede81 shares their experience of sticking with a current setup featuring an RTX 4070Ti and 32GB RAM, which is sufficient for general use and experimentation. They highlight the importance of evaluating actual use cases and the impact of current memory prices on decision-making.
- Dry-Influence9 advises against building powerful local setups due to current high prices, suggesting that waiting could yield better value. They recommend renting GPUs or using existing computers to experiment, as this can provide similar capabilities without the significant financial commitment.
We built a local inference engine that skips ROCm entirely and just got a 4x speedup on a consumer AMD GPU (Activity: 124): ZINC is a new inference engine designed to bypass the complexities of ROCm by directly interfacing with AMD GPUs through Vulkan, achieving a 4x speedup on an AMD Radeon AI PRO R9700. The engine supports models like Qwen3.5-35B-A3B and Qwen3.5-2B, with current performance at 33.58 tok/s, compared to 107 tok/s for llama.cpp on the same hardware. ZINC’s architecture allows it to run on hardware not officially supported by ROCm, and it includes an OpenAI-compatible API server for parallel request batching. The project is open-source and available on GitHub. Some commenters question the significance of the speedup given that ZINC’s performance is still less than a third of llama.cpp’s speed. Others express skepticism about achieving such improvements when larger companies have struggled in this area.
- Big-Masterpiece-9581 questions the significance of the 4x speedup, pointing out that despite the improvement, the performance is still less than a third of llama.cpp‘s speed. This suggests that while the optimization is notable, it may not yet be competitive with existing solutions in terms of raw throughput.
- fallingdowndizzyvr highlights a performance issue, noting that achieving only 7 tok/s on an AMD Radeon AI PRO R9700 with the Qwen3.5-35B-A3B-UD Q4_K_XL model indicates a potential inefficiency in the initial implementation. This suggests that the baseline performance was suboptimal, which could have skewed the perceived improvement.
- hipcatinca provides a benchmark comparison using an RX 570 with llama.cpp via Vulkan, achieving approximately 31 tok/s with the llama3.1:8b model. This serves as a reference point, illustrating that other configurations and models can achieve significantly higher throughput on different hardware setups.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. Claude Code Source Leak and Reactions

Claude code source code has been leaked via a map file in their npm registry (Activity: 1598): On March 31, 2026, the full source code of Anthropic’s Claude Code CLI was leaked through a .map file in their npm registry, as reported on GitHub. The codebase, consisting of approximately 512k lines of TypeScript, is built using React + Ink for terminal UI and runs on the Bun runtime. This leak potentially exposes major gated features that are not yet public. The comments reflect a misunderstanding among some users about the implications of the leak, particularly the difference between Large Language Models (LLMs) and agents, highlighting a knowledge gap in the community.
- The leak of Claude’s source code via a map file in their npm registry has sparked discussions about the potential implications for developers and researchers. One key point is the distinction between Large Language Models (LLMs) and agents, as highlighted by Nedshent. This leak may expose a knowledge gap where people might not fully understand how LLMs function compared to agents, which are typically more task-specific and interactive.
- The technical details of the leak reveal that the codebase consists of approximately 512k lines of TypeScript, built with React and Ink for terminal UI, and runs on the Bun runtime. This setup suggests a modern and scalable architecture, potentially offering insights into how Claude’s infrastructure is designed to handle complex tasks and interactions.
- There is speculation about the reasons behind the leaks, with some users humorously suggesting that Anthropic might be using Claude itself for development and content creation tasks. This raises questions about the security and operational practices within Anthropic, especially if such reliance on AI could inadvertently lead to more leaks or security vulnerabilities.
Anthropic staff reacts to Claude code leak 👀 (Activity: 859): The image is a meme depicting a humorous Twitter exchange that indirectly references a code leak from Anthropic, a company known for its work in AI. The meme uses a popular internet joke about an ‘immortal snail’ to suggest that the leak is an inevitable consequence of being ‘caught’ by the snail, implying a sense of inevitability or fate. This reflects a lighthearted community reaction to the leak, rather than a technical discussion or official statement from Anthropic. Commenters humorously note the dual reactions to the leak: legal teams wanting to ‘delete it’ while engineers have already ‘starred it,’ indicating a divide between legal caution and technical curiosity. Another comment suggests that with Anthropic’s rapid development pace, such incidents were expected.
- Belium suggests that the leak of Claude’s code could be beneficial for Anthropic, as it generates hype and allows engineers to identify and fix bugs. The leak also provides engineers with the opportunity to create their own implementations or ‘harnesses’ of Claude, potentially increasing its usage and influence in the developer community.
- IntenselySwedish highlights a perceived irony in Anthropic’s situation, pointing out that the company, which has been accused of large-scale copyright violations through book piracy, is now facing its own copyright challenges with the leak of Claude’s code. This comment underscores the complex legal and ethical landscape surrounding AI development and intellectual property.
- xitizen7 comments on the rapid pace of development and releases from Anthropic, suggesting that such a leak was almost inevitable given the company’s trajectory. This reflects a broader industry trend where fast-paced innovation can sometimes lead to security oversights or unintended disclosures.
Claude Code Source Leak Megathread (Activity: 653): The Claude Code CLI source code was leaked, revealing several technical details. Notably, the npm source (@anthropic-ai/claude-code@2.1.74) shows that the DuckDuckGo replacement in the Rust port is incorrect; the real package uses a nested API call to Anthropic’s server-side search with encrypted content blobs. Additionally, a two-tier web system is implemented, where 85 domains are pre-approved for full content extraction, while others are limited to 125-character quotes. Structured data in <head> is ignored, and tables are not supported in the markdown converter. The system limits to 8 results per query with no pagination. A hidden feature, KAIROS_DREAM, allows Claude to self-review and update its memory after inactivity. The newer search version (web_search_20260209) enables Claude to programmatically filter search results. The source can be verified in the minified cli.js of the npm package. Anthropic has issued a DMCA to remove the leaked code from GitHub. Some commenters criticize the code quality, suggesting that many critics may lack experience in shipping production apps. Others focus on the technical implications of the leak, such as the incorrect assumptions about DuckDuckGo usage and the limitations of the markdown converter.
- Ooty-io highlights several technical aspects of the Claude Code source, noting that the package makes nested API calls to Anthropic’s server-side search, with results returned as encrypted content blobs, rather than using DuckDuckGo as a standalone replacement. Additionally, the source code reveals a two-tier web system where 85 documentation domains are pre-approved for full content extraction, while other sites are limited to 125-character quotes. The code also shows that structured data in <head> tags is ignored, and tables are not supported in the markdown conversion process.
- Independent-Corgi-88 discusses the broader implications of the Claude Code leak, suggesting it points towards a future of AI characterized by multi-agent coordination, memory layers, and persistent interaction. This perspective emphasizes the importance of systems with memory and coordination over raw model capability, suggesting that the future of AI involves environments that support sustained and useful work. The comment also references J3nna, an AI being developed to understand its operating environment, highlighting the shift in focus from model capability to the surrounding system.
- Joozio provides insights from analyzing the Claude Code source, noting that the CLAUDE.md file is reinserted with every turn change, impacting token usage. They also mention that switching models mid-session clears the prompt cache, leading to increased token costs. Additionally, Claude Code ranks poorly on terminal benchmarks, coming in last for Opus among harnesses, with a flat 77% performance compared to Cursor’s 77% to 93%. Joozio implemented several patterns from the source, such as semantic memory merging and cache monitoring, into their own agent.
i dug through claude code’s leaked source and anthropic’s codebase is absolutely unhinged (Activity: 6259): The leaked source code of Anthropic’s Claude reveals a whimsical feature: a terminal-based pet system called /buddy, which includes 18 species with a gacha rarity system and interactive ASCII companions. The codebase also shows unconventional practices, such as hex encoding species names to bypass internal scanners, and a voice mode using Deepgram Nova 3 for speech-to-text. The project is codenamed ‘tengu’, with telemetry events and feature flags reflecting this. The codebase is notably large, with main.tsx at 803,924 bytes and several files exceeding 4,000 lines. It contains 460 eslint-disable comments and numerous deprecated functions still in use, indicating a lack of codebase hygiene. Additionally, there are unreleased features like ‘kairos’ and ‘ultraplan’, and several hidden slash commands. Some commenters argue that the codebase’s state is typical for large projects and not particularly ‘unhinged’, while others express interest in the /buddy feature, wishing it were available sooner.
- A user points out that the presence of deprecated functions in the codebase is likely a strategic decision to signal developers not to use them in new code. This is a common practice in large codebases where gradual migration to new implementations is necessary, especially when multiple developers are involved and there is pressure from sales teams to maintain functionality while transitioning.
- Another commenter argues that the codebase’s state is typical for large projects, especially those developed before the advent of AI tools like GPT-3. They suggest that the complexity and seemingly chaotic nature of the code are standard in environments where many developers contribute under tight deadlines and evolving requirements.
- A technical insight is provided regarding the perception of the codebase as ‘unhinged.’ The commenter suggests that such a view might stem from a lack of experience with large-scale software projects, where the code often appears disorganized due to the sheer number of contributors and the necessity to maintain legacy systems while integrating new features.
Claude Code’s source code just leaked — so I had Claude Code analyze its own internals and build an open-source multi-agent framework from it (Activity: 513): The source code for Claude Code was leaked, revealing over 500K lines of TypeScript, including its multi-agent orchestration layer. A developer re-implemented this as an open-source, model-agnostic framework, allowing integration of different LLMs like Claude and GPT in a shared workflow. Key features include multi-agent teams, task pipelines with dependency resolution, inter-agent messaging, and an LLMAdapter interface. The framework is ~8000 lines of TypeScript and is available on GitHub under the MIT license. Some commenters appreciate the framework’s ability to integrate various LLMs, which can reduce costs. However, others note that the framework’s core functionality is similar to existing solutions like CrewAI and AutoGen, and that the re-implementation mainly replicates standard agent loop patterns.
- Macaulay_Codin critiques the framework, noting that it follows a standard agent loop pattern: calling an LLM, executing tool calls, and iterating over results. The multi-agent aspect is essentially a task queue coordinator, which is not novel. The framework includes five built-in tools, rewritten from Claude Code’s tools, and is implemented in 8k lines of TypeScript, suggesting it’s a manageable project rather than a massive reverse engineering effort. Alternatives like CrewAI, AutoGen, and the Claude Agent SDK offer similar functionalities.
- JuryNightFury highlights the framework’s capability to integrate with other model families using an OpenRouter API key, demonstrating its model-agnostic nature. This feature allows it to fetch reviews from various models, showcasing its flexibility in utilizing different AI models beyond its original design.
- NoInside3418 appreciates the potential cost savings and efficiency gains from using the framework to enable communication between subagents from different models like Gemini, Codex, and Claude. This interoperability could streamline processes by leveraging the strengths of each model, such as Gemini’s large context and low cost, Haiku’s implementation capabilities, and GPT’s planning features.
Anthropic’s leaked CLI source code reveals a hidden “Tamagotchi” pet and autonomous multi-agent teams. The bar for developer tools is getting wild. (Activity: 161): Anthropic accidentally exposed the source code of their CLI tool, revealing innovative features like a Tamagotchi-style virtual pet called “BUDDY” that gamifies the terminal experience by leveling up based on coding behavior. Additionally, the code includes features like “ULTRAPLAN,” which allows the AI to autonomously plan for 30 minutes, and “BRIDGE MODE,” where multiple AI instances collaborate as a team. Another feature, “KAIROS,” autonomously manages failing tests and dependencies. These features suggest a shift towards more autonomous and interactive developer tools. For a detailed breakdown, see the full analysis. Commenters are skeptical about the feasibility of autonomous multi-agent teams, suggesting the pet feature is more believable due to its potential for user engagement. There is also curiosity about whether these features represent real product directions or are merely experimental ideas.
- Senior_Hamster_58 raises skepticism about the claim of autonomous multi-agent teams being proven by a leaked repository, suggesting that such features might be more speculative or experimental rather than indicative of a real product direction. They question whether these features are part of a serious development effort or merely internal experiments that may not reach production, highlighting a common issue in software development where many ideas do not survive the transition from concept to release engineering.
- OutrageousIndustry28 claims that the feature is already live and can be activated using a specific command (/buddy). This suggests that at least some components of the leaked features might be functional or accessible, indicating a level of readiness beyond mere speculation or internal testing. However, without further verification, this claim remains anecdotal.
- rainmaker66 and prussell774 both suggest that the features, including the “Tamagotchi” pet and autonomous multi-agent teams, are part of an April Fool’s joke by Anthropic. This implies that the leaked code might not represent serious development efforts but rather a playful or humorous initiative, which is a common practice in tech companies around April 1st.

3. OpenAI and Anthropic Funding and Developments

OpenAI raises $122 billion to accelerate the next phase of AI (Activity: 794): OpenAI has raised $122 billion, reaching a post-money valuation of $852 billion, to bolster its position as a core AI infrastructure provider. The company reports 900 million weekly active users for ChatGPT and $2 billion in monthly revenue. Strategic partnerships with Amazon, NVIDIA, and Microsoft are pivotal in advancing their AI capabilities, focusing on enhanced compute infrastructure and a unified AI superapp for both consumer and enterprise applications. More details can be found in the original article. Commenters are questioning the allocation of such a large funding amount, with some expressing skepticism about the necessity of this capital given recent fundraising efforts.

AI Discords

Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.

[AINews] The Claude Code Source Leak

Wed, 01 Apr 2026 06:24:21 GMT

OpenAI’s Largest Fundraise in Human History closed today, growing by a few billion, but disclosing some cool numbers like $24B ARR (growing 4x faster than Google/Meta in their heyday), and also had a “soft IPO” with $3B of investment from rich people and inclusion in ETFs from ARK Invest, although ChatGPT WAU growth seem to has stalled out - they STILL have not crossed the 1B WAU mark targeted for end 2025. Codex also worryingly has not announced a new milestone for March.

By far the biggest news of the day is the Claude Code source leak, in itself not particularly damaging for Anthropic, but surely embarrassing and also somewhat educational - Christmas come early for Coding Agent nerds. You can read the many many tweets and posts covering the 500k LOC codebase, and you can browse multiple hosted forks of the source.

There are fun curiosities, such as the full verb list, or Capybara/Mythos v8, or the /buddy April Fools feature, or Boris’ confirmed WTF counter, or creating the cursed “Claude Codex”, or the dozen other unreleased features, but most serious players are commenting on a few things. Sebastian Raschka probably has a good list of the top 6:

Putting Repo state in Context (eg recent commits, git branch info)
Aggressive cache reuse
Custom Grep/Glob/LSP (standard in industry)
1. Claude code has less than 20 tools default on (up to 60+ total): AgentTool, BashTool, FileReadTool, FileEditTool, FileWriteTool, NotebookEditTool, WebFetchTool, WebSearchTool, TodoWriteTool, TaskStopTool, TaskOutputTool, AskUserQuestionTool, SkillTool, EnterPlanModeTool, ExitPlanModeV2Tool, SendMessageTool, BriefTool, ListMcpResourcesTool, and ReadMcpResourceTool.
  more in ccunpacked
File read deduplication/tool result sampling
Structured Session Memory (more on this)
Subagents

Memory

Claude Code’s Memory has a 3 layer design with 1) a MEMORY.md that is just an index to other knowledge, 2) topic files loaded on demand, and 3) full session transcripts that can be searched. There’s also an “autoDream” mode for “sleep” - merging memories, deduping, pruning, removing contradictions.

A deeper analysis from mem0 finds 8 phases:

caption...

And there are 5 kinds of Compaction:

Subagents use Prompt Caching

A key feature of CC: they use the KV cache to create a fork-join model for their subagents, meaning they contain the full context and don’t have to repeat work. In other words: Parallelism is basically free.

The 5 level Permission System

The 2 Types of Plan mode

here:

Resilience/Retry

Other Unreleased/Internal Features

Including an employee-only gate and an employee TUI, but also a bunch of other stuff in development including ULTRAPLAN and KAIROS:

note a few of these were recently shipped

And internal MAGIC DOCS:

AI News for 3/23/2026-3/24/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Top Story: Claude Code source leak — architecture discoveries, Anthropic’s response, and competitor reactions

What happened

Claude Code had substantial source artifacts exposed via shipped source maps / package contents, which triggered rapid public reverse-engineering, mirroring, and derivative ports. The discussion quickly shifted from “embarrassing leak” to “what does this reveal about state-of-the-art agent harness design?” Multiple observers highlighted that the leak exposed orchestration logic rather than model weights, including autonomous modes, memory systems, planning/review flows, and model-specific control logic. Public forks proliferated; one post claimed 32.6k stars and 44.3k forks on a fork before legal fear led to a Python conversion effort using Codex (Yuchenj_UW). Later commentary put the exposed code volume at 500k+ lines (Yuchenj_UW). Anthropic then moved to contain redistribution via DMCA takedowns according to several posters (dbreunig, BlancheMinerva). Separately, a Claude Code team member announced a product feature during the fallout — easier local/web GitHub credential setup via /web-setup (catwu) — implying normal product operations continued. The leak also created a live security hazard: attackers quickly registered suspicious npm packages such as color-diff-napi and modifiers-napi to target people trying to compile the leaked code (Butanium_).

Facts vs. opinions

What is reasonably factual from the tweets:

[AINews] The Last 4 Jobs in Tech

Tue, 31 Mar 2026 01:04:54 GMT

It’s well known that org charts are changing with AI - the first trend we called out was in 2023 with the Rise of the AI Engineer (now an official org at Meta!), and then in 2025 with Tiny Teams (hired by Meta!), but it seems Yoni Rechtman over at the 99D Substack has the mental model for the new post-AI roles (at least in white collar tech):

top level tweet from Karri

Karri Saarinen, CEO of Linear, made a popular analogy to the teamwork roles that emerged in World of Warcraft. This is a good 2D augmentation of an earlier age-based company model (much less realistic, name a tech company that fits the latter format, they exist but are very hard to find):

AI News for 3/28/2026-3/30/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Claude Code Computer Use, Codex Interop, and the Coding-Agent Harness Race

Claude Code gets computer use: Anthropic added computer use inside Claude Code, letting the agent open apps, click through UIs, and test what it built directly from the CLI in research preview for Pro/Max users. The practical significance is closed-loop verification: code → run → inspect UI → fix → re-test, which several engineers called the missing piece for reliable app iteration, especially compared with open-ended desktop agents (Claude announcement, @Yuchenj_UW on the “eyes” unlock, @omarsar0).
Cross-agent composition is becoming standard: OpenAI shipped a Codex plugin for Claude Code that can trigger reviews, adversarial reviews, and “rescue” flows from inside Anthropic’s toolchain, using a ChatGPT subscription rather than custom glue code. This is notable less as a plugin novelty and more as a signal that coding stacks are becoming composable harnesses rather than monolithic products (plugin by @dkundel, usage thread by @reach_vb, open-source note). Separately, OpenAI shared that late-night Codex tasks run longer, with jobs started around 11pm being 60% more likely to run 3+ hours, which fits the emerging pattern of delegating refactors and planning to background agents (OpenAI Devs).
Harness quality is now visibly a first-order variable: Theo argued that Opus scores ~20% higher in Cursor than in Claude Code, and more broadly that closed-source harnesses make it hard for the community to diagnose or fix regressions (performance gap claim, closed-source critique). That theme repeated across the feed: model capability deltas are narrowing, while tooling, prompt/runtime orchestration, and review loops still create large practical differences.

Hermes Agent’s Rapid Rise, Multi-Agent Profiles, and the Open Harness Ecosystem

Hermes has become the week’s breakout open agent stack: Nous shipped a major Hermes Agent update that drove a wave of migrations from OpenClaw/OpenClaw-like setups, with users emphasizing better compaction, less bloat, stronger adaptability, and faster shipping cadence (Nous release, Teknium’s multi-agent profiles, community migration examples, another). The new multi-agent profiles give each bot its own memory, skills, histories, and gateway connections, moving Hermes from “personal assistant” toward a reusable agent OS abstraction.
An ecosystem is forming around traces, remote control, and self-improvement: Several projects extend Hermes beyond core inference. @jayfarei’s opentraces.ai provides a CLI/schema/review flow for sanitizing and publishing agent traces to Hugging Face for analytics, evals, SFT, and RL. @kaiostephens uploaded ~4,000 GLM-5 Hermes traces to HF. @IcarusHermes described an integration where agents log their own decisions, export data, fine-tune smaller successors on their history, and switch over to cheaper models. @winglian’s ARC adds remote browser-based monitoring/control with E2E encryption.
Open vs proprietary agent infra is being actively contested: @ClementDelangue explicitly argued that open-source agent tools should default to open-source models, both for privacy and durability. In parallel, vendors are attacking known pain points: @fchollet highlighted PokeeClaw as a more secure OpenClaw-style assistant with sandboxing, approvals, RBAC, and audit trails; Z AI launched AutoClaw, a local OpenClaw runtime with no API key required and optional GLM-5-Turbo.

Qwen3.5-Omni, GLM-5-Turbo/AutoClaw, and the Push Toward Local/Agentic Specialization

Qwen3.5-Omni is a major multimodal release: Alibaba introduced Qwen3.5-Omni, with native text/image/audio/video understanding, script-level captioning, built-in web search and function calling, and a standout “audio-visual vibe coding” demo where the model builds websites/games from spoken visual instructions. Reported capabilities include support for 10h audio / 400s of 720p video, 113 speech-recognition languages, and 36 spoken languages; Alibaba claims it outperforms Gemini 3.1 Pro in audio and matches its AV understanding in some settings (launch thread, demo thread, additional demo). A useful caveat from @kimmonismus: “omni” here is about interpreting multimodal inputs, not arbitrary multimodal generation.
Z AI continues to tune for agentic workloads: Artificial Analysis evaluated GLM-5-Turbo, Z AI’s proprietary agent-optimized variant. It scored 47 on the AA Intelligence Index, slightly behind open-weight GLM-5 (Reasoning) at 50, but posted 1503 on GDPval-AA, ahead of GLM-5’s 1408, supporting the claim that the model is tuned for real-world agent workflows rather than broad benchmark maximalism.
Specialized open models are increasingly the deployment pattern: Several tweets converged on the same thesis: companies will increasingly own and specialize open models on proprietary data rather than rent general-purpose APIs indefinitely (@oneill_c, @ClementDelangue). Supporting evidence ranged from a Qwen3.5-27B model distilled from Claude 4.6 Opus trending on HF for weeks and reportedly fitting on 16GB in 4-bit (Unsloth, @Hesamation) to growing enthusiasm for local runtimes like llama.cpp and MLX.

Local Inference and Systems: llama.cpp at 100k, Flash-MoE on MacBooks, and Web/Serving Toolchains

Local AI had a symbolic milestone with llama.cpp hitting 100k GitHub stars: @ggerganov’s reflection framed 2026 as potentially the breakout year for local agentic workflows, arguing that useful automation doesn’t require frontier-scale hosted models and that the right portable runtime stack matters more than absolute scale. The post also emphasized the importance of cross-hardware, non-vendor-locked infra.
Flash-MoE on Apple Silicon drew strong attention: A widely shared post claimed Qwen3.5-397B could run on a 48GB MacBook Pro at 4.4 tok/s using a pure C + Metal engine that streams weights from SSD and only loads the active experts, reportedly using ~5.5GB RAM during inference (summary thread). Related work includes anemll-flash-mlx, which focuses on optimizing only the MoE path on top of MLX, and AI Toolkit’s new Apple Silicon support.
Web and serving stacks also moved: Transformers.js v4 added a WebGPU backend across browser/Node/Bun/Deno with major perf gains and 200+ architectures. vLLM-Omni v0.18.0 shipped 324 commits, production TTS/omni serving, unified quantization, diffusion runtime refactors, and a dozen-plus new models. On the speech side, Artificial Analysis covered Cohere Transcribe: a 2B conformer encoder-decoder, Apache 2.0, trained on 14 languages, hitting 4.7% AA-WER and roughly 60x real-time transcription speed.

Agent Research: Natural-Language Harnesses, Meta-Harness, Async SWE Agents, and Long-Context via Filesystems

Harness engineering is becoming a research field of its own: A Tsinghua/Shenzhen paper on natural-language agent harnesses proposed letting an LLM execute orchestration logic from an SOP rather than hard-coded harness rules, a direction that multiple practitioners found mind-bending but plausible as context budgets rise (@rronak_ summary). Meta pushed the idea further with Meta-Harness, a method that optimizes the harness end-to-end over code, traces, and scores rather than just the base model; claims include #1 among Haiku agents on TerminalBench-2 and strong gains in text classification and transfer (@yoonholeee, explainer by @LiorOnAI).
Async/multi-agent SWE design got stronger empirical backing: The CAID paper from CMU argues for centralized asynchronous isolated delegation using manager agents, dependency graphs, isolated git worktrees, self-verification, and merges. Reported gains were +26.7 absolute on PaperBench and +14.3 on Commit0 versus single-agent baselines, suggesting that concurrency and isolation beat simply giving one agent more iterations (@omarsar0 summary).
Coding agents as long-context processors is one of the more interesting reframings: A paper highlighted by @dair_ai treats huge corpora as directory trees and lets off-the-shelf coding agents navigate them with shell commands and Python, rather than stuffing text into context windows or relying purely on retrieval. Reported results include 88.5% on BrowseComp-Plus (750M tokens) vs 80% previous best, and operation up to 3T tokens.

Training, Optimization, Evaluation, and Production Case Studies

Muon got a meaningful systems/math optimization: Gram Newton-Schulz is a drop-in replacement for Muon’s Newton-Schulz step that works on the smaller symmetric XXᵀ Gram matrix rather than the large rectangular matrix, reportedly making Muon up to 2x faster while preserving validation perplexity within 0.01. The work drew praise from @tri_dao as the kind of cross-disciplinary linear algebra + fast-kernel result that actually matters.
Two practical implementation details stood out: Ross Wightman flagged a subtle but important PyTorch trunc_normal_ misuse pattern in LLM training code: default a/b are absolute values, not standard deviations, so many codebases effectively aren’t truncating at all; he also noted numerical oddities later fixed in nightlies. At the application layer, Shopify’s DSPy case study was notable for economics: one slide highlighted a reduction from $5.5M to $73K/year by decomposing business logic, modeling intent with DSPy, and switching to a smaller optimized model while maintaining performance (follow-up).
New evals/benchmarks continued to expose gaps: World Reasoning Arena targets hypothetical/world-model reasoning and reports a substantial gap to humans. Tau Bench’s new banking domain adds a realistic 698-doc support environment where best models still only solve about 25% of tasks. Meanwhile, a Stanford-led paper highlighted by @Zulfikar_Ramzan found sycophantic AI can increase users’ certainty while reducing willingness to repair relationships, underscoring that “helpfulness” metrics can obscure socially harmful behavior.

Top tweets (by engagement)

Claude Code computer use: Anthropic’s release was the biggest technical product launch in the set, and likely the most consequential for day-to-day coding-agent UX (announcement).
Claude Code hidden features: @bcherny’s thread drew massive engagement, reflecting how quickly expert users are now optimizing around coding-agent workflows rather than raw model prompts.
Hermes Agent update: The broad community response to Nous’s major Hermes release suggests open agent harnesses have reached a new adoption phase.
Qwen3.5-Omni launch: Alibaba’s multimodal release was one of the day’s biggest model announcements and especially notable for its practical demos around audio/video-driven app creation (launch).
llama.cpp at 100k stars: @ggerganov’s milestone post captured the local-first mood of the week: increasingly capable open models plus increasingly capable local runtimes.