

Discover more from Latent Space
The New Kings of Open Source AI (Oct 2023 Recap)
Mistral is the new open source unicorn in town, top takes from the AI Engineer Summit, and our usual highest-signal recap of top items for the AI Engineer from Oct 2023
We’re sorry that this monthly recap is delayed - it feels futile to cover >1 month old news in AI but we’re still committed to recapping things monthly, so as to provide useful historical perspective for future readers of this newsletter. This work is as much a part of our process for keeping up to date as it is for you to read.
Join us to celebrate the One Year Anniversary of ChatGPT and at Modular ModCon!
Move over, Meta, there are new open source kings in town. Mistral 7B, released at the tail end of Sept 2023, is both Apache 2.0 and smaller but better than Llama 2, and is now rumored to be raising $400m at $2.5b valuation from a16z:

Sure, 75% of the 97.5% shrinkage highlighted in Mistral AI’s launch post is from the Chinchilla paper itself, whose death was covered by us on the pod and is well documented by now. But the real story is the shift in efficient frontier of size-vs-performance tradeoffs:

Mistral calls it their “teaser model”, and we’ve already covered speculation that Mistral 7B is an 800B token checkpoint on a whopping 8T token dataset. Llama 2 trained on 2T tokens, but if this speculation is true, and with Together AI also releasing a new 30T token dataset this month, it seems that the “Token Crisis” is not yet a problem (as leaders of both OpenAI and EleutherAI believe).
Finetunes. In the month since release, Mistral’s release has revolutionized the open models landscape, most prominently with Zephyr 7B, HuggingFace’s finetune of Mistral based on the UltraChat dataset using Direct Preference Optimization, a simple alternative to PPO-based RLHF. This finetune pushes Mistral 7B over the edge to beat Llama 2 70B on MT Bench, an impressive feat for a 10x smaller model. The Nous Research community (which had a big coming out party at OpenAI DevDay) has also been very active, switching the Hermes base model to Mistral, beating previous Hermes 13B and 70B models. Naturally, there is a ton of interest in custom finetuning, and both Brev and WandB have been on top of it.
What does Open mean?
Though Mistral’s weights are truly no-BS open source licensed, their dataset is not open, and though a paper was published, it offered only red herrings on Sliding Window Attention. We don’t REALLY know why it performs so well and it seems the research community did not benefit from Mistral’s publication — so is Mistral ACTUALLY open or are we simply settling for licenses for weights being the bar for ‘open’?
In June, OSS Capital attempted to define an Open Weights foundation but it failed to catch on. In reality there is no incentive for companies to obey any standards but their own to open their data, since it only invites lawsuits and competition. There’s also very little incentive to open their model architecture and training process, except to the extent they want to allow inference and finetuning, as Mistral and Meta have done.
covered this trend in more detail in AI and Open Source 2023.
Stanford to the rescue: The new Foundation Model Transparency Index, out of Percy Liang’s CRFM lab at Stanford, is the best step forward for open models because it defines openness standards SET BY PEERS:
Instead of a binary open/not open model imposed by the OSI or some other gatekeeper foundation, the FMTI scores models and labs on 100 points of transparency1, allowing them to pick and choose degrees of transparency, but using the one thing they care about more than impotent open source complainers: actual peer pressure.
Now, perhaps, the open model movement has a chance to measurably improve and converge towards the maximum acceptable openness as collectively decided by the actions of model labs, not the masses with no skin in the game who will always want more openness than they are comfortable giving.
The AI Engineer Summit
October was also our first AI Engineer Summit, where we were so happy to meet many of you in person for the first time! We are still processing the many talks and workshops and off the record conversations from that event, but by far the most hotly debated talk was GitHub VP of Product Mario Rodriguez’s Day 2 Keynote, on Copilot’s product journey, philosophy, and $100m ARR milestone:
This is because it happened to coincide with a very poorly sourced press writeup that Copilot loses $20/month per user, causing easily excitable engineers to do some very bad napkin math, despite repeated denials from GitHub executives (now formally confirmed by GitHub’s CEO at their Universe conference, but unfortunately without any further specifics).
We are still editing the individual speaker videos and dripping them out, but in the meantime, catch up with the almost 30k developers who watched us live:
A separate roundup post is forthcoming on summit takeaways.
Latent Space News
Oct 7 — AIE Summit Preview #1 - Swyx on Software 3.0 and the Rise of the AI Engineer
Oct 8 — AIE Summit Preview #2 - The AI Horcrux — Swyx on Cognitive Revolution
Oct 14 — Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue
Oct 19 — The End of Finetuning — with Jeremy Howard of Fast.ai
Oct 26 — Powering your Copilot for Data – with Artem Keydunov of Cube.dev
And last month’s recap if you missed it: The AI OS (Sept 2023 Recap)
News Items
The raw notes from which I draw everything above. You can always see the raw material on Github and of course I can’t see everything, check the other newsletters and podcasters that do.
OpenAI
Oct 4 - 1.0 release of Python SDK soon. Can begin beta testing now. changelog
Oct 6 - sama on Joe Rogan podcast
Oct 6 - Finetuning UI has end to end jobs, no code required (tweet - not our fault)
Oct 11 - OpenAI’s technology explained
Oct 19 - DallE3 in chatgpt plus and enterprise together with research paper
Oct 20 - Tender offer @ 86b valuation
Oct 26 - GPT4 knowledge cutoff updated to Apr 2023
Oct 26 - Preparedness Challenge and hiring
Oct 30 - slow rollout of GPT4 32k context with ‘all tools’ model
Third party: Collection of ChatGPT System Prompts including Voice and Custom Instructions
Frontier Model News
(other members of the Frontier Model Forum + Meta)
Oct 16 — Inflection Pi internet access & therapy mode
Oct 16 — Anthropic Claude is rolled out to 95 more countries
Oct 31 — All Metamates given GPT-4 access internally, while Google Gemini is still nowhere to be found
did we miss something or was October a super quiet month in non-OpenAI land?
Notable Fundraising
Anthropic
Google invests $500 million upfront and $1.5 billion over time in new round
after their $300m for 10% in Feb and participation in $450m round in May
after Amazon committed $1.25bn upfront, and up to $4bn in Sept
TheInformation speculates at a valuation of $20bn
yes, everyone is confused - this is corporate, not fund, “investing”
but this is inline with their $5b, 4 year plan to take on OpenAI
FTX stake is owned by bankruptcy estate
Modal Labs - $16m Series A with Redpoint (swyx is an advisor)
Anyscale - $8m Seed with OpenAI (we were the first podcast to cover Cursor)
Induced AI - $2.3m seed with Sam Altman and Peak15 (Sequoia India/SEA)
Mistral is rumored to be raising $400m at $2.5b valuation from a16z an General Catalyst
Open Models
We use tshirt sizing as a shorthand to make models comparable within weight classes
Most notable model
Mistral 7B outperforms Llama 13B on all tested benchmarks
Mistral Mission: "Our ambition is to become the leading supporter of the open generative Al community, and bring open models to state-of-the-art performance… Mistral Al will progressively and methodically release new models that close the performance gap between black-box and open solutions - making open solutions the best options on a growing range of enterprise use-cases."
“…The field has so far put the emphasis on scaling laws in 2 dimensions (directly associating model capabilities to training cost) the problem is rather 3 dimensional (model capabilities, training cost, inference cost), and much remains to be explored to obtain the best performance with the smallest possible model.”
introduces Sliding Window Attention
Pieter Abbeel: "convnets again?"
likely a red herring
Mistral 7b Finetunes:
Zephyr 7b beats Llama 70b on MT Bench
noncommercial license for now
The Hermes 2 model was trained on 900,000 instructions, and surpasses all previous versions of Hermes 13B and below, and matches 70B on some benchmarks! Hermes 2 changes the game with strong multiturn chat skills, system prompt capabilities, and uses ChatML format.
Large
Medium
Adept Fuyu-8B: Multimodal model (tweet) that powers Adept Workflows
Reke Yasa-1: MUltimodal LLM that natively supports images, audio, and short video clips as inputs, with internal RAG deployment and code interpreter execution. (tweet, blog)
Small
Open source projects and templates
Daniel Gross’ LocalPilot: “In my experience, 7b isn't usefully fast enough for autocomplete on M1, but M2 Max is the punctuated equilibrium; it's suddenly good enough. (34b quantized models are fast enough for Q&A.)“
Llama 2 in C (HN/Github, Karpathy comment) - compare to Llama 2 in Mojo 🔥
Local LLM calculator: select LLM, and GPU, and see if can run locally
SlowLlama: Finetune llama2-70B and codellama on MacBook Air without quantization
Cloudflare AI projects
LlamaResearcher: personalized research assistant that sends summaries of papers of interest straight to your inbox
Audioflare: An all-in-one AI audio playground using Cloudflare AI Workers to transcribe, analyze, summarize, and translate any audio file.
Dadjokes: Mistral finetune on /r/dadjokes
Fast Whisper distributions
Whisper turbo - purely in browser (tweet context), using webgpu
Distil-Whisper: distilled version of Whisper that is 6 times faster, 49% smaller
Redpajama Dataset v2 - 30T tokens, but the real innovation is offering customizable filtering that is likely to be the new standard in open datasets (this is a step up from Eleuther’s The Pile)
Riffusion with Lyrics (HN) now generating short song segments with VOICE!
OpenLLMetry – OpenTelemetry-based observability for LLMs (HN)
Other launches
Stanford CRFM ecosystem graphs: tracks foundation models, upstream datasets, and downstream products
Langchain (friend of the pod)
LangServe: The best way to deploy your LangChains
Templates: deployable reference architectures for a wide variety of tasks (aka Langserve Hub)
Mojo 🔥 (friend of the pod! we are attending ModCon in Dec) is now available on Apple silicon Macs and has LLaMa.cpp level performance (Announcement, Performance thread)
Perplexity PPLX API: Mistral 7B, Llama2 13B, Code Llama 34B, and Llama2 70B models supported. notes from @danjl on Latent Space Discord:
currently included with perplexity pro, no $/token (for now? I'm assuming only in public beta, that won't scale)
really leans into openAI api compatibility. They actually just use the openai python client. All you have to do is switch api_base, api_key, and model to point to perplexity to switch an application from openai to perplexity in python
Show HN: Shortbread – Create AI comics in minutes (shortbread.ai)
great demo of what a nice stable diffusion 1.5 app can do today
Midjourney released new 2x and 4x upscalers
Adobe Firefly 2 Image Model - 4MP output, depth of field control, and high-frequency details like skin pores. Features like Generative Match enable style transfer from reference images. (X)
Baidu claims their new model Ernie 4 rivals GPT4 but cant be proven (Announcement, Thread)
Databricks MLflow 2.8 supports LLM-as-a-judge metrics - resulting in significant savings in time (from 2 weeks with human workforce to 30 minutes with LLM judges) and costs (from $20 per task to $0.20 per task)
Morph Code Index is an OSS semantic code search engine for you, your codebase, and your personal AI SWE.
Papers and Good Reads
Nathan Benaich's annual State of AI Report is out! lots of good charts.
Models
Efficient streaming language models with attention sinks (HN)
Attention Sinks: Use and maintain "attention sinks", initial tokens that the model focuses on.
Rolling Cache: Keep a rolling collection of recent tokens to optimize speed without sacrificing accuracy.
Placeholder Token: Add a special token during training to act as a dedicated attention sink, enhancing streaming deployment.
Think before you speak: Training Language Models With Pause Tokens (HN)
adding up to 10 "pause tokens" lets models improve reasoning - tested up to 1B params on C4
seems similar to the backspace token paper
Expressive text-to-image generation with rich text (HN) - being able to modify generated images by modifying text using fonts and text colors. going from words -> token maps (masks).
FontoGen - The model takes a font description as an input, and produces a font file as an output.
NEFTune - a "one simple trick" to get higher quality finetunes by adding noise (Thread, Github)
Prompting
Emotional jailbreaks work. https://arxiv.org/abs/2307.11760 "This is very important to my career" and the dead grandma jailbreak
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models 94.4 on HumanEval with gpt4, 86.9 on HumanEval with gpt3.5
Large Language Models as Analogical Reasoners from Deepmind
When given a task, the LLM is prompted to: First, create relevant examples (problems and their solutions) for the task.Then, use these examples as guidance to solve the main task.
RAG
MemGPT - LLMs as operating systems (twitter, site, paper, HN)
MemWalker - LLM as an interactive agent, allowing it to decide how to read the text via iterative prompting. (Paper)
We introduce MemWalker, a method that first processes the long context into a tree of summary nodes. Upon receiving a query, the model navigates this tree in search of relevant information, and responds once it gathers sufficient information.
1. Build a data structure (memory tree) 2. Traverse it via LLM prompting Outperforms long context, retrieval, & recurrent baselines. (1/n) (tweet)
RAG vs Long Context (tweet, paper): we find that LLM with 4K context window using simple retrieval-augmentation at generation can achieve comparable performance to finetuned LLM with 16K context window via positional interpolation on long context tasks, while taking much less computation.
Vec2Text: Text embeddings can be inverted (twitter, paper)
"These results imply that text embeddings present the same threats to privacy as the text from which they are computed, and embeddings should be treated with the same precautions as raw data."
"Vec2Text is trained to invert two state-of-the-art embedding models: GTR-base (Niet al., 2021), a T5-based pre-trained transformer for text retrieval, and text-embeddings-ada-002 available via the OpenAI API"
"Vec2Text is able to recover 94% of first names, 95% of last names, and 89% of full names (first, last format) while recovering 26% of the documents exactly."
Choosing vector database: a side-by-side comparison (HN)
"Everyone I talk to who is building some vector db based thing sooner or later realizes they also care about the features of a full-text search engine.
They care about filtering, they care to some degree about direct lexical matches, they care about paging, getting groups / facet counts, etc.
Vectors, IMO, are just one feature that a regular search engine should have. IMO currently Vespa does the best job of this, though lately it seems Lucene (Elasticsearch and Opensearch) are really working hard to compete"
Vespa.ai is spinning out of Yahoo as a separate company people speak highly of Vespa, it is targeting search and recsys problems rather than specifically vector db problems
Evals
Evaluating LLMs is a minefield
the reports of ChatGPT having a liberal bias were a result of oversensitive prompts
GPT4 passing the bar exam and USMLE is a sign of contamination
Efficiency
Running Stable Diffusion XL 1.0 in 298MB of RAM
OnnxStream is based on the idea of decoupling the inference engine from the component responsible of providing the model weights, which is a class derived from
WeightsProvider
. AWeightsProvider
specialization can implement any type of loading, caching and prefetching of the model parameters. For example a customWeightsProvider
can decide to download its data from an HTTP server directly, without loading or writing anything to disk (hence the word "Stream" in "OnnxStream"). Two defaultWeightsProviders
are available:DiskNoCache
andDiskPrefetch
.
Agents
OpenAgents: An Open Platform for Language Agents in the Wild
open replication off chatgpt plus tools based on their own comparison table (see chart) using 1. Data agent: data analysis and tools, 2. Plugin agent: 200+ APIs from API provider: RapidAPI, OAI plugin store, 3. Web agent (Tools: Chrome debugger API)
from the Executable Language Grounding (XLang) Lab! We are part of the HKU NLP Group at the University of Hong Kong.
didnt find a lot of insight while reading thru. too much marketing in the paper
Multimodality
Multi-modal prompt injection image attacks against GPT-4V (simonwillison.net)
Ferret: Refer and Ground Anything Anywhere at Any Granularity - nice attempt at Open GPT4-V, and has a nice GRIT dataset others can use
Meta released MetaCLIP - fully OSS replication of CLIP pipeline
Paper: Demystifying CLIP Data
Learning
This page contains interactive charts for exploring how large language models represent truth https://saprmarks.github.io/geometry-of-truth/dataexplorer/
Language Modeling is Compression from DeepMind echoes what Ilya Sutskever said in a recent talk
Large Language Models in 2023 (tweet, recorded talk) from Hyung Won Chung, OpenAI & Google Brain
emergence (Wei et al) is still underappreciated
Perspective of "yet": "This idea doesn't work" -> "This idea doesn't work YET"
Document experiments that failed because of insufficient “intelligence”
Do not declare failure yet and make it easy to rerun in the future
As soon as the new model comes out, rerun them
Learn what works and what doesn’t
Update your intuition on emergent abilities and scale
We need post-training
instruction tuning - FLAN
Reward Model training
Policy model training
bitter lesson
Many Transformer variants have been proposed but almost all fancy variations don’t scale well
More useful to abstract away Transformer as sequence of functions and think about input and output shapes and types
Misc
OpenAI is too cheap to beat - "At $366K ($166K AWS + $200K talent), we’re paying around $80 per-fine-tuning run, about 8-20x higher than what we’re paying OpenAI!"
TimeGPT - some skepticism in HN commentary
Safety Summit debates blew up - we don’t focus on those in Latent Space
Biden executive order vs Andrew Ng tweet
Memes
Although it must be observed that any system that has GPT4 ranking as the 3rd most open LLM is a little suspect?
The New Kings of Open Source AI (Oct 2023 Recap)
Nice coverage. Quick things:
* typo in fundraising. Anyscale -> Anysphere.
* Keep reusing the "open source ML diagram" it's a good one.
* Plotting model size for 60% on MMLU seems a little bit misleading. People were not optimizing for inference as much a few years ago, and 60% is not an agreed upon number.
* Don't cozy up to the Stanford FMTI too much, it's got some flaws - wouldn't say it is rescuing us ;) https://www.interconnects.ai/p/fmti-critique
* There's more news in DPO RLHF coming soon.
Would you be interested in doing a 2023 round up of Open-Source AI evolution? As a guest post in AI Supremacy? https://aisupremacy.substack.com/p/guest-posts-on-ai-supreamcy