Discover more from Latent Space
The New Kings of Open Source AI (Oct 2023 Recap)
Mistral is the new open source unicorn in town, top takes from the AI Engineer Summit, and our usual highest-signal recap of top items for the AI Engineer from Oct 2023
We’re sorry that this monthly recap is delayed - it feels futile to cover >1 month old news in AI but we’re still committed to recapping things monthly, so as to provide useful historical perspective for future readers of this newsletter. This work is as much a part of our process for keeping up to date as it is for you to read.
Move over, Meta, there are new open source kings in town. Mistral 7B, released at the tail end of Sept 2023, is both Apache 2.0 and smaller but better than Llama 2, and is now rumored to be raising $400m at $2.5b valuation from a16z:
Sure, 75% of the 97.5% shrinkage highlighted in Mistral AI’s launch post is from the Chinchilla paper itself, whose death was covered by us on the pod and is well documented by now. But the real story is the shift in efficient frontier of size-vs-performance tradeoffs:
Mistral calls it their “teaser model”, and we’ve already covered speculation that Mistral 7B is an 800B token checkpoint on a whopping 8T token dataset. Llama 2 trained on 2T tokens, but if this speculation is true, and with Together AI also releasing a new 30T token dataset this month, it seems that the “Token Crisis” is not yet a problem (as leaders of both OpenAI and EleutherAI believe).
Finetunes. In the month since release, Mistral’s release has revolutionized the open models landscape, most prominently with Zephyr 7B, HuggingFace’s finetune of Mistral based on the UltraChat dataset using Direct Preference Optimization, a simple alternative to PPO-based RLHF. This finetune pushes Mistral 7B over the edge to beat Llama 2 70B on MT Bench, an impressive feat for a 10x smaller model. The Nous Research community (which had a big coming out party at OpenAI DevDay) has also been very active, switching the Hermes base model to Mistral, beating previous Hermes 13B and 70B models. Naturally, there is a ton of interest in custom finetuning, and both Brev and WandB have been on top of it.
What does Open mean?
Though Mistral’s weights are truly no-BS open source licensed, their dataset is not open, and though a paper was published, it offered only red herrings on Sliding Window Attention. We don’t REALLY know why it performs so well and it seems the research community did not benefit from Mistral’s publication — so is Mistral ACTUALLY open or are we simply settling for licenses for weights being the bar for ‘open’?
In June, OSS Capital attempted to define an Open Weights foundation but it failed to catch on. In reality there is no incentive for companies to obey any standards but their own to open their data, since it only invites lawsuits and competition. There’s also very little incentive to open their model architecture and training process, except to the extent they want to allow inference and finetuning, as Mistral and Meta have done.covered this trend in more detail in AI and Open Source 2023.
Stanford to the rescue: The new Foundation Model Transparency Index, out of Percy Liang’s CRFM lab at Stanford, is the best step forward for open models because it defines openness standards SET BY PEERS:
Instead of a binary open/not open model imposed by the OSI or some other gatekeeper foundation, the FMTI scores models and labs on 100 points of transparency1, allowing them to pick and choose degrees of transparency, but using the one thing they care about more than impotent open source complainers: actual peer pressure.
Now, perhaps, the open model movement has a chance to measurably improve and converge towards the maximum acceptable openness as collectively decided by the actions of model labs, not the masses with no skin in the game who will always want more openness than they are comfortable giving.
The AI Engineer Summit
October was also our first AI Engineer Summit, where we were so happy to meet many of you in person for the first time! We are still processing the many talks and workshops and off the record conversations from that event, but by far the most hotly debated talk was GitHub VP of Product Mario Rodriguez’s Day 2 Keynote, on Copilot’s product journey, philosophy, and $100m ARR milestone:
This is because it happened to coincide with a very poorly sourced press writeup that Copilot loses $20/month per user, causing easily excitable engineers to do some very bad napkin math, despite repeated denials from GitHub executives (now formally confirmed by GitHub’s CEO at their Universe conference, but unfortunately without any further specifics).
We are still editing the individual speaker videos and dripping them out, but in the meantime, catch up with the almost 30k developers who watched us live:
A separate roundup post is forthcoming on summit takeaways.
Latent Space is a reader-supported publication. To support our work, consider becoming a free or paid subscriber!
Latent Space News
And last month’s recap if you missed it: The AI OS (Sept 2023 Recap)
Oct 6 - sama on Joe Rogan podcast
Oct 11 - OpenAI’s technology explained
Oct 20 - Tender offer @ 86b valuation
Oct 26 - GPT4 knowledge cutoff updated to Apr 2023
Oct 26 - Preparedness Challenge and hiring
Oct 30 - slow rollout of GPT4 32k context with ‘all tools’ model
Third party: Collection of ChatGPT System Prompts including Voice and Custom Instructions
Frontier Model News
(other members of the Frontier Model Forum + Meta)
Oct 16 — Inflection Pi internet access & therapy mode
Oct 16 — Anthropic Claude is rolled out to 95 more countries
Oct 31 — All Metamates given GPT-4 access internally, while Google Gemini is still nowhere to be found
did we miss something or was October a super quiet month in non-OpenAI land?
yes, everyone is confused - this is corporate, not fund, “investing”
but this is inline with their $5b, 4 year plan to take on OpenAI
FTX stake is owned by bankruptcy estate
Modal Labs - $16m Series A with Redpoint (swyx is an advisor)
Induced AI - $2.3m seed with Sam Altman and Peak15 (Sequoia India/SEA)
Mistral is rumored to be raising $400m at $2.5b valuation from a16z an General Catalyst
We use tshirt sizing as a shorthand to make models comparable within weight classes
Most notable model
Mistral 7B outperforms Llama 13B on all tested benchmarks
Mistral Mission: "Our ambition is to become the leading supporter of the open generative Al community, and bring open models to state-of-the-art performance… Mistral Al will progressively and methodically release new models that close the performance gap between black-box and open solutions - making open solutions the best options on a growing range of enterprise use-cases."
“…The field has so far put the emphasis on scaling laws in 2 dimensions (directly associating model capabilities to training cost) the problem is rather 3 dimensional (model capabilities, training cost, inference cost), and much remains to be explored to obtain the best performance with the smallest possible model.”
introduces Sliding Window Attention
Mistral 7b Finetunes:
The Hermes 2 model was trained on 900,000 instructions, and surpasses all previous versions of Hermes 13B and below, and matches 70B on some benchmarks! Hermes 2 changes the game with strong multiturn chat skills, system prompt capabilities, and uses ChatML format.
Open source projects and templates
Daniel Gross’ LocalPilot: “In my experience, 7b isn't usefully fast enough for autocomplete on M1, but M2 Max is the punctuated equilibrium; it's suddenly good enough. (34b quantized models are fast enough for Q&A.)“
Local LLM calculator: select LLM, and GPU, and see if can run locally
Cloudflare AI projects
Dadjokes: Mistral finetune on /r/dadjokes
Fast Whisper distributions
Redpajama Dataset v2 - 30T tokens, but the real innovation is offering customizable filtering that is likely to be the new standard in open datasets (this is a step up from Eleuther’s The Pile)
Stanford CRFM ecosystem graphs: tracks foundation models, upstream datasets, and downstream products
Langchain (friend of the pod)
Perplexity PPLX API: Mistral 7B, Llama2 13B, Code Llama 34B, and Llama2 70B models supported. notes from @danjl on Latent Space Discord:
currently included with perplexity pro, no $/token (for now? I'm assuming only in public beta, that won't scale)
really leans into openAI api compatibility. They actually just use the openai python client. All you have to do is switch api_base, api_key, and model to point to perplexity to switch an application from openai to perplexity in python
great demo of what a nice stable diffusion 1.5 app can do today
Midjourney released new 2x and 4x upscalers
Databricks MLflow 2.8 supports LLM-as-a-judge metrics - resulting in significant savings in time (from 2 weeks with human workforce to 30 minutes with LLM judges) and costs (from $20 per task to $0.20 per task)
Morph Code Index is an OSS semantic code search engine for you, your codebase, and your personal AI SWE.
Papers and Good Reads
Nathan Benaich's annual State of AI Report is out! lots of good charts.
Attention Sinks: Use and maintain "attention sinks", initial tokens that the model focuses on.
Rolling Cache: Keep a rolling collection of recent tokens to optimize speed without sacrificing accuracy.
Placeholder Token: Add a special token during training to act as a dedicated attention sink, enhancing streaming deployment.
adding up to 10 "pause tokens" lets models improve reasoning - tested up to 1B params on C4
seems similar to the backspace token paper
Expressive text-to-image generation with rich text (HN) - being able to modify generated images by modifying text using fonts and text colors. going from words -> token maps (masks).
FontoGen - The model takes a font description as an input, and produces a font file as an output.
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models 94.4 on HumanEval with gpt4, 86.9 on HumanEval with gpt3.5
Large Language Models as Analogical Reasoners from Deepmind
When given a task, the LLM is prompted to: First, create relevant examples (problems and their solutions) for the task.Then, use these examples as guidance to solve the main task.
MemWalker - LLM as an interactive agent, allowing it to decide how to read the text via iterative prompting. (Paper)
We introduce MemWalker, a method that first processes the long context into a tree of summary nodes. Upon receiving a query, the model navigates this tree in search of relevant information, and responds once it gathers sufficient information.
1. Build a data structure (memory tree) 2. Traverse it via LLM prompting Outperforms long context, retrieval, & recurrent baselines. (1/n) (tweet)
RAG vs Long Context (tweet, paper): we find that LLM with 4K context window using simple retrieval-augmentation at generation can achieve comparable performance to finetuned LLM with 16K context window via positional interpolation on long context tasks, while taking much less computation.
"These results imply that text embeddings present the same threats to privacy as the text from which they are computed, and embeddings should be treated with the same precautions as raw data."
"Vec2Text is trained to invert two state-of-the-art embedding models: GTR-base (Niet al., 2021), a T5-based pre-trained transformer for text retrieval, and text-embeddings-ada-002 available via the OpenAI API"
"Vec2Text is able to recover 94% of first names, 95% of last names, and 89% of full names (first, last format) while recovering 26% of the documents exactly."
"Everyone I talk to who is building some vector db based thing sooner or later realizes they also care about the features of a full-text search engine.
They care about filtering, they care to some degree about direct lexical matches, they care about paging, getting groups / facet counts, etc.
Vectors, IMO, are just one feature that a regular search engine should have. IMO currently Vespa does the best job of this, though lately it seems Lucene (Elasticsearch and Opensearch) are really working hard to compete"
Vespa.ai is spinning out of Yahoo as a separate company people speak highly of Vespa, it is targeting search and recsys problems rather than specifically vector db problems
the reports of ChatGPT having a liberal bias were a result of oversensitive prompts
GPT4 passing the bar exam and USMLE is a sign of contamination
OnnxStream is based on the idea of decoupling the inference engine from the component responsible of providing the model weights, which is a class derived from
WeightsProviderspecialization can implement any type of loading, caching and prefetching of the model parameters. For example a custom
WeightsProvidercan decide to download its data from an HTTP server directly, without loading or writing anything to disk (hence the word "Stream" in "OnnxStream"). Two default
open replication off chatgpt plus tools based on their own comparison table (see chart) using 1. Data agent: data analysis and tools, 2. Plugin agent: 200+ APIs from API provider: RapidAPI, OAI plugin store, 3. Web agent (Tools: Chrome debugger API)
from the Executable Language Grounding (XLang) Lab! We are part of the HKU NLP Group at the University of Hong Kong.
didnt find a lot of insight while reading thru. too much marketing in the paper
Ferret: Refer and Ground Anything Anywhere at Any Granularity - nice attempt at Open GPT4-V, and has a nice GRIT dataset others can use
Meta released MetaCLIP - fully OSS replication of CLIP pipeline
Paper: Demystifying CLIP Data
This page contains interactive charts for exploring how large language models represent truth https://saprmarks.github.io/geometry-of-truth/dataexplorer/
emergence (Wei et al) is still underappreciated
Perspective of "yet": "This idea doesn't work" -> "This idea doesn't work YET"
Document experiments that failed because of insufficient “intelligence”
Do not declare failure yet and make it easy to rerun in the future
As soon as the new model comes out, rerun them
Learn what works and what doesn’t
Update your intuition on emergent abilities and scale
We need post-training
instruction tuning - FLAN
Reward Model training
Policy model training
Many Transformer variants have been proposed but almost all fancy variations don’t scale well
More useful to abstract away Transformer as sequence of functions and think about input and output shapes and types
OpenAI is too cheap to beat - "At $366K ($166K AWS + $200K talent), we’re paying around $80 per-fine-tuning run, about 8-20x higher than what we’re paying OpenAI!"
TimeGPT - some skepticism in HN commentary
Safety Summit debates blew up - we don’t focus on those in Latent Space
Latent Space is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Although it must be observed that any system that has GPT4 ranking as the 3rd most open LLM is a little suspect?