Latent Space

Share this post

The New Kings of Open Source AI (Oct 2023 Recap)

www.latent.space

Discover more from Latent Space

The AI Engineer newsletter + Top 10 US Tech podcast. Exploring AI UX, Agents, Devtools, Infra, Open Source Models. See https://latent.space/about for highlights from Chris Lattner, Andrej Karpathy, George Hotz, Simon Willison, Emad Mostaque, et al!
Over 30,000 subscribers
Continue reading
Sign in

The New Kings of Open Source AI (Oct 2023 Recap)

Mistral is the new open source unicorn in town, top takes from the AI Engineer Summit, and our usual highest-signal recap of top items for the AI Engineer from Oct 2023

swyx
and
Noah Hein
Nov 12, 2023
29
Share this post

The New Kings of Open Source AI (Oct 2023 Recap)

www.latent.space
3
Share

We’re sorry that this monthly recap is delayed - it feels futile to cover >1 month old news in AI but we’re still committed to recapping things monthly, so as to provide useful historical perspective for future readers of this newsletter. This work is as much a part of our process for keeping up to date as it is for you to read.

Join us to celebrate the One Year Anniversary of ChatGPT and at Modular ModCon!


Move over, Meta, there are new open source kings in town. Mistral 7B, released at the tail end of Sept 2023, is both Apache 2.0 and smaller but better than Llama 2, and is now rumored to be raising $400m at $2.5b valuation from a16z:

this chart went slightly viral

Sure, 75% of the 97.5% shrinkage highlighted in Mistral AI’s launch post is from the Chinchilla paper itself, whose death was covered by us on the pod and is well documented by now. But the real story is the shift in efficient frontier of size-vs-performance tradeoffs:

per the Mistral paper. Note that neither axis starts at 0, which exaggerates the degree of difference, but this is still a major accomplishment

Mistral calls it their “teaser model”, and we’ve already covered speculation that Mistral 7B is an 800B token checkpoint on a whopping 8T token dataset. Llama 2 trained on 2T tokens, but if this speculation is true, and with Together AI also releasing a new 30T token dataset this month, it seems that the “Token Crisis” is not yet a problem (as leaders of both OpenAI and EleutherAI believe).

Finetunes. In the month since release, Mistral’s release has revolutionized the open models landscape, most prominently with Zephyr 7B, HuggingFace’s finetune of Mistral based on the UltraChat dataset using Direct Preference Optimization, a simple alternative to PPO-based RLHF. This finetune pushes Mistral 7B over the edge to beat Llama 2 70B on MT Bench, an impressive feat for a 10x smaller model. The Nous Research community (which had a big coming out party at OpenAI DevDay) has also been very active, switching the Hermes base model to Mistral, beating previous Hermes 13B and 70B models. Naturally, there is a ton of interest in custom finetuning, and both Brev and WandB have been on top of it.

What does Open mean?

Though Mistral’s weights are truly no-BS open source licensed, their dataset is not open, and though a paper was published, it offered only red herrings on Sliding Window Attention. We don’t REALLY know why it performs so well and it seems the research community did not benefit from Mistral’s publication — so is Mistral ACTUALLY open or are we simply settling for licenses for weights being the bar for ‘open’?

In June, OSS Capital attempted to define an Open Weights foundation but it failed to catch on. In reality there is no incentive for companies to obey any standards but their own to open their data, since it only invites lawsuits and competition. There’s also very little incentive to open their model architecture and training process, except to the extent they want to allow inference and finetuning, as Mistral and Meta have done.

Ahead of AI
covered this trend in more detail in AI and Open Source 2023.

Updated chart from How Open Source eats AI, which we can’t believe was over a year ago today

Stanford to the rescue: The new Foundation Model Transparency Index, out of Percy Liang’s CRFM lab at Stanford, is the best step forward for open models because it defines openness standards SET BY PEERS:

Instead of a binary open/not open model imposed by the OSI or some other gatekeeper foundation, the FMTI scores models and labs on 100 points of transparency1, allowing them to pick and choose degrees of transparency, but using the one thing they care about more than impotent open source complainers: actual peer pressure.

Now, perhaps, the open model movement has a chance to measurably improve and converge towards the maximum acceptable openness as collectively decided by the actions of model labs, not the masses with no skin in the game who will always want more openness than they are comfortable giving.

The AI Engineer Summit

October was also our first AI Engineer Summit, where we were so happy to meet many of you in person for the first time! We are still processing the many talks and workshops and off the record conversations from that event, but by far the most hotly debated talk was GitHub VP of Product Mario Rodriguez’s Day 2 Keynote, on Copilot’s product journey, philosophy, and $100m ARR milestone:

This is because it happened to coincide with a very poorly sourced press writeup that Copilot loses $20/month per user, causing easily excitable engineers to do some very bad napkin math, despite repeated denials from GitHub executives (now formally confirmed by GitHub’s CEO at their Universe conference, but unfortunately without any further specifics).

We are still editing the individual speaker videos and dripping them out, but in the meantime, catch up with the almost 30k developers who watched us live:

A separate roundup post is forthcoming on summit takeaways.

Latent Space is a reader-supported publication. To support our work, consider becoming a free or paid subscriber!

Latent Space News

  • Oct 5 — RAG Is A Hack - with Jerry Liu from LlamaIndex

  • Oct 7 — AIE Summit Preview #1 - Swyx on Software 3.0 and the Rise of the AI Engineer

  • Oct 8 — AIE Summit Preview #2 - The AI Horcrux — Swyx on Cognitive Revolution

  • Oct 14 — Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

  • Oct 19 — The End of Finetuning — with Jeremy Howard of Fast.ai

  • Oct 26 — Powering your Copilot for Data – with Artem Keydunov of Cube.dev

And last month’s recap if you missed it: The AI OS (Sept 2023 Recap)

News Items

The raw notes from which I draw everything above. You can always see the raw material on Github and of course I can’t see everything, check the other newsletters and podcasters that do.

OpenAI

  • Oct 4 - 1.0 release of Python SDK soon. Can begin beta testing now. changelog

  • Oct 6 - sama on Joe Rogan podcast

  • Oct 6 - Finetuning UI has end to end jobs, no code required (tweet - not our fault)

  • Oct 11 - OpenAI’s technology explained

  • Oct 19 - DallE3 in chatgpt plus and enterprise together with research paper

  • Oct 20 - Tender offer @ 86b valuation

  • Oct 26 - GPT4 knowledge cutoff updated to Apr 2023

  • Oct 26 - Preparedness Challenge and hiring

  • Oct 30 - slow rollout of GPT4 32k context with ‘all tools’ model

  • Third party: Collection of ChatGPT System Prompts including Voice and Custom Instructions

Frontier Model News

(other members of the Frontier Model Forum + Meta)

  • Oct 16 — Inflection Pi internet access & therapy mode

  • Oct 16 — Anthropic Claude is rolled out to 95 more countries

  • Oct 31 — All Metamates given GPT-4 access internally, while Google Gemini is still nowhere to be found

did we miss something or was October a super quiet month in non-OpenAI land?

Notable Fundraising

  • Anthropic

    • Google invests $500 million upfront and $1.5 billion over time in new round

      • after their $300m for 10% in Feb and participation in $450m round in May

      • after Amazon committed $1.25bn upfront, and up to $4bn in Sept

      • TheInformation speculates at a valuation of $20bn

    • yes, everyone is confused - this is corporate, not fund, “investing”

    • but this is inline with their $5b, 4 year plan to take on OpenAI

    • FTX stake is owned by bankruptcy estate

  • Modal Labs - $16m Series A with Redpoint (swyx is an advisor)

  • Anyscale - $8m Seed with OpenAI (we were the first podcast to cover Cursor)

  • Induced AI - $2.3m seed with Sam Altman and Peak15 (Sequoia India/SEA)

  • Mistral is rumored to be raising $400m at $2.5b valuation from a16z an General Catalyst

Open Models

We use tshirt sizing as a shorthand to make models comparable within weight classes

  • Most notable model

    • Mistral 7B paper

      • Mistral 7B outperforms Llama 13B on all tested benchmarks

      • Mistral Mission: "Our ambition is to become the leading supporter of the open generative Al community, and bring open models to state-of-the-art performance… Mistral Al will progressively and methodically release new models that close the performance gap between black-box and open solutions - making open solutions the best options on a growing range of enterprise use-cases."

      • “…The field has so far put the emphasis on scaling laws in 2 dimensions (directly associating model capabilities to training cost) the problem is rather 3 dimensional (model capabilities, training cost, inference cost), and much remains to be explored to obtain the best performance with the smallest possible model.”

      • introduces Sliding Window Attention

        • Pieter Abbeel: "convnets again?"

        • likely a red herring

      • no info on data at all

      • Mistral 7b Finetunes:

        • Zephyr 7b

          • Zephyr 7b beats Llama 70b on MT Bench

          • nathan lambert take

          • noncommercial license for now

        • Open Hermes 2

          • The Hermes 2 model was trained on 900,000 instructions, and surpasses all previous versions of Hermes 13B and below, and matches 70B on some benchmarks! Hermes 2 changes the game with strong multiturn chat skills, system prompt capabilities, and uses ChatML format.

        • CollectiveCognition v1

  • Large

    • Lemur-70B & Lemur-70B-Chat: Llama2 finetunes for Language Agents - The closest open model to GPT-3.5 on 15 agent tasks (tweet, paper)

  • Medium

    • Adept Fuyu-8B: Multimodal model (tweet) that powers Adept Workflows

    • Reke Yasa-1: MUltimodal LLM that natively supports images, audio, and short video clips as inputs, with internal RAG deployment and code interpreter execution. (tweet, blog)

  • Small

    • Stable LM 3B: Bringing Sustainable, High-Performance LMs to Smart Devices (Stability AI)

Open source projects and templates

  • Daniel Gross’ LocalPilot: “In my experience, 7b isn't usefully fast enough for autocomplete on M1, but M2 Max is the punctuated equilibrium; it's suddenly good enough. (34b quantized models are fast enough for Q&A.)“

  • Headshot AI template

  • Llama 2 in C (HN/Github, Karpathy comment) - compare to Llama 2 in Mojo 🔥

  • Local LLM calculator: select LLM, and GPU, and see if can run locally

  • SlowLlama: Finetune llama2-70B and codellama on MacBook Air without quantization

  • Cloudflare AI projects

    • LlamaResearcher: personalized research assistant that sends summaries of papers of interest straight to your inbox

    • Audioflare: An all-in-one AI audio playground using Cloudflare AI Workers to transcribe, analyze, summarize, and translate any audio file.

  • Dadjokes: Mistral finetune on /r/dadjokes

  • Fast Whisper distributions

    • Whisper turbo - purely in browser (tweet context), using webgpu

    • Distil-Whisper: distilled version of Whisper that is 6 times faster, 49% smaller

  • Redpajama Dataset v2 - 30T tokens, but the real innovation is offering customizable filtering that is likely to be the new standard in open datasets (this is a step up from Eleuther’s The Pile)

  • Riffusion with Lyrics (HN) now generating short song segments with VOICE!

  • OpenLLMetry – OpenTelemetry-based observability for LLMs  (HN)

Other launches

  • Stanford CRFM ecosystem graphs: tracks foundation models, upstream datasets, and downstream products

  • Langchain (friend of the pod)

    • LangServe: The best way to deploy your LangChains

    • Templates: deployable reference architectures for a wide variety of tasks (aka Langserve Hub)

  • Mojo 🔥 (friend of the pod! we are attending ModCon in Dec) is now available on Apple silicon Macs and has LLaMa.cpp level performance (Announcement, Performance thread)

  • Perplexity PPLX API: Mistral 7B, Llama2 13B, Code Llama 34B, and Llama2 70B models supported. notes from @danjl on Latent Space Discord:

    • currently included with perplexity pro, no $/token (for now? I'm assuming only in public beta, that won't scale)

    • really leans into openAI api compatibility. They actually just use the openai python client. All you have to do is switch api_base, api_key, and model to point to perplexity to switch an application from openai to perplexity in python

  • Show HN: Shortbread – Create AI comics in minutes (shortbread.ai)

    • great demo of what a nice stable diffusion 1.5 app can do today

  • SiFive Rolls Out RISC-V Cores Aimed at Generative AI and ML

    • Publicly discussed problems with SiFive - see our Chris Lattner pod

  • Midjourney released new 2x and 4x upscalers

  • Adobe Firefly 2 Image Model - 4MP output, depth of field control, and high-frequency details like skin pores. Features like Generative Match enable style transfer from reference images. (X)

  • Baidu claims their new model Ernie 4 rivals GPT4 but cant be proven (Announcement, Thread)

  • Defog Agents: AI Assistants for complex data workflows

  • Databricks MLflow 2.8 supports LLM-as-a-judge metrics - resulting in significant savings in time (from 2 weeks with human workforce to 30 minutes with LLM judges) and costs (from $20 per task to $0.20 per task)

  • Morph Code Index is an OSS semantic code search engine for you, your codebase, and your personal AI SWE.

Papers and Good Reads

  • Nathan Benaich's annual State of AI Report is out! lots of good charts.

  • Models

    • Efficient streaming language models with attention sinks (HN)

      • Attention Sinks: Use and maintain "attention sinks", initial tokens that the model focuses on.

      • Rolling Cache: Keep a rolling collection of recent tokens to optimize speed without sacrificing accuracy.

      • Placeholder Token: Add a special token during training to act as a dedicated attention sink, enhancing streaming deployment.

    • Think before you speak: Training Language Models With Pause Tokens (HN)

      • adding up to 10 "pause tokens" lets models improve reasoning - tested up to 1B params on C4

      • seems similar to the backspace token paper

    • Expressive text-to-image generation with rich text (HN) - being able to modify generated images by modifying text using fonts and text colors. going from words -> token maps (masks).

      • Notable criticism of this method over regular prompting

    • FontoGen - The model takes a font description as an input, and produces a font file as an output.

    • NEFTune - a "one simple trick" to get higher quality finetunes by adding noise (Thread, Github)

  • Prompting

    • Emotional jailbreaks work. https://arxiv.org/abs/2307.11760 "This is very important to my career" and the dead grandma jailbreak

    • Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models 94.4 on HumanEval with gpt4, 86.9 on HumanEval with gpt3.5

    • Large Language Models as Analogical Reasoners from Deepmind

      • When given a task, the LLM is prompted to: First, create relevant examples (problems and their solutions) for the task.Then, use these examples as guidance to solve the main task.

    • using Midjourney and GPT4 to code an Angry Birds clone

  • RAG

    • MemGPT - LLMs as operating systems (twitter, site, paper, HN)

    • MemWalker - LLM as an interactive agent, allowing it to decide how to read the text via iterative prompting. (Paper)

      • We introduce MemWalker, a method that first processes the long context into a tree of summary nodes. Upon receiving a query, the model navigates this tree in search of relevant information, and responds once it gathers sufficient information.

      • 1. Build a data structure (memory tree) 2. Traverse it via LLM prompting Outperforms long context, retrieval, & recurrent baselines. (1/n) (tweet)

    • RAG vs Long Context (tweet, paper): we find that LLM with 4K context window using simple retrieval-augmentation at generation can achieve comparable performance to finetuned LLM with 16K context window via positional interpolation on long context tasks, while taking much less computation.

    • Vector database is not a separate database category (HN)

    • Vec2Text: Text embeddings can be inverted (twitter, paper)

      • "These results imply that text embeddings present the same threats to privacy as the text from which they are computed, and embeddings should be treated with the same precautions as raw data."

      • "Vec2Text is trained to invert two state-of-the-art embedding models: GTR-base (Niet al., 2021), a T5-based pre-trained transformer for text retrieval, and text-embeddings-ada-002 available via the OpenAI API"

      • "Vec2Text is able to recover 94% of first names, 95% of last names, and 89% of full names (first, last format) while recovering 26% of the documents exactly."

    • Choosing vector database: a side-by-side comparison (HN)

      • "Everyone I talk to who is building some vector db based thing sooner or later realizes they also care about the features of a full-text search engine.

      • They care about filtering, they care to some degree about direct lexical matches, they care about paging, getting groups / facet counts, etc.

      • Vectors, IMO, are just one feature that a regular search engine should have. IMO currently Vespa does the best job of this, though lately it seems Lucene (Elasticsearch and Opensearch) are really working hard to compete"

    • Vespa.ai is spinning out of Yahoo as a separate company people speak highly of Vespa, it is targeting search and recsys problems rather than specifically vector db problems

  • Evals

    • Evaluating LLMs is a minefield

      • the reports of ChatGPT having a liberal bias were a result of oversensitive prompts

      • GPT4 passing the bar exam and USMLE is a sign of contamination

        • "OpenAI’s method to detect contamination is superficial and sloppy"

  • Efficiency

    • Running Stable Diffusion XL 1.0 in 298MB of RAM

      • OnnxStream is based on the idea of decoupling the inference engine from the component responsible of providing the model weights, which is a class derived from WeightsProvider. A WeightsProvider specialization can implement any type of loading, caching and prefetching of the model parameters. For example a custom WeightsProvider can decide to download its data from an HTTP server directly, without loading or writing anything to disk (hence the word "Stream" in "OnnxStream"). Two default WeightsProviders are available: DiskNoCache and DiskPrefetch.

  • Agents

    • OpenAgents: An Open Platform for Language Agents in the Wild

      • open replication off chatgpt plus tools based on their own comparison table (see chart) using 1. Data agent: data analysis and tools, 2. Plugin agent: 200+ APIs from API provider: RapidAPI, OAI plugin store, 3. Web agent (Tools: Chrome debugger API)

      • from the Executable Language Grounding (XLang) Lab! We are part of the HKU NLP Group at the University of Hong Kong.

      • didnt find a lot of insight while reading thru. too much marketing in the paper

  • Multimodality

    • Multimodality and Large Multimodal Models (Chip Huyen)

    • Multi-modal prompt injection image attacks against GPT-4V (simonwillison.net)

    • Multimodality and Large Multimodal Models (LMMs)

    • Ferret: Refer and Ground Anything Anywhere at Any Granularity - nice attempt at Open GPT4-V, and has a nice GRIT dataset others can use

    • Meta released MetaCLIP - fully OSS replication of CLIP pipeline

      • Paper: Demystifying CLIP Data

  • Learning

    • This page contains interactive charts for exploring how large language models represent truth https://saprmarks.github.io/geometry-of-truth/dataexplorer/

    • Language Modeling is Compression from DeepMind echoes what Ilya Sutskever said in a recent talk

    • Large Language Models in 2023 (tweet, recorded talk) from Hyung Won Chung, OpenAI & Google Brain

      • emergence (Wei et al) is still underappreciated

      • Perspective of "yet": "This idea doesn't work" -> "This idea doesn't work YET"

        • Document experiments that failed because of insufficient “intelligence”

        • Do not declare failure yet and make it easy to rerun in the future

        • As soon as the new model comes out, rerun them

        • Learn what works and what doesn’t

        • Update your intuition on emergent abilities and scale

      • We need post-training

        • instruction tuning - FLAN

        • Reward Model training

        • Policy model training

      • bitter lesson

        • Many Transformer variants have been proposed but almost all fancy variations don’t scale well

        • More useful to abstract away Transformer as sequence of functions and think about input and output shapes and types

  • Misc

    • LLMs overview from Flyte

      • OpenAI is too cheap to beat - "At $366K ($166K AWS + $200K talent), we’re paying around $80 per-fine-tuning run, about 8-20x higher than what we’re paying OpenAI!"

        • Open, general-purpose LLM companies might not be viable

    • The Killer Use Case for LLMs Is Summarization (HN)

    • PaLI-3 Vision Language Models (arxiv.org)

    • TimeGPT - some skepticism in HN commentary

    • Safety Summit debates blew up - we don’t focus on those in Latent Space

      • Biden executive order vs Andrew Ng tweet

    • The New Yorker on the future of training methods (HN)

Memes

Latent Space is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

1

Although it must be observed that any system that has GPT4 ranking as the 3rd most open LLM is a little suspect?

29
Share this post

The New Kings of Open Source AI (Oct 2023 Recap)

www.latent.space
3
Share
Previous
Next
A guest post by
Noah Hein
I write on the Internet about what I learn. Helping others do the same. Currently learning about AI, Product, and Growth.
Subscribe to Noah
3 Comments
Share this discussion

The New Kings of Open Source AI (Oct 2023 Recap)

www.latent.space
Nathan Lambert
Writes Interconnects
Nov 12·edited Nov 12Liked by swyx

Nice coverage. Quick things:

* typo in fundraising. Anyscale -> Anysphere.

* Keep reusing the "open source ML diagram" it's a good one.

* Plotting model size for 60% on MMLU seems a little bit misleading. People were not optimizing for inference as much a few years ago, and 60% is not an agreed upon number.

* Don't cozy up to the Stanford FMTI too much, it's got some flaws - wouldn't say it is rescuing us ;) https://www.interconnects.ai/p/fmti-critique

* There's more news in DPO RLHF coming soon.

Expand full comment
Reply
Share
1 reply by swyx
Michael Spencer
Writes AI Supremacy
Nov 27

Would you be interested in doing a 2023 round up of Open-Source AI evolution? As a guest post in AI Supremacy? https://aisupremacy.substack.com/p/guest-posts-on-ai-supreamcy

Expand full comment
Reply
Share
1 more comment...
Top
New
Community

No posts

Ready for more?

© 2023 Latent.Space
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing