Latent Space
Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0
How to train your own Large Multimodal Model — with Hugo Laurençon & Leo Tronchon of HuggingFace M4

How to train your own Large Multimodal Model — with Hugo Laurençon & Leo Tronchon of HuggingFace M4

The journey from LLMs to Large Multimodal Models (LMMs), from BLOOM-175B to the 9B-80B IDEFICS model & OBELICS dataset, and an exclusive peek into HuggingFace Research & the exploding Paris AI scene.

Latent Space is heating up! Our paper club ran into >99 person Discord limits, oops.

We are also introducing 2 new online meetups: LLM Paper Club Asia for Asia timezone (led by Ivan), and AI in Action: hands-on application of AI (led by KBall).

To be notified of all upcoming Latent Space events, subscribe to our new Luma calendar (sign up for individual events, or hit the RSS icon to sync all events to calendar).

In the halcyon open research days of 2022 BC (Before-ChatGPT), DeepMind was the first to create a SOTA multimodal model by taking a pre-existing LLM (Chinchilla 80B - now dead?) and pre-existing vision encoder (CLIP) and training a “glue” adapter layer, inspiring a generation of stunningly cheap and effective multimodal models including LLaVA (one of the Best Papers of NeurIPS 2023), BakLLaVA and FireLLaVA.

However (for reasons we discuss in today’s conversation), DeepMind’s Flamingo model was never open sourced. Based on the excellent paper, LAION stepped up to create OpenFlamingo, but it never scaled beyond 9B. Simultaneously, the M4 (audio + video + image + text multimodality) research team at HuggingFace announced an independent effort to reproduce Flamingo up to the full 80B scale:

from Victor Sanh, coauthor on IDEFICS

The effort started in March, and was released in August 2023.

We happened to visit Paris last year, and visited HuggingFace HQ to learn all about HuggingFace’s research efforts, and cover all the ground knowledge LLM people need to become (what Chip Huyen has termed) “LMM” people. In other words:

yes, today we’re going to learn how to train our own multimodal models!

What is IDEFICS?

IDEFICS is an Open Access Visual Language Model, available in 9B and 80B model sizes. As an attempt to re-create an open-access version of Flamingo, it seems to track very well on a range of multimodal benchmarks (which we discuss in the pod):

You can see the reasoning abilities of the models to take a combination of interleaved images + text in a way that allows users to either describe images, ask questions about the images, or extend/combine the images into different artworks (e.g. poetry).

Example 1: multiturn conversation about an image
Example 2: one-turn creative text generation on 2 images interleaved with text

📷 From IDEFICS’s model card and blog post

The above demo screenshots are actually fine-tuned instruct versions of IDEFICS — which are again in 9B and 80B versions.

IDEFICS was built by connecting two unimodal models together to provide the multi-modality you see showcased above.

OBELICS: a new type of Multimodal Dataset

IDEFICS’ training data used the usual suspect datasets, but to get to par with Flamingo they needed to create a new data set.

Enter OBELICS: “An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents”:

  • 115B text tokens

  • 141M English documents

  • 353M images

These bullets are carefully curated and filtered by going through Common Crawl dumps between FEB 2020 - FEB 2023. We discuss the 2 months of mindnumbing, unglamorous work creating this pipeline:

There’s a lot of mentions of ‘multi-modal' web documents’ which deserves some explanation. We’ll show you instead of tell you:

There are huge amounts of datasets that look like the left side. The right side is far more informative for visual QA, and was OBELICS’ focus.

You can see from this graph that OBELICS ends up outperforming the other image-text pairs dataset (LAION in this case) when stacked head-to-head.

You can view a subset of OBELICS and perform visualizations on them here:

Nomic Labs Vizualization tool

2024 Update: WebSight et al

Most of this interview was recorded on Halloween 2023 at HuggingFace’s headquarters in Paris:

We also held a smol HuggingFace/Latent Space (HuggingSpace?) meetup!

In anticipation of an IDEFICS v2 release. However, several roadblocks emerged, including a notable scandal around CSAM in LAION-5B, which affected all models using that dataset. The M4 team have adopted a strategy of shipping smaller advancements in 2024, and the first ship of the year is WebSight, a dataset of 823,000 HTML/CSS codes representing synthetically generated English websites, each accompanied by a corresponding screenshot (rendered with Playwright). This is intended for tasks like screenshot-to-code workflows like Vercel’s V0 or TLDraw, and will be part of the dataset for IDEFICS-2.

As noted in our Best Papers recap, synthetic data is emerging as one of the top themes of 2024, and the IDEFICS/OBELICS team have wasted no time enabling themselves with it.


  • [0:00:00] Intro

  • [0:00:00] Hugo, Leo’s path into multimodality

  • [0:09:16] From CLIP to Flamingo

  • [0:12:54] Benchmarks and Evals

  • [0:16:54] OBELICS dataset

  • [0:34:47] Together Redpajama v2

  • [0:37:12] GPT4 Vision

  • [0:38:44] IDEFICS model

  • [0:40:57] Query-Key Layernorm for training

  • [0:46:40] Choosing smaller vision encoders - EVA-CLIP vs SIG-GLIP

  • [0:49:02] IDEFICS v2

  • [0:52:39] Multimodal Hallucination

  • [0:59:12] Why Open Source Multimodality

  • [1:05:29] Naming: M4, OBELICS, IDEFICS

  • [1:08:56] 2024 Update from Leo

Show Notes

1 Comment
Latent Space
Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0
The podcast by and for AI Engineers! In 2023, over 1 million visitors came to Latent Space to hear about news, papers and interviews in Software 3.0.
We cover Foundation Models changing every domain in Code Generation, Multimodality, AI Agents, GPU Infra and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! We break news and exclusive interviews from OpenAI, tiny (George Hotz), Databricks/MosaicML (Jon Frankle), Modular (Chris Lattner), (Jeremy Howard), et al.
Full show notes always on