Inference, Fast and Slow

When System 1/System 2 analogies are not enough: The 6 types of LLM inference

Nov 05, 2024

∙ Paid

We are experimenting with a “screenshot essay” series with visuals we find helpful for putting things in context, for Latent Space supporters. Let us know your feedback!

o1 made it consensus that thoughtful, slower “System 2” models are a great complement to the standard fastest-time-to-first-token “System 1” models. However, there are more than just “fast” and “slow” needs arising in AI engineered systems, and we can benefit from a useful model of tradeoffs in cost, speed, and intelligence.

With the release of OpenAI’s new Predicted Output API, which allows you to submit draft responses and then leverages prompt lookup/lookahead speculative decoding for a 2x-5.8x faster response at ~50% more token costs, the year 2024 has now produced have a full map of inference paradigms for the discerning AI engineer:

update Apr 2025: ther’s also now Flex pricing which is a shorter version of Batch.

Plotting these use cases on a rough indicative price-latency tradeoff curve, we get a nice kinked chart:

Keep reading with a 7-day free trial

Subscribe to Latent.Space to keep reading this post and get 7 days of free access to the full post archives.