Llama 2: The New Open LLM SOTA (ft. Nathan Lambert, Matt Bornstein, Anton Troynikov, Russell Kaplan, Whole Mars Catalog et al.)

Latent Space: The AI Engineer Podcast

0:00

-1:19:53

Llama 2: The New Open LLM SOTA (ft. Nathan Lambert, Matt Bornstein, Anton Troynikov, Russell Kaplan, Whole Mars Catalog et al.)

>2000 AI Engineers joined our emergency Space with Simon Willison and Alex Volkov. Discussing datasets, benchmarks, scaling laws, kremlinology, and predictions. Come for the Llama, stay for the drama!

Nathan Lambert

Simon Willison

Alex Volkov

, and 2 others

Jul 19, 2023

As first discussed on our May Emergency pod and leaked 4 days ago, Llama (renamed from LLaMA) was upgraded to Llama 2 (pretraining on 2 trillion tokens with 2x the context length - bigger than any dataset discussed in Datasets 101, and adding ~$20m of RLHF/preference annotation) and released for commercial use on 18 July.

It immediately displaced Falcon-40B as the leading open LLM1 and was immediately converted/quantized to GGML and other formats. Llama 2 seems to outperform all other open source models in their equivalent weight class:

We covered many of these in Benchmarks 101

Why are open models important? The intersection of Open Source and AI is one of the oldest themes on this publication, and there has been a raging debate on the security and reliability of the OpenAI models and APIs. Users have reported GPT-4’s quality going down, which has been denied and denied and as of today, given some supporting data from Databricks, and complained about the API reliability and rapid deprecation schedules. Last and surely the biggest, there are entire classes of businesses and government/healthcare/military organizations that categorically cannot send any of their sensitive data to an external API provider, even if it is OpenAI through Azure. The only way to have total control is to own and serve your own models, which Llama 2 now pushes forward in terms of the state of the art (your own GPT3.5-quality model, though it is nowhere near Claude 2 or GPT-4).

As we do with breaking news, we got on to Twitter Spaces again to chat with two scheduled guests:

Nathan Lambert
, ML Researcher at Huggingface and author of Interconnects who had the best summary of the Llama2 paper
Matt Bornstein, organizer of the a16z infra team that launched Llama2.ai (source here) and has been coding up a storm with AI demo apps, unusual for VCs

as well as Anton Troynikov of Chroma, Russell Kaplan of Scale AI, and Omar Qazi of the Whole Mars Catalog.

Enjoy!

Show Notes

Official links
Where to try
- Llama2.ai (source), Perplexity Llama Chat
- Live playground/API on Replicate, deploy all versions on Baseten
- https://huggingface.co/spaces/ysharma/Explore_llamav2_with_TGI
- Dev ports - simonw llm-replicate, ggml using llama.cpp (7B, 13B) or pinokio, ollama, Core ML port
Timeline
- 24 Feb - LLaMA 1 announced
- 6 May - our No Moats podcast - first mention of Zuck opening up Llama
- 14 July - Llama 2 leaked
- 18 July - Llama 2 announced
Community notes
- Nathan’s research paper recap
- 638 LOC, 4 dependencies
- Usage restrictions - MAU restriction, derivative models
- Grouped Query Attention
- System prompt
- 2 trillion token dataset
- >$20m price tag (rlhf, jimfan),
- Separate models for safety and helpfulness (jimfan)
- Mistral AI founders left out of paper
- Interesting fails:

Timestamps

[00:02:30] Introducing the speakers
[00:03:32] Nathan Lambert intro
[00:04:48] General Summary of Llama 2
[00:05:57] Sarah Silverman killed Dataset Transparency?
[00:08:48] Simon's Recap of Llama 2
[00:11:43] Matt's Intro
[00:12:59] a16z Infra's new AI team?
[00:15:10] Alessio's recap of Llama 2
[00:17:26] Datasets 101 Followup
[00:18:14] Context Length 4k
[00:20:35] Open-ish Source? Usage Policy and Restrictions
[00:23:38] Huggingface Responsible AI License
[00:24:57] Pretraining Llama 2 Base Model beyond Chinchilla
[00:29:55] Llama 2 is incomplete? Race to publish
[00:31:40] Come for the Llama, stay for the (Meta) drama
[00:33:22] Language Translation
[00:35:10] Llama2's coding abilities
[00:35:59] Why we want to know about the training data
[00:37:45] The importance of Meta pushing forward Truly Open AI
[00:40:59] Llama 2 as Enabler of Startups
[00:43:59] Where you can try Llama 2
[00:44:25] Do you need dataset transparency if you have evals?
[00:45:56] >$20m cost of Llama 2 is primarily preference data collection
[00:48:59] Do we even need human annotators?
[00:49:42] Models Rating Models
[00:53:32] How to get Code preference data
[00:54:34] Llama 2 Finetuning Ecosystem
[00:56:32] Hey Apple: Llama2 on Metal pls
[00:57:17] Llama 2 and Chroma
[01:00:15] Open Source MoE model?
[01:00:51] Llama 2 using tools
[01:01:40] Russell Kaplan on Scale AI's Llama 2 plans
[01:03:31] Scale annotating code?
[01:04:36] Immortality
[01:04:59] Running Llama on your phone
[01:06:54] Sama <3 Satya <3 Zuck? "Azure as Launch Partner"
[01:10:58] Meta "Open Source" Leadership
[01:11:56] Prediction: Finetuning => New Use Cases from Internal State
[01:13:54] Prediction: Llama Toolformer
[01:14:39] Prediction: Finetune-for-everything
[01:15:50] Predictions: Llama Agents
[01:16:35] dP(Doom)?
[01:19:21] Wrapping up

Transcript

[00:00:00] Introducing the speakers

[00:00:00] Alessio Fanelli: There's not a single dull day in this space. I think when we started the podcast in January, a lot of people asked us, how long can you really do this? Just focusing on AI research and, and models. And I think the, the answer is clear now. A long time. So excited for this and excited to have Simon again.

[00:00:16] You're basically a honorary guest host of all of our Twitter spaces. Cool. Thank you.

[00:00:21] Simon Willison: No, it's great to be here again.

[00:00:23] Alessio Fanelli: And Nathan, thanks for joining us. Actually share your your writeup on, on Lama two technical details with Swyx this morning. So it's great to to have you here to dive into some of the details.

[00:00:33] Nathan Lambert: Yeah, sounds good. As probably clear Huggingface was trying to collaborate on releasing the model on the platform. So we ended up getting some early details, which made it a lot easier for me to cram study before the chaos hit.

[00:00:48] Alessio Fanelli: No, that's great. It, it's kind of what happened with the code interpreter episode when Sean and I had access for about five hours and Simon was like, I've been playing with this for weeks and add all the, the insights scoops.

[00:00:59] So I think this will be a, a good episode.

[00:01:02] Nathan Lambert intro

[00:01:02] Alessio Fanelli: Maybe Nathan, you just want to give people a little bit of background on what you do at Hugging and Face and yeah, the, your experience with the LAMA two kinda preview. Yeah. So

[00:01:12] Nathan Lambert: I've been a researcher and helping lead reinforcement learning from human feedback efforts at Hugging and face, which really means I do some research and I try to figure out how to fine tune models to do what people want.

[00:01:26] Generally we're trying to operate in the scale a little bit smaller than what Meta is doing cuz we obviously don't have that kind of resources at a startup. So I do a lot of technical research and also try to actually engage and communicate that with the community and specifically, Llama, I think I was most interested on kind of the research side.

[00:01:48] I think the paper is a phenomenal artifact and it's clear that the model is really strong in a lot of areas. And then kind of the big picture trends of where open source is going. Like this is a clear step in a direction that a lot of people wanted, but weren't sure if it was gonna happen. Yep.

[00:02:04] Alessio Fanelli: What are some of the things that stood out to you?

[00:02:06] I think to a lot of the AI engineers audience that we have, they're not as deep into the details of the papers. We'd love to get a, a read from somebody like you who's a much deeper at a, you know, model research level.

[00:02:18] General Summary of Llama 2

[00:02:18] Nathan Lambert: Yeah. It's like, where do I start? So I think as a general summary, the paper includes a lot of details on methodology. So like, what are the things that they did in their stack to build, to actually run this? And it misses a lot of details on. What does a specific data set actually look like? It's clear that they have a really fine-tuned data set and they paid a lot of money for these data sets.

[00:02:46] I think may like, it seems like now that both surge and scale are claiming some part in it, which I find hilarious. Cause it's really unclear, which are two of the probably biggest data labeling firms. So they kind of took the approach, meta took the approach of starting with open source preference data and then added a lot onto it.

[00:03:04] And the most interesting part to me on this preference data, which is a new technical approach, is they trained two preference models, two reward models, one toward making the model helpful and one for making the model safe. And then in terms of open source models, it's clearly more performant on kind of ground root benchmarks and then it's safer.

[00:03:27] Sarah Silverman killed Dataset Transparency?

[00:03:27] swyx: That's where I was

[00:03:28] Simon Willison: gonna wrap up to clarify, right. This is a big difference from the first LAMA paper. Cause the first LAMA paper was very, was so detailed in terms of how the training data worked, that people were able to essentially replicate it. And so you're saying that this new paper, there's, there's much less transparency as to how the training worked

[00:03:45] Nathan Lambert: on the DIS side.

[00:03:46] Yeah, I think they, they did a lot of new methodological things to, so taking the time to explain that like is not as much of a data focused paper. There's no table that is like, this is what the distribution of pre-training data came from. I would guess that it's a similar data set to the original llama with the kind of, they mentioned like one of the details that's really interesting is that they mentioned they up weight high factuality content.

[00:04:14] So things that probably seem like Wikipedia, that seems like they're doing some sort of up ranking. During base model training, but they don't de, they did some type of thing they didn't detail

[00:04:24] swyx: because it's also

[00:04:25] Simon Willison: worth mentioning, I mean, they're being sued right now by Sarah Silverman of all people. I mean, it's one of the many lawsuits flying around, but there's a lawsuit specifically over the training data involved in the first Lama because one of the things that went into that was this data set called Books three and Books three is like 190,000 pirated eBooks, like the full text of all of the ha Harry bot novels, things like that.

[00:04:45] Which, yeah, that's very difficult to say that that's not extremely copyrighted data. So I wonder if that's part of the reason they've been less transparent this time round is that, you know, it got them in trouble last time.

[00:04:57] Nathan Lambert: Yeah. One of my colleagues on kind of the Ethics and Society time I side immediately pointed out that pub, publicly available data is the phrase often used in the paper, but that does not mean that it's free from copyright issues and or terms of service issues.

[00:05:11] It means that I could go on a computer and download it.

[00:05:13] Simon Willison: Right. If you, if you scrape the entire internet, very little of that stuff is actually like public domain.

[00:05:21] Nathan Lambert: Yeah. And, and I, I think without going down kind of social issues, rabbit hole right now, I think the notion of public is extremely being strained by AI and changing communication practices. And it's just like kind of those things where it's like, oh, okay, here we go.

[00:05:36] And they also use words like democratize and they have these sentences in the paper that are extremely value written, which is like the carbon footprint of our model. And releasing this is good because it'll mean a lot of people don't have to train models and burn more CO2 in the future. And it's like, okay, meta, like, like what?

[00:05:53] Where are you going with

[00:05:54] swyx: this? Yeah. Perhaps before we go too deep into the issues, cuz we, we have lots to talk about. I would also want to get a high level overview from Simon and from Matt who's also just joined us from a 16 and Z. So maybe Simon, you, you wanna go first with like, just recap for everybody what you think the relevant details are about LAMA two and, I mean, and we'll talk, we'll talk about Matt stuff.

[00:06:18] Simon's Recap of Llama 2

[00:06:18] swyx: Yeah.

[00:06:19] Simon Willison: So, yeah, I mean the, the, the, the headline here is that LAMA two has been released and meta kept their promise of doing a version of llama that is used, usable for commercial purposes, which is so big because so much of the, like, llama itself came out at the end of February, and so many models have been released on top of that.

[00:06:37] So, LA models like Vicuna, which was a fine tuned llama, all of them with the same, no, not, not usable for commercial purposes. Warning. So now we've got a really high quality foundation model that we are allowed to use commercially. I think the the amount of innovation we're gonna see over the next few weeks is, is just going to explode.

[00:06:54] You know, I feel like this is, this is monumental on that front in terms of quality. I never know how to interpret these benchmarks. The benchmarks all look good. You know, the claims are, it's a bit better than, than Lama it's competitor with the GP chat, GPT 3.5, et cetera, et cetera. I have no reason to disbelieve that, but it always takes quite a while with these new models to get a feel for them.

[00:07:13] You have to spend time with them to really feel like, is it trustworthy as a summarizer, all of those kinds of things. My, my hunch is that it is gonna do turn out to be extremely good. Like I, I, I doubt that it'll, it'll, it'll, it'll turn out to be sort of a damp squib on that front. But yeah, so they've released it.

[00:07:30] The It's available commercially and you are allowed to redistribute it, but the only way to officially get the waits is to fill in a form on their website and wait for them to approve you still, which is kind of stupid because obviously it's already started leaking. I've down, I downloaded a version onto my laptop this afternoon, which, which worked.

[00:07:47] There's a G G M L and the bloke thing that's floating around and hugging, hugging face already, so, you know, within. 24 to 48 hours. I think every possible version of this thing will be available to download without going through a waiting list. I'm almost not sure why they, why they even bother with that.

[00:08:03] Especially since, you know, llama leaked within I within a few days last time and somebody ended up submitting a pull request to the GitHub Readme with a link to the BitTorrent for the LAMA models, which Facebook didn't delete. You know, they didn't sort of, They, they kind of like nodded and winked and said, yeah, this is what you can do.

[00:08:20] And now it's even legitimately okay to do it because the license says you can. But anyway, it's out there. You can run it on your computer right now today. The it's also hosted in a bunch of places. Yeah Andrea Horowitz got that sponsored, the version of it that's available on Replicate, although you actually do have to pay for that.

[00:08:37] I noticed that I built up 26 cents in, in replicate charges already playing around with that model. But it's api, so, so it's available via API or you can run it on your own machine and, you know, it's, it's open season. That's all start, start poking around with it and seeing what it can do.

[00:08:52] swyx: It's open season.

[00:08:53] Speaking of Andreesen, yes, Matt. Hey.

[00:08:56] Matt Bornstein: Hey. Hey everyone. Thank you for having me. And Simon, if you wanna send me a Venmo request for 26 cents, I'll, I'll happily reimburse you.

[00:09:02] Simon Willison: Absolutely. Yeah.

[00:09:04] Matt Bornstein: We, we may lose about $3 on the transaction fee, but I think it'd be worth it

[00:09:09] swyx: just to throw in a term sheet in there for a data set.

[00:09:11] Nathan Lambert: You're good?

[00:09:13] Matt's Intro

[00:09:13] Matt Bornstein: No, I'm, I'm a huge data set fan. And, and, you know, we've, we've followed Simon's work for quite a while, and, and Nathan, it's, it's great to have a chance to share a stage with you. I think folks probably saw we you know, released a bunch of sort of, you know, VC version of evaluations. You know, we're way less smart than, you know, Nathan and Simon and a bunch of folks on the in the, in the space here.

[00:09:33] But using just sort of the. Does it feel good approach and trying to get a fairly representative sample across different types of prompts? The model seems very good. We were playing a lot with 13 B and we're playing now with 70 B, and it really does give you kind of very fast g p t 3.5 level responses to some questions.

[00:09:54] I, I think Simon's point about benchmarks is very well taken. It's hard to know how to interpret those. So, so we sort of go for the, for the direct version and for creative tasks. You know, especially it's, it, it seems very good so far. So, so a lot of what we're doing is just trying to get it out there as much as possible and, and, and as fast as possible.

[00:10:11] You know, I I think we should all be incredibly, you know, appreciative that Meta is doing this and it, and it's not, you know, maybe quite perfect, you know, for some of the reasons that folks are are talking about. But you know, I think it's gonna be a huge unlock in open source LLMs and, and we're trying to, you know, just sort of support the community as much as possible.

[00:10:29] a16z Infra's new AI team?

[00:10:29] swyx: Yeah, I have to say, you guys are doing a bang up job recently. What, so what is, is there, this is a big team effort, right? Like I, I, I see that there's a number of names from your team, just essentially building projects and then collaborating on this this demo. Like maybe could just, could you describe like what it is andreessen's ACC sort, sort of involvement so far and like yeah.

[00:10:50] What, what, what is the scope of this? Yeah.

[00:10:53] Matt Bornstein: You know, we all applied for, you know L three engineer jobs and, and got turned down by all the, all the big tech firms. So we thought, hey, you know, we'll, we'll just do it our ourselves. Yeah. Look, I think, and this might be a little controversial, your average venture capitalist doesn't do any real work, and I completely include myself in this category, you know?

[00:11:14] Allocating resources to support teams is, is important. It's an important function in the economy, but it's, it's what you might call indirect work, which is you're supporting someone else doing something. You know, we just sort of made the decision when we really saw AI starting to take off that we should start doing real work too.

[00:11:31] And it's really just about supporting the ecosystem, especially around open source like Simon. We're massive believers that the innovation you see in open source is really gonna be a big unlock for AI based applications, right? Not everybody can just use. The Open AI API is good, as good as it is, and not everybody can train a model from scratch, right?

[00:11:52] Not everybody you know is, is Nome Shazi or, or someone like that. So so we think it's a really huge unlock and, and again, we're just trying to support as much as possible. So today we you know, we released a playground to play around with Llama2. We got it up on, on Replicate so people can just sort of try it with an API call and try integrating it into their apps.

[00:12:10] We released an AI starter kit over the last couple of weeks which people are actually using. We were shocked. We're, we're a little nervous cuz our, our code, you know, may or may not be production ready. But, but you'll see more and more of this from us over time.

[00:12:23] swyx: Yeah, I've seen your companion chat bot, and I have to say, it's actually pretty impressive.

[00:12:26] It's got all the, is it the latest features in terms, especially in terms of streaming and lag chain and all the other stuff. So kudos to your team on that. Just to round out the overviews or the, the high level takes, before we go into individual details Alessio has been compiling the show notes, which we were gonna publish when this podcast goes live on lane space.

[00:12:45] Lessio, maybe you want to go over some of the, the notes that you've been taking. Then I'll, I'll go over to Alex.

[00:12:50] Alessio's recap of Llama 2

[00:12:50] Nathan Lambert: Yeah, we

[00:12:50] Alessio Fanelli: got a, we got a lot of stuff to run through here. I think like the most interesting things that I read from the paper. One, there's a abandoned size model. So the 7 billion, 13 billion and 70 billion made it to release, but there's a 34 billion size that didn't make it.

[00:13:08] And in the safety chart, you can actually see it's like, Twice as unsafe, quote unquote. And they decided not to publish it because of lack of time to red team it. So I don't know if anybody had a chance to try the 34 B before the release, but I would love to learn, learn more about that. Outside of that, yeah, as Simon and Nathan were talking about, the data piece is a lot more obscure.

[00:13:31] So LAMA one was 67% common crop, 15% c4, a bunch of GitHub Vidia books as we mentioned. We don't have any information about LAMA two, but they did mention they have a 40% larger pre-training corpus. So they've obviously been investing a lot in that. Also, yeah, the, the supervised, fine tuning was very interesting.

[00:13:52] I saw a tweet, somebody asked the laou how to kill a process, and laou was like, you can't kill things. And I was like, just a process. It's not a person. So I think in, in some places, the, it might have gone too far with the R L H F but that's another, that's another interesting side, right? Like if this is the starting point and like the defacto standard for open source models, are we okay with, you know, not being able to ask how to kill a Linux process?

[00:14:18] But I'm not, I'm not sure about that

[00:14:20] Nathan Lambert: yet.

[00:14:21] Simon Willison: I ran into that myself. I, I asked it to give me all of the animal emoji and it said that that would be disrespectful if it, if it attempted to do that, which was kind of interesting.

[00:14:32] Alessio Fanelli: Exactly. So that's a, that's an open question on open, you know, it's the Joel safety question.

[00:14:39] It's like, how much do we need to do before we release the smartest to the public versus what should that. The public side. The other thing is like, they should have let this GPUs burn for more. Like if you look at the, at the loss graphs, like these models are not saturated, I guess. Like they spent a lot of, a lot of money to try and train these.

[00:14:56] Datasets 101 Followup

[00:14:56] Alessio Fanelli: But it seems like there's a lot of work left to do there. We just did a data sets 1 0 1 episode that we released yesterday, which is already old news because now LAMA two is out and this is all the rage. But we talked about some of the scaling laws and we thought the 200 x was like the new LAMA ratio.

[00:15:12] But I think this one is 275 x Sean, I think.

[00:15:17] swyx: Yeah. So that's five. Yeah, 2 trillion tokens for seven B model. And that's, you know, that's up from 1.2 last time. So they, they've definitely ramped up the, the, the amount of data and they, they just refuse to tell us any of it because, well, you know, guess what happened last time They, you know, they published the data, infra red pajama went and cloned you know, line for line exactly what was in the LAMA paper.

[00:15:39] So, you know, then that created, you know, red pa, red pajama model and then open lama as well.

[00:15:44] Context Length 4k

[00:15:44] Simon Willison: So I saw it says that the context length is up from the first lama. Do we know what the new context length is?

[00:15:50] Matt Bornstein: I think it's,

[00:15:50] Nathan Lambert: yeah, 4k. 4k.

[00:15:53] Simon Willison: Is that likely to be higher for the 70 B model or are they all the same context length?

[00:15:58] Matt Bornstein: I believe they're all the same and we have tested it a little bit and my intuition is that you can actually get more effective performance, more accuracy out of 4K rather than scaling up the way, say OpenAI have to 32 K or high. Like it's, I think it's just hard to find high quality. Training data. So it's when users actually start to submit longer inputs, performance kind of breaks down.

[00:16:22] And I'm not talking about open AI specifically, but in general, and that's, that's my intuition on why you know, why meta is keeping it relatively small for these models.

[00:16:31] Simon Willison: I'm kind of hoping that somebody, now that it's open source, somebody finds some clever trick to increase that. I've been playing with the Claude 100,000 a lot recently and it's pretty phenomenal what you can do once you've got that extra context length.

[00:16:43] swyx: There

[00:16:44] Alex Volkov: is actually a trick. It's called rope. We've seen this with a two, two line change that you can, you can make Lama forget about the context it was trained on, and there was back and forth about how effective this is and whether or not it suffers from the same dip, you know, in the middle of the context.

[00:16:59] But this rope scaling trick then was verified by folks from, I think Microsoft, independently from that guy Kaiko, Ken Devrel, and I, I see some folks in the audience here who are participating in this. So apparently this applies to the previous LAMA and would likely apply to this next one as well.

[00:17:17] Simon Willison: That's pretty exciting. I can't wait to, this is the thing I'm looking forward to is now that it open source. All of this stuff is go, these experiments are just gonna start happening at such, such, such a fast rate. This happened with Lamba before. You know, once you let every researcher in the world download and start tinkering with your model, people start finding optimizations and, and new tricks at a, at a crazy rate.

[00:17:37] It's gonna be really interesting.

[00:17:39] Nathan Lambert: So

[00:17:39] Alex Volkov: I think the interesting piece here is to see whether or not the commercial license will unlock even more, or did the researchers didn't care and kinda threw the kitchen sink of everything they wanted to hack together on the previous llama. I'm thinking because it's open source commercially now companies will actually start, you know, doubling down because there will be able to then use the fruits of their labor on commercial purposes.

[00:18:02] So we'll likely see

[00:18:04] Alessio Fanelli: more.

[00:18:05] Open-ish Source? Usage Policy and Restrictions

[00:18:05] Alessio Fanelli: I think you guys use the magic word, which is open source, and everybody has a, has a different, different definition. And I know we had Tom Warren in the audience who asked the question about this. So Tom, I'm gonna invite you up to speak if you're around.

[00:18:18] Simon Willison: Yeah. I'm gonna say, call it, I, I say openly licensed, not open source, because I feel like open source has a definition, this doesn't quite apply here.

[00:18:27] Alessio Fanelli: Yeah, yeah, exactly. If you go, actually on my website, I wrote like a 10,000 words thing on like the history of open source licensing, and there's things that are open source, things that are somewhat open source in traditional infra, that's like the server side public license. Some of these things that like Elastic and Mongo came up with to avoid the a w s a p i compatible in quotes products that were literally just the same thing.

[00:18:51] So yeah, it's, it's really curious also that the breakpoint for the LAMA license is 700 million monthly active users, which is. A lot of users obviously, but there's some notable people that go over it. So Snapchat is one company that is obviously a, a close competitor to, to meta TikTok, isn't there?

[00:19:10] YouTube, by far exceeds that

[00:19:13] Simon Willison: amount. Yeah. It's worth noting, but that's actually, that's not a rule going forward as of the date of the release. If you have 700 milli monthly users, you can't, you, you have to get an extra license from, from Meta. If you manage to achieve 700 million million monthly extras next week, you could still use it.

[00:19:30] Like it's, it's, it's, it's that point in time that

[00:19:32] swyx: matters. Yeah, at that point they should just name people. But yeah. Just to close the loop on this open source element, you know, there's one other piece of about the open source or, or the usage policy, which is you can't use it to train any other model.

[00:19:44] Thou shalt not have any other models before llama. Llama is your only model that you can fine tune with, with llama data.

[00:19:52] Simon Willison: I think it's more than that. This is they're protecting against distilling the model, right? The thing that everyone's been doing, like Una was trained on Chachi PT data, despite open AI having a thing in their terms, it says you can't train a competing model.

[00:20:04] I don't, I'm really frustrated by this because the, the language says you cannot train a competing large language model. But what does that even mean? Who gets to decide what a large language model is? If in six months time we invent a new architecture is that's still an l l M that's covered under those terms.

[00:20:20] It's, it's frustratingly vague.

[00:20:22] Nathan Lambert: Yeah, these clauses are kind of bogus. We talk about them a lot of hugging base. And it seems also from a legal perspective, the things that they're grounded in, like terms of service are being walked back in kind of this digital domain. And then also it's just like unclear what is actually using the language model.

[00:20:40] So all these things where people use language models as a judge, or you can just generate a bunch of interesting prompts to then modify them. It's so ridiculous to even think of trying to enforce these clauses. It's surprising to see it show up,

[00:20:54] swyx: which you have to note, like in the LAMA two paper itself, they also use other company models to do their evaluations.

[00:21:02] Right? Like so and I, and you know, a strict reading of the, of those clauses would not allow them from from that.

[00:21:08] Huggingface Responsible AI License

[00:21:08] swyx: Nathan, actually a quick follow up. Hugging face has its own license, the rail license. I think there was some iteration following this stable diffusion release. Would you, would that be appropriate for something like Alama two?

[00:21:19] Nathan Lambert: Yeah, I think it's good. I don't have a hundred percent knowledge of rail. My understanding is that it's like, generally the goal is to be like commercially available with good intention and then there's kind of like, it starts to try to give leverage for people to come after bad actors using their models.

[00:21:37] I, I think the commercial use of this is gonna be off the charts very soon, like at hugging face. A lot of the monetization efforts are around like trying to enable commercial use of open source language models. And the license questions have been a constant discussion for the last six months from things we're trying to build and from customers.

[00:21:57] So like this is definitely going to

[00:21:59] swyx: be used. Yeah. Yeah. Okay. So I don't, it's, it's do we have, we have a lot of you know, insightful people here.

[00:22:07] I feel like the, the best way to organize this space is maybe to just kind of try to stick to as, as many sort of factual elements as we, as we can.

[00:22:15] I feel like Nathan, since you've done the most work you've had the most time with the paper, to be honest. What El maybe sort of pick on one other sort of element of, of the paper that you, that you find worth discussing and we can kind of go into that.

[00:22:27] Pretraining Llama 2 Base Model beyond Chinchilla

[00:22:27] swyx: Maybe the, sort of the, the pre-training base model stuff.

[00:22:30] Nathan Lambert: Like, I, I don't think there's a lot on the pre-training. The, there's definitely an important thing that makes it able to be used, which is they use, like, what is cqa? It's like cross query attention, which will make inference on the bigger models faster. I think there's kind of a asterisk that is interesting on that code and math and reasoning seems pretty.

[00:22:49] Not emphasized in the paper, and that's what their kind of like market for. That's what ChatGPT is used by a lot of people on this call for. I think at a technical level, the Rh f details are the most fleshed out that we have seen. Sure. And kind of confirm a lot of the capabilities we've seen insinuated by anthropic and open ai.

[00:23:11] So that was like kind of a relief for me as someone that's trying to be like, I still think this really works. And they dropped this paper is like, we really like this, which was not guaranteed. I, I have one

[00:23:22] Matt Bornstein: pre-training question. And this is for you, Nathan, or, or for the whole group. Like we, we talked about it before.

[00:23:27] The, the amount of pre-training data here goes far beyond chinchilla optimal and the loss curves were still going down when they cut it off. Like, are we ready to say that chinchilla optimal is just not optimal anymore?

[00:23:43] Nathan Lambert: Oh, I'm ready. I never really cared about it. Like I think data quality is changing that completely.

[00:23:51] It's like, I think when Gent came out, data quality standards were so different and given what the practices are now, I, it's like, what does it mean?

[00:24:03] Matt Bornstein: It was a really big deal at the time though, right? I mean, it was kind of this breathtaking result that if you just ramp up training data much higher than you thought or people had been doing, you just kept getting better performance.

[00:24:15] May maybe Nathan, since you're, you know, the most knowledgeable on this space, like can you just like, give us a little intuition, like when you say better data quality, like what exactly is happening under the hood that makes this possible now?

[00:24:26] Nathan Lambert: Oh, they're removing. Okay. Think about all the tweets and texts that everyone sends, and we have these weird insider jokes and phrasings that we do.

[00:24:37] They make no sense if you read them and your language model, like half reproduces them. So like, and like I'll say like you got got, or something that is just very confusing from like a token prediction state point of view, and then also a ton of just errors. It's like I write a blog post. I used to not take it as seriously, I've like published a blog with a half finished sentence in it.

[00:25:00] It's like they would just scrape that and take it, but trying to actually get data that is complete is, is consistent, is just extremely hard. I think technical terms are like deduplication, so you don't wanna pass the model, the same text, even if it came from different websites and there's tons more that goes into this.

[00:25:21] I, I don't think it's the area of my most expertise, but I think it's actually pretty simple. You just wanna put good text into the model and understanding what good text is on the internet is really hard.

[00:25:34] Matt Bornstein: So you're sort of saying the reason people were using not enough data initially is cuz they just weren't good enough at cleaning it. And now that those methods have advanced so much, we're moving duplicates better, we can measure quality better, all of that. Like, like do you think we're gonna keep going up, I guess is the question like this, you know, they trained a seven B model on 2 trillion tokens.

[00:25:52] Like, do you think that's like the Maxim or are we gonna keep going?

[00:25:55] Nathan Lambert: I kind of like, I, I think the intuition on like what you're saying is how getting more higher quality data is making it so using more works better. I like, that's what everyone in my circles is saying is the trend and given machine learning in the last few years, I think trends tend to be stickier than most people expect them to be.

[00:26:17] So I would expect it to keep going. I just kind of trust the process to continue for a lot of stuff like this.

[00:26:22] swyx: Yeah. So we on our podcast, we've been asking everyone that we can possibly CAGR ask about, you know, went from two x tokens to perran ratio with Kaplan, and then 20 x with chinch, now 200 x with llama, like someone's gonna try 2000.

[00:26:37] Right? We did have a response today from one of our previous guests Varun of Codium who said that they did try a thousand to one tokens, to params ratio. And it definitely gone into the range of overfitting. So your loss can continue to go down, but you're not sort of measuring overfitting in, in, in, in some of that respect.

[00:26:53] So it's, it's very unclear. I would say though, you know, I, I do have visual sources like. Chin. It's not that chinch was wrong. Chinch was optimizing for a particular set of assumptions, particularly the pre-training compute budget, right? Compute optimal sort of scaling laws. And if you look at the llama paper right on the first page, I have it open right in front of me.

[00:27:12] They actually criticize that and say like, you know, this, this disregards the inference budget which is critical when you're actually serving the model instead of just optimizing for a pre-training compute objective. And as things move from research into production, inference starts to become more, more of a concern.

[00:27:28] Resource constraints starts becoming more of, more of a concern. And so I, I, I think it's actually quite reasonable to move on from chinchilla, which is a very important result. And, and say that, you know, we are, we are exploring very different objectives as compared to, you know, more than a year ago when Chinchilla was published.

[00:27:45] Llama 2 is incomplete? Race to publish

[00:27:45] Nathan Lambert: Yeah, I agree. I was just gonna say that I feel like the was going down like all of these fa reading the paper, it feels like this is a checkpoint of a much longer term project. They like readily list off things that they didn't get to but they want to continue and like capabilities or something.

[00:28:03] Some of the methods seem like kind of hacks to make things work that they didn't know if didn't get to work. Like Anthropic came up with context distillation, which is a way of getting a really, the behavior of a really long system prompt into a shorter prompt essentially like, and, and they did something like this in this paper to get the P model to behave like characters for longer conversation turns.

[00:28:27] And like, there's all sorts of little things that I just think meta is going to continue this and.

[00:28:34] Simon Willison: So that's kinda fascinating cuz that that implies that the, the actual story here, it's the AI arms race, right? It's, it's, it's Zuckerberg saying, no, we need to get something out right now. Get it to a point where it's good enough and safe enough and then let's ship it.

[00:28:46] And it's not so much that they, they, they didn't necessarily have time to get to the sort of perfect point that they wanted to get to.

[00:28:54] swyx: Yeah, that is the I have asked people about this offline, and so I was like, okay, so why don't people throw a lot more compute at this? And they're like, you know, as long as you have a state-of-the-art model, you should just ship it and get credit and then wait till, like, wait a few months and then get the next version out.

[00:29:08] That way you have a lot more shots on gold.

[00:29:11] Simon Willison: That totally makes sense. Yeah.

[00:29:14] swyx: And I was like, oh, okay. Like we are in such early stages that honestly, I mean, they spent 3 million G p U hours on this thing. They could spend 30 million in, like, obviously it would be way better. Like we're in such early stages that even these relatively simple.

[00:29:27] Like don't forget Lama one was published in February of this year. We're in such a easy cycle where it, it's, it's still within, you know, the order of months to make and improve one of these things. That it's not too terrible.

[00:29:40] Come for the Llama, stay for the (Meta) drama

[00:29:40] swyx: I do, I guess I should also mention a shout out that Not every person who worked on LAMA two is on the paper.

[00:29:48] Guerro Lampel and who's, who's one of the co-founders of Misra, the French startup that raised like a hundred million C round. Apparently worked on LAMA two and they left him out because in, they left his team out because they left Meta before this paper was published. So interesting passage.

[00:30:03] Treat there. If anyone wants to go through that,

[00:30:05] Alessio Fanelli: come for Alama, stay for the drama. Oh, it's hard. It's hard to read, you know, into like the, as you know, especially when it comes to like, work that then goes over source. It's always we did the work. We didn't I don't know, since, since nobody here worked at Meta I would rather not go, not go down that path.

[00:30:23] Yeah,

[00:30:23] swyx: I, I'll just leave a bookmark there. Okay. Yeah, but exactly.

[00:30:26] Nathan Lambert: We're not in the room there. I,

[00:30:28] Matt Bornstein: I, I'm for one shocked to hear that there may be drama among researchers. I've, I've never heard of that happening before.

[00:30:34] Nathan Lambert: Right. Near, especially after three organizational restructures of researchers hopping, playing hopscotch from one org to another, and being in between, in between jobs.

[00:30:43] I don't know.

[00:30:45] swyx: All right. Alex, do you have your hand up? And then I wanted to dig more on the the preference data that Nathan mentioned. Mm-hmm.

[00:30:52] Language Translation

[00:30:52] Alex Volkov: Hey guys. Just to introduce myself real quick, I'm Alex. We participant in the spaces is, and my angle and the way I vibe, quote unquote vibe check models is via languages.

[00:31:03] And to me, it was really surprising that they released kind of the second iteration while also knowing how much meta actually does for translation. They have very famous N L L B models, no language left behind. They released the world models that you can speak in multiple, like a thousand languages that understands, and for some reason, they're open source models.

[00:31:23] They are not very strong multilingually. So we've seen this with GPT4, which was way better at multilingual speak. Claude highlighted this point with Claude two that is like way better at the blue score. I think for, for languages, and I've tried and my go-to like vibe check with these models is to, with the, especially the open source one is the ability to translate, the ability to understand the languages.

[00:31:46] I've tried it with, with Hebrew a little bit. I've tried with. Very, very impressed. Now, obviously fine tuning will come and obviously people will fine tune these, these models towards different outcomes, but it's very interesting considering how much meta does elsewhere for languages and to bring the world together.

[00:32:02] How much kind of this model did not focus on this, this specific kind of issue. And the, the, the second thing is also code. I know you guys talked about human eval. That's fairly low in terms of the score out of the box. And obviously fine tuning will, will, will make it better, but fairly, fairly disappointing score on, on human ev, right?

[00:32:22] Fairly low coding abilities. And we've seen previously that there's some assumption that training on more code in your dataset actually gives you better kinda logic and reasoning abilities. So kind of surprised that that was fairly low. We went to chairman with these two, two examples about Lama.

[00:32:40] Llama2's coding abilities

[00:32:40] swyx: I'll say on the human eval piece don't count it, not just yet. So I've, I've had some dms with Quinn Slack or of source graph, and he's is you know, very actively building Cody their, their coding assistant bot. And it's well known that human eval is not a very good or reflective measure of how we use coding chatbots.

[00:32:59] And so like, it, it is probably human EV emails is probably overrepresented in terms of being, being like this effectively the sole benchmark by which we value code models. We, we just need new benchmarks for code.

[00:33:11] Matt Bornstein: I do think it's possible better instruction tuning will improve code performance of the LAMA two models as well, because their reasoning capabilities are actually relatively good. Not perfect, but relatively good, which makes me think there may be more code in the pre-training than it seems.

[00:33:26] swyx: Well it's difficult to know cuz they don't talk.

[00:33:29] We'll, we'll see, we'll see.

[00:33:31] Why we want to know about the training data

[00:33:31] Simon Willison: I mean, this is the thing that's so infuriating about these opaque models that don't talk about their training data is as users of the models, we need to know, we need to know how much, like if it's had code in it, all of those kinds of things in order to make decisions about what we're going to use it for.

[00:33:45] So I kind of feel like you know, the, the, the secrecy around these models really hurts me as a consumer of these models, just from a practical point of view of being able to make good judgements about what the model's gonna like to be able to do.

[00:33:55] Matt Bornstein: I, I do think that's true, Simon. You know, I wanna make just one defensive of Meadow, which is like, this is pretty amazing what they've released and they've, you know, given to the world, obviously it may benefit them commercially as well, but you know, it actually carries pretty substantial risks for them and actually think it's kind of a courageous act to, to release and, you know, so it, and it's the things like the training data.

[00:34:20] Safety that like really, you know, when you're, when you're meta and you have billions of, of active users, like you, you actually are taking a pretty big risk with these things. And, you know, regulatory bodies have their sights on you. So I, I do think you're right. I, I just, I, you know, for what it's worth, wanna I agree with, I agree with, it's actually a

[00:34:37] Simon Willison: positive thing.

[00:34:38] I agree with everything you say, but at the same time, right now, I've got a whole bunch of models that I'm choosing to be to, to, that I'm trying to choose between, and I don't have the information I need to make the decision. I feel like at some point it's going to be a competitive advantage to put out a model with transparency of the data over, over what went into the data.

[00:34:55] Cause people will be able to use that model more effectively. But yeah, I completely understand these strategic challenges that I'm, I'm astonished that meta went ahead with this release. I never thought they'd, they'd take the risk of releasing something like this and someone use it for something bad and now they're on the front page, all of the, all of the papers for it.

[00:35:12] So yeah, I'm, I'm super excited about it on that front. I wanna

[00:35:15] The importance of Meta pushing forward Truly Open AI

[00:35:15] Alex Volkov: ajo. Yeah. I know from the perspective of releasing something as open source as they did previously we didn't have commercial licensing, obviously. Now the big thing is we have commercial licensing, but the amount of people, I don't know if you guys noticed, but like the amount of people who signed, quote unquote in support of releasing these models, Paul Graham and Mark Andreesen, and like a bunch of other folks, like in addition to the model, they also released kind of a counterweight to the moratorium papers and all the AI safety stuff.

[00:35:41] Because there was a, an FTC pro, right? There was like some, some regulatory stuff talking about the previous releases of LAMA from, from a long time ago. And now not only they released like the, the, the, the quote unquote open source. So unless it doesn't, doesn't kick me off here. Not fully open source, but definitely we're able to use this commercially.

[00:36:00] But they also released kind of a industry leaders selling like the, the, the open source is needed. And I think that. That, like, gives a very strong counterweight to the M and the keep, keep it closed and don't release kind of thing. We saw, and it's very interesting. It comes from meta specifically.

[00:36:16] So in addition to the courageousness that they did, it looks like they're also kind of leading the industry in terms of like, this is how to do fully commercial again, quote unquote open source, not open source license, but this is how to release models in a, in a, in a safe way. So definitely joining the, the courage and the applauds for meta and the team.

[00:36:35] Nathan Lambert: Yeah, I just don't think that like, like the cu we're not the customers of meta with respect to this model. I think they're trying to build these for their own purposes and then they have very strong, like, I think it's kind of the principles of like transparency and research that these organizations at Meta have stood by. And I think that's like the newest representation of it, more than like, and I don't think they're trying to make money off releasing this in any way. Like there is an ecosystem perspective of like where AI content proliferates, there's more creativity for their users and that enables social media and things.

[00:37:08] But I think we're still pretty far from that. And it's more of like a values and internal research and development tool for themselves. Like is there a way for them to make money directly off of this NPCs

[00:37:19] Alessio Fanelli: and the Metaverse. But I mean, I don't know.

[00:37:23] swyx: Well, so we, we, we last hosted one of these emergency pods, I think maybe two, two pods ago.

[00:37:28] Which was I think in May where we did our when the No Moats memo came out from Google. And we actually talked a little bit about what an ecosystem around a language model looks like when you have stackable loras customizing and fine tunes that are based on top of an existing base model that is well known.

[00:37:48] I, I think that might be part of the strategy there. You know Facebook is also well known for releasing, I guess, PyTorch and, and React. And, and those are very well, like, they don't make money from that directly, but they definitely do benefit from the ecosystem that has sprung around it, that, that essentially represents a lot of free development from, from the open source community.

[00:38:07] Simon Willison: I think there's a lot to be said. The fact that meta AI are at the very heart of openly licensed language model research, and that's because of Lama, you know, Lama came out and it kicked off this immense tidal wave of interest and of activity with meta ai right at the very center of that. And in the world that we live in right now, being at the very center of all of the research and innovation happening around language models feels like a really valuable place to be.

[00:38:31] Llama 2 as Enabler of Startups

[00:38:31] swyx: Yeah, it, it, it really is. I I, and maybe we can go to a little bit to, to Matt again. One thing I wanted to get your thoughts on that, you know, I don't know how long you have with, with us, but is the impact on the startup ecosystem, right? Like how, how big of an enabler is this? Or does this, I guess just commoditize everything to a point where, you know, everyone's just rappers.

[00:38:50] Matt Bornstein: I think it's a really, really massive deal. You know, we've met with. Conservatively hundreds of AI startups now maybe, maybe thousands. We'd have to go back and look and, and, and I sort of alluded to this before, but the really big dilemma is do I train my own model or do I just use something off the shelf?

[00:39:15] And we're really, we're increasingly seeing that the answer for almost everybody is kind of a hybrid approach. We're seeing increasing number of startups, basically triage. Their AI workloads where if things require, you know, really high levels of accuracy and you know, human like text generation, GBT four is the only answer.

[00:39:38] But many queries or workloads actually don't require that, right? So you can kind of scale down and say, you know, for a really simple query, I can use, you know, an open source model off the shelf for something in the middle. I can fine tune for various tasks and then you can get pretty sophisticated about what you route, where all of that is only possible if we have commercially usable, really high quality language models and especially ones that have been efficiently trained such that latency is, is, is low and cost is relatively low.

[00:40:09] So I think what we're gonna see happen is there's gonna be a, a big push for startups to use. Lama two models and, and other open source models that have similar levels of performance. Fine tune it in ways that actually work for specific tasks, right? Not for specific data, like I think that was sort of a head fake, but for, for specific tasks and, and really be able to build more defensible businesses that way.

[00:40:34] You know, this, there's nothing wrong with using OpenAI. That's fantastic, but it's probably not good to make that a hundred percent of your business. And, and a lot of founders are doing that now. So, so that's why I think this is, this is such a huge deal and, you know, the, the progress just today has been amazing.

[00:40:51] Like, there's gonna be, by the end of today a number of hosts where you can just easily use The Lama two models, like right outta the box, you know, replicates one that we work with, but there there are others as well. You know, you can already run it on your local computer with two bit precision, which is kind of crazy if you stop and think about that for a second, that with two bits you can actually run a super advanced language model on your own computer.

[00:41:15] So I, I think I, I just think this is a huge, huge deal for startups and I think if you're a startup founder working in ai, you know, you, you really should be taking a look at, at open source models now and seeing how they, how they can be used to, to kind of deepen your moat and, and, you know, build a really great AI product.

[00:41:34] Where you can try Llama 2

[00:41:34] swyx: Right. So me, I would like to help fill in the blanks. So apart from Replicate, it looks like hugging Face has also launched an inference endpoint for that. And as far as I know, it's one of the only few ways to try the 70 B model off the shelf. I think Base 10 has also maybe put something up. And then for the, for the two bit quantized model, you can look at the G GML ecosystem.

[00:41:55] Do you need dataset transparency if you have evals?

[00:41:55] swyx: Yeah. And, and then I also wanted to recognize one of the other respondents in our chat, we have a little, little comment window here. ARD Doshi was responding, I think, to Simon. And, and I, I did actually have a pushback, right? Like, we don't have to know. The full data sets of of Lama as long as we are able to eval for everything that we want to know about.

[00:42:13] I think we actually have to live with AI becoming more and more of a black box. Even though the, the mo the the weights are open I mean for me it

[00:42:20] Simon Willison: comes down to model competition. If I have two equally capable models and one of them, I know what's in it, them, I don't, and I'm gonna use the open, the, the, the more, the more transparent one.

[00:42:30] And I'm hoping, because there are so many models competing now, I'm hoping this becomes one of the factors that models compete with each other on

[00:42:38] swyx: I'm, you know, dataset non-transparency I guess is like an emerging theme because like, it's not like we had that for Falcon either. So yeah, we can

[00:42:47] Simon Willison: hope for it and that's a huge problem, right?

[00:42:49] Falcon, if you ask Falcon about human rights abuses in, in the Middle East, it has some very different opinions and I want to understand why. I want to know how they got it to, to do those things.

[00:43:00] swyx: Yeah, yeah, exactly. Yeah, we won't know. And we can, all, we can, all we can do is ask for more transparency there.

[00:43:06] But I do, I do support the you know, the concepts of building a business on open source models. Because open AI will not randomly deprecate your models on you, you know, every three months. And I do think that for people who want a certain level of stability and are okay with trading off not being state of the art in three months I think that is a perfectly reasonable tradeoff.

[00:43:26] >$20m cost of Llama 2 is primarily preference data collection

[00:43:26] swyx: Okay. I wanted to go back to Nathan A. Little bit and talk a little bit more about the preference data and the R R L H F data. So you estimated a 25 million cost for LAMA two. And as far as I can tell, That's, that's actually primarily data collection, not GPUs.

[00:43:46] Nathan Lambert: Yeah. This is based on kind of our pilot contract to do preference data collection at hug and paste cuz we can give, like we're collecting a small amount of data in a similar way and if you do a back of the envelope cost calculation and scale it up by whatever, like 10 or a hundred x that what they did, then you get towards this 20 million number and it could be higher depending on how many flags they end up using in their data.

[00:44:12] So I think what they did was safety is pretty interesting. So they like separated it and collected metadata and that means they could also collect other metadata during the process. And as you kind of add more knobs to the preference data collection because it takes longer for people to do the task and the cost goes up.

[00:44:29] So I think pretty safe to say order of 10 million, especially given, because that's what was rumored with open AI around ChatGPT and everything like that. So, It is not a shock at all to me. And, and is the

[00:44:43] swyx: focus on multi turn significantly higher or, you know, comment worthy I guess?

[00:44:49] Nathan Lambert: Not really. So generally when doing on setting this up, it comes down to per pro, like how many tasks the workforce is gonna do.

[00:44:58] And you could do an instruction prompt, which is one turn, or you could do a four turn chat and that would, you'd generally be able to trade off the number of labels that you get in that respect. So I think the multi turn is more because open source data sets don't contain a lot of that, which is something that we found in, in our work as well.

[00:45:16] And they did that because they needed the model capabilities and they needed to train a preference model that can do that. And I agree, I, I think they must have figured that out months ago. Cause this also takes a lot of time how it works generally. You can see this in the paper, how they say they have these RH F versions and generally what happens is, You sign a contract and then these people sit you down and they're like, we are gonna try to do this over batches and we scale up the amount of data we're sending over time so that we can do calibration.

[00:45:43] And each batch you get some data from the vendor and then you look through the samples and you see what you like and you see what you don't like and then you change it going forwards. And what they did is they took those batches and they trained a model iteratively and then they saw what their model needed and they went back to the vendor to say, okay, we need more data in this regard to improve things.

[00:46:01] So it was a really hands-on, really involved process. And I would guess it takes weeks to months for them to get all this data from a vendor. It's definitely not something you can just get fast and honestly, a potential reason why code is not as good is because way harder to get code data in this regard.

[00:46:20] So all the task companies are extremely limited in people that know a lot about code. So you get way lower throughput for getting preference labels in code and getting that kind of preference data.

[00:46:33] Do we even need human annotators?

[00:46:33] swyx: That makes a ton of sense. Anyone else have any other commentary, I guess, about the additional data collection? Like what I sense now is that they're, there're there's an inc, there's a shift away from, I guess the pre-training data sets which are more opaque but also equally well understood towards more of this preference in our HF data.

[00:46:52] Alessio Fanelli: Yeah, they, they spent a lot of time in the supervised fine tuning data too. They actually compare human vendors to some of their models and they were like, yes, we should just use the. Human annotators or like reinforcement learning.

[00:47:05] Nathan Lambert: I'll tell you what, yeah.

[00:47:07] swyx: The annotators are using the models anyway, right?

[00:47:09] So it's just Yeah, exactly.

[00:47:10] Nathan Lambert: Models all the way down.

[00:47:12] Models Rating Models

[00:47:12] speaker 1: I I

[00:47:13] Alessio Fanelli: think also the other, I mean, to me, some of these things are like chemy, right? They're like, we stopped annotating super fast and fine tuning data at 27,540 annotations. Why? It's like, it seems like such a arbitrary number, you know, that I feel like that's gonna be one of the next research areas, you know, figuring out where the, the right limit is.

[00:47:35] Do we have maybe, do you know if there's any really good again, like open source? Open source, like datasets for posts, not pre-training, but like a fine tuning then R lhf. Because I think one of the big moments with Uber pajama was like, okay, we can take the LAMA one data mixture, use all the open source data sets and just run GPUs at them.

[00:47:55] How do we get to do the same with the post-training flow?

[00:47:58] Nathan Lambert: Okay, so you were breaking up a little bit for the question. So I, I'm gonna say what I think it was, and if it wasn't, you can jump in and clarify. So I think it's like, how do we recreate this supervised training data set and like, can we do anything else with it after the fact?

[00:48:14] Yeah. So Gen, this is another thing that we've started doing, and I think that what, so the open source equivalents are something like Open Assistant created a really high quality dataset, artifact, and then the recent trend is for this thing that's like called Uncensored dataset, which I think is this totally silly name.

[00:48:34] Because really what they're doing is they're removing instructions like as a language model, I don't wanna say this. And therefore when you remove these things, the model gets more helpful. So that's just gonna be the new type of data, which is just clean response on instructions with really strong distribution control.

[00:48:50] And the thing is about recreating this is that it's. Hard to create a diverse set of tasks. So what they are essentially paying money for is someone to make sure that you're not getting a whole bunch of the same poems or something. It's like getting 27,000 weird creative tasks that don't all overlap with each other is why you have to pay a lot of money for it.

[00:49:11] Rather than saying, oh, we have 250 people on this call, it's all due, 10 of them. And then that's a solid start. Like we would just have a totally misshape in distribution and it wouldn't be that useful. So I think even in, so you can go look at like instruction, BT and other papers like this have breakdowns of what that instruction data, the supervised, fine tuning data actually looks like.

[00:49:33] But actually creating it is pretty hard. And I do think that the vendors provide a really high quality amount of data, but their point about the models being able to create it is also really true. So it's, it's, it's pretty borderline right now. And anthropic stop using that in their, in their future work.

[00:49:50] So like, Philanthropics new base models are just good enough at responding to instructions where they don't need to do supervised, fine tuning. And that's like in the constitutional AI paper. So it's like, I don't think that's the place to invest time. It's much more on the preference side to get the RL HF model and to get these preference models going.

[00:50:09] So then maybe you can even do creative things like constitutional AI and stuff after that.

[00:50:13] Alessio Fanelli: Yep. So if you wanna do work in open source today, you think you're better off contributing to this site versus like trying to train another yet another model.

[00:50:24] Nathan Lambert: Yeah. There's no preference models out there and it's astonishing to me, especially given that meta's papers like, oh, we use a ensemble of two preference models.

[00:50:32] The thing that I wanna see is them do or someone do, is like take a base LAMA model and then also train another preference model that's for code and then try to do RH F where you like have a prompt flag for all the. All the code questions get rated by their own preference model as well and see what that can do because they already broke it down into like instruction helpfulness and safety.

[00:50:52] Mm-hmm. It's like, why can't we add another one? It it, it's so obvious that I'm surprised it didn't, it, it just makes a lot of sense. Seeing it in the paper. I was like,

[00:51:02] How to get Code preference data

[00:51:02] swyx: stoked. Yeah. This, this conversation gave me a bit of an idea for essentially llama stack overflow. Like you, you imagine like Stack overflow with with like sort of llama at, its at its base, but then like, it's not very good at coding, but we can actually do ratings on like, you know, preference ratings on, on answers and, and, and entire conversation chains.

[00:51:21] And at, at some point, we'll, we'll accumulate the, the code DA dataset that we need to find here in lama. That would probably do it.

[00:51:27] Yeah,

[00:51:28] Nathan Lambert: we, we've like, there's challenges in base models and how to execute code to get feedback and stuff, but, We've seen early experiments and like we've worked on one, funnily enough that was called Stack Lama. We like did a, like a nice experimentation of that hugging face and it's, it's out there, it's ready for someone to invest more time in it and do it.

[00:51:48] I think especially now that Llama2, I'm like, Lama two's gonna be easier to work with. It's just better language models are a little bit easier to

[00:51:56] swyx: steer. Absolutely. Alex, you have and Mars catalog you, you just joined and I I am sure you have a question. Yeah, go ahead Alex.

[00:52:04] Llama 2 Finetuning Ecosystem

[00:52:04] Alex Volkov: I, I, I just want to complete down what Nathan said.

[00:52:06] It's going to be easier to work with because the ton of the ecosystem and the different kind of. Things that the first Lama opened up is now there, right? The G GML is there, all the, for all and, and the Pinocchio browsers, like all different things. How to run like Lama on your laptop already kind of existing.

[00:52:25] And now we're just gonna see the commercial folk come in. The, the folks for, for whom working on this actually needs like a dollar sign afterwards. And now they'll be able to also participate in this. And we've seen this already. I, I dunno if you guys talked about this or not scale. AI apparently had early access to this and now released a a, I think open source, like full open source toolkit to fine tune mosaic and which is now Databricks also chime in, but it's now super simple to fine tune LAMA on their you know, infrastructure.

[00:52:54] Even though they have the, the TT models, et cetera. They still wanna support LAMA and those Yeah, like the ecosystem exists and I think Nathan's completely right. It's gonna be easier to

[00:53:03] Nathan Lambert: use. Easier to find tune. Yeah. Like hugging face. I think every. Library, like all these people at Hugging and Face, were working super hard this weekend to make day zero support for Llama2.

[00:53:14] Like Transformers, pft, T r L, for like all these people put in the hours to make it's, it's there like this week it's. Like people are doing this now instead of talking on a podcast, they're fine doing this thing. I'm sure that,

[00:53:28] swyx: For, for what it's worth I did actually look into the scale thing because I thought that was kind of interesting, their announcement.

[00:53:33] They never said that they were directly used at Llama2. Perhaps there's, they're not allowed to say so. They all, they say scaly, I is proud to be a meta launch partner. We're launching a platform for customizing lms, blah, blah, blah. And, and obviously, you know, you know, that scale does annotation, so I think it's just heavily implied.

[00:53:51] But I don't think they're allowed to say,

[00:53:54] Simon Willison: I, I've got,

[00:53:56] Nathan Lambert: yeah, surge announced they did the surge device data. At least I I think they did more of it too. Go ahead.

[00:54:02] Hey Apple: Llama2 on Metal pls

[00:54:02] Simon Willison: Quick hugging face Transformers question, I really want to run LAMA two on my M two Mac using metal. And so it takes advantage of the GPU integration and the M two.

[00:54:12] Could somebody please figure out how to do that with hugging face transformers, then publish the world's most straightforward how to do this document because I have not managed it yet. And I think that would be a huge capacity increase for, for all sorts

[00:54:24] swyx: of people.

[00:54:24] Nathan Lambert: Yeah. Pedro's at hugging face is working on that. At least integrating these models with Apple directly is fantastic. I agree. I agree. We agree. There's

[00:54:38] Russell Kaplan: also a project called llama cpp that hardware accelerates for the M two for the llama one. So I'm sure they're gonna be updating that for the new models as well,

[00:54:49] Simon Willison: working mean on the cpp.

[00:54:51] But I've, I've not seen it run metal yet. I need to, evidently I haven't checked the reading in the past few weeks.

[00:54:58] swyx: Isn't it, as long as it's in G gml, it works, right? Yeah. And those are

[00:55:01] Alex Volkov: the converted models in G GML format. We were able to run one. You guys should split it between CPUs and gpu and I don't know, in the audience, we LAMA two seven B in G gml and

[00:55:13] Nathan Lambert: run really fast.

[00:55:15] Simon Willison: Fantastic. Yeah. Again, if somebody wants to be really useful, publish a nice detailed step-by-step instructions, they're getting that working and I will benefit from it and so will load of it. I don't want to do it myself. I want somebody else to, to figure it out

[00:55:26] swyx: for me. Yes. And, and Simon's, Simon's very good at this.

[00:55:31] You can just kind of copy and paste the, the kind of tutorial quality that he does. That'd be great for all of us. Thank you.

[00:55:36] I wanna recognize Anton, who is joined. Hey,

[00:55:39] Nathan Lambert: stranger.

[00:55:40] Anton Troynikov: Hey, Swick. How's it going,

[00:55:41] swyx: man? It's going well. We're very excited about open source models. What you got?

[00:55:46] Anton Troynikov: Yeah, I mean, it's an exciting time, right?

[00:55:47] Llama 2 and Chroma

[00:55:47] Anton Troynikov: I got asked almost immediately, what does this mean for chroma and retrieval and all the other things. We're in the process of benchmarking and evaluating. To see if it's actually suitable in the sort of retrieval augmented generation use case. Intuitively we have this idea that lighter weight models want to perform well because you don't need so many weights for all the facts.

[00:56:08] You just need them to be reasoning machines. So yeah, we're excited to be trying that out. We'll ship results as soon as we have them available.

[00:56:16] swyx: What evals do you look at for models as reasoning machines?

[00:56:21] Anton Troynikov: I mean, there's plenty of retrieval, augmented generation benchmarks out there. The one that I usually run as a quick test is the SciQ data sets, the multiple choice question answering with distractors and supporting paragraphs.

[00:56:33] Ah, but there's, you know, there's entire batteries of these tests. One of the things that we're actually looking at doing at chroma very soon, and we've been speaking to the AI research labs about this, is nobody's really got benchmarks that are relevant to production data. The benchmarks that exist are very academically oriented and fairly synthetic.

[00:56:51] So they consist of, you know, crowdsourced exam, answer question answers. They consist of sort of this really document retrieval oriented thing where it's like, find a document that's relevant to this query, but production use cases don't always look like that. So we're actually looking at, you know, community sourced benchmarks that, that focus much more on the what, what the real data actually looks like.

[00:57:15] swyx: Yeah, totally. The only one I can think of that is, I guess the most prominent one is the open assistance dataset that is gonna free and clear of any usage restrictions stuff. Yeah, I mean do would you, yeah, I think

[00:57:27] Nathan Lambert: so.

[00:57:28] Anton Troynikov: Usage restrictions, I think, I think for evaluating models, there are very few restrictions for use of these data sets.

[00:57:36] For benchmarking, it's very few restrictions for training. There is for sort of commercial purposes, there is, but for the case of like, does this model work well in a retrieval context, there are very few usage restrictions.

[00:57:48] Nathan Lambert: Got it.

[00:57:49] swyx: Amazing. Who else has questions or topics that you wanna bring up about LAMA two and generate?

[00:57:55] Open Source MoE model?

[00:57:55] Alessio Fanelli: One thing that I was thinking about is in the benchmarks they compare to G B T for, but if what George Hotz said on the podcast was right and should be D four is like eight attention heads. I wonder when people are gonna get eight, you know, get a LAMA two mixer expert going and benchmarking that.

[00:58:12] Maybe it will be better. I don't know.

[00:58:15] swyx: Yes, there, there is a little bit of a playbook that has been published out there, so I mean, it, it takes more skill than I, I have, but I'm sure someone else, else out there is currently working on it. I think that the Chinese universities have, have made some interesting progress there.

[00:58:28] Yeah, Simon, and then we'll go to Mars.

[00:58:31] Llama 2 using tools

[00:58:31] Simon Willison: So we talked about the we talked about retrieve augmented generation. The other thing I'm excited about is tool format, right? The the thing where it can call functions, essentially Uhhuh and that's mentioned in the paper. They mentioned they benchmarked along that, but, but I didn't get a feel for something that was really good at, the thing I want is I want basically exactly the same APIs, open AI functions, but I want it to run off of Llama2.

[00:58:53] I think that would be, that would open up all sorts of opportunities.

[00:58:57] Nathan Lambert: They, they said that that capability was emergent and they didn't train on it. There's a line in the discussion where it's like, oh yeah, we got some tool performance where we didn't train on it. So now we can all go fine tune on it and it should be easier.

[00:59:10] Russell Kaplan on Scale AI's Llama 2 plans

[00:59:10] Anton Troynikov: We got Russell Kaplan in here from the space, from scale ai. I think we wanna bring him up. I think he's got a few interesting things to say about how scale is thinking about these things. I know that they were mentioned here before.

[00:59:20] swyx: Hey Russell.

[00:59:21] Russell Kaplan: Here you go. Great. Yeah, no thanks. Thanks Anton. Yeah, we were, we were super stoked about the LAMA two release. Yeah, we put out a, an open source library LM engine for folks to fine tune and serve LAMA two and other language models whether hosted by scale or, or on their own infrastructure.

[00:59:37] And I think generally at scale we're looking to start doing a lot more open source stuff. So you know, one of the next things we're gonna be doing is starting to fine tune LAMA two on interesting domain specific data sets that we create, or, or problem domain. So Anton you mentioned not sure how well it's working for retrieval.

[00:59:55] You know, we'd love to just like put together a data set that we could use to fine tune these models to be good at retrieval. I think we have one planned out for SQL right now. Potentially other tool use. So yeah, I'd be really curious, you know, hearing from the audience. If there are sort of requests for, for good fine tunes of LAMA two or if anyone, you know, already has that data, you can just clone our repo LM engine and and try it out.

[01:00:17] Simon Willison: So I've got one for you. I want a clone of chat GP PT code interpreter built on top of LAMA two, which I imagine would require quite extensive fine tuning. But my good, I mean we've talked about this recently, how chapter code interpreter really is a next level AI tool. Being able to run our own version of that against LAMA two would be incredible.

[01:00:35] Yeah, that would be, that would be great.

[01:00:36] Russell Kaplan: I, yeah, we do, we do, we do a lot of code sort of data acquisition right now, so I think that's definitely in the wheelhouse. But yeah, that's a, that's a good idea to,

[01:00:45] Anton Troynikov: to try out.

[01:00:45] Code data acquisition sounds so sinister. Russell,

[01:00:49] Russell Kaplan: You know, it takes, you gotta, you gotta write a lot of code. Write a

[01:00:52] Matt Bornstein: lot of code. Yeah.

[01:00:53] Russell Kaplan: I think we have something like 350,000 people all around the world who are sort of helping with this stuff. And within that there's, you know, a lot of domain specific expertise.

[01:01:01] Scale annotating code?

[01:01:01] swyx: Is there a way that like, so we were talking before you joined about scale acquiring, I guess preference data from developers rather than I guess the, the standard annotators that you have. Is this a, is this a, is this a need or focus that you have? Is there a way that we can help or Yeah. How do we crowdsource this?

[01:01:18] Yeah, no,

[01:01:19] Russell Kaplan: definitely. No. So, so one of the interesting things has just been for, for our business where, you know, we do a lot of the R LH f labeling for, for all the companies training these foundation models has just been that the level of expertise required has gone up tremendously. Right? So we have a lot of our crowd now it's, it's really domain experts in.

[01:01:38] Specific areas, whether it's programming in a particular language or people who have, you know, passed the CPA or people who have passed the bar or licensed in some profession. That's really been where a lot of our sort of growth has been. And so, yeah, I mean, if anyone is a programmer and wants to kind of infuse their knowledge into the AI, that will power the rest of our, of our society increasingly over time.

[01:02:01] You can, you can just go to scale.com and and sign up to, to start help help

[01:02:04] Nathan Lambert: programming.

[01:02:06] Immortality

[01:02:06] Anton Troynikov: Another, another benefit of this is by the time we have ais strong enough to simulate entire human beings, your data will already be in them. So you'll be resurrected and

[01:02:15] Nathan Lambert: get to live forever in the afterlife.

[01:02:18] swyx: Indeed, we are the first immorals. It's the way to achieve immortality. Yeah. You know, immortality take it. It's yours, but it's not on the battlefield. It's editing Wikipedia. That is that is immortality.

[01:02:29] Running Llama on your phone

[01:02:29] swyx: Mars, you had your hand up. Hey, really

[01:02:31] Whole Mars Catalog: been enjoying listening to this conversation. I think it's such an exciting day with LAMA two and the commercial license.

[01:02:39] One of the things that I've really been excited about, and I think Qualcomm made an announcement with Meta and they said they're going to be looking at optimizing it for Snapdragon hardware, accelerating it. I think one of the most interesting things about these open source models, especially now that you have a commercial license, is actually running it on your laptop or even your smartphone.

[01:03:03] You know, maybe the 7 billion parameter model and the kind of use cases that opened up, that opens up that, you know, just weren't there a few months ago. I was wondering if people had any thoughts on that and what we might see in that area.

[01:03:17] Nathan Lambert: Meta just gave Tipco a huge softball for Apple to fix Siri, and they still hate each other.

[01:03:26] Simon Willison: So I've been running the Qna seven B on my iPhone for a couple of months, just as a, mainly as a demo. So I could just shove it people's face and go Look, my phone's offline. And it's still writing me terrible poetry. And I have to admit, it's fun. I've not yet found use cases for that quality of model for, for when I'm offline.

[01:03:44] And maybe I'm just not being imaginative enough. My, my hunch is that models that are smaller like that, that can run on your phone are much more interesting if you combine them with retrieval, augmented generation or, or tool use. So on. And just as a, a plain sort of chatty PT style language model, I've not yet found many practical uses for it.

[01:04:02] I'd love to hear from people. Oh, that's not true. I use it for brainstorming occasionally if I want to come up with a name for something that's like I used to dread naming things. Now I, I'm fine with naming things cause I get a language model to brainstorm for me. But one on my phone is good enough to do that.

[01:04:16] I've had it come up with some names for things for me so far.

[01:04:18] Nathan Lambert: We talked about evaluation a lot. I've used it for naming and I've also used these models to kind of generate evaluation prompts, which is kind of a different way to do it. It's like come up with some hard python coding questions where you put a bug in this type of function and like, I'm not gonna come up with that on my own.

[01:04:36] Yeah, it can be a really

[01:04:37] swyx: useful spot check, I guess, or I dunno, men mental augmentation tool, whatever

[01:04:43] Nathan Lambert: we call that.

[01:04:44] Sama <3 Satya <3 Zuck? "Azure as Launch Partner"

[01:04:44] Anton Troynikov: So can we, can we take a minute to do some kremlinology here? What's the deal with like, friendship ended with Sam Altman? Now Mark Zuckerberg is my best friend with Satya. I wanna, I wanna get into that

[01:04:55] Alessio Fanelli: side.

[01:04:56] I was smiling a lot more in this picture with Mark than with Sam. That's what I noted. But wait, there's

[01:05:01] swyx: the picture. What?

[01:05:03] Alessio Fanelli: Satya posted a photo with, with Mark and he was like just laughing away. And then I looked back at the one that, remember the one you posted, Satya and Sam together, and I think the bill conference or something with

[01:05:15] Anton Troynikov: Satya, Satya, Sam, and Sam's nipples.

[01:05:17] Simon Willison: Yes.

[01:05:19] Alessio Fanelli: And say Satya was not smiling as much. I don't know. But I, I really wonder what that does to the, you know, open AI does have to pay back a lot of money to Microsoft stuff. It's

[01:05:29] Anton Troynikov: kinda, it's kinda crazy that that a Azure is the launch partner cuz Open AI is exclusively running on Azure, Azure hardware.

[01:05:36] This is a very, very curious move. Right. And I, I can't really disentangle it. Given sort of the scope of Microsoft's investment in OpenAI is entirely in Azure credits. Like one interpretation of this move is that they've already got OpenAI locked in. Right. They're not going anywhere. So might as well get the other, you know, contending models, right?

[01:06:02] If, if you're, if you're Satya, how are you thinking? The only thing that we know for sure at cruise value in this environment is owning compute, and that's what Microsoft

[01:06:11] swyx: has. Yes. But AWS is also a launch partner, right? What does it mean to be a launch partner of an open source model? Like if you can run compute, you can, you can run it.

[01:06:20] Alessio Fanelli: I think that's the, that's the main, the main question. Yeah. But I think like Microsoft is clearly, you know, happy to be involved. To them, it's like a yes. Their first equals exclusivity just one way, you know, it's not a two way exclusivity, so they don't, that's whatever. The other thing is

[01:06:35] speaker 1: this, this will probably increase the demand, the compute demand on Azure from all of their enterprise customers, right?

[01:06:41] So, you know, whether they're selling compute to OpenAI or all of the other enterprises they work with. Having more models available that, that everyone's using should, should just kinda

[01:06:50] Matt Bornstein: keep growing that business. Not to mention, I

[01:06:52] Russell Kaplan: think a lot of their Azure customers probably have significant concerns about privacy, about putting sensitive business data through this and being able to just run inference on your own hardware that you control probably is more appealing to them in some cases than running REST API and calling out to open AI's infrastructure.

[01:07:11] Azure?

[01:07:12] Anton Troynikov: Well, they've got, they've got Azure endpoints for the open AI models. I'm, I'm not that, I'm actually not quite up to speed with the privacy model there, but my understanding is there's not really much difference.

[01:07:25] Simon Willison: My hunch is that it doesn't matter if it is what? What matters is, is what people feel.

[01:07:29] It's the vibes. And you see so many of these, so many people, so many companies saying, no, absolutely no way we would pipe pump any of our private data through somebody else's model. Even if they say they won't use it for training, which they all do, but whereas I guess maybe they're okay with pumping it through as through Microsoft as you, but at least it's on our own, like GPU reserved instances.

[01:07:51] Maybe that's what's going on here. There's so much paranoia around this space at the moment. Yeah, a lot of the

[01:07:55] Russell Kaplan: details come down to can you run it within your own virtual private cloud? I, I wish, I wish we could close enterprise customer security requirements on the vibes, but at least in my experience at at scale people do, you know, there there's some compliance function somewhere in the organization that has to sort of check the boxes that you're not, you know, gonna get screwed on later.

[01:08:15] And so that's definitely been one of the big drivers of people looking to self-host their own open source LMS more and more.

[01:08:25] Alessio Fanelli: Yeah. And the other thing is that they did not use any Azure compute to actually train the model. So if you go in the paper it mentions they only use their super cluster and their internal production cluster.

[01:08:35] So no Azure we use to train it. I guess it's just the inference partner. Yeah, so I mean, going back to the point of they just want GPUs to run. It's not about this is the best GPUs that we use. They didn't even use it.

[01:08:48] Meta "Open Source" Leadership

[01:08:48] Matt Bornstein: I think what's really interesting

[01:08:49] speaker 1: about, about this release is that, you know, for, for a while people have been talking about how oh, is meta behind in, in ai, generative AI and language models. And, and you know, I think Roone had a tweet that was like, the best open source model sounds a lot better than the fifth best language model.

[01:09:06] And it's actually totally true. And, and I actually think that that companies, you know, if you are behind, if you're not in first place, if you, if you open source stuff and you just sort of get the community using it you can, you can get a lot of goodwill,

[01:09:18] Nathan Lambert: get a lot of adoption and actually really move

[01:09:20] speaker 1: the industry forward.

[01:09:21] So yeah, really cool to see Meta sort of put this out and I think, I think it will also spur a lot more open source from a lot

[01:09:28] swyx: of other companies.

[01:09:28] I fully agree. I think, I think this is something that we've been very excited about. We heard, we heard some bes about it a couple months ago and then you know earlier this week or I guess last week and now, now it's fully out. Okay. Maybe I'll do just a round for predictions.

[01:09:43] What happens next in open source models over with Lama.

[01:09:46] Prediction: Finetuning => New Use Cases from Internal State

[01:09:46] Nathan Lambert: I'll go first. I'll

[01:09:47] go

[01:09:47] Anton Troynikov: first. I think the first thing that needs to happen here is the community will actually get the model into its hands and find out its true capabilities. Benchmarks only take us so far. Once that has happened, we're gonna see an extensive sort of period of fine tuning where people are going to apply it to their particular applications and, you know, keep, keep pushing the envelope here and then if it is sufficiently capable, I actually think that we might find new uses for these models that we don't find in rest APIs served ones because you can get at the internal state.

[01:10:16] Right. The thing that I'm always thinking about obviously is embeddings and internal states and, and like modifications here. And I think that there's actually a great deal of interesting research and engineering to be done by looking into what's happening in these models live, especially a sufficiently capable one, which we can do reasoning.

[01:10:32] And so I'm particularly excited about that. I'm particularly excited about having something at least sufficiently capable that we can start to reason about because the entire research community has access to it rather than, you know, behind a closed wall inside some of the

[01:10:45] Nathan Lambert: bigger AI labs.

[01:10:47] swyx: Anyone else? Simon Nathan?

[01:10:48] Nathan Lambert: Yeah, I, I would mostly just double down on that and I could comment on how remarkable the collapse of kind of NLP research as it was, has been onto open AI APIs.

[01:11:01] And this is an opportunity to reset some of that dynamic where so much academic work, which is fine tuning open AI models. And I was like, oh, sorry, we nuked all your fine tuned models and things like that. Like from a values perspective, this is huge for research to kind of proceed as it was meant to be in a way.

[01:11:23] And that is wonderful.

[01:11:24] Prediction: Llama Toolformer

[01:11:24] Simon Willison: I'm looking forward to the first fine tunes. I think like alpaca is what unlocked llama. I can't wait to see what people do, especially since everyone's already amped up and ready to go. So I think it'll be fascinating to see what the, how those start shaping up the next few days, few weeks.

[01:11:38] And yeah, I want to see people, I want to see the applications. I want to see people figure out retrieve augmented generation. I want to see people figure out if it can do to tool format, all of those things, especially the tricks which make the sort of smaller the seven B models able to do, solve interesting problems.

[01:11:53] And I think this is gonna happen really quickly. You know, we've got so many more people who know how to work with these models today than we did when Lama came out back at the end of February. So I'm expecting that to just be a whirlwind of activity starting about four hours ago. And yeah, I can't wait to see what happens.

[01:12:09] Prediction: Finetune-for-everything

[01:12:09] Simon Willison: I, I totally

[01:12:10] Russell Kaplan: agree. I think, I think there's gonna be an explosion. Of domain specific and use case specific fine tunes. And I think that, you know, the sort of first order effects are gonna be pretty clear on, you know, this different industry, this different domain. Everyone is gonna start putting out these domain specific fine tunes, not just the companies themselves doing it for their own use case, but you know, they're like, as someone said, like alpaca sort of made llama to or made llama accessible.

[01:12:36] We'll have something really similar, but for each category of application. And then I think the second order effect that's really interesting to me is I think tool use and agents are gonna get really good. Right now. People are using, you know, sort of off the shelf untuned language models to try to build agents have them use tools.

[01:12:57] But if you, if you're building a, you know, an application and you just need to use that one tool really, really well, Now you have suddenly a G P T 3.5 class model that you can fine tune exclusively to that tool. It's gonna work really well. And I think that the, you know, the barrier to, to utility is so high for these tool use real world applications because of this sort of problem of exponential compounding of errors over long chains.

[01:13:24] But if fine tuning works well for that, I think it's gonna be a really big game changer.

[01:13:30] Predictions: Llama Agents

[01:13:30] Anton Troynikov: I am so bullish on agents, like I'm well aware that they're nothing but toys today. Although I can think of a couple of practical use cases, including in the fine tuning context. Russell, we ought to talk about this actually later, but that's a really good point to my mind that sort of having an easy to find train model for your particular agent use case is maybe going to make these things more useful than they are today.

[01:13:51] I'm, I'm very bullish on that. I'm hopeful and of course cuz Koma builds memory for agents. It would be great for us to.

[01:13:57] swyx: All right. I think unless you dunno if you have any predictions. I, I, I think I'm kind of out. You guys are definitely taking all the ones that I was gonna say. Yeah.

[01:14:05] dP(Doom)?

[01:14:05] Nathan Lambert: Wait, wait, wait,

[01:14:05] Anton Troynikov: wait, wait. Before, before we sign off here, let's go around the, let's go around the room. Probability of AI doom improved or made worse by the release of LA material.

[01:14:14] Nathan Lambert: Let's go.

[01:14:15] Simon Willison: I couldn't care less. I don't care about the doom scenarios. I care about building stuff with, with what we've got.

[01:14:22] Nathan Lambert: So,

[01:14:22] Anton Troynikov: so none, it has not moved

[01:14:24] Nathan Lambert: your needle. No.

[01:14:25] Simon Willison: My, my needle is, is stuck on the sort of metal, maybe 5%, but, but not worth thinking about. Too hard.

[01:14:31] Anton Troynikov: All right. Five, 5% doom. I'm, I'm willing to accept 5% doom.

[01:14:36] We've, we've, we've accepted way more percent doom than other technologies.

[01:14:39] Alessio Fanelli: I'm an old DOM, so it's we're, we're gonna use it for more good than bad. We'll be done with it.

[01:14:45] Speaker 2: I would like to believe that having a model that we can actually understand and like go deep and develop on top of it, will not only advert the DOMA scenarios, but will allow us to prepare better in case any crazy person wants to make doom on their own. A sufficient enough community of builders of LLMs and ais

[01:15:10] Matt Bornstein: can stop that.

[01:15:12] Yeah, I think that's a really

[01:15:13] Anton Troynikov: great point actually. The safety story gets better when we have more opportunities to work with the core internals of the models as they actually exist instead of hypothetical abstract objects that we reason about.

[01:15:27] swyx: Yeah, I was

[01:15:27] speaker 1: gonna say

[01:15:28] swyx: like, I'm a pretty high P doom person, but it, it's moved down because we can have, you know, GC five or LAMA three, you know, explain the weights of LAMA two.

[01:15:37] And I, I do think that that improves interpretability quite a bit. How

[01:15:42] Nathan Lambert: are you going to know if it's telling the

[01:15:43] Anton Troynikov: truth? I like, I, I know that you, I know about these, just ask the model approaches, but I'm pretty skeptical.

[01:15:49] Nathan Lambert: I've gotta tell ya.

[01:15:51] swyx: Give it a GoBoard you know, swap out one of the positions, see what happens, you know, that kinda stuff.

[01:15:55] You know, we, we've done small versions of this. We've done, we've done very, very small skills version of this already, right. Like, so, I dunno,

[01:16:01] Nathan Lambert: this

[01:16:01] swyx: is hand wavy. I mean, you

[01:16:02] Nathan Lambert: know. No, I'm,

[01:16:03] Anton Troynikov: I'm just, I'm just genuinely curious about the ideas here, but that's, that's a different discussion. Exactly. Yeah. Yeah.

[01:16:09] Russell Kaplan: Yeah, I just think it's amazing how these language model capabilities that just a few months ago felt cutting edge when people used them for the first time in chat. G B T have now progressed to a state where it's almost becoming commodified and everybody's having these models.

[01:16:27] There's more and more of them popping up, people starting things and open source models exploding. I don't think necessarily we can fully understand the significance of what's happening here today, but going into the future, it's probably going to be really common for pretty much every computer to be running large language models natively on the device.

[01:16:51] Wrapping up

[01:16:51] swyx: All right. Well, that's a very positive view of the future. I think we're all very encouraged by that. Yeah. I would just want to thank everyone for joining and sharing your thoughts on LAMA two. Alessio. Did you have parting

[01:17:01] Alessio Fanelli: thoughts? No, that was it. Thank you everyone.

[01:17:05] swyx: Thank you so much. We'll clean up the audio of this thing and post it tomorrow on the in space, but otherwise, I think we should follow what Russell and, and Nathan and the others have been saying, which is go play with Llama2.

[01:17:14] So I guess we'll all go do that. Have a wonderful day everyone. Thanks everyone. Thank you sir. Alex. Thanks everyone. Bye bye. Have a

[01:17:23] Speaker 2: great time.

And 5th best model overall, which is still impressive?