Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

Latent Space: The AI Engineer Podcast

0:00

-1:05:02

Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

On raising $200m to build agent operating systems that can reason and code, why LLMs beat reinforcement learning for agent usecases, and how to build a Scenius with top AI people

Kanjun Qiu

Oct 14, 2023

Thanks to the over 11,000 people who joined us for the first AI Engineer Summit! A full recap is coming, but you can 1) catch up on the fun and videos on Twitter and YouTube, 2) help us reach 1000 people for the first comprehensive State of AI Engineering survey and 3) submit projects for the new AI Engineer Foundation.

See our Community page for upcoming meetups in SF, Paris, NYC, and Singapore.

This episode had good interest on Twitter.

Last month, Imbue was crowned as AI’s newest unicorn foundation model lab, raising a $200m Series B at a >$1 billion valuation. As “stealth” foundation model companies go, Imbue (f.k.a. Generally Intelligent) has stood as an enigmatic group given they have no publicly released models to try out1. However, ever since their $20m Series A last year their goal has been to “develop generally capable AI agents with human-like intelligence in order to solve problems in the real world”.

From RL to Reasoning LLMs

Along with their Series A, they announced Avalon, “A Benchmark for RL Generalization Using Procedurally Generated Worlds”. Avalon is built on top of the open source Godot game engine, and is ~100x faster than Minecraft to enable fast RL benchmarking and a clear reward with adjustable game difficulty.

After a while, they realized that pure RL isn’t a good path to teach reasoning and planning. The agents were able to learn mechanical things like opening complex doors, climbing, but couldn’t go to higher level tasks. A pure RL world also doesn’t include a language explanation of the agent reasoning, which made it hard to understand why it made certain decisions. That pushed the team more towards the “models for reasoning” path:

“The second thing we learned is that pure reinforcement learning is not a good vehicle for planning and reasoning. So these agents were able to learn all sorts of crazy things: They could learn to climb like hand over hand in VR climbing, they could learn to open doors like very complicated, like multiple switches and a lever open the door, but they couldn't do any higher level things. And they couldn't do those lower level things consistently necessarily. And as a user, I do not want to interact with a pure reinforcement learning end to end RL agent. As a user, like I need much more control over what that agent is doing.”

Inspired by Chelsea Finn’s work on SayCan at Stanford, the team pivoted to have their agents do the reasoning in natural language instead. This development parallels the large leaps in reasoning that humans have developed as the scientific method:

“We are better at reasoning now than we were 3000 years ago. An example of a reasoning strategy is noticing you're confused. Then when I notice I'm confused, I should ask:
What was the original claim that was made?
What evidence is there for this claim?
Does the evidence support the claim?
Is the claim correct?
This is like a reasoning strategy that was developed in like the 1600s, you know, with like the advent of science. So that's an example of a reasoning strategy. There are tons of them. We employ all the time, lots of heuristics that help us be better at reasoning. And we can generate data that's much more specific to them.“

The Full Stack Model Lab

One year later, it would seem that the pivot to reasoning has had tremendous success, and Imbue has now reached a >$1B valuation, with participation from Astera Institute, NVIDIA, Cruise CEO Kyle Vogt, Notion co-founder Simon Last, and others. Imbue tackles their work with a “full stack” approach:

Models. Pretraining very large (>100B parameter) models, optimized to perform well on internal reasoning benchmarks, with a ~10,000 Nvidia H100 GPU cluster2 lets us iterate rapidly on everything from training data to architecture and reasoning mechanisms.
Tools and Agents. Building internal productivity tools from coding agents for fixing type checking and linting errors, to sophisticated systems like CARBS (for hyperparameter tuning and network architecture search).
Interface Invention. Solving agent trust and collaboration (not merely communication) with humans by creating better abstractions and interfaces — IDEs for users to program computers in natural language.
Theory. Publishing research about the theoretical underpinnings of self-supervised learning, as well as scaling laws for machine learning research.

Kanjun believes we are still in the “bare metal phase” of agent development, and they want to take a holistic approach to building the “operating system for agents”. We loved diving deep into the Imbue approach toward solving the AI Holy Grail of reliable agents, and are excited to share our conversation with you today!

Recorded at the beautiful StudioPod studios in San Francisco

Timestamps

[00:00:00] Introductions
[00:06:07] The origin story of Imbue
[00:09:39] Imbue's approach to training large foundation models optimized for reasoning
[00:12:18] Imbue's goals to build an "operating system" for reliable, inspectable AI agents
[00:15:37] Imbue's process of developing internal tools and interfaces to collaborate with AI agents
[00:17:27] Imbue's focus on improving reasoning capabilities in models, using code and other data
[00:19:50] The value of using both public benchmarks and internal metrics to evaluate progress
[00:21:43] Lessons learned from developing the Avalon research environment
[00:23:31] The limitations of pure reinforcement learning for general intelligence
[00:28:36] Imbue's vision for building better abstractions and interfaces for reliable agents
[00:31:36] Interface design for collaborating with, rather than just communicating with, AI agents
[00:37:40] The future potential of an agent-to-agent protocol
[00:39:29] Leveraging approaches like critiquing between models and chain of thought
[00:45:49] Kanjun's philosophy on enabling team members as creative agents at Imbue
[00:53:51] Kanjun's experience co-founding the communal co-living space The Archive
[01:00:22] Lightning Round

Show Notes

Imbue
MIT Media Lab
Research mentioned:
Agent Protocol - part of the AI Engineer Foundation
Xerox PARC
Michael Nielsen
Jason Benn
Outset Capital
Scenius - Kevin Kelly

Transcript

Alessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, Partner and CTO at Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai. [00:00:19]

Swyx: Hey, and today in the studio we have Kanjun from Imbue. Welcome. So you and I have, I guess, crossed paths a number of times. You're formerly named Generally Intelligent and you've just announced your rename, rebrand in huge, humongous ways. So congrats on all of that. And we're here to dive in into deeper detail on Imbue. We like to introduce you on a high level basis, but then have you go into a little bit more of your personal side. So you graduated your BS at MIT and you also spent some time at the MIT Media Lab, one of the most famous, I guess, computer hacking labs in the world. Then you graduated MIT and you went straight into BizOps at Dropbox, where you're eventually chief of staff, which is a pretty interesting role we can dive into later. And then it seems like the founder bug hit you. You were basically a three times founder at Ember, Sorceress, and now at Generally Intelligent slash Imbue. What should people know about you on the personal side that's not on your LinkedIn? That's something you're very passionate about outside of work. [00:01:12]

Kanjun: Yeah. I think if you ask any of my friends, they would tell you that I'm obsessed with agency, like human agency and human potential. [00:01:19]

Swyx: That's work. Come on.

Kanjun: It's not work. What are you talking about?

Swyx: So what's an example of human agency that you try to promote? [00:01:27]

Kanjun: With all of my friends, I have a lot of conversations with them that's kind of helping figure out what's blocking them. I guess I do this with a team kind of automatically too. And I think about it for myself often, like building systems. I have a lot of systems to help myself be more effective. At Dropbox, I used to give this onboarding talk called How to Be Effective, which people liked. I think like a thousand people heard this onboarding talk, and I think maybe Dropbox was more effective. I think I just really believe that as humans, we can be a lot more than we are. And it's what drives everything. I guess completely outside of work, I do dance. I do partner dance. [00:02:03]

Swyx: Yeah. Lots of interest in that stuff, especially in the sort of group living houses in San Francisco, which I've been a little bit part of, and you've also run one of those. [00:02:12]

Kanjun: That's right. Yeah. I started the archive with two friends, with Josh, my co-founder, and a couple of other folks in 2015. That's right. And GPT-3, our housemates built. [00:02:22]

Swyx: Was that the, I guess, the precursor to Generally Intelligent, that you started doing more things with Josh? Is that how that relationship started? Yeah. [00:02:30]

Kanjun: This is our third company together. Our first company, Josh poached me from Dropbox for Ember. And there we built a really interesting technology, laser raster projector, VR headset. And then we were like, VR is not the thing we're most passionate about. And actually it was kind of early days when we both realized we really do believe that in our lifetimes, like computers that are intelligent are going to be able to allow us to do much more than we can do today as people and be much more as people than we can be today. And at that time, we actually, after Ember, we were like, work on AI research or start an AI lab. A bunch of our housemates were joining OpenAI, and we actually decided to do something more pragmatic to apply AI to recruiting and to try to understand like, okay, if we are actually trying to deploy these systems in the real world, what's required? And that was Sorceress. That taught us so much about maybe an AI agent in a lot of ways, like what does it actually take to make a product that people can trust and rely on? I think we never really fully got there. And it's taught me a lot about what's required. And it's kind of like, I think informed some of our approach and some of the way that we think about how these systems will actually get used by people in the real world. [00:03:42]

Swyx: Just to go one step deeper on that, you're building AI agents in 2016 before it was cool. You got some muscle and you raised $30 million. Something was working. What do you think you succeeded in doing and then what did you try to do that did not pan out? [00:03:56]

Kanjun: Yeah. So the product worked quite well. So Sorceress was an AI system that basically looked for candidates that could be a good fit and then helped you reach out to them. And this was a little bit early. We didn't have language models to help you reach out. So we actually had a team of writers that like, you know, customized emails and we automated a lot of the customization. But the product was pretty magical. Like candidates would just be interested and land in your inbox and then you can talk to them. As a hiring manager, that's such a good experience. I think there were a lot of learnings, both on the product and market side. On the market side, recruiting is a market that is endogenously high churn, which means because people start hiring and then we hire the role for them and they stop hiring. So the more we succeed, the more they... [00:04:39]

Swyx: It's like the whole dating business. [00:04:40]

Kanjun: It's the dating business. Exactly. Exactly. And I think that's the same problem as the dating business. And I was really passionate about like, can we help people find work that is more exciting for them? A lot of people are not excited about their jobs and a lot of companies are doing exciting things and the matching could be a lot better. But the dating business phenomenon like put a damper on that, like it's actually a pretty good business. But as with any business with like relatively high churn, the bigger it gets, the more revenue we have, the slower growth becomes because if 30% of that revenue you lose year over year, then it becomes a worse business. So that was the dynamic we noticed quite early on after our Series A. I think the other really interesting thing about it is we realized what was required for people to trust that these candidates were like well vetted and had been selected for a reason. And it's what actually led us, you know, a lot of what we do at Imbue is working on interfaces to figure out how do we get to a situation where when you're building and using agents, these agents are trustworthy to the end user. That's actually one of the biggest issues with agents that, you know, go off and do longer range goals is that I have to trust, like, did they actually think through this situation? And that really informed a lot of our work today. [00:05:52]

Alessio: Let's jump into GI now, Imbue. When did you decide recruiting was done for you and you were ready for the next challenge? And how did you pick the agent space? I feel like in 2021, it wasn't as mainstream. Yeah. [00:06:07]

Kanjun: So the LinkedIn says that it started in 2021, but actually we started thinking very seriously about it in early 2020, late 2019, early 2020. So what we were seeing is that scale is starting to work and language models probably will actually get to a point where like with hacks, they're actually going to be quite powerful. And it was hard to see that at the time, actually, because GPT-3, the early versions of it, there are all sorts of issues. We're like, oh, that's not that useful, but we could kind of see like, okay, you keep improving it in all of these different ways and it'll get better. What Josh and I were really interested in is how can we get computers that help us do bigger things? Like, you know, there's this kind of future where I think a lot about, you know, if I were born in 1900 as a woman, like my life would not be that fun. I'd spend most of my time like carrying water and literally like getting wood to put in the stove to cook food and like cleaning and scrubbing the dishes and, you know, getting food every day because there's no refrigerator, like all of these things, very physical labor. And what's happened over the last 150 years since the industrial revolution is we've kind of gotten free energy, like energy is way more free than it was 150 years ago. And so as a result, we've built all these technologies like the stove and the dishwasher and the refrigerator, and we have electricity and we have infrastructure, running water, all of these things that have totally freed me up to do what I can do now. And I think the same thing is true for intellectual energy. We don't really see it today, but because we're so in it, but our computers have to be micromanaged. You know, part of why people are like, oh, you're stuck to your screen all day. Well, we're stuck to our screen all day because literally nothing happens unless I'm doing something in front of my screen. I don't, you know, I can't send my computer off to do a bunch of stuff for me. And there is a future where that's not the case, where, you know, I can actually go off and do stuff and trust that my computer will pay my bills and figure out my travel plans and do the detailed work that I am not that excited to do so that I can like be much more creative and able to do things that I as a human, I'm very excited about and collaborate with other people. And there are things that people are uniquely suited for. So that's kind of always been the thing that has been really exciting to me. Like Josh and I have known for a long time, I think that, you know, whatever AI is, it would happen in our lifetimes. And the personal computer kind of started giving us a bit of free intellectual energy. And this is like really the explosion of free intellectual energy. So in early 2020, we were thinking about this and what happened was self-supervised learning basically started working across everything. So worked in language, SimClear came out, I think MoCo had come out, Momentum Contrast had come out earlier in 2019, SimClear came out in early 2020. And we're like, okay, for the first time, self-supervised learning is working really well across images and text and suspect that like, okay, actually it's the case that machines can learn things the way that humans do. And if that's true, if they can learn things in a fully self-supervised way, because like as people, we are not supervised. We like go Google things and try to figure things out. So if that's true, then like what the computer could be is much bigger than what it is today. And so we started exploring ideas around like, how do we actually go? We didn't think about the fact that we could actually just build a research lab. So we were like, okay, what kind of startup could we build to like leverage self-supervised learning? So that eventually becomes something that allows computers to become much more able to do bigger things for us. But that became General Intelligence, which started as a research lab. [00:09:39]

Alessio: So your mission is you aim to rekindle the dream of the personal computer. So when did it go wrong and what are like your first products and user facing things that you're building to rekindle it? [00:09:53]

Kanjun: Yeah. So what we do at Imbue is we train large foundation models optimized for reasoning. And the reason for that is because reasoning is actually, we believe the biggest blocker to agents or systems that can do these larger goals. If we think about something that writes an essay, like when we write an essay, we like write it. We put it and then we're done. We like write it and then we look at it and we're like, oh, I need to do more research on that area. I'm going to go do some research and figure it out and come back and, oh, actually it's not quite right. The structure of the outline. So I'm going to rearrange the outline, rewrite it. It's this very iterative process and it requires thinking through like, okay, what am I trying to do? Is the goal correct? Also like, has the goal changed as I've learned more? So as a tool, like when should I ask the user questions? I shouldn't ask them questions all the time, but I should ask them questions in higher risk situations. How certain am I about the like flight I'm about to book? There are all of these notions of like risk certainty, playing out scenarios, figuring out how to make a plan that makes sense, how to change the plan, what the goal should be. That are things that we lump under the bucket of reasoning and models today, they're not optimized for reasoning. It turns out that there's not actually that much explicit reasoning data on the internet as you would expect. And so we get a lot of mileage out of optimizing our models for reasoning in pre-training. And then on top of that, we build agents ourselves and we, I can get into, we really believe in serious use, like really seriously using the systems and trying to get to an agent that we can use every single day, tons of agents that we can use every single day. And then we experiment with interfaces that help us better interact with the agents. So those are some set of things that we do on the kind of model training and agent side. And then the initial agents that we build, a lot of them are trying to help us write code better because code is most of what we do every day. And then on the infrastructure and theory side, we actually do a fair amount of theory work to understand like, how do these systems learn? And then also like, what are the right abstractions for us to build good agents with, which we can get more into. And if you look at our website, we build a lot of tools internally. We have a like really nice automated hyperparameter optimizer. We have a lot of really nice infrastructure and it's all part of the belief of like, okay, let's try to make it so that the humans are doing the things humans are good at as much as possible. So out of our very small team, we get a lot of leverage. [00:12:18]

Swyx: And so would you still categorize yourself as a research lab now, or are you now in startup mode? Is that a transition that is conscious at all? [00:12:26]

Kanjun: That's a really interesting question. I think we've always intended to build, you know, to try to build the next version of the computer, enable the next version of the computer. The way I think about it is there's a right time to bring a technology to market. So Apple does this really well. Actually, iPhone was under development for 10 years, AirPods for five years. And Apple has a story where iPhone, the first multi-touch screen was created. They actually were like, oh wow, this is cool. Let's like productionize iPhone. They actually brought, they like did some work trying to productionize it and realized this is not good enough. And they put it back into research to try to figure out like, how do we make it better? What are the interface pieces that are needed? And then they brought it back into production. So I think of production and research as kind of like these two separate phases. And internally we have that concept as well, where like things need to be done in order to get to something that's usable. And then when it's usable, like eventually we figure out how to productize it. [00:13:20]

Alessio: What's the culture like to make that happen, to have both like kind of like product oriented, research oriented. And as you think about building the team, I mean, you just raised 200 million. I'm sure you want to hire more people. What are like the right archetypes of people that work at Imbue? [00:13:35]

Kanjun: I would say we have a very unique culture in a lot of ways. I think a lot about social process design. So how do you design social processes that enable people to be effective? I like to think about team members as creative agents, because most companies, they think of their people as assets and they're very proud of this. And I think about like, okay, what is an asset? It's something you own that provides you value that you can discard at any time. This is a very low bar for people. This is not what people are. And so we try to enable everyone to be a creative agent and to really unlock their superpowers. So a lot of the work I do, you know, I was mentioning earlier, I'm like obsessed with agency. A lot of the work I do with team members is try to figure out like, you know, what are you really good at? What really gives you energy and where can we put you such that, how can I help you unlock that and grow that? So much of our work, you know, in terms of team structure, like much of our work actually comes from people. Carbs, our hyperparameter optimizer came from Abe trying to automate his own research process doing hyperparameter optimization. And he actually pulled some ideas from plasma physics. He's a plasma physicist to make the local search work. A lot of our work on evaluations comes from a couple of members of our team who are like obsessed with evaluations. We do a lot of work trying to figure out like, how do you actually evaluate if the model is getting better? Is the model making better agents? Is the agent actually reliable? A lot of things kind of like, I think of people as making the like them shaped blob inside imbue and I think, you know, yeah, that's the kind of person that we're, we're hiring for. We're hiring product engineers and data engineers and research engineers and all these roles. We have projects, not teams. We have a project around data, data collection and data engineering. That's actually one of the key things that improve the model performance. We have a pre-training kind of project with some fine tuning as part of that. And then we have an agent's project that's like trying to build on top of our models as well as use other models in the outside world to try to make agents then we actually use as programmers every day. So all sorts of different, different projects. [00:15:37]

Swyx: As a founder, you're now sort of a capital allocator among all of these different investments effectively at different projects. And I was interested in how you mentioned that you were optimizing for improving reasoning and specifically inside of your pre-training, which I assume is just a lot of data collection. [00:15:55]

Kanjun: We are optimizing reasoning inside of our pre-trained models. And a lot of that is about data. And I can talk more about like what, you know, what exactly does it involve? But actually big, maybe 50% plus of the work is figuring out even if you do have models that reason well, like the models are still stochastic. The way you prompt them still makes, is kind of random, like makes them do random things. And so how do we get to something that is actually robust and reliable as a user? How can I, as a user, trust it? We have all sorts of cool things on the, like, you know, I was mentioning earlier when I talked to other people building agents, they have to do so much work, like to try to get to something that they can actually productize and it takes a long time and agents haven't been productized yet for, partly for this reason is that like the abstractions are very leaky. We can get like 80% of the way there, but like self-driving cars, like the remaining 20% is actually really difficult. We believe that, and we have internally, I think some things that like an interface, for example, that lets me really easily like see what the agent execution is, fork it, try out different things, modify the prompt, modify like the plan that it is making. This type of interface, it makes it so that I feel more like I'm collaborating with the agent as it's executing, as opposed to it's just like doing something as a black box. That's an example of a type of thing that's like beyond just the model pre-training, but on the model pre-training side, like reasoning is a thing that we optimize for. And a lot of that is about what data do we put in. [00:17:27]

Swyx: It's interesting just because I always think like, you know, out of the levers that you have, the resources that you have, I think a lot of people think that running foundation model company or a research lab is going to be primarily compute. And I think the share of compute has gone down a lot over the past three years. It used to be the main story, like the main way you scale is you just throw more compute at it. And now it's like, Flops is not all you need. You need better data, you need better algorithms. And I wonder where that shift has gone. This is a very vague question, but is it like 30-30-30 now? Is it like maybe even higher? So one way I'll put this is people estimate that Llama2 maybe took about three to $4 million of compute, but probably 20 to $25 million worth of labeling data. And I'm like, okay, well that's a very different story than all these other foundation model labs raising hundreds of millions of dollars and spending it on GPUs. [00:18:20]

Kanjun: Data is really expensive. We generate a lot of data. And so that does help. The generated data is close to actually good, as good as human labeled data. [00:18:34]

Swyx: So generated data from other models? [00:18:36]

Kanjun: From our own models. From your own models. Or other models, yeah. [00:18:39]

Swyx: Do you feel like there's certain variations of this? There's the sort of the constitutional AI approach from Anthropic and basically models sampling training on data from other models. I feel like there's a little bit of like contamination in there, or to put it in a statistical form, you're resampling a distribution that you already have that you already know doesn't match human distributions. How do you feel about that basically, just philosophically? [00:19:04]

Kanjun: So when we're optimizing models for reasoning, we are actually trying to like make a part of the distribution really spiky. So in a sense, like that's actually what we want. We want to, because the internet is a sample of the human distribution that's also skewed in all sorts of ways. That is not the data that we necessarily want these models to be trained on. And so when we're generating data, we're not really randomly generating data. We generate very specific things that are like reasoning traces and that help optimize reasoning. Code also is a big piece of improving reasoning. So generated code is not that much worse than like regular human written code. You might even say it can be better in a lot of ways. So yeah. So we are trying to already do that. [00:19:50]

Alessio: What are some of the tools that you thought were not a good fit? So you built Avalon, which is your own simulated world. And when you first started, the metagame was like using games to simulate things using, you know, Minecraft and then OpenAI is like the gym thing and all these things. And I think in one of your other podcasts, you mentioned like Minecraft is like way too slow to actually do any serious work. Is that true? Yeah. I didn't say it. [00:20:17]

Swyx: I don't know. [00:20:18]

Alessio: That's above my pay grade. But Avalon is like a hundred times faster than Minecraft for simulation. When did you figure that out that you needed to just like build your own thing? Was it kind of like your engineering team was like, Hey, this is too slow. Was it more a long-term investment? [00:20:34]

Kanjun: Yeah. At that time we built Avalon as a research environment to help us learn particular things. And one thing we were trying to learn is like, how do you get an agent that is able to do many different tasks? Like RL agents at that time and environments at that time. What we heard from other RL researchers was the like biggest thing keeping holding the field back is lack of benchmarks that let us explore things like planning and curiosity and things like that and have the agent actually perform better if the agent has curiosity. And so we were trying to figure out in a situation where, how can we have agents that are able to handle lots of different types of tasks without the reward being pretty handcrafted? That's a lot of what we had seen is that like these very handcrafted rewards. And so Avalon has like a single reward it's across all tasks. And it also allowed us to create a curriculum so we could make the level more or less difficult. And it taught us a lot, maybe two primary things. One is with no curriculum, RL algorithms don't work at all. So that's actually really interesting. [00:21:43]

Swyx: For the non RL specialists, what is a curriculum in your terminology? [00:21:46]

Kanjun: So a curriculum in this particular case is basically the environment Avalon lets us generate simpler environments and harder environments for a given tasks. What's interesting is that the simpler environments, what you'd expect is the agent succeeds more often. So it gets more reward. And so, you know, kind of my intuitive way of thinking about it is, okay, the reason why it learns much faster with a curriculum is it's just getting a lot more signal. And that's actually an interesting general intuition to have about training these things as like, what kind of signal are they getting? And like, how can you help it get a lot more signal? The second thing we learned is that reinforcement learning is not a good vehicle, like pure reinforcement learning is not a good vehicle for planning and reasoning. So these agents were not able to, they were able to learn all sorts of crazy things. They could learn to climb like hand over hand in VR climbing, they could learn to open doors like very complicated, like multiple switches and a lever open the door, but they couldn't do any higher level things. And they couldn't do those lower level things consistently necessarily. And as a user, I do not want to interact with a pure reinforcement learning end to end RL agent. As a user, like I need much more control over what that agent is doing. And so that actually started to get us on the track of thinking about, okay, how do we do the reasoning part in language? And we were pretty inspired by our friend Chelsea Finn at Stanford was I think working on SACAN at the time where it's basically an experiment where they have robots kind of trying to do different tasks and actually do the reasoning for the robot in natural language. And it worked quite well. And that led us to start experimenting very seriously with reasoning. [00:23:31]

Alessio: How important is the language part for the agent versus for you to inspect the agent? You know, like is it the interface to kind of the human on the loop really important or? [00:23:43]

Kanjun: Yeah, I personally think of it as it's much more important for us, the human user. So I think you probably could get end to end agents that work and are fairly general at some point in the future. But I think you don't want that. Like we actually want agents that we can like perturb while they're trying to figure out what to do. Because, you know, even a very simple example, internally we have like a type error fixing agent and we have like a test generation agent. Test generation agent goes off rails all the time. I want to know, like, why did it generate this particular test? [00:24:19]

Swyx: What was it thinking? [00:24:20]

Kanjun: Did it consider, you know, the fact that this is calling out to this other function? And the formatter agent, if it ever comes up with anything weird, I want to be able to debug like what happened with RL end to end stuff. Like we couldn't do that. Yeah. [00:24:36]

Swyx: It sounds like you have a bunch of agents operating internally within the company. What's your most, I guess, successful agent and what's your least successful one? [00:24:44]

Kanjun: The agents don't work. All of them? I think the only successful agents are the ones that do really small things. So very specific, small things like fix the color of this button on the website or like change the color of this button. [00:24:57]

Swyx: Which is now sweep.dev is doing that. Exactly. [00:25:00]

Kanjun: Perfect. Okay. [00:25:02]

Swyx: Well, we should just use sweep.dev. Well, I mean, okay. I don't know how often you have to fix the color of a button, right? Because all of them raise money on the idea that they can go further. And my fear when encountering something like that is that there's some kind of unknown asymptote ceiling that's going to prevent them, that they're going to run head on into that you've already run into. [00:25:21]

Kanjun: We've definitely run into such a ceiling. But what is the ceiling? [00:25:24]

Swyx: Is there a name for it? Like what? [00:25:26]

Kanjun: I mean, for us, we think of it as reasoning plus these tools. So reasoning plus abstractions, basically. I think actually you can get really far with current models and that's why it's so compelling. Like we can pile debugging tools on top of these current models, have them critique each other and critique themselves and do all of these, like spend more computer inference time, context hack, retrieve augmented generation, et cetera, et cetera, et cetera. Like the pile of hacks actually does get us really far. And a way to think about it is like the underlying language model is kind of like a noisy channel. Actually I don't want to use this analogy. It's actually a really bad analogy, but you kind of like trying to get more signal out of the channel. We don't like to think about it that way. It's what the default approach is, is like trying to get more signal out of this noising channel. But the issue with agents is as a user, I want it to be mostly reliable. It's kind of like self-driving in that way. Like it's not as bad as self-driving, like in self-driving, you know, you're like hurtling at 70 miles an hour. It's like the hardest agent problem. But one thing we learned from Sorceress and one thing we learned by using these things internally is we actually have a pretty high bar for these agents to work. You know, it's actually really annoying if they only work 50% of the time and we can make interfaces to make it slightly less annoying. But yeah, there's a ceiling that we've encountered so far and we need to make the models better. We also need to make the kind of like interface to the user better. And also a lot of the like critiquing. I hope what we can do is help people who are building agents actually like be able to deploy them. I think, you know, that's the gap that we see a lot of today is everyone who's trying to build agents to get to the point where it's robust enough to be deployable. It just, it's like an unknown amount of time. Okay. [00:27:12]

Swyx: So this goes back into what Embu is going to offer as a product or a platform. How are you going to actually help people deploy those agents? Yeah. [00:27:21]

Kanjun: So our current hypothesis, I don't know if this is actually going to end up being the case. We've built a lot of tools for ourselves internally around like debugging, around abstractions or techniques after the model generation happens. Like after the language model generates the text and like interfaces for the user and the underlying model itself, like models talking to each other, maybe some set of those things kind of like an operating system. Some set of those things will be helpful for other people. And we'll figure out what set of those things is helpful for us to make our agents. Like what we want to do is get to a point where we can like start making an agent, deploy it, it's reliable, like very quickly. And there's a similar analog to software engineering, like in the early days, in the seventies and the sixties, like to program a computer, like you have to go all the way down to the registers and write things and eventually we had assembly. That was like an improvement. But then we wrote programming languages with these higher levels of abstraction and that allowed a lot more people to do this and much faster. And the software created is much less expensive. And I think it's basically a similar route here where we're like in the like bare metal phase of agent building. And we will eventually get to something with much nicer abstractions. [00:28:36]

Alessio: We had this conversation with George Hotz and we were like, there's not a lot of reasoning data out there. And can the models really understand? And his take was like, look, with enough compute, you're not that complicated as a human. Like the model can figure out eventually why certain decisions are made. What's been your experience? Like as you think about reasoning data, like do you have to do a lot of like manual work or like is there a way to prompt models to extract the reasoning from actions that they [00:29:03]

Swyx: see? [00:29:03]

Kanjun: So we don't think of it as, oh, throw enough data at it and then it will figure out what the plan should be. I think we're much more explicit. You know, a way to think about it is as humans, we've learned a lot of reasoning strategies over time. We are better at reasoning now than we were 3000 years ago. An example of a reasoning strategy is noticing you're confused. Then when I notice I'm confused, I should ask like, huh, what was the original claim that was made? What evidence is there for this claim? Does the evidence support the claim? Is the claim correct? This is like a reasoning strategy that was developed in like the 1600s, you know, with like the advent of science. So that's an example of a reasoning strategy. There are tons of them. We employ all the time, lots of heuristics that help us be better at reasoning. And we didn't always have them. And because they're invented, like we can generate data that's much more specific to them. So I think internally, yeah, we have a lot of thoughts on what reasoning is and we generate a lot more specific data. We're not just like, oh, it'll figure out reasoning from this black box or like it'll figure out reasoning from the data that exists. Yeah. [00:30:04]

Alessio: I mean, the scientific method is like a good example. If you think about hallucination, right, people are thinking, how do we use these models to do net new, like scientific research? And if you go back in time and the model is like, well, the earth revolves around the sun and people are like, man, this model is crap. It's like, what are you talking about? Like the sun revolves around the earth. It's like, how do you see the future? Like if the models are actually good enough, but we don't believe them, it's like, how do we make the two live together? So you're like, you use Inbu as a scientist to do a lot of your research and Inbu tells you, hey, I think this is like a serious path you should go down. And you're like, no, that sounds impossible. Like how is that trust going to be built? And like, what are some of the tools that maybe are going to be there to inspect it? [00:30:51]

Kanjun: Really there are two answers to this. One element of it is as a person, like I need to basically get information out of the model such that I can try to understand what's going on with the model. Then the second question is like, okay, how do you do that? And that's kind of some of our debugging tools, they're not necessarily just for debugging. They're also for like interfacing with and interacting with the model. So like if I go back in this reasoning trace and like change a bunch of things, what's going to happen? Like, what does it conclude instead? So that kind of helps me understand like, what are its assumptions? And, you know, we think of these things as tools. And so it's really about like, as a user, how do I use this tool effectively? I need to be willing to be convinced as well. It's like, how do I use this tool effectively? And what can it help me with? [00:31:36]

Swyx: And what can it tell me? There's a lot of mention of code in your process. And I was hoping to dive in even deeper. I think we might run the risk of giving people the impression that you view code or you use code just as like a tool within InView just for coding assistance. But I think you actually train code models. And I think there's a lot of informal understanding about how adding code to language models improves their reasoning capabilities. I wonder if there's any research or findings that you have to share that talks about the intersection of code and reasoning. Hmm. Yeah. [00:32:08]

Kanjun: So the way I think about it intuitively is like code is the most explicit example of reasoning data on the internet. [00:32:15]

Swyx: Yeah. [00:32:15]

Kanjun: And it's not only structured, it's actually very explicit, which is nice. You know, it says this variable means this, and then it uses this variable. And then the function does this. As people, when we talk in language, it takes a lot more to extract that explicit structure out of our language. And so that's one thing that's really nice about code is I see it as almost like a curriculum for reasoning. I think we use code in all sorts of ways. The coding agents are really helpful for us to understand what are the limitations of the agents. The code is really helpful for the reasoning itself. But also code is a way for models to act. So by generating code, it can act on my computer. And, you know, when we talk about rekindling the dream of the personal computer, kind of where I see computers going is, you know, like computers will eventually become these much more malleable things where I, as a user today, I have to know how to write software code, like in order to make my computer do exactly what I want it to do. But in the future, if the computer is able to generate its own code, then I can actually interface with it in natural language. And so one way we think about agents is kind of like a natural language programming language. It's a way to program my computer in natural language that's much more intuitive to me as a user. And these interfaces that we're building are essentially IDEs for users to program our computers in natural language. Maybe I should say what we're doing that way. Maybe it's clearer. [00:33:47]

Swyx: I don't know. [00:33:47]

Alessio: That's a good pitch. What do you think about the different approaches people have, kind of like text first, browser first, like multi-on? What do you think the best interface will be? Or like, what is your, you know, thinking today? [00:33:59]

Kanjun: In a lot of ways, like chat as an interface, I think Linus, Linus Lee, you had on this. I really like how he put it. Chat as an interface is skeuomorphic. So in the early days, when we made word processors on our computers, they had notepad lines because that's what we understood these like objects to be. Chat, like texting someone is something we understand. So texting our AI is something that we understand. But today's word documents don't have notepad lines. And similarly, the way we want to interact with agents, like chat is a very primitive way of interacting with agents. What we want is to be able to inspect their state and to be able to modify them and fork them and all of these other things. And we internally have, think about what are the right representations for that? Like architecturally, like what are the right representations? What kind of abstractions do we need to build? And how do we build abstractions that are not leaky? Because if the abstractions are leaky, which they are today, like, you know, this stochastic generation of text is like a leaky abstraction. I cannot depend on it. And that means it's actually really hard to build on top of. But our experience and belief is actually by building better abstractions and better tooling, we can actually make these things non-leaky. And now you can build like whole things on top of them. So these other interfaces, because of where we are, we don't think that much about them. [00:35:17]

Swyx: Yeah. [00:35:17]

Alessio: I mean, you mentioned, this is kind of like the Xerox Spark moment for AI. And we had a lot of stuff come out of Parc, like the, what you see is what you got editors and like MVC and all this stuff. But yeah, but then we didn't have the iPhone at Parc. We didn't have all these like higher things. What do you think it's reasonable to expect in like this era of AI, you know, call it like five years or so? Like what are like the things we'll build today and what are things that maybe we'll see in kind of like the second wave of products? [00:35:46]

Kanjun: That's interesting. I think the waves will be much faster than before. Like what we're seeing right now is basically like a continuous wave. Let me zoom a little bit earlier. So people like the Xerox Parc analogy I give, but I think there are many different analogies. Like one is the like analog to digital computer is kind of an example, like another analogy to where we are today. The analog computer Vannevar Bush built in the 1930s, I think, and it's like a system of pulleys and it can only calculate one function. Like it can calculate like an integral. And that was so magical at the time because you actually did need to calculate this integral bunch, but it had a bunch of issues like in analog errors compound. And so there was actually a set of breakthroughs necessary in order to get to the digital computer, like Turing's decidability, Shannon. I think the like whole like relay circuits can be thought of as can be mapped to Boolean operators and a set of other like theoretical breakthroughs, which essentially were abstractions. They were like creating abstractions for these like very like lossy circuits. They were creating abstractions for these like very analog circuits and digital had this nice property of like being error correcting. And so when I talk about like less leaky abstractions, that's what I mean. That's what I'm kind of pointing a little bit to. It's not going to look exactly the same way. And then the Xerox PARC piece, a lot of that is about like, how do we get to computers that as a person, I can actually use well. And the interface actually helps it unlock so much more power. So the sets of things we're working on, like the sets of abstractions and the interfaces, like hopefully that like help us unlock a lot more power in these systems. Like hopefully that'll come not too far in the future. I could see a next version, maybe a little bit farther out. It's like an agent protocol. So a way for different agents to talk to each other and call each other. Kind of like HTTP. [00:37:40]

Swyx: Do you know it exists already? [00:37:41]

Kanjun: Yeah, there is a nonprofit that's working on one. I think it's a bit early, but it's interesting to think about right now. Part of why I think it's early is because the issue with agents, it's not quite like the internet where you could like make a website and the website would appear. The issue with agents is that they don't work. And so it may be a bit early to figure out what the protocol is before we really understand how these agents get constructed. But, you know, I think that's, I think it's a really interesting question. [00:38:09]

Swyx: While we're talking on this agent to agent thing, there's been a bit of research recently on some of these approaches. I tend to just call them extremely complicated chain of thoughting, but any perspectives on kind of meta-GPT, I think it's the name of the paper. I don't know if you care about at the level of individual papers coming out, but I did read that recently and TLDR, it beat GPT-4 and human eval by role-playing software agent development agency, instead of having sort of single shot or single role, you have multiple roles and how having all of them criticize each other as agents communicating with other agents. [00:38:45]

Kanjun: Yeah, I think this is an example of an interesting abstraction of like, okay, can I just plop in this like multi-role critiquing and see how it improves my agent? And can I just plop in chain of thought, tree of thought, plop in these other things and see how they improve my agent? One issue with this kind of prompting is that it's still not very reliable. It's like, there's one lens, which is like, okay, if you do enough of these techniques, you'll get to high reliability. And I think actually that's a pretty reasonable lens. We take that lens often. And then there's another lens that's like, okay, but it's starting to get really messy what's in the prompt and like, how do we deal with that messiness? And so maybe you need like cleaner ways of thinking about and constructing these systems. And we also take that lens. So yeah, I think both are necessary. Yeah. [00:39:29]

Swyx: Side question, because I feel like this also brought up another question I had for you. I noticed that you work a lot with your own benchmarks, your own evaluations of what is valuable. I would say I would contrast your approach with OpenAI as OpenAI tends to just lean on, hey, we played StarCraft or hey, we ran it on the SAT or the, you know, the AP bio test and that did results. Basically, is benchmark culture ruining AI? [00:39:55]

Swyx: Or is that actually a good thing? Because everyone knows what an SAT is and that's fine. [00:40:04]

Kanjun: I think it's important to use both public and internal benchmarks. Part of why we build our own benchmarks is that there are not very many good benchmarks for agents, actually. And to evaluate these things, you actually need to think about it in a slightly different way. But we also do use a lot of public benchmarks for like, is the reasoning capability in this particular way improving? So yeah, it's good to use both. [00:40:26]

Swyx: So for example, the Voyager paper coming out of NVIDIA played Minecraft and set their own benchmarks on getting the Diamond X or whatever and exploring as much of the territory as possible. And I don't know how that's received. That's obviously fun and novel for the rest of the engineer, the people who are new to the scene. But for people like yourselves, you build Avalon just because you already found deficiencies with using Minecraft. Is that valuable as an approach? Oh, yeah. I love Voyager. [00:40:57]

Kanjun: I mean, Jim, I think is awesome. And I really like the Voyager paper and I think it has a lot of really interesting ideas, which is like the agent can create tools for itself and then use those tools. [00:41:06]

Swyx: He had the idea of the curriculum as well, which is something that we talked about earlier. Exactly. [00:41:09]

Kanjun: And that's like a lot of what we do. We built Avalon mostly because we couldn't use Minecraft very well to like learn the things we wanted. And so it's like not that much work to build our own. [00:41:19]

Swyx: It took us, I don't know. [00:41:22]

Kanjun: We had like eight engineers at the time, took about eight weeks. So six weeks. [00:41:27]

Swyx: And OpenAI built their own as well, right? Yeah, exactly. [00:41:30]

Kanjun: It's just nice to have control over our environment. But if you're doing our own sandbox to really trying to inspect our own research questions. But if you're doing something like experimenting with agents and trying to get them to do things like Minecraft is a really interesting environment. And so Voyager has a lot of really interesting ideas in it. [00:41:47]

Swyx: Yeah. Cool. One more element that we had on this list, which is context and memory. I think that's kind of like the foundational, quote unquote, RAM of our era. I think Andrej Karpathy has already made this comparison. So there's nothing new here. And that's just the amount of working knowledge that we can fit into one of these agents. And it's not a lot, right? Especially if you need to get them to do long running tasks. If they need to self-correct from errors that they observe while operating in their environment. Do you see this as a problem? Do you think we're going to just trend to infinite context and that'll go away? Or how do you think we're going to deal with it? [00:42:22]

Kanjun: I think when you talked about what's going to happen in the first wave and then in the second wave, I think what we'll see is we'll get like relatively simplistic agents pretty soon. And they will get more and more complex. And there's like a future wave in which they are able to do these like really difficult, really long running tasks. And the blocker to that future, one of the blockers is memory. And that was true of computers too. You know, I think when von Neumann made the von Neumann architecture, he was like, the biggest blocker will be like, we need this amount of memory, which is like, I don't remember exactly like 32 kilobytes or something to store programs. And that will allow us to write software. He didn't say it this way because he didn't have these terms, but that only really was like happened in the seventies with the microchip revolution. It may be the case that we're waiting for some research breakthroughs or some other breakthroughs in order for us to have like really good long running memory. And then in the meantime, agents will be able to do all sorts of things that are a little bit smaller than that. I do think with the pace of the field, we'll probably come up with all sorts of interesting things like, you know, RAG is already very helpful. [00:43:26]

Swyx: Good enough, you think? [00:43:27]

Kanjun: Maybe good enough for some things. [00:43:29]

Swyx: How is it not good enough? I don't know. [00:43:31]

Kanjun: I just think about a situation where you want something that's like an AI scientist. As a scientist, I have learned so much about my fields and a lot of that data is maybe hard to fine tune or on, or maybe hard to like put into pre-training. Like a lot of that data, I don't have a lot of like repeats of the data that I'm seeing. You know, like if I'm a scientist, I've like accumulated so many little data points. And ideally I'd want to store those somehow, or like use those to fine tune myself as a model somehow, or like have better memory somehow. I don't think RAG is enough for that kind of thing. But RAG is certainly enough for like user preferences and things like that. Like what should I do in this situation? What should I do in that situation? That's a lot of tasks. We don't have to be a scientist right away. Awesome. [00:44:21]

Swyx: I have a hard question, if you don't mind me being bold. Yeah. I think the most comparable lab to InView is Adept. You know, a research lab with like some amount of product situation on the horizon, but not just yet, right? Why should people work for InView over Adept? And we can cut this if it's too like... Yeah. [00:44:40]

Kanjun: The way I think about it is I believe in our approach. The type of thing that we're doing is we're trying to like build something that enables other people to build agents and build something that really can be maybe something like an operating system for agents. I know that that's what we're doing. I don't really know what everyone else is doing. You know, I can kind of like talk to people and have some sense of what they're doing. And I think it's a mistake to focus too much on what other people are doing, because extremely focused execution on the right thing is what matters. To the question of like, why us? I think like strong focus on reasoning, which we believe is the biggest blocker, on inspectability, which we believe is really important for user experience and also for the power and capability of these systems. Building non-leaky, good abstractions, which we believe is solving the core issue of agents, which is around reliability and being able to make them deployable. And then really seriously trying to use these things ourselves, like every single day, and getting to something that we can actually ship to other people that becomes something that is a platform. Like, it feels like it could be Mac or Windows. I love the dogfooding approach. [00:45:49]

Swyx: That's extremely important. And you will not be surprised how many agent companies I talk to that don't use their own agent. Oh no, that's not good. That's a big surprise. [00:45:59]

Kanjun: Yeah, I think if we didn't use our own agents, then we would have all of these beliefs about how good they are. Wait, did you have any other hard questions you wanted to ask? [00:46:08]

Swyx: Yeah, mine was just the only other follow-up that you had based on the answer you just gave was, do you see yourself releasing models or do you see yourself, what is the artifacts that you want to produce that lead up to the general operating system that you want to have people use, right? And so a lot of people just as a byproduct of their work, just to say like, hey, I'm still shipping, is like, here's a model along the way. Adept took, I don't know, three years, but they released Persimmon recently, right? Like, do you think that kind of approach is something on your horizon? Or do you think there's something else that you can release that can show people, here's kind of the idea, not the end products, but here's the byproducts of what we're doing? [00:46:51]

Kanjun: Yeah, I don't really believe in releasing things to show people like, oh, here's what we're doing that much. I think as a philosophy, we believe in releasing things that will be helpful to other people. [00:47:02]

Swyx: Yeah. [00:47:02]

Kanjun: And so I think we may release models or we may release tools that we think will help agent builders. Ideally, we would be able to do something like that, but I'm not sure exactly what they look like yet. [00:47:14]

Swyx: I think more companies should get into the releasing evals and benchmarks game. Yeah. [00:47:20]

Kanjun: Something that we have been talking to agent builders about is co-building evals. So we build a lot of our own evals and every agent builder tells me, basically evals are their biggest issue. And so, yeah, we're exploring right now. And if you are building agents, please reach out to me because I would love to, like, figure out how we can be helpful based on what we've seen. Cool. [00:47:40]

Swyx: That's a good call to action. I know a bunch of people that I can send your way. Cool. Great. [00:47:43]

Kanjun: Awesome. [00:47:44]

Swyx: Yeah. We can zoom out to other interests now. [00:47:46]

Alessio: We got a lot of stuff. So we have Sherif from Lexicon, the podcast. He had a lot of interesting questions on his website. You similarly have a lot of them. Yeah. [00:47:55]

Swyx: I need to do this. I'm very jealous of people with personal websites right there. Like, here's the high level questions of goals of humanity that I want to set people on. And I don't have that. [00:48:04]

Alessio: It's never too late, Sean. [00:48:05]

Swyx: Yeah. [00:48:05]

Alessio: It's never too late. [00:48:06]

Kanjun: Exactly. [00:48:07]

Alessio: There were a few that stuck out as related to your work that maybe you're kind of learning [00:48:12]

Swyx: more about it. [00:48:12]

Alessio: So one is why are curiosity and goal orientation often at odds? And from a human perspective, I get it. It's like, you know, would you want to like go explore things or kind of like focus on your career? How do you think about that from like an agent perspective? Where it's like, should you just stick to the task and try and solve it as in the guardrails as possible? Or like, should you look for alternative solutions? [00:48:34]

Swyx: Yeah. [00:48:34]

Kanjun: I think one thing that's really interesting about agents actually is that they can be forked. Like, you know, we can take an agent that's executed to a certain place and said, okay, here, like fork this and do a bunch of different things. I try a bunch of different things. Some of those agents can be goal oriented and some of them can be like more curiosity driven. You can prompt them in slightly different ways. And something I'm really curious about, like what would happen if in the future, you know, we were able to actually go down both paths. As a person, why I have this question on my website is I really find that like I really can only take one mode at a time and I don't understand why. And like, is it inherent in like the kind of context that needs to be held? That's why I think from an agent perspective, like forking it is really interesting. Like I can't fork myself to do both, but I maybe could fork an agent to like add a certain point in a task. [00:49:26]

Swyx: Yeah. Explore both. Yeah. [00:49:28]

Alessio: How has the thinking changed for you as the funding of the company changed? That's one thing that I think a lot of people in the space think is like, oh, should I raise venture capital? Like, how should I get money? How do you feel your options to be curious versus like goal oriented has changed as you raise more money and kind of like the company has grown? [00:49:50]

Kanjun: Oh, that's really funny. Actually, things have not changed that much. So we raised our Series A $20 million in late 2021. And our entire philosophy at that time was, and still kind of is, is like, how do we figure out the stepping stones, like collect stepping stones that eventually let us build agents, kind of these new computers that help us do bigger things. And there was a lot of curiosity in that. And there was a lot of goal orientation in that. Like the curiosity led us to build CARBS, for example, this hyperparameter optimizer. Great name, by the way. [00:50:28]

Swyx: Thank you. [00:50:29]

Kanjun: Is there a story behind that name? [00:50:30]

Swyx: Yeah. [00:50:31]

Kanjun: Abe loves CARBS. It's also cost aware. So as soon as he came up with cost aware, he was like, I need to figure out how to make this work. But the cost awareness of it was really important. So that curiosity led us to this really cool hyperparameter optimizer. That's actually a big part of how we do our research. It lets us experiment on smaller models. And for those experiment results to carry to larger ones. [00:50:56]

Swyx: Which you also published a scaling laws, which is great. I think the scaling laws paper from OpenAI was like the biggest. And from Google, I think, was the greatest public service to machine learning that any research lab can do. Yeah, totally. [00:51:10]

Kanjun: What was nice about CARBS is it gave us scaling laws for all sorts of hyperparameters. So yeah, that's cool. It basically hasn't changed very much. So there's some curiosity. And then there's some goal oriented parts. Like Avalon, it was like a six to eight week sprint for all of us. And we got this thing out. And then now different projects do like more curiosity or more goal orientation at different times. Cool. [00:51:36]

Swyx: Another one of your questions that we highlighted was, how can we enable artificial agents to permanently learn new abstractions and processes? I think this is might be called online learning. [00:51:45]

Kanjun: Yeah. So I struggle with this because, you know, that scientist example I gave. As a scientist, I've like permanently learned a lot of new things. And I've updated and created new abstractions and learned them pretty reliably. And you were talking about like, okay, we have this RAM that we can store learnings in. But how well does online learning actually work? And the answer right now seems to be like, as models get bigger, they fine tune faster. So they're more sample efficient as they get bigger. [00:52:15]

Swyx: Because they already had that knowledge in there. You're just kind of unlocking it. [00:52:23]

Kanjun: Partly maybe because they already have like some subset of the representation. Partly they just memorize things more, which is good. So maybe this question is going to be solved, but I still don't know what the answer is. [00:52:36]

Swyx: As I've had a platform that continually fine tunes for you as you work on that domain, which is something I'm working on. Well, that's great. We would love to use that. We'll talk more. Two more questions just about your general activities. I think you've just been very active in the San Francisco tech scene. You're a founding member of Software Commons. [00:52:56]

Kanjun: Oh yeah, that's true. [00:52:57]

Swyx: Tell me more. By the time I knew about SPC, it was already a very established thing. But what was it like in the early days? What was the story there? [00:53:05]

Kanjun: Yeah, the story is Ruchi, who started it, was the VP of operations at Dropbox. And I was the chief of staff and we worked together very closely. She's actually one of the investors in Sorceress. And SPC is an investor in Vue. And at that time, Ruchi was like, you know, I would like to start a space for people who are figuring out what's next. And we were figuring out what's next post-Ember, those three months. And she was like, do you want to just like hang out in this space? And we're like, sure. And it was a really good group. Wasim and Jeff from Pilot, the folks from Zulip, and a bunch of other people at that time. It was a really good group. We just hung out. There was no programming. It's much more official than it was at that time. [00:53:44]

Swyx: Yeah, now it's like a YC before YC type of thing. That's right, yeah. [00:53:48]

Kanjun: At that time, we literally, it was a bunch of friends hanging out in the space together. [00:53:51]

Swyx: And was this concurrent with the Archive? [00:53:53]

Kanjun: Oh yeah, actually, I think we started the Archive around the same time. [00:53:56]

Swyx: You're just like really big into community. But also like, so, you know, I run a Hacker House and I'm also part of hopefully what becomes like the next Software Commons or whatever. What are the principles in organizing communities like that with really exceptional people that go on to do great things? Do you have to be really picky about who joins? Like all your friends just magically turn out super successful like that. You know, it's not normal, right? Like this is very special. And a lot of people want to do that and fail. And you had the co-authors of GPT-3 in your house. That's true. [00:54:32]

Kanjun: And a lot of other really cool people that you'll eventually hear about. [00:54:35]

Swyx: Co-founders of Pilot and anyone else. I don't want you to pick your friends, but there's some magic special sauce in getting people together and in one workspace, living space, whatever, right? And that's part of why I'm here in San Francisco. And I would love for more people to learn about it and also maybe get inspired to build their own. [00:54:52]

Kanjun: Your question is really more about like, how do you actually build a community that where people in it are like eventually are awesome? [00:54:59]

Swyx: Okay. [00:55:00]

Kanjun: Which is different than like why live in a co-living house. So one adage we had when we started the archive was you become the average of the five people closest to you. [00:55:08]

Swyx: Yes. [00:55:08]

Kanjun: And I think that's roughly true. And good people draw good people. So there are really two things. One, we were quite picky and it mattered a lot to us. Is this someone where if they're hanging out in the living room, we'd be really excited to come hang out. Yeah. Two is I think we did a really good job of creating a high growth environment and an environment where people felt really safe. We actually apply these things to our team and it works remarkably well as well. So I do a lot of basically how do I create safe spaces for people where it's not just like safe law, but like it's like a safe space where people really feel inspired by each other. And I think at the archive, we really made each other better. My friend, Michael Nielsen called it a self-actualization machine. [00:55:52]

Swyx: My goodness. Okay. [00:55:54]

Kanjun: And I think, yeah, people came in. Was he a part of the archive? He was not, but he hung out a lot. Honorary member. Friend of the archive. [00:56:02]

Swyx: Yeah. [00:56:02]

Kanjun: The culture was that we learned a lot of things from each other about like how to make better life systems and how to think about ourselves and psychological debugging. And a lot of us were founders. So having other founders going through similar things was really helpful. And a lot of us worked in AI. And so having other people to talk about AI with was really helpful. And so I think all of those things led to a form of idea flux and also kind of like, so I think a lot about like the idea flux and default habits or default impulses. It led to a set of idea flux and default impulses that led to some really interesting things and led to us doing much bigger things, I think, than we otherwise would have decided to do because it felt like taking risks was less risky. So that's something we do a lot of on the team. It's like, how do we make it so that taking risks is less risky? And there's a term called senious. [00:56:57]

Swyx: Yes. I was thinking Kevin Kelly. Kevin Kelly, senious. I was going to feed you that word, but I didn't want to like bias you. Yes. [00:57:02]

Kanjun: I think maybe like a lot of what I'm interested in is constructing a kind of senious. And the archive was definitely a senious in a particular, or like getting toward a senious in a particular way. And Jason Ben, my archive housemate and who now runs the neighborhood, [00:57:17]

Swyx: has a good way of putting it. [00:57:17]

Kanjun: If genius is from your genes, senious is from your scene. Yeah, I think like maybe a lot of the community building impulse is from this like interest in what kind of idea flux can be created. You know, there's a question of like, why did Xerox PARC come out with all of this interesting stuff? It's their senious. Why did Bell Labs come out with all this interesting stuff? Maybe it's their senious. Why didn't the transistor come out of Princeton? And the other people working on it at the time. [00:57:44]

Swyx: I just think it's remarkable how you hear a lot about Alan Kay. And I just read a bit. And apparently Alan Kay was like the most junior guy at Xerox PARC. Yeah, definitely. [00:57:53]

Kanjun: He's just the one who talks about it. He talks the most. [00:57:57]

Swyx: Yeah, exactly. Yeah. So I, you know, hopefully I'm also working towards contributing that senious. I called mine the more provocative name of the arena. Interesting. That's quite provocative. In the arena. [00:58:08]

Kanjun: So are you fighting other people in the arena? [00:58:11]

Swyx: No. You never know. [00:58:12]

Alessio: On any day in the mission, it's an adventure. [00:58:15]

Swyx: We're in the arena trying stuff, as they say. You are also a GP at Outset Capital, where you also co-organize the Thursday Nights in AI, where hopefully someday I'll eventually speak. You're on the roster. [00:58:28]

Kanjun: I'm on the roster. [00:58:29]

Swyx: Thank you so much. So why spend time being a VC and organizing all these events? You're also a very busy CEO and, you know, why spend time with that? Why is that an important part of your life? [00:58:39]

Kanjun: Yeah, for me personally, I really like helping founders. So Allie, my investing partner, is fortunately amazing and she does everything for the fund. So she like hosts the Thursday night events and she finds folks who we could invest in. And she does basically everything. Josh and I are her co-partners. So Allie was our former chief of staff at Sorceress. We just thought she was amazing. She wanted to be an investor. And Josh and I also like care about helping founders and kind of like giving back to the community. What we didn't realize at the time when we started the fund is that it would actually be incredibly helpful for Imbue. So talking to AI founders who are building agents and working on, you know, similar things is really helpful. They could potentially be our customers and they're trying out all sorts of interesting things. And I think being an investor, looking at the space from the other side of the table, it's just a different hat that I routinely put on. And it's helpful to see the space from the investor lens as opposed to from the founder lens. So I find that kind of like hat switching valuable. It maybe would lead us to do slightly different things. [00:59:44]

Swyx: Awesome. Appreciate that. [00:59:46]

Alessio: Yeah, you've been really generous with your time. Let's just wrap with the lightning round. Okay. So we have two questions, acceleration, exploration, and then a takeaway. So the acceleration question is, what's something that already happened in AI that you thought would take much longer to be here? [01:00:03]

Kanjun: I think the rate at which we discover new capabilities of existing models and kind of like build hacks on top of them to make them work better is something that has been surprising and awesome. And the research community building on its own ideas, that's probably, you want something very specific. Yeah, I think the rate at which we discovered capabilities probably. [01:00:22]

Swyx: Cool. Exploration slash requests for startups. If you weren't building Imbue, what AI company would you build? Hmm. Every founder has like their like number two. Really? Yeah, I don't know. [01:00:33]

Kanjun: Wow. I cannot imagine building any other thing than Imbue. [01:00:37]

Swyx: Wow. Well, that's a great answer too. [01:00:38]

Kanjun: It's like obviously the thing to build. [01:00:42]

Swyx: Okay. [01:00:42]

Kanjun: It's like obviously work on the fundamental platform. Yeah. [01:00:46]

Swyx: So that was my attempt at innovating this question, but the previous one was, but what was the most interesting unsolved question in AI? [01:00:53]

Kanjun: My answer is kind of boring, but the most interesting unsolved questions are these questions of, how do we make these stochastic systems into things that we can like reliably use and build on top of? [01:01:04]

Swyx: Yep. [01:01:05]

Alessio: And yeah, take away what's one message you want everyone to remember? [01:01:09]

Kanjun: Maybe two things. One is just the like, we're in a historic moment. I didn't think in my lifetime I would necessarily be in, like able to work on the things I'm excited to work on in this moment, but we're in a historic moment that where we'll look back and be like, oh my God, the future was invented in these years. And I think like, there may be a set of messages to take away from that. One is like, AI is a tool like any technology. And you know, when it comes to things like, what might the future look like? Like we like to think about it as, it's like just a better computer. It's like much more powerful computer that gives us a lot of free intellectual energy that we can now like solve so many problems with. You know, there are so many problems in the world [01:01:53]

Swyx: where we're like, [01:01:53]

Kanjun: oh, it's not worth a person thinking about that. And so things get worse and things get worse. No one wants to work on maintenance. And like this technology gives us the potential to actually be able to like allocate intellectual energy to all of those problems. And the world could be much better, like could be much more thoughtful because of that. I'm so excited about that. And there are definitely risks and dangers. And we actually do a fair, something I didn't talk about is we do a fair amount of work on the policy side. On the safety side, like we think about safety and policy in terms of engineering theory and also regulation. And kind of comparing to like the automobile or the airplane or any new technology, there's like a set of new possible capabilities and a set of new possible dangers that are unlocked with every new technology. And so on the engineering side, like we think a lot about engineering safety, like how do we actually engineer these systems so that they are inspectable and why we reason in natural language so that the systems are very inspectable so that we can like stop things if anything weird is happening. That's why we don't think end-to-end black boxes [01:02:58]

Swyx: are a good idea. [01:02:58]

Kanjun: On the theoretical side, we like really believe in like deeply understanding, like when we actually fine tune on individual examples, like what's going on, when we're pre-training, what's going on, like debugging tools for these agents to understand like what's going on. And then on the regulation side, I think there's actually a lot of regulation that already covers many of the dangers like that people are talking about. And there are areas where there's not much regulation. And so we focus on those areas where there's not much regulation. So some of our work is actually, we built an agent that helped us analyze the 20,000 pages of policy proposals submitted to the Department of Commerce request for AI policy proposals. We looked at what were the problems people brought up and what were the solutions they presented and then like did a summary analysis and kind of like, you know, build agents to do that. And now the Department of Commerce is like interested in using that as a tool to like analyze proposals. And so a lot of what we're trying to do on the regulation side is like actually figure out where is there regulation missing and how do we actually in a very targeted way try to solve those missing areas. So I guess if I were to say like, what are the takeaways? It's like the takeaway is like the future could be really exciting if we can actually get agents that are able to do these bigger things. Reasoning is the biggest blocker plus like these sets of abstractions to make things more robust and reliable. And there are, you know, things where we have to be quite careful and thoughtful about how do we deploy these and what kind of regulation should go along with it so that this is actually a technology that where we, when we deploy it, it is protective to people and not harmful. [01:04:36]

Swyx: Awesome, wonderful. [01:04:38]

Alessio: Thank you so much for your time, Kanjun. [01:04:40]

Kanjun: Thank you. [01:04:41]

Swyx: Thank you. [01:04:48]

Even the relatively taciturn Adept recently released Persimmon-8B, and shared their design thinking at the AI Engineer Summit.

As Emad Mostaque points out, 10k H100s would usually cost $300m or $175m per year, but Kanjun replies that they got an “unusually good deal”.

Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue

From RL to Reasoning LLMs

The Full Stack Model Lab

Timestamps

Show Notes

Transcript

Discussion about this episode

Ready for more?