Amazon is betting on agents to win the AI race

August 21, 2025
2:27 pm

Hello, and welcome to Decoder! This is Alex Heath, your Thursday episode guest host and deputy editor at The Verge. One of the biggest topics in AI these days is agents — the idea that AI is going to move from chatbots to reliably completing tasks for us in the real world. But the problem with agents is that they really aren’t all that reliable right now.

There’s a lot of work happening in the AI industry to try to fix that, and that brings me to my guest today: David Luan, the head of Amazon’s AGI research lab. I’ve been wanting to chat with David for a long time. He was an early research leader at OpenAI, where he helped drive the development of GPT-2, GPT-3, and DALL-E. After OpenAI, he cofounded Adept, an AI research lab focused on agents. And last summer, he left Adept to join Amazon, where he now leads the company’s AGI lab in San Francisco.

We recorded this episode right after the release of OpenAI’s GPT-5, which gave us an opportunity to talk about why he thinks progress on AI models has slowed. The work that David’s team is doing is a big priority for Amazon, and this is the first time I’ve heard him really lay out what he’s been up to.

I also had to ask him about how he joined Amazon. David’s decision to leave Adept was one of the first of many deals I call reverse acquihire, in which a Big Tech company all-but-actually buys a buzzy AI startup to avoid antitrust scrutiny. I don’t want to spoil too much, but let’s just say that David left the startup world for Big Tech last year because he says he knew where the AI race was headed. I think that makes his predictions for what’s coming next worth listening to.

This interview has been lightly edited for length and clarity.

David, welcome to the show.

Thanks so much for having me on. I’m really excited to be here.

It’s great to have you. We have a lot to talk about. I’m super interested in what you and your team are up to at Amazon these days. But first, I think the audience could really benefit from hearing a little bit about you and your history, and how you got to Amazon, because you’ve been in the AI space for a long time, and you’ve had a pretty interesting career leading up to this. Could you walk us through a little bit of your background in AI and how you ended up at Amazon?

First off, I find it absolutely hilarious that anyone would say I’ve been around the field for a long time. It’s true in relative terms, because this field is so new, and yet, nonetheless, I’ve only been doing AI stuff for about the last 15 years. So compared with many other fields, it’s not that long.

Well, 15 years is an eternity in AI years.

It is an eternity in AI years. I remember when I first started working in the field. I worked on AI just because I thought it was interesting. I thought having the opportunity to build systems that could think like humans, and, ideally, deliver superhuman performance, was such a cool thing to do. I had no idea that it was going to blow up the way that it did.

But my personal background, let’s see. I led the research and engineering teams at OpenAI from 2017 to mid-2020, where we did GPT-2 and GPT-3, as well as CLIP and DALL-E. Every day was just so much fun, because you would show up to work and it was just your best friends and you’re all trying a bunch of really interesting research ideas, and there was none of the pressure that exists right now.

Then, after that, I led the LLM effort at Google, where we trained a model called PaLM, which was quite a strong model for its time. But shortly after that, a bunch of us decamped to various startups, and my team and I ended up launching Adept. It was the first AI agent startup. We ended up inventing the computer-use agent effectively. Some good research had been done beforehand. We had the first production-ready agent, and Amazon brought us in to go run agents for it about a year ago.

Great, and we’ll get into that and what you’re doing at Amazon. But first, given your OpenAI experience, we’re now talking less than a week from the release of GPT-5. I’d love to hear you reflect on that model, what GPT-5 says about the industry, and what you thought when you saw it. I’m sure you still have colleagues at OpenAI who worked on it. But what does that release signify?

I think it really signifies a high level of maturity at this point. The labs have all figured out how to reliably tape out increasingly better models. One of the things that I always harp on is that your job, as a frontier-model lab, is not to train models. Your job as a frontier-model lab is to build a factory that repeatedly churns out increasingly better models, and that’s actually a very different philosophy for how to make progress. In the I-build-a-better-model path, all you do is think about, “Let me make this tweak. Let me make this tweak. Let me try to glom onto people to get a better release.”

If you care about it from the perspective of a model factory, what you’re actually doing is trying to figure out how you can build all the systems and processes and infrastructure to make these things smarter. But with the GPT-5 release, I think what I find most interesting is that a lot of the frontier models these days are converging in capabilities. I think, in part, there’s an explanation that one of my old colleagues at OpenAI, Phillip Isola, who’s now a professor at MIT, came up with called the Platonic representation hypothesis. Have you heard of this hypothesis?

No.

So the Platonic representation hypothesis is this idea, similar to Plato’s cave allegory, which is really what it’s named after, that there is one reality. But we, as humans, see only a particular rendering of that reality, like the shadows on the wall in Plato’s cave. It’s the same for LLMs, which “see” slices of this reality through the training data they’re fed.

So every incremental YouTube video of, for example, someone going for a nature walk in the woods, is all ultimately generated by the actual reality that we live in. As you train these LLMs on more and more and more data, and the LLMs become smarter and smarter, they all converge to represent this one shared reality that we all have. So, if you believe this hypothesis, what you should also believe is that all LLMs will converge to the same model of the world. I think that’s actually happening in practice from seeing frontier labs deliver these models.

Well, there’s a lot to that. I would maybe suggest that a lot of people in the industry don’t necessarily believe we live in one reality. When I was at the last Google I/O developer conference, cofounder Sergey Brin and Google DeepMind chief Demis Hassabis were onstage, and they both seemed to believe that we were existing in multiple realities. So I don’t know if that’s a thing that you’ve encountered in your social circles or work circles over the years, but not everyone in AI necessarily believes that, right?

[Laughs] I think that hot take is above my pay grade. I do think that we only have one.

Yeah, we have too much to cover. We can’t get into multiple realities. But to your point about everything converging, it does feel as if benchmarks are starting to not matter as much anymore, and that the actual improvements in the models, like you said, are commodifying. Everyone’s getting to the same point, and GPT-5 will be the best on LMArena for a few months until Gemini 3.0 comes out, or whatever, and so on and so on.

If that’s the case, I think what this release has also shown is that maybe what is really starting to matter is how people actually use these things, and the feelings and the attachments that they have toward them. Like how OpenAI decided to bring back its 4o model because people had a literal attachment to it as something they felt. People on Reddit have been saying, “It’s like my best friend’s been taken away.”

So it really doesn’t matter that it’s better at coding or that it’s better at writing; it’s your friend now. That’s freaky. But I’m curious. When you saw that and you saw the reaction to GPT-5, did you predict that? Did you see that we were moving that way, or is this something new for everyone?

There was a project called LaMDA or Meena at Google in 2020 that was basically ChatGPT before ChatGPT, but it was available only to Google employees. Even back then, we started seeing employees developing personal attachments to these AI systems. Humans are so good at anthropomorphizing anything. So I wasn’t surprised to see that people formed bonds with certain model checkpoints.

But I think that when you talk about benchmarking, the thing that stands out to me is what benchmarking is really all about, which at this point is just people studying for the exam. We know what the benchmarks are in advance. Everybody wants to post higher numbers. It’s like the megapixel wars from the early digital camera era. They just clearly don’t matter anymore. They have a very loose correlation with how good of a photo this thing actually takes.

I think the question, and the lack of creativity in the field that I’m seeing, boils down to the fact that AGI is way more than just chat. It’s way more than just code. Those just happen to be the first two use cases that we all know work really well for these models. There’s so many more useful applications and base model capabilities that people haven’t even started figuring out how to measure well yet.

I think the better questions to ask now if you want to do something interesting in the field are: What should I actually run at? Why am I trying to spend more time making this thing slightly better at creative writing? Why am I trying to spend my time trying to make this model X percent better at the International Math Olympiad when there’s so much more left to do? When I think about what keeps me and the people who are really focused on this agent’s vision going, it’s looking to solve a much greater breadth of problems than what people have worked out so far.

That brings me to this topic. I was going to ask about it later. But you’re running the AGI research lab at Amazon. I have a lot of questions about what AGI means to Amazon, specifically, but I’m curious first for you, what did AGI mean to you when you were at OpenAI helping to get GPT off the ground, and what does it mean to you now? Has that definition changed at all for you?

Well, the OpenAI definition for AGI we had was a system that could outperform humans at economically valuable tasks. While I think that was an interesting, almost doomer North Star back in 2018, I think we have gone so much beyond that as a field. What gets me excited every day is not how do I replace humans at economically valuable tasks, but how do I ultimately build toward a universal teammate for every knowledge worker.

What keeps me going is the sheer amount of leverage we could give to humans on their time if we had AI systems to which you could ultimately delegate a large chunk of the execution of what you do every day. So my definition for AGI, which I think is very tractable and very much focused on helping people — as the first most important milestone that would lead me to say we’re basically there — is a model that could help a human do anything they want to do on a computer.

I like that. That’s actually more concrete and grounded than a lot of the stuff I’ve heard. It also shows how different everyone feels about what AGI means. I was just on a press call with Sam Altman for the GPT-5 launch, and he was saying he now thinks of AGI as a model that can self-improve itself. Maybe that’s related to what you’re saying, but it sounds as if you’re grounding it more in the actual use case.

Well, the way that I look at it is self-improvement is interesting, but to what end, right? Why do we, as humans, care if the AGI is self-improving itself? I don’t really care, personally. I think it’s cool from a scientist’s perspective. I think what’s more interesting is how do I build the most useful form of this super generalist technology, and then be able to put it in everybody’s hands? And I think the thing that gives people tremendous leverage is if I can teach this agent that we’re training to handle any useful task that I need to get done on my computer, because so much of our life these days is in the digital world.

So I think it’s very tractable. Going back to our discussion about benchmarking, the fact that the field cares so much about MMLU, MMLU-Pro, Humanity’s Last Exam, AMC 12, et cetera, we don’t have to live in that box of “that’s what AGI does for me.” I think it’s way more interesting to look at the box of all useful knowledge-worker tasks. How many of them are doable on your machine? How can these agents do them for you?

So it’s safe to say that for Amazon, AGI means more than shopping for me, which is the cynical joke I was going to make about what AGI means for Amazon. I’d be curious to go back to when you joined Amazon, and you were talking to the management team and Andy Jassy, and how still to this day you guys talk about the strategic value of AGI as you define it for Amazon, broadly. Amazon is a lot of things. It’s really a constellation of companies that do a lot of different things, but this idea kind of cuts across all of that, right?

I think that if you look at it from the perspective of computing, so far the building blocks of computing have been: Can I rent a server somewhere in the cloud? Can I rent some storage? Can I write some code to go hook all these things up and deliver something useful to a person? The building blocks of computing are changing. At this point, the code’s written by an AI. Down the line, the actual intelligence and decision-making are going to be done by an AI.

So, then what happens to your building blocks? So, in that world, it’s super important for Amazon to be good specifically at solving the agent’s problem, because agents are going to be the atomic building blocks of computing. And when that is true, I think so much economic value will be unlocked as a result of that, and it really lines up well with the strengths that Amazon already has on the cloud side, and putting together ridiculous amounts of infrastructure and all that.

I see what you’re saying. I think a lot of people listening to this, even people who work in tech, understand conceptually that agents are where the industry’s headed. But I would venture to guess that the vast majority of the listeners to this conversation have either never used an agent or have tried one and it didn’t work. I would pretty much say that’s the lay of the land right now. What would you hold out as the best example of an agent, the best example of where things are headed and what we can expect? Is there something you can point to?

So I feel for all the people who have been told over and over again that agents are the future, and then they go try the thing, and it just doesn’t work at all. So let me try to give an example of what the actual promise of agents is relative to how they’re pitched to us today.

Right now, the way that they’re pitched to us is, for the most part, as just a chatbot with extra steps, right? It’s like, Company X doesn’t want to put a human customer service rep in front of me, so now I have to go talk to a chatbot. Maybe behind the scenes it clicks a button. Or you’ve played with a product that does computer use that is supposed to help me with something on my browser, but in reality it takes four times as long, and one out of three times it screws up. This is kind of the current landscape of agents.

Let’s take a concrete example: I want to do a particular drug discovery task where I know there’s a receptor, and I need to be able to find something that ends up binding to this receptor. If you pull up ChatGPT today and you talk to it about this problem, it’s going to go and find all the scientific research and write you a perfectly formatted piece of markdown of what the receptor does, and maybe some things you want to try.

But that’s not an agent. An agent, in my book, is a model and a system that you can literally hook up to your wet lab, and it’s going to go and use every piece of scientific machinery you have in that lab, read all the literature, propose the right optimal next experiment, run that experiment, see the results, react to that, try again, et cetera, until it’s actually achieved the goal for you. The degree to which that gives you leverage is so, so, so much higher than what the field is currently able to do right now.

Do you agree, though, that there’s an inherent limitation in large language models and decision-making and executing things? When I see how LLMs, even still the frontier ones, still hallucinate, make things up, and confidently lie, it’s terrifying to think of putting that technology in a construct where now I’m asking it to go do something in the real world, like interact with my bank account, ship code, or work in a science lab.

When ChatGPT can’t spell right, that doesn’t feel like the future we’re going to get. So, I’m wondering, are LLMs it, or is there more to be done here?

So we started with a topic of how these models are increasingly converging in capability. While that’s true for LLMs, I don’t think that’s been true, to date, for agents, because the way that you should train an agent and the way that you train an LLM are quite different. With LLMs, as we all know, the bulk of their training happens from doing next-token prediction. I’ve got a giant corpus of every article on the internet, let me try to predict the next word. If I get the next word right, then I get a positive reward, and if I get it wrong, then I’m penalized. But, in reality, what’s actually happening is what we in the field call behavioral cloning or imitation learning. It’s the same thing as cargo culting, right?

The LLM never learns why the next word is the right answer. All it learns is that when I see something that is similar to the previous set of words, I should go say this particular next word. So the issue with this is that this is great for chat. This is great for creative-use cases where you want some of the chaos and randomness from hallucinations. But if you want it to be an actual successful decision-making agent, these models need to learn the true causal mechanism. It’s not just cloning human behavior; it’s actually learning if I do X, the consequence of it is Y. So the question is, how do we train agents so that they can learn the consequences of their actions? The answer, obviously, cannot be just doing more behavioral cloning and copying text. It has to be something that looks like actual trial and error in the real world.

That’s basically the research roadmap for what we’re doing in my group at Amazon. My friend Andrej Karpathy has a really good analogy here, which is imagine you have to train an agent to go play tennis. You wouldn’t have it spend 99 percent of its time watching YouTube videos of tennis, and then 1 percent of its time actually playing tennis. You would have something that’s far more balanced between these two activities. So what we’re doing in our lab here at Amazon is large-scale self-play. If you remember, the concept of self-play was the technique that DeepMind really made popular in the mid-2010s, when it beat humans at playing Go.

So for playing Go, what DeepMind did was spin up a bajillion simulated Go environments, and then it had the model play itself over and over and over again. Every time it found a strategy that was better at beating a previous version of itself, it would effectively get a positive reward via reinforcement learning to go do more of that strategy in the future. If you spent a lot of compute on this in the Go simulator, it actually discovered superhuman strategies for how to play Go. Then when it played the world champion, it made moves that no human had ever seen before and contributed to the state of the art of that whole field.

What we’re doing is, rather than doing more behavioral coding or watching YouTube videos, we’re creating a giant set of RL [reinforcement learning] gyms, and each one of these gyms, for example, is an environment that a knowledge worker might be working in to get something useful done. So here’s a version of something that’s like Salesforce. Here’s a version of something that’s like an enterprise resource plan. Here’s a computer-aided design program. Here’s an electronic medical record system. Here’s accounting software. Here is every interesting domain of possible knowledge work as a simulator.

Now, instead of training an LLM just to do tech stuff, we have the model actually propose a goal in every single one of these different simulators as it tries to solve that problem and figure out if it’s successfully solved or not. It then gets rewarded and receives feedback based on, “Oh, did I do the depreciation correctly?” Or, “Did I correctly make this part in CAD?” Or, “Did I successfully book the flight?” to choose a consumer analogy. Every time it does this, it actually learns the consequences of its actions, and we believe that this is one of the big missing pieces left for actual AGI, and we’re really scaling up this recipe at Amazon right now.

How unique is this approach in the industry right now? Do you think the other labs are onto this as well? If you’re talking about it, I would assume so.

I think that what’s interesting is this field. Ultimately, you have to be able to do something like this, in my opinion, to get beyond the fact that there’s a limited amount of free-floating data on the internet that you can train your models on. The thing we’re doing at Amazon is, because this came from what we did at Adept and Adept has been doing agents for so long, we just care about this problem way more than everybody else, and I think we’ve made a lot of progress toward this goal.

You called these gyms, and I was thinking physical gyms, for a second. Does this become physical gyms? You have a background in robotics, right?

That’s a good question. I’ve also done robotics work before. Here we also have Pieter Abbeel, who came from Covariant and is a Berkeley professor whose students ended up creating the majority of the RL algorithms that work well today. It’s funny that you say gyms, because we were trying to find an internal code name for the effort. We kicked around Equinox and Barry’s Bootcamp and all this stuff. I’m not sure everybody had the same sense of humor, but we call them gyms because at OpenAI we had a very useful early project called OpenAI Gym.

This was before LLMs were a thing. OpenAI Gym was a collection of video game and robotics tasks. For example, can you balance a pole that’s on a cart and can you train an RL algorithm that can keep that thing perfectly centered, et cetera. What we were inspired to ask was, now that these models are smart enough, why have toy tasks like that? Why not put the actual useful tasks that humans do on their computers into these gyms and have the models learn from these environments? I don’t see why this wouldn’t also generalize to robotics.

Is the end state of this an agent’s framework system that gets deployed through AWS?

The end state of all this is a model plus a system that is rock-solid reliable, like 99 percent reliable, at all sorts of valuable knowledge-work tasks that are done on a computer. And this is going to be something that we think will be a service on AWS that’s going to underpin, effectively, so many useful applications in the future.

I did a recent Decoder episode with Aravind Srinivas, the CEO of Perplexity, about his Comet Browser. A lot of people on the consumer side think that the browser interface is actually going to be the way to get to agents, at scale, on the consumer side.

I’m curious what you think of that. This idea that it’s not enough to just have a chatbot, you really need to have ChatGPT, or whatever model, sit next to your browser, look at the web page, act on it for you, and learn from that. Is that where all this is headed on the consumer side?

I think chatbots are definitely not the long-term answer, or at least not chatbots in the way we think about them today if you want to build systems that take actions for you. The best analogy I have for this is this: my dad is a very well-intentioned, smart guy, who spent a lot of his career working in a factory. He calls me all the time for tech support help. He says, “David, something’s wrong with my iPad. You got to help me with this.” We’re just doing this over the phone, and I can’t see what’s on the screen for him. So, I’m trying to figure, “Oh, do you have the settings menu open? Have you clicked on this thing yet? What’s going on with this toggle?” Chat is such a low bandwidth interface. That is the chat experience for trying to get actions done, with a very competent human on the other side trying to handle things for you.

So one of the big missing pieces, in my opinion, right now in AI, is our lack of creativity with product form factors, frankly. We are so used to thinking that the right interface between humans and AIs is this perpendicular one-on-one interaction where I’m delegating something, or it’s giving me some news back or I’m asking you a question, et cetera. One of the real things we’ve always missed is this parallel interaction where both the user and the AI actually have a shared canvas that they’re jointly collaborating on. I think if you really think about building a teammate for knowledge workers or even just the world’s smartest personal assistant, you would want to live in a world where there’s a shared collaborative canvas for the two of you.

Speaking of collaboration, I’m really curious how your team works with the rest of Amazon. Are you pretty walled off from everything? Do you work on Nova, Amazon’s foundational model? How do you interact with the rest of Amazon?

What Amazon’s done a great job with, for what we’re doing here, is allowing us to run pretty independently. I think there’s recognition that some of the startup DNA right now is really valuable for maximum speed. If you believe AGI is two to five years away, some people are getting more bullish, some people are getting more bearish. It doesn’t matter. That’s not a lot of time in the grand scheme of things. You need to move really, really fast. So, we’ve been given a lot of independence, but we’ve also taken the tech stack that we’ve built and contributed a lot of that upstream to the Nova foundation model as well.

So is your work, for example, already impacting Alexa Plus? Or is that not something that you’re part of in any way?

That’s a good question. Alexa Plus has the ability to, for example, if your toilet breaks, you’re like, “Ah, man, I really need a plumber. Alexa, can you get me a plumber?” Alexa Plus then spins up a remote browser, powered by our technology, that then goes and uses Thumbtack, like a human would, to go get a plumber to your house, which I think is really cool. It’s the first production web agent that’s been shipped, if I remember correctly.

The early response to Alexa Plus has been that it’s a dramatic leap for Alexa but still brittle. There’s still moments where it’s not reliable. And I’m wondering, is this the real gym? Is this the at-scale gym where Alexa Plus is how your system gets more reliable much faster? You have to have this in production and deployed to… I mean, Alexa has millions and millions of devices that it’s on. Is that the strategy? Because I’m sure you’ve seen the earlier reactions to Alexa Plus are that it’s better, but still not as reliable as people would like it to be.

Alexa Plus is just one of many customers that we have, and what’s really interesting about being within Amazon is, to go back to what we were talking about earlier, web data is effectively running out, and it’s not useful for training agents. What’s actually useful for training agents is lots and lots of environments, and lots and lots of people doing reliable multistep workflows. So, the interesting thing at Amazon is that, in addition to Alexa Plus, basically every Fortune 500 business’s operations are represented, in some way, by some internal Amazon team. There’s One Medical, there’s everything happening on supply chain and procurement on the retail side, there’s all this developer-facing stuff on AWS.

Agents are going to require a lot of private data and private environments to be trained. Because we’re in Amazon, that’s all now 1P [first-party selling model]. So they’re just one of many different ways in which we can get reliable workflow data to train the smarter agent.

Are you doing this already through Amazon’s logistics operations, where you can do stuff in warehouses, or [through] the robotic stuff that Amazon is working on? Does that intersect with your work already?

Well, we’re really close to Pieter Abbeel’s group on the robotics side, which is awesome. In some of the other areas, we have a big push for internal adoption of agents within Amazon, and so a lot of those conversations or engagements are happening.

I’m glad you brought that up. I was going to ask: how are agents being used inside Amazon today?

So, again, as we were saying earlier, because Amazon has an internal effort for almost every useful domain of knowledge work, there has been a lot of enthusiasm to pick up a lot of these systems. We have this internal channel called… I won’t tell you what it’s actually called.

It’s related to the product that we’ve been building. It’s just been crazy to see teams from all over the world within Amazon — because one of the main bottlenecks we’ve had is we didn’t have availability outside the US for quite a while — and it was crazy just how many international Amazon teams wanted to start picking this up, and then using it themselves on various operations tasks that they had.

This is your just agent framework that you’re talking about. This is something you haven’t released publicly yet.

We released Nova Act, which was a research preview that came out in March. But as you can imagine, we’ve added way more capability since then, and it’s been really cool. The thing we always do is we first dogfood with internal teams.

Your colleague, when you guys released Nova Act, said it was the most effortless way to build agents that can reliably use browsers. Since you’ve put that out, how are people using Nova Act? It’s not something that, in my day-to-day, I hear about, but I assume companies are using it, and I’d be curious to hear what feedback you guys have gotten since you came out with it.

So, a wide range of enterprises and developers are using Nova Act. And the reason you don’t hear about it is we’re not a consumer product. If anything, the whole Amazon agent strategy, including what I did before at Adept, is sort of doing normcore agents, not the super sexy stuff that works one out of three times, but super reliable, low-level workflows that work 99-plus percent of the time.

So, that’s the target. Since Nova Act came out, we’ve actually had a bunch of different enterprises end up deploying with us that are seeing 95-plus percent reliability. As I’m sure you’ve seen from the coverage of other agent products out there, that’s a material step up from the average 60 percent reliability that folks see with those systems. I think that the reliability bottleneck is why you don’t see as much agent adoption overall in the field.

We’ve been having a lot of really good luck, specifically by focusing extreme amounts of effort on reliability. So we’re now used for things like, for example, doctor and nurse registrations. We have another customer called Navan, formerly TripActions, which uses us basically to automate a lot of backend travel bookings for its customers. We’ve got companies that basically have 93-step QA workflows that they’ve automated with a single Nova Act script.

I think the early progress has been really cool. Now, what’s up ahead is how do we do this extreme large-scale self-play on a bajillion gyms to get to something where there’s a bit of a “GPT for RL agents” moment, and we’re running as fast as we can toward that right now.

Do you have a line of sight to that? Do you think we’re two years from that? One year?

Honestly, I think we’re sub-one year. We have line of sight. We’ve built out teams for every step of that particular problem, and things are just starting to work. It’s just really fun to go to work every day and realize that one of the teams has made a small but very useful breakthrough that particular day, and the whole cycle that we’re doing for this training loop seems to be going a little bit faster every day.

Going back to GPT-5, people have said, “Does this portend a slowdown in AI progress?” And 100 percent I think the answer is no, because when one S-curve peters out… the first one being pretraining, which I don’t think has petered out, by the way, but it’s definitely, at this point, less easy to get gains than before. And then you’ve got RL with verifiable rewards. But then every time one of these S-curves seems to slow down a little bit, there’s another one coming up, and I think agents are the next S-curve, and the specific training recipe we were talking about earlier is one of the main ways of getting that next giant amount of acceleration.

It sounds like you and your colleagues have identified the next turn that the industry is going to take, and that starts to put Nova, as it exists today, into more context for me, because Nova, as an LLM, is not an industry-leading LLM. It’s not in the same conversation as Claude, GPT-5, or Gemini.

Is Nova just not as important, because what’s really coming is what you’ve been talking about with agents, which will make Nova more relevant? Or is it important that Nova is the best LLM in the world as well? Or is that not the right way to think about it?

I think the right way to think about it is that every time you have a new upstart lab trying to join the frontier of the AI game, you need to bet on something that can really leapfrog, right? I think what’s interesting is every time there’s a recipe change for how these models are trained, it creates a giant window of opportunity for someone new who’s starting to come to the table with that new recipe, instead of trying to catch up on all the old recipes.

Because the old recipes are actually baggage for the incumbents. So, to give some examples of this, at OpenAI, of course, we basically pioneered giant models. The whole LLM thing came out of GPT-2 and then GPT-3. But those LLMs, initially, were text-only training recipes. Then we discovered RLHF [reinforcement learning from human feedback], and then they started getting a lot of human data via RLHF.

But then in the switch to multimodal input, you kind of have to throw away a lot of the optimizations you did in the text-only world, and that gives time for other people to catch up. I think that was actually part of how Gemini was able to catch up — Google bet on certain interesting ideas on native multimodal that turned out well for Gemini.

After that, reasoning models gave another opportunity for people to catch up. That’s why DeepSeek was able to surprise the world, because that team straight quantum-tunneled to that instead of doing every stop along the way. I think with the next turn being agents — especially agents without verifiable rewards — if we, at Amazon, can figure out that recipe earlier, faster, and better than everybody else, with all the scale that we have as a company, it basically brings us to the frontier.

I haven’t heard that articulated from Amazon before. That’s really interesting. It makes a lot of sense. Let’s end on the state of the talent market and startups, and how you came to Amazon. I want to go back to that. So Adept, when you started it, was it the first startup to really focus on agents at the time? I don’t think I had heard of agents until I saw Adept.

Yeah, actually we were the first startup to focus on agents, because when we were starting Adept, we saw that LLMs were really good at talking but could not take action, and I could not imagine a world in which that was not a crucial problem to be solved. So we got everybody focused on solving that.

But when we got started, the word “agent,” as a product category, wasn’t even coined yet. We were trying to find a good term, and we played with things like large action models, and action transformers. So our first product was called Action Transformer. And then, only after that, did agents really start picking up as being the term.

Walk me through the decision to leave that behind and join Amazon with most of the technical team. Is that right?

Mm-hmm.

I have a phrase for this. It’s a deal structure that has now become common with Big Tech and AI startups: it’s reverse acquihire, where basically the core team, such as you and your cofounders, join. The rest of the company still exists, but the technical team goes away. And the “acquirer” — I know it’s not an acquisition — but the acquirer pays a licensing fee, or something to that effect, and shareholders make money.

But the startup is then kind of left to figure things out without its founding team, in most cases. The most recent example is Google and Windsurf, and then there was Meta and Scale AI before that. This is a topic we’ve been talking about on Decoder a lot. The listeners are familiar with it. But you were one of the first of these reverse acquihires. Walk me through when you decided to join Amazon and why.

So I hope, in 50 years, I’m remembered more as being an AI research innovator rather than a deal structure innovator. First off, humanity’s demand for intelligence is way, way, way higher than the amount of supply. So, therefore, for us as a field, to invest ridiculous amounts of money in building the world’s biggest clusters and bringing the best talent together to drive those clusters is actually perfectly rational, right? Because if you can spend an extra X dollars to build a model that has 10 more IQ points and can solve a giant new concentric circle of useful tasks for humanity, that is a worthwhile trade that you should do any day of the week.

So I think it makes a lot of sense that all these companies are trying to put together critical mass on both talent and compute right now. From my perspective on why I joined Amazon, it’s because Amazon knows how important it is to win on the agent side, in particular, and that agents are a crucial bet for Amazon to build one of the best frontier labs possible. To get to the level of scale, you’re hearing all these CapEx numbers from the various hyperscalers. It’s just completely mind-boggling and it’s all real, right?

It’s over $340 billion in CapEx this year alone, I think, from just the top hyperscalers. It’s an insane number.

That sounds about right. At Adept, we raised $450 million, which, at the time, was a very large number. And then, today is…

It’s chump change now.

[Laughs] It’s chump change.

That’s one researcher. Come on, David.

[Laughs] Yes, one researcher. That’s one employee. So if that’s the world that you live in, it’s really important, I think, for us to partner with someone who’s going to go fight all the way to the end, and that’s why we came to Amazon.

Did you foresee that consolidation and those numbers going up when you did the deal with Amazon? You knew that it was going to just keep getting more expensive, not only on compute but on talent.

Yes, that was one of the biggest drivers.

And why? What did you see coming that, at the time, was not obvious to everyone?

There were two things I saw coming. One, if you want to be at the frontier of intelligence, you have to be at the frontier of compute. And if you are not on the frontier of compute, then you have to pivot and go do something that is totally different. For my whole career, all I’ve wanted to do is build the smartest and most useful AI systems. So, the idea of turning Adept into an enterprise company that sells only small models or turns into a place that does forward-deployed engineering to go help you deploy an agent on top of someone else’s model, none of those things appealed to me.

I want to figure out, “Here are the four crucial remaining research problems left to AGI. How do we nail them?” Every single one of them is going to require two-digit billion-dollar clusters to go run it. How else am I — and this whole team that I’ve put together, who are all motivated by the same thing — going to have the opportunity to go do that?

If antitrust scrutiny did not exist for Big Tech like it does, would Amazon have just acquired the company completely?

I can’t speak to general motivations and deal structuring. Again, I’m an AI research innovator, not an innovator in legal structure. [Laughs]

You know I have to ask. But, okay. Well, maybe you can answer this. What are the second-order effects of these deals that are happening, and, I think, will continue to happen? What are the second-order effects on the research community, on the startup community?

I think it changes the calculus for someone joining a startup these days, knowing that these kinds of deals happen, and can happen, and take away the founder or the founding team that you decided to join and bet your career on. That is a shift. That is a new thing for Silicon Valley in the last couple of years.

Look, there’s two things I want to talk about. One is, honestly, the founder plays a really important role. The founder has to want to really take care of the team and make sure that everybody is treated pro rata and equally, right? The second thing is, it’s very counterintuitive in AI right now, because there’s only a small number of people with a lot of experience. And because the next couple of years are going to move so fast, and a lot of the value, the market positioning, et cetera, is going to be decided in the next couple of years.

If you’re sitting there responsible for one of these labs, and you want to make sure that you have the best possible AI systems, you need to hire the people who know what they’re doing. So, the market demand, the pricing for these people, is actually totally rational, just solely because of how few of them there are.

But the counterintuitive thing is that it doesn’t take that many years, actually, to find yourself at the frontier, if you’re a junior person. Some of the best people in the field were people who just started three or four years ago, and by working with the right people, focusing on the right problems, and working really, really, really hard, they found themselves at the frontier.

AI research is one of those areas where if you ask four or five questions, you’ve already discovered a problem that nobody has the answer to, and then you can just focus on that and how do you become the world expert in this particular subdomain? So I find it really counterintuitive that there’s only very few people who really know what they’re doing, and yet it’s very easy, in terms of the number of years, to become someone who knows what they’re doing.

How many people actually know what they’re doing in the world from your definition? This is a question I get asked a lot. I was literally just asked this on TV this morning. How many people are there, who can actually build and conceptualize training a frontier model, holistically?

I think it depends on how generous or tight you want to be. I would say the number of people who I would trust with a giant dollar amount of compute to go do that is probably sub-150.

Sub-150?

Yes. But there are many more people, let’s say, another 500 people or so, who would be extremely valuable contributors to an effort that was populated by a certain critical mass of that 150 who really know what they’re doing.

But for the total market, that’s still less than 1,000 people.

I’d say it’s probably less than 1,000 people. But again, I don’t want to trivialize this: I think junior talent is extremely important, and people who come from other domains, like physics or quant finance, or who have just been doing undergrad research, these people make a massive difference really, really, really fast. But you want to surround them with a couple of folks who have already learned all the lessons from previous training attempts in the past.

Is this very small group of elite people building something that is inherently designed to replace them? Maybe you disagree with that, but I think superintelligence, conceptually, would make some of them redundant. Does it mean there’s actually fewer of them, in the future, making more money, because you only need some orchestrators of other models to build more models? Or does the field expand? Do you think it’s going to become thousands and thousands of people?

The field’s definitely going to expand. There are going to be more and more people who really learn the tricks that the field has developed so far, and discover the next set of tricks and breakthroughs. But I think one of the dynamics that’s going to keep the field smaller than other fields, such as software, is that, unlike regular software engineering, foundation model training breaks so many of the rules that we think we should have. In software, let’s say our job here is to build Microsoft Word. I can say, “Hey, Alex, it’s your job to make the save feature work. It’s David’s job to make sure that cloud storage works. And then someone else’s job is to make sure the UI looks good.” You can factorize these problems pretty independently from one another.

The issue with foundation model training is that every decision you take interferes with every other decision, because there’s only one deliverable at the end. The deliverable at the end is your frontier model. It’s like one giant bag of weights. So what I do in pretraining, what this other person does in supervised fine-tuning, what this other person does in RL, and what this other person does to make the model run fast, all interact with one another in sometimes pretty unpredictable ways.

So, with the number of people, it has one of the worst diseconomies of scale of anything I’ve ever seen, except maybe sports teams. Maybe that’s the one other case where you don’t want to have 100 midlevel people; you want to have 10 of the best, right? Because of that, the number of people who are going to have a seat at the table at some of the best-funded efforts in the world, I think, is actually going to be somewhat capped.

Oh, so you think the elite stays relatively where it is, but the field around it — the people who support it, the people who are very meaningful contributors — expands?

I think the number of people who know how to do super meaningful work will definitely expand, but it will still be a little constrained by the fact that you cannot have too many people on any one of these projects at once.

What advice would you give someone who’s either evaluating joining an AI startup, or a lab, or even an operation like yours in Big Tech on AI, and their career path? How should they be thinking about navigating the next couple of years with all this change that we’ve been talking about?

First off, tiny teams with lots of compute are the correct recipe for building a frontier lab. That’s what we’re doing at Amazon with its staff and my team. It’s really important that you have the opportunity to run your research ideas in a particular environment. If you go somewhere that already has 3,000 people, you’re not really going to have a chance. There’s so many senior people ahead of you who are all too ready to try their particular ideas.

The second thing is, I think people underestimate the codesign of the product, the user interface, and the model. I think that’s going to be the most important game that people are going to play in the next couple of years. So going somewhere that actually has a very strong product sense, and a vision for how users are actually going to deeply embed this into their own lives, is going to be really important.

One of the best ways to tell is to ask, are you just building another chatbot? Are you just trying to fight one more entrant in the coding assistant space? Those just happen to be two of the earliest product form factors that have product market fit and are growing like crazy. I bet when we fast-forward five years and we look back on this period, there will be six to seven more of these crucial product form factors that will look obvious in hindsight but that no one’s really solved today. If you really want to take an asymmetrical upside bet, I would try to spend some time and figure out what those are now.

Thanks, David. I’ll let you get back to your gyms.

Thanks, guys. This was really fun.

_{Questions or comments about this episode? Hit us up at [email protected]. We really do read every email!}

Amazon is betting on agents to win the AI race