Eiso Kant, Co-Founder of Poolside, will explore why combining reinforcement learning and software development might be the fastest path to human-level AI. He'll share Poolside’s mission to build an AI that doesn’t just autocomplete code—but learns like a real developer.

 
 
 
DOWNLOAD TRANSCRIPT

289 Audio.mp3: Audio automatically transcribed by Sonix

289 Audio.mp3: this mp3 audio file was automatically transcribed by Sonix with the best speech-to-text algorithms. This transcript may contain errors.

EISO:

If you want to get highly capable foundation models and software development, you can focus only on code. Software development is a representation of understanding the world. So our models are very capable general purpose models, by the way. They're quite good at writing poems and helping you with all other things that are in software development. But what we do is we focus all of our efforts earlier in the training towards software development, and we put huge amounts of emphasis in our training data and our approaches to training on reasoning. Where it is now moving towards is increasingly more agentic. And what this means is that the tasks are becoming more complex, higher abstracted, not just write a function, but try to implement a whole feature. Try to figure out a complex bug. And here it also means that it's a multi-step process. Well, first of all, thank you for having me here today. Um, and it is definitely worth going a little bit back in my background. Um, I'm at my core computer geek. I've been programming for most of my life. Uh, and in 2015 I read an article by Andrej Karpathy, and it was titled The Unreasonable Effectiveness of Recurrent Neural Nets. Uh, and I have to say, that article pretty much launched me down a rabbit hole, uh, of language modeling. And, uh, now it's important to note that this is 2015, because in 2016, the next thing that really caught my attention was like, for a lot of the world, AlphaGo. And when AlphaGo came out, it thankfully was the unreasonable effectiveness of reinforcement learning. And I think those two things that happen in pretty short succession, um, really framed my thinking till this date. Um, so in 2016, I, uh, pivoted the company I was building to focus entirely on building AI that was capable of writing code. Uh, this was using language models.

CRAIG:

Did you say 2020.

EISO:

15 to 2016? Yeah.

CRAIG:

Before Transformers? Yeah.

EISO:

Before Transformers. So this is at the time we were training models with LSTMs. Uh, so the, the precursor to Transformers, and we were starting to do some of the world's first work around reinforcement learning via code execution. Back then, we used to call it RL via compilers. And, and you can kind of see why those two things from those two origins kind of came together at that moment as kind of lucky to get the right information in front of me at the right time. And over the course of the following four years of working on this with a whole team, uh, I built up an incredibly strong conviction that language modeling is done with Transformers. Post LSTMs was going to be able to go all the way to human level capabilities, but it was not going to be able to do so just by predicting the next token. It was going to have to go hand in hand with reinforcement learning. And along that way, uh, by the end of 2017, I met now my co-founder, Jason. Jason was a CTO at GitHub. This was about two years before they got acquired by Microsoft, and he made an acquisition offer to buy that company. We had the world's first code completion models that were working.

EISO:

And he saw it. Kind of a similar future that that I believed in was that the AI was going to approximate human level capabilities in software development, given enough time and resources. Turn down the offer, but became good friends and kind of never stopped talking. And at some point had a podcast together for years, which is just a great way for two grown men to meet with each other on a regular schedule. And, you know, fast forward, uh, that company didn't succeed. We spent about four years on it, and we were too early. We actually had quite a bit of technology working, but it was actually challenging to get developers willing to connect their editor to the internet. Sounds crazy today, but not that long ago. That was the. That was a big part of the challenge. And and come you know, life happened and then come around November 22nd hits, you know, 2022 and ChatGPT comes out. And at that point, everything that had always believed them was going to happen was starting to happen. And that kind of was, to some extent, the thing that looked like the biggest failure in my career. I think I spent years working on the whole research team, like this strong conviction that didn't end up succeeding as a company.

EISO:

And then now all of a sudden, you know, it's taking off in the world. And at that point, to both me and Jason, since we were, you know, still speaking on a weekly basis, it was very clear what the next ten years were going to look like. It was now just going to be a massive acceleration of closing the gap between models and human intelligence and even beyond. And so we started asking ourselves the question of like, what does it take to go into that race? And we looked around and we realized that the sentiment in the world was. All we need to do to reach AGI is to scale up parameter size and more data and just do more language modeling. And that wasn't our view. Our view was yes, skill massively matters. It has a direct relationship with intelligence and models and kind of empirically shown till date, and continues to show that our view was that it had to go hand in hand with reinforcement learning. And that's what we build poolside around. That's why we started this company.

CRAIG:

And you built it as as a code generation platform, not as a step towards AGI.

EISO:

So both when we we wrote something on our website two years ago on day one and it's still there post AI vision, it's like, what's her view for the next years. And we laid out three steps. Step one make AI capable to assist everyone in building software. Step two allow anyone to build software with. Step three generalize to all domains. And very clearly stated, we're in the rate this company exists to be in the race to AGI. And the reason we focused on software development is that we felt it was going to do two things for us. One, it was going to unlock the first area where AI was going to have massive economic impact. And a reason from the fact that we saw AI getting increasingly more capable in the domain, but also that software developers and people who build software have traditionally always been on the front frontier of adopting technology early. So it was obvious to us that it was going to be the first place that was going to have big impact in terms of productivity and changing the way we work. The second part of it was that we wanted to kind of put blinders on. We knew that if we were going to say, hey, we're general purpose for everything from writing poems to helping with medical knowledge to, you know, going down software development, we were going to spread ourselves too thin. Uh, and software development is a great proxy for intelligence, requires understanding the world. It requires complex reasoning, requires planning over long term objectives, requires interacting with the digital world in front of us requires visual understanding. Like, it requires a lot of what makes up not all, but a lot of what makes up. You know, valuable human intelligence. And so our view was by pushing the frontier there, we were going to naturally converge on the same point as everyone outside. I think in the future, will, which is often referred to as AGI.

CRAIG:

Right. Uh, and the, um, you know, the things like GitHub copilot, it's based on, uh, on OpenAI's models or, or some of the other, uh, most of the other code generation platforms are based on the general foundation models. I think you guys built, uh, prepare proprietary models from scratch. Is that right? Uh, trained, uh, solely on on, uh, code?

EISO:

No, that second part is is is not correct. The first part is, you know, entering the race to AGI. We build foundation models from the ground up. But if you want to get highly capable foundation models in software development, you can't focus only on code. Software development is a representation of understanding the world. So our models, uh, they're very capable general purpose models, by the way. They're quite good at writing poems and helping you with all other things that are in software development. But what we do is we focus all of our efforts earlier in the training towards software development, and we put huge amounts of emphasis, uh, in our training data and our approaches to training on reasoning, on the ability for models to take in a longer term objectives and successfully work through those both by reasoning through the information that's presented to them, but also by tool use, by being able to actually interact with their environment. And and so but because all of us live in a world with a fixed parameter budget, I can serve a 10 trillion parameter model cost efficiently to users. So this is not going to happen on today's hardware. So if you think about every model company, being able to serve a maximum size model doesn't matter if it's a sparse or dense, but effectively a certain amount of sense that you can spend per per inference request, right? Because we all operate on effectively the same, you know, three different types of hardware that are there. So we all have the same cost profile. Don't get me wrong, it can be 20 or 30% more efficient here or there. And Deep Seek did a great job. And like but we're all on the same on the same budget. And our view has always been is let's use that parameter budget far more towards software development by by pushing it towards software development related capabilities earlier in its training, by applying more training compute to software development related knowledge tasks, etc. and a lot of that has to do with our work in reinforcement learning.

CRAIG:

Yeah. And you use reinforcement learning from, uh, code execution, right? Which is, uh, uh, I mean, there's been a lot of work on rl AI. Um, you know, feedback from other models, but this is specifically at running code that's been written to see whether or not it executes. Is that right?

EISO:

Yeah, absolutely. So if we our view has always been is that in the pre-training stage of models. And I think this is generally held. So we're we're pushing these models on the most general task you could possibly have. Right. Predicting the next token. But when we talk about skills like software development they operate inside an environment. Right. The code gets written, it gets executed, it gets tested. It needs to run in a full system. We as developers are modifying it. We're finding we're getting feedback from errors and bugs that we're introducing ourselves. And all of that feedback is it makes it it makes an iterative process to build software. Our view that experiential iterative process should be very akin to how models learn. So step one for us two years ago was building a code execution environment that's grown a lot by now. We are starting to approximate a million repositories. So real world code bases that are fully containerized with their test suite that you can make any changes in and have tens of millions of revisions as a total set. And you can have you can define tasks synthetic but also humanly written tasks in there and have the models explore the solution space.

EISO:

And with RL it's kind of always the same thing, right? You have a task, you have a number of samples that you're rolling out in terms of the possible trajectories to solving that task, and then you've got the rewards from when it successfully solves it or fails. And so by having this extremely large environment, our job becomes increasingly improving the quality of the tasks that we generate, most of which synthetic, increasingly improving the reward signal that we bring to the model. The obvious one is the unit tests are passing, but there's a lot more reward signal that can can join on top of that. And then always making sure that the diversity of that environment is constantly growing so that you're both having more diverse tasks and more challenging tasks, and that those models are getting more capable with our latest generation of model that is no longer just a single or multi-turn, you know, rollout. Now it's an agent that is entering into that environment and getting access to more tools and doing more complex things to learn.

CRAIG:

Yeah, and the initial training, you have two models, right? Two. Two main foundation models. Uh, what's the difference between them? And then the question I was going to ask is on training. Are you training them in the traditional traditional. It's also new. But uh, you know, feeding it, uh, huge amounts of data, feeding the transformer algorithm, huge amounts of data for it to encode, uh, knowledge in its weights. Or are you, like, as I understand, deep sea, uh, training directly with reinforcement learning.

EISO:

So if you look at the different stages of training, the foundation model, kind of the the usual term, the traditional pipeline, right, is, uh, we're pre-training on effectively the web right after you've done a lot of work in terms of optimizing the web, rewriting it with models, making it more coherent, tagging it, weighting it, tons of experiments. And then we have kind of the next token prediction stage. Uh, and that happens throughout the, the majority of the, of the, of the pre-training budget. Along the way, you'll, you'll shift some of the distributions of your data sets differently towards the end of training. And then you've kind of have the post-training step and within post training, you find yourself with techniques that are a mix between supervised fine tuning. So like you said, providing a data sets with examples. Uh, often this is more conversational. So the model gets used to this conversational style of back and forth user assistant. And then you have the reinforcement learning component, which is, you know, giving the model these sets of tasks in these environments and rewarding it, you know, when it's successfully completes them or successfully as steps in the process that are correct. And and then you might do some later SFT again. And that's kind of a standard pipeline I would say of of 2025 model building for, for everybody. Um, now there are places where all of us in the industry are innovating to push further than that. Uh, I would say where we have done, um, and really put our focus on is how much further can we push the reinforcement learning in that, uh, what's what's the limit of doing rl uh on verified rewards or code execution, uh, deep exploration in what is the limits of reinforcement learning in non-verified rewards.

EISO:

This is the one part where we don't go into too much detail yet, because we hold this a little more proprietary, and then it's what we internally often referred to as bread and butter model building. Like effectively, what you're doing in model building is you're improving the effectiveness of your data and you're improving the effectiveness of your compute. Um, and there is a ton of work that you are constantly doing. Right? You're if that's your latest sweep over an improved data set, over architecture, over attention mechanisms. But it was important for us that we moved out of the the world of artisanal model building, where every single one of these became its own project and effort by researchers to having really a model factory. How would we be able to actually go very quickly from these ideas to As to results, how can you run a sweep with a thousand different variations of hyperparameters or different data setups? How can you make this deterministic so that your results from idea to experiment are perfectly traceable? And by now, two years in building models is more about working on the factory that builds models than it is about the latest artisan idea that you're pulling components from to try to make work.

CRAIG:

Yeah. Um, on the, uh, rl. The code execution, uh, feedback. Um, is is that a continual loop, uh, that the model writes code, uh, executes, uh, gets a result, feeds the result back to the model, uh, to its training and executing writes code again executes. So it's it's getting better and better at writing code. Um, how does that work?

EISO:

So that's close to where it started. So where it started was, um, here's a here's a repository. Here it is. Containerized code can be executed and tested. And it started with defining some very concrete tasks like let's remove a function, hide the function from the model, give it the instruction back, translate that function into an instruction to try to write it, have it think of its thoughts and then its solution, and do maybe ten or 15 or 20 samples and then score the ones that are correct and, and negatively score the ones that fail. And there's different RL algorithms. The latest popular one is the jpo where you group these rollouts. So you take advantage of both the positive and the negative samples. Uh, and that's where for us RL started, where it is now moving towards uh is increasingly more agentic. So Rockford agents in the loop. And what this means is that the tasks are becoming more complex, higher abstracted, not just write a function, but try to implement a whole feature. Try to figure out a complex bug. Uh, and here it also means that it's a multi-step process. So the agent is in the container. It no longer just only edits code. It can run commands, it can search things, it can open files, it can read them, it can store things. It can execute different binaries. So you can see it trying to pull a dependency and create it from source again and try to install it. So you, you see that the agents are effectively becoming closer to having access to the same tools we have as developers, right? We don't just write code, we make a lot more changes in the system, and the rewards are increasingly more on successfully completing a longer range task. Rl is a very finicky beast. Uh, and getting reinforcement learning stable means that you're always on the border of looking for tasks that are complex enough for the model to be able to learn from, but not so complex that it can never get any solution right. So there's nothing to learn. Not unlike us as humans, by the way.

CRAIG:

Yeah, yeah. Uh. That's interesting. Is who else is using this kind of feedback from code execution? Um, because it's the first time I've seen it. But that doesn't mean that it isn't the standard among.

EISO:

So when I started in this space in 2016, I think we were the first to ever even look at it. I'm sure good ideas pop up in many places. I'm sure there were others, but it wasn't really a thing. Two years ago when we started poolside. I don't think anyone was focused on RL almost at all, just on code execution feedback. Today, it is clear that almost every reasoning model that we see out there, if that's the one you refer to or from OpenAI or anthropic etc., will have a version of this as part of their training loop. What I would say is unique about us is the skill at the extremely large scale of our environment for for code execution. The vast amount of diversity in terms of tasks and what the model can learn. And and now I would say increasingly more our work is is also going beyond that. Like what what does RL look like for places that are unverifiable. So not just code and software development, because one of the points of view that we really hold is that knowledge is important to learn. You need knowledge to understand the world needed representations of knowledge. But there is something slightly more universal about successful reasoning and thought over a domain. Uh, and if you can push those capabilities in a domain like software development, you see that it's kind of like an onion. It kind of or like a better analogy is a stone dropped in the water that has a ripple effect on everything else that sits nearby. So improving the capabilities in code, improve the capabilities in math, improve the capabilities in reasoning and other domains as far as legal and others going further. And I don't think this is unique anymore. I think others have have seen this as well. And so I think right now at the frontier, it's a race of scaling up reinforcement learning compute for everybody.

CRAIG:

You said at the beginning that that you're you're building really for developers, but that the long term goal is to build a system where anybody could write software, presumably with natural language. The current iteration of the product, um, with this, uh, RL reinforcement learning with code execution feedback, does that mean that it won't stop working or won't produce a result to the user until the code that it's written executes successfully.

EISO:

So if you think about this from a model and product working together perspective, and there is something really interesting in how Wrkf has evolved till right now. In the beginning, Wrkf was something we did in our training part of Code Bases and uh, and we have specific kind of training code for that. Now, the agent that is going into the RL loop is the exact same agent that we're going to start shipping as product to users. And so and by the way, we've already seen examples of this in the world. If you've seen these deep research products that are out there, they're effectively an agent trained for deep research in the RL loop. And then that agent gets put a UI on top of it so that it becomes available to users. And this is why it's increasingly becoming clear to us that for general purpose software development agents, the most capable agents are going to come out of the foundation model companies. That doesn't mean that there won't be very capable, specific agents built on top of models by many other people, but because the agent itself, the tools, it has access to, the prompts that surround it, the that is what's actually going in the training loop, that agent becomes increasingly more capable. Now for the end user, how you interface with AI for software development really has to do with the fact that AI is not yet at human level capabilities. So everything that we're building around these interfaces today are usually to deal with the fact that the models are still highly fallible. They still make quite a lot of mistakes. And that means that we are often, you know, building things around them so that you can review the code, that you can come back, that you can have a multi-turn conversation, that you can tell it when it was wrong, that you give it feedback from the environment that you're in. If that's sense, your tests failed. How do you pass that along? But we are increasingly on a trajectory where that level of, you know, lots of code and features built around the model are becoming thinner and thinner.

CRAIG:

Yeah, yeah. Because the I'm not a coder and I've been talking to people about this, um, long before the initial, uh, Gemini applications hit the market. Uh, there was a guy in machine programming at Intel who used to talk about this. Um, and at that time, it seemed like a distant dream. And then, you know, just a few years later, uh, it's happening. But the again, not being a coder, my my problem in using any of the code generation tools is I can't spot a mistake. I can't read the code myself. And so it'll hit. It'll give me an output. I run it, get an error, send the error back to the model. The model says, oh, yes, of course. Well, this is what's wrong. Fixes it. I run the code, hit another error, and it's just this endless loop. It never ends. It seems to me, though, that that that with RL that could be overcome if you, if you let the model work, uh, in a container, as you say, not not, uh, you know, giving the user the output until it's executed or not. Maybe it says, uh, you know, I'm sorry, I, I can't get this to run.

EISO:

You're you're heading in the right direction. Correct. Like, the an agent is essentially a model loop environment afterwards. Ml. Right. And the environment is where it's operating in, and the model is what matters, and the rest is just code around it for loop and the environments where you provide the tools, you know, to the model. And so what we're starting to see already is that there's increasingly larger tasks of longer duration with higher complexity that models are able to iterate on to try to successfully get to or give up at some point and say, hey, I can figure this out. And that's kind of the world that we're moving towards, right? We're moving towards where from a from a multi-turn chat experience back and forth to a world where, hey, I have this task, can you go and do this? Step one off of being the model, asking the right clarification questions. Right? Ask the human is a tool in itself, right for the model. Like I can go and ask someone and then go off and try and if the environment is well set up like the container as you described, that's where you're going to see a lot of work getting done. But it's important to note that we're not yet living in a world where models reach the same level of capabilities that we have. So for certain types of tasks, this absolutely can go successfully to completion, you know, ten out of ten times. But a lot of tasks, it's not fully there yet. And this is where this notion of kind of test time compute often gets referred to, giving the model more time to essentially run, more inference, to try to get to a solution. But it's kind of like tomorrow, if I give you a quantum physics problem, I'm assuming you're not a quantum physicist, Craig. You know, an hour or 500 hours of test time, compute is probably not going to get you to solve it. Uh, and that's the same with models, right? They have real limitations and gaps still. They're just not always as obvious. Um, obviously boundary as ours are. They often fail in surprising ways.

CRAIG:

Yeah. Um, and you're using transformers. You mentioned LSTM. I had Sepp Hochreiter. I can never pronounce his name, uh, on the program. And he's, uh, continuing his research on, on LSTMs and for listeners. He was the the guy that came up with long short term memory, uh, with that, uh, algorithm. Um, that was the standard for a long time. Uh, in, in he, he believes that that you can widen the, uh, the memory window and that there's still a lot of applications. And then there's, uh, this Mamba, uh, which, uh, is is, you know, layers of, uh, of, uh.

EISO:

Of state space, uh, style approach.

CRAIG:

Thank you. Yeah, yeah. And then you have a transformer that kind of sums up what's going on. And then another block of of, uh, these state space, uh, Layers. Um, are you using either of those or are you sticking with transformers?

EISO:

So it's a really good question. So we have, uh, coming back to the factory approach, right? So and specifically what you're talking about, like attention mechanisms and model architecture, uh, they're an important detail, and it's one that you spend time experimenting on. Uh, but the way to think about them is that in many cases, those details unlock either inference speed or training speed. Uh, and, and potentially, you know, that goes off in hand in hand with longer context windows. Right. So there's this longer working memory of a model. We've done quite a lot of work around RNN style attention. Uh, you you mentioned Mumbai. You have also Rkv, one of the co-authors, works at poolside. Um, we've had we've had quite a bit of success with RNN style attention. Uh, in the end, what you kind of see is that the different attention mechanisms, uh, in combination as a hybrid, often kind of really good results. That's kind of what you were talking about with Mamba when there's, you know, the hybrids where you had transformer blocks or kind of global attention layers. These are interesting details and I can geek out about them for hours. But I think it's also it's also really important to realize that like they are optimizations they're not fundamental breakthroughs. Um, we definitely in our horizon could still have fundamental breakthroughs on architecture, uh, that can have massive impact on, you know, either the memory of models, because right now we don't truly have we either have this pre-trained like training embedded, updating the weights and activations or just using the working memory if that's an RNN state or a mama style state or a transformer.

EISO:

It's all effectively, you know, this, uh, passes once the inference call is done. Uh, and so were. Those are things that continue to push. But also at this point, it does feel like we are at a moment where we've been saying this for a little while, that we don't need a fundamental breakthrough in architecture to be able to get all the way to human level capabilities in coding or software development. It will be massively helpful, and there are definitely architecture breakthroughs that, if they happen, could upend our entire industry. Right. All of a sudden, it doesn't mean it requires these billions of dollars anymore to train models at the frontier. But right now, like the unreasonable effectiveness of neural nets is really there. You scale them up, you scale up more, compute if that's on, either, you know, traditional next token prediction because you have the data and synthetic data generation or if that's on RL, which I think is increasingly going to be a bigger part of compute budget of training a model, so much so that I'm willing to make the prediction that it becomes the largest part of the compute budget of training a model in the following years that you can go all the way. Uh, but yes, those things matter because at the end of the day, the effectiveness of our compute often relates to the architecture.

CRAIG:

Yeah. Um, you were saying that, um, uh, that as smart as the models are, uh, they're not at human level yet. They're certainly at human level with, um, natural language. Um, what is it about code that prevents them from from getting to that level?

EISO:

So I think what we see is by having trained on most of the web. Right. And now all of us have been rewriting the web into better and cleaner and synthetic forms of it. Uh, is that you definitely get an incredible amount of knowledge encoded. You get an incredible language understanding? But the web is is an output product. It's the final article written. It's not the thought process that went to it. It's the final research paper. It's not every single step of thinking of an experiment. And and that gets to that. Right? It's it's Einstein's relativity theory, but not the hundreds or thousands of hours that he spent thinking it through and the things that didn't work and how I think this is really important because the process of the creation of work turns out to be really important training data set that doesn't exist in enough data that next token prediction alone will get there. Is that the limit? If you had infinite data and infinite compute, you could get to AGI, right? Next, token prediction is an incredible strong optimization pressure in learning of the model. But since we don't have, you know, trillions of tokens of thought process of correct reasoning and math and coding, and in medicine and in law and all of these areas. We see that, in my view, the reasoning component thinking I see reasoning is a subset of thinking. Reasoning is goal oriented. It requires an outcome, while thinking can be much more broader. That is something that isn't well represented. And because we don't have that, we see that models can spectacularly fail on things that we consider very simple. Uh, but my best, my my favorite example of this is when you look at how they do math.

EISO:

So if I take something simple like, uh, a large number multiplied by a large number, and I throw it into L and LM today, the number that it will output will be wrong, but it will only be off by maybe, you know, 5% left or right, you know, 3385 times 9802. It will actually be directionally correct. And the truth is, if you gave me that number and required an instant response from me, probably be worse. But let's say, you know, hopefully I get directionally correct when we get models to reason about it in the way that we do, which is we we apply the little algorithm that we learned in school, you know, on how to do that math. Some it gets a correct output. That's the part that is underdeveloped in models. And reinforcement learning now offers the promise and is starting to offer the results that we can develop that complex reasoning. Now, software development and coding is just a really great task that requires a lot of complex reasoning, uh, and multi-step reasoning to do something correctly and often because it operates in environments that are so much larger than our working memory. Right? A huge code base, an entire system that you want the model to be an agent, you want it to be able to interact with the feedback it gets when it makes a mistake, because you can't reasonably expect it today to be perfect. And neither are we. Right, by the way. So that's the part that is missing this. And this doesn't just hold true in coding. It holds true in almost every advanced knowledge work domain where you apply today.

CRAIG:

Yeah. Um, and and I mean, I was thinking that in the code is deterministic, uh, it would be easier for a model to, to follow, uh, the steps and, and understand the reasoning behind the steps. But, but maybe, maybe that's over overly naive. Um.

EISO:

I wish it was that case.

EISO:

Now, I do think that, um, because it is deterministic, it makes the training with reinforcement learning to improve that level of reasoning a lot easier than in non-deterministic domains. Right. You have you can do a lot to be able to say, does this code compile? Was it correct? Did it pass unit tests? There's another notion that I often refer to internally as as large language model arbitrage, uh, which is that models are better to reason about code. What it is supposed to do, what the successful inputs to outputs are, than it is about writing it the other way around. And by the way, yes, so are we. I can, you know, it's it's easier to actually observe something and then reason about what it's supposed to do than actually create that thing. And, and the combination of this model arbitrage with the fact that it has deterministic outcomes and it can be executed, makes it the world's greatest target for RL. That's what got me excited about it in 2016. It's what got me to start this company together with Jason like it's and it's now I think what's leading a lot of the improvements that we're seeing in reasoning capabilities of models, not just in code across the board, because it turns out that reasoning is is a representation that we're learning in models that is kind of, you know, touches every other representation of knowledge.

CRAIG:

Yeah. So where does the product stand now? And I have to ask also I, I use Manus, um, you know, the, the Chinese multi-agent autonomous multi-agent software, uh, that that operates in a virtual, um, website in the cloud so that it's not using your, your, uh, computer while it's reasoning and, you know, with mixed results. But, uh, how do you where do you where does that stand in, in the universe of code generation? Is that is that, uh, just, uh, an engineering exercise that's interesting? Or do you think that they're doing anything, uh, and when you talk about agents, uh, in, uh, in poolside, is it a similar kind of architecture where it's operating in a virtual environment and, and, and returns a response without you having to. Keep your eye open and everything.

EISO:

I haven't used Manus, but I have seen the I've seen some of the demo videos and I think it's a it's a great showcase of where model capabilities are, are combined with a great environment right to the point earlier model loop environment. And that environment is where you provide the tools is where you allow the model to to run in that loop and execute. And if we look at where we are today. So we started before models were at a genetic level capabilities. We started with this back and forth multi-turn kind of chat conversation that a developer has in the editor on a web assistant with poolside. Uh, and then we moved towards that model, preventing and presenting kind of a plan and code changes. So Changes or native edits that it was making in the files. Now we are increasingly moving to an agentic world where where the agent itself is effectively a runtime that can be interfaced with from the editor, it can be interfaced with from a CLI tool, can be interfaced from an API call. Um, right now they're still very much about having it run in your local environment. Our work is definitely towards remote execution environments, right? How do we get these agents to run in remote execution environments? Because a lot of code that we develop and work on doesn't run on our laptop. It runs somewhere on a Kubernetes cluster, right or wrong, somewhere in CI. Um, and so I do think across the board, capable agents in software development will follow this pattern everywhere. This is not just an US thing.

EISO:

I think this will happen. Our job really at poolside is to do two things really well. One is to to really focus on building the most capable models for software development and keep scaling up to to push that frontier of what's possible. And then second is how do we bring all of that product experience around the model for two things one, to allow others in the future to build on top of us. And second, how do we have our own view on user experience? Because the user experience is constantly evolving in the world as models are getting more capable. And bringing that to end users. And here we've made a decision two years ago that we are still very, very like glad that we did to really focus on the enterprise. We want to bring poolside out over time to every developer in the world. But we looked at one of the most complex environments that have the largest number of developers working in it, and it's enterprises, right? Us Bank in New York and a 50,000 software developers. And in those environments, if you're looking at a view where in 12 months people are running thousands or tens of thousands, maybe hundreds of thousands of agents, how do you manage that chaos? How do you orchestrate those agents? How do you audit log them? How do you monitor and observe them? How do you allow people in your organization to elastically scale them up and down? Because our view is, is that we're moving to a world where agents are effectively an elastic AI workforce.

CRAIG:

And and how does that relate to pool size product in that you're you're moving towards AI orchestration or or or. Right.

EISO:

We've been building the we've been building the whole stack. Right. So that's what we've been doing for two years. So think about it. The model, everything that serves the model at the API middle layer that exposes it out, the user interfaces. Today we focus on VSCode, IntelliJ, Visual Studio and then the CLI and the web, and then the API layer which is critical how people interact. And and so now what is adding to that is uh, an admin and orchestration side of it. And I think this is frankly where you'll see everyone in the industry is already moving towards this. It's clear now that the gap between agents and models, they're effectively the same thing. Just agents. The wrapper around the model is closing with our level of capability, so we can entrust them for longer duration tasks, which means we want to be able to manage them, you know, more centrally and with more oversight and all of the, you know, relevant enterprise features that are needed for that.

CRAIG:

Yeah, but but, uh, at the bottom layer, you're still talking about code generation, is that right?

EISO:

We talk exactly full software development because software development goes beyond code generation. Right. It's about helping you build a product, you know, PRD document. It's about helping you, uh, monitor the logs in a system. Right. Our view is agents are going to be running everywhere. They're going to be working synchronously with developers asynchronously for tasks to be sent off, and also autonomously. They'll be running in your CI, they'll be running in your containers. They'll be observing your logs. I think we are still underestimating the surface area in the world of where agents are going to be running.

CRAIG:

Yeah. Uh, but but again, on code generation, is the is the intent then that developers in a large enterprise would, would be using poolside, uh, to write code as a, as a partner, uh, or assistant in the way that people are using, uh, GitHub copilot.

EISO:

Absolutely, absolutely. And they already are. Yeah. So we're already deployed in enterprise today where where developers are working side by side with poolside. Uh, moving now from a multi-turn chat experience to an increasingly more agentic experience.

CRAIG:

Right. And, and how do you evaluate your effectiveness? I mean, this is a big topic these days. Evaluation. Uh, everyone trains to a benchmark and then comes out with a product and says, look, ours beat There's, uh, but, uh, but that doesn't necessarily, uh, transfer to the to the user experience. How are you evaluating poolside. How do you sell poolside? Uh, to replace copilot or.

EISO:

It's a it's a great question. So from an we can do a whole episode on evals, uh, evals, you know, um, are essentially the way that we break them down. And I think most in our industry is benchmarks that you have that you hill climb. You think they're really representative of the broad range of capabilities you're trying to improve in your model evaluations and benchmarks that you run, that you don't want a hill climb, but you want to understand if something is going wrong throughout your model training. These things shouldn't drop down all of a sudden. Uh, and then you have, you know, your famous five checks and red teaming. This is where it's really, you know, users, uh, internal, external, paid free that are, that are spending time with your, the versions of your model behind your product and they're and they're, they're giving their feedback on, you know, what is it better at, what is it worse at. And a lot of things can't be caught initially in evals. Effectively what you're learning from your vibe checks, you're going to go build evaluation sets for it and and capturing your next generation, your next generation, if that's you learn all your model is stubborn. Okay. How do you ease out that. So evals are living set of things. It's important that they range very broadly. Uh, so this is everything from general reasoning to specific coding capabilities to, you know, personality and consistency of the model.

EISO:

And so you're constantly improving on that. You have a dedicated team for it, but you also have internally and evaluation framework that everyone is constantly contributing to. They're never perfect in our space. And if you would only hill climb a benchmark, and you would make every decision around that. You can end up with a model that seems great on paper but isn't great to use. And we've seen some examples in our industry of that. So it's an art that we're all trying to increasingly make more of a science in terms of how do we sell poolside? Um, a couple of things. So we focused on enterprises, places with more than 5000 developers. Um, we do this across finance, public sector and defense and kind of core strategic industries. So big industrials, big tech. We work very closely with Amazon Web Services. Uh, we're one of very few companies who have a first party partnership with them. I mean, if an enterprise is looking to purchase poolside, it fully retires their commit spend that they have with Amazon. Uh, and, uh, that's something that I believe only four companies historically have had with them. It could be wrong. 4 or 5 or so. That means that we have a very strong joint go to market motion. Uh, and uh, but outside of AWS, we also are willing to bring poolside on prem.

EISO:

This is something we do quite a bit in defense and in public sector, uh, where we're increasingly more bringing it on a server in a, in a, in a data center, or what we do in defense and government workstation that goes into a lab or a skiff. Um, but the the notion is, at the end of the day that because we are able to bring our models behind the firewall of customers, we also see an increasing future where the model weights change for the customer, where we can start bringing reinforcement learning to the customer side. There's still a lot of work to do there. We're early in that, but when you are going to be running tens of thousands of agents in the enterprise environment, you want that shared knowledge of all the trajectories that they take, every thought, every decision, every interaction with someone in your company, every action, code change, etc. you want that to be able to, over time, change the weights of the model so that the model. Becomes your your company your company model. And that's something we feel very strongly about. We think that over the next couple of years, we're not going to just see a central model with a big context window. We'll see models that are increasingly going to adapt to the environment that they're working in.

CRAIG:

Yeah. And I mean, you're when you say model or product, these are proprietary. You're not open source are you?

EISO:

We're not open source. No. We, uh, we think that it's really valuable that the world is building open source models. But from the capital investments that we're making right now, we just haven't seen that to be a strategy for us.

CRAIG:

Yeah, but when you install something on prem, how do you protect your your IP? I mean, this is.

EISO:

A really good question. We've never been too worried about the weights. Uh, from, at current model capabilities, you are constantly on the neck working on the next generation of your model. And so if I take the worst case scenario and risk manage it, if somebody leaks the weights of our model on a torn website, a lot of things would have to happen for that to happen inside an enterprise which is quite regulated. And, and, you know, pretty solid in terms of how people operate. Someone would have to take that, you know, take it out of the building, upload it. But even if that happens, you know, three, 4 or 5 months later, we're on to the next generation of model. In a year from now, that model is obsolete. And now I think there's very fair concerns in the future when models reach certain levels of capabilities, that you want to think differently about this, and there are technical solutions that you can take towards this. Um, but right now, we've actually not taken that same level of, uh, paranoia about weights as maybe some others have in the industry.

CRAIG:

Yeah. Another, uh, question and I'm coming up to an hour, so I don't want to keep you too long, but, um, at the coding level, um, uh, is there some metric about how often, uh, poolside returns, uh, function that's executable and how often it fails? And are you working at I know you were saying that you're working in a, in a larger, uh, context. So you're working on features and the whole software stack, but ultimately it comes down to writing, uh, executable functions. Yeah. So, yeah.

EISO:

Oh, across the stack. So, uh, in our own training, everything. If this gets measured, uh, in, uh, if we didn't get the customer side, we expose every metric possible that we can gather. We expose, uh, both in APIs, uh, so that they can import it in by tools. Uh, and this is not just, uh, in places where code can't be executed because it's not an agent, because it's being suggested. We also actually save everything from do people accept the changes? Do they reject them? I think it's really important for our customers to be able to have really granular visibility into how AI is helping their teams, where it's failing and where it's successful, and be able to break that down across any dimension they like. If that's a programming language or team or specific, you know, project, we're still in the realm where where AI makes tons of mistakes and you want to be able to actually very clearly observe where that sits, because we're definitely living in a world and will for some time where we're still the primary actors. Right? We're instructing AI, um, but yes, it's critical. And and this is, frankly, one of the things I think I've really enjoyed seeing our customers be very happy about because we're by providing that level of transparency, they can make much more informed decisions in terms of, you know, where they need to invest in adoption of AI because it's already taking a part of the team and it's making the massive productive, Of, but not yet. Everyone and places where like, hey, this doesn't work well enough yet. Let's wait till the next generation of models is.

CRAIG:

Is there anything I haven't asked about that you think listeners should know?

EISO:

I think you have touched upon this, but I think it's it's important to take a step back from all the noise in our market, because there's a lot there's a new tool every month and then every three months. I mean, there's a reason to popularity. Uh, but if you really take a step back and look at the next five years, if you hold the same assumption that we hold that model capabilities will converge with human level capabilities in software development. And frankly, the vast majority of knowledge work that we do behind a laptop, then that means a lot of things are going to fundamentally change in terms of how we work and how we structure. And whatever the latest noise or in the market is, bring it back to, is this still relevant in that world? And then you'll find that certain things are highly relevant, are going to become increasingly so, and other things you probably won't hear about in 2 or 3 months or 2 or 3 years. And I think that's from the best advice that that, that hopefully I can give, which is, uh, you know, measure against will this be relevant when AI reaches this level of capabilities? And it will be a good filter to kind of sift through, through everything that's happening right now. Yeah.

CRAIG:

Yeah. Well, that's that's a good point to end on. Um, yeah. Fascinating. And and, um, if people want to give you a spin, they go to poolside AI is it or.

EISO:

Yeah. So right now we're only available in enterprises that we work with. So we're not generally available. Uh, we want to make sure we get there. Uh, but if you are working in a large enterprise, do definitely reach out to our team from there. Uh, and we're always happy to then to find a way to engage. And it's definitely our goal to make sure poolside becomes available to everybody, uh, outside of the enterprise as well.

Sonix is the world’s most advanced automated transcription, translation, and subtitling platform. Fast, accurate, and affordable.

Automatically convert your mp3 files to text (txt file), Microsoft Word (docx file), and SubRip Subtitle (srt file) in minutes.

Sonix has many features that you'd love including share transcripts, powerful integrations and APIs, upload many different filetypes, generate automated summaries powered by AI, and easily transcribe your Zoom meetings. Try Sonix for free today.


Learn more
 
blink-animation-2.gif
 
 

 Eye On AI features a podcast with senior researchers and entrepreneurs in the deep learning space. We also offer a weekly newsletter tracking deep-learning academic papers.


Sign up for our weekly newsletter.

 
Subscribe
 

WEEKLY NEWSLETTER | Research Watch

Week Ending 9.28.2025 — Newly published papers and discussions around them. Read more