Eye On AI

Fei-Fei Li discusses the rise of spatial intelligence and the world models that could transform how machines perceive, imagine, and and interact with reality. We explore how this spatial intelligence goes beyond language to connect perception, action, and reasoning in physical environments.

DOWNLOAD TRANSCRIPT

303 Audio.mp3: Audio automatically transcribed by Sonix

303 Audio.mp3: this mp3 audio file was automatically transcribed by Sonix with the best speech-to-text algorithms. This transcript may contain errors.

Fei-Fei Li:

The spatial intelligence work I've been thinking about in the past few years is truly a continuation of my entire career's focus in computer vision and visual intelligence. Why did I emphasize on spatial is because we've come to a point in our technology that the level of sophistication and profound capabilities of this technology is no longer at the level of staring at an image or even simple understanding, simple videos. It is a deeply, deeply perceptual, spatial, and also connects to uh robotics, it connects to embodied AI as well as ambient AI.

Craig S Smith:

Welcome to the podcast. In this episode, I have the honor of talking again to Fey Fei Lee, a pioneer in artificial intelligence and computer vision. I had Fay Fei on the podcast a few years ago, and I invite you all to go listen to that episode. I'll put a link at the end of this video. We'll explore her insights on world models and the importance of spatial intelligence, crucial elements for creating AI that truly understands and interacts with the world around us. While large language models are amazing, much, if not most of human knowledge is not captured in text. And to reach a more general artificial intelligence, models need to experience the world firsthand, or at least through video. We talk about her startup, World Labs, and their first product, Marble, which generates incredible complex 3D spaces from the models' internal representations of the world. Plus, she tells us her guilty pleasure watching her favorite TV show on airplanes, and you probably won't be surprised what that show is. Stay tuned for an enlightening conversation.

Fei-Fei Li:

I'm Faye Fee. I'll be joining the podcast of I on AI and discussing uh spatial intelligence and world models. Come and join us.

Craig S Smith:

Build the future of multi-agent software with agency. That's AGNTCY. Now an open source Linux Foundation project. Agency is building the Internet of Agents, a collaborative layer where AI agents can discover, connect, and work across any framework. All the pieces engineers need to deploy multi-agent systems now belong to everyone who builds on agency, including robust identity and access management that ensures every agent is authenticated and trusted before interacting. Agency also provides open standardized tools for agent discovery, seamless protocols for agent-to-agent communication, and modular components for scalable workflows. Collaborate with developers from Cisco, Dell Technologies, Google Cloud, Oracle, Red Hat, and more than 75 other supporting companies to build next generation AI infrastructure together. Agency is dropping code, specs, and services, no strings attached. Visit agency.org to contribute. That's agntcy.org. I wanted to talk not so much about Marble, your your new model, which is amazing, that uh generates uh sort of consistent and persistent 3D worlds that uh the viewer can move through. Uh, I want to talk more about why you're focusing on world models on spatial intelligence, uh, why that is necessary to go beyond uh learning in language and how your approach differs from the Koons. Uh so can you talk first of all about uh did did the world model work uh emerge from your work uh on ambient intelligence, or has this been a parallel track?

Fei-Fei Li:

The spatial intelligence work I've been thinking about in the past few years is truly a continuation of my entire career's focus in computer vision and visual intelligence, right? And uh why did I emphasize on spatial is because we've come to a point in our technology that uh the level of sophistication and profound capabilities of this technology is no longer at the level of staring at an image or or or even simple understanding, simple videos. It is uh deeply, deeply perceptual, spatial, and also connects to uh robotics, it connects to embodied AI as well as ambient AI. So from this point of view, it's a continuation really of just my entire career in uh computer vision and AI.

Craig S Smith:

And uh the importance of spatial intelligence, uh, and I've talked about this, yeah, I've had this podcast for a while. Uh human knowledge uh language models learn from human knowledge encoded in text. Uh, but that's a very finite subset of human knowledge. And as you've pointed out, many people have pointed out, humans learn uh a lot from interacting in the world without language. Uh, so it's important we're going to move beyond uh the current LLMs, as amazing as they are, that we develop models that have a more direct experience of the world, learn more directly from the world. Your uh approach has been, well, certainly with model with uh marble to take the internal representations of the world that the model learns and create uh an external uh visual reality with those. Lacoon's approach is is to create internal representations uh from direct experience or video uh uh input that allows a model to learn uh physical laws of motion and and things like that. Um it is there a parallel between that? Are the two approaches uh complementary or are they uh overlapping?

Fei-Fei Li:

So, yeah, so first of all, I I actually would not pit myself against Yang or vice versa, because I think we are we're on the spec, you know, on this continuum of intellectual uh just approaches to spatial intelligence and world modeling. So if you you probably read my recent uh long essay that I called Manifesto of uh spatial intelligence, and there I was very clear. I actually think both implicit representation and eventually some level of explicit representation, especially in the output layer, are both probably needed if we are considering a universal omnipotent uh world model eventually. And they both play roles. For example, our current world model in World Apps, Marble, uh does explicitly output uh 3D representation, but uh within the model, there is there is actually also implicit uh uh representation, um, you know, along with the explicit uh output. So it is uh it is um eventually, I think we need both, to be honest. And also in terms of the input modality, yes, learning from videos is very important, you know. Um the the whole world is is a continuous input of lots of frames. But the whole world to intelligent or just to animals is not just watching passively, it's also an embodied experience of movements, of interaction, of um uh tactile experience, of sound, of uh of um, you know, uh smell and and and and also just just forces, uh, you know, physical forces of uh and temperature and all this, right? So I think it's deeply multimodal. And uh if you even look at, of course, Marble as a model is just the first step, but in our tech tech piece that we released uh just this few days ago, we were pretty clear that we believe in multimodality um as a as a learning paradigm as well as uh as an input paradigm. So um yeah, so I I think you know there's a lot of intellectual discussions on this. It also shows the excitement in the early days of of this uh model. So I would now say that um we have finished exploring um exactly the model architecture, the representation and and all that.

Craig S Smith:

In your um world model, the inputs are primarily videos, um, and then the the model uh builds an internal representation of the world.

Fei-Fei Li:

Not quite. In if you experience Marble, our world model, our inputs are generally pretty multimodal. You can do a pure text, you can do one to multiple images, you can do videos, and you can also input course 3D layouts, you know, like boxes or voxels. So it's uh it's multimodal, and I think we're gonna deepen that as we we go.

Craig S Smith:

And is the ambition beyond, you know, it's a fantastic product with many applications, is the ambition to to to build a system, uh, and when I said the the input is video, but to build a system that can learn uh from direct experience, whether it's through video or or or some other modality, but uh it's it's not learning through a secondary uh uh medium like text.

Fei-Fei Li:

Yes, I think world model is about learning about the world, and the world is very multimodal. And uh whether it's machines or animals, we are multi-sensory as well. So the learning is through sensing, and uh and the sensing has different modalities. So text is one form. Um, again, this is where we depart from uh, you know, uh animals, because uh most animals don't learn through sophisticated language, but but but humans do, but um, you know, our today's AI's world model will learn from a lot of language input as well as uh other modality, but it's not squeezed through only language.

Craig S Smith:

Yeah, and uh one of the the limitations of LLMs uh is that the uh the models are the model parameters are fixed after training. So the the models don't learn continuously. There's a certain amount of learning at uh test time inference. I uh is that a uh something that that you're tackling with world models, uh, because presumably a world model is learning as it encounters new environments.

Fei-Fei Li:

Yeah, the the the continuous learning paradigm is definitely a very, very important one, especially for uh for um you know living beings. That's how we uh how we do it. And even there, there is you know online learning versus offline learning. So um in the current form of our um world model, we are still more in the batch or offline uh learning mode. But I I'm we're definitely open to you know continuous, especially uh eventually online learning uh modality.

Craig S Smith:

Yeah, and and how would that would it I mean it wouldn't be a a very different architecture? It would it would simply be a matter of engineering, is that right?

Fei-Fei Li:

Well, I would be open-minded, you know, yeah, yes. I think it's gonna be a mixture of both, you know. Obviously, good engineering, good, you know, fine-tuning, online, you know, uh learning can already happen, but maybe there will be new architecture.

Craig S Smith:

Yeah, yeah. Um can you talk about the real-time frame model which which underlies um uh Marble and uh your work with world models?

Fei-Fei Li:

Yeah, so you're referring to a tech blog that we also put out a couple of weeks ago that specifically double clicks on our um real-time frame model. Um Warlab is a um uh a very research-heavy organ organization. We also care about product, but uh so much of our uh work right now in this stage is model first. So we're definitely looking at how we can uh uh we can push forward spatial intelligence. And this particular line of work, which definitely uh connects to Marble, is really focusing on how we can um achieve frame-based generation with as much uh uh geometric consistency and permanency as we can, because uh some of the early work in frame-based generation, you lose that um permanency as you move forward. But in this particular case, we try to really balance and also uh do it in a compute-efficient way during inference, which in this case we achieve through uh a single H100 in in uh inference time. We don't really know. Some other uh frame-based models, they don't tell us how many uh chips they use during their inference time. Um so we don't we we have a hypothesis, it's it's a uh quite a number, but uh we don't know that information to compare ourselves.

Craig S Smith:

Yeah, and and in your your um manifesto, I think you called it, uh you talk about the need for a universal task function.

Fei-Fei Li:

Um universal what function?

Craig S Smith:

Universal task function.

Fei-Fei Li:

Well, yes, yeah.

Craig S Smith:

Analogous to the next prediction, next token prediction in language models. Um does RTFM uh does uh uh have uh a prediction element to it. Uh can what what are you talking about when you say universal task function beyond what what uh the predictive element of of uh RTFM?

Fei-Fei Li:

Well, so one of the biggest breakthroughs of Gen AI is really this um discovery of the objective function of next token prediction, right? Because it it really is such a beautiful formulation because um language is in this kind of sequential uh you can tokenize language in this kind of sequential representation. And your learning function of next token prediction is precisely what you need during inference, which is that as you generate language, whether humans generate or computers generate, it really is putting token after token forward. So having a an objective function that is a hundred percent aligned with the actual eventual task that is supposed to um uh uh uh achieve or carry out is just is just great because it makes the optimization um just on target. In computer vision or in in um world modeling, it is not as it's not as uh simple because if you look at our relationship with uh with language, it really is just to speak it or to generate it. There's no language in nature that you're staring at it. I mean, eventually you learn to read, but that's because it's already generated. So, but there is, you know, your relationship with the language is deeply generative. It it literally is a generated thing by by humans. But the relationship with the world is so much more multimodal, I would say, right? There is a world out there for you to observe, for you to interpret, for you to reason with it, for you to eventually interact with it. But there is also a mind's eye that can formulate different versions of reality as well as imagination, and also allow you to also generate the story, the the the the imagined world, or or so it's it's much more complex. So, what is the task that defines or the objective function that defines a universal function that we can use that is as powerful as next token prediction? It's actually a really profound question, because is it let's say is it 3D reconstruction? You know, just uh some people actually would argue that the universal task for world modeling could be that it's just be able to 3D reconstruct the world, because if if that is the objective function, and if we achieve that, a lot of things will fall through uh just fall naturally out of it. But that is, you could also argue, I don't think so, because most animal brains don't necessarily do precise 3D reconstruction, you know, like it's not clear a tiger or a person actually can reconstruct the world, yet we're such powerful visually intelligent, spatially intelligent um beings. So maybe that's not necessarily the right task. Then is it um is it in the next frame prediction as a a way to, as if it's uh, you know, the next token prediction. Well, there's some power to that, right? Like, first of all, there's a lot of training data for for this. And second, is that it uh in order to predict the next frame, you you you you have to learn the structure of the world because uh worlds are not white noise. So there's a lot of structure that connects one frame to to the other. And if you if you do it well, maybe that is um maybe that is the right universal task or objective function. But maybe it is also unsatisfying because you treat the world as 2D. And uh and and uh the world is not 2D. It's uh so so you do you really force the representation, collapse it in a very unsatisfying way. And also, even if you manage to, you can say if you manage to do it perfectly, 3D is implicit. That's true, but it's also very wasteful because uh uh with the 3D structure, there's actually a lot more information that you don't have to lose in the way that frame-based uh prediction might. So so there is a lot of uh there's it's still a lot of exploration on that.

Craig S Smith:

RTFM. And I have to ask you, was that a joke uh naming it RTFM?

Fei-Fei Li:

It was a it was a brilliant uh play with the I Did Not Inventes. One of our researchers is uh really brilliant in naming, right? Like, as you know, you know, I don't know how much work say those words here.

Craig S Smith:

I'll I'll say it, yeah. Read the fucking manual of see uh the the acronym, yeah.

Fei-Fei Li:

Every computer scientist knows that, so we we find that really fun to to just play with that uh name.

Craig S Smith:

But but RTFM is predicting uh the next frame, right? That's that's uh and with ID consistency, yes, and that's what's interesting about the internal representation that uh that the model learns. I mean, sitting here, I'm I'm looking at my computer screen. I know what the other side of the computer screen looks like, even though I can't see it, and I can I there's an internal representation in my mind of that. Uh and and your uh model does that. I mean, that's why you can move around objects uh in the in in even though it's a uh it's a 3D representation on a 2D screen, but you can move around see the other side of things. So the model is has an internal representation of the 3D object, even if uh its current view is uh, you know, it can't see uh the backside of something. It and that really interests me. That's essentially what you're talking about when you say spatial intelligence, understanding that that 3D world. Uh is is that does that include, does that learning uh include uh things like uh the physic physical laws of nature? I mean, does it understand that uh that you can't walk through a solid object or that uh I I think on one of the podcasts I saw you on uh somebody was talking about uh using this for uh people with uh with fear of heights. So you could look over the edge of something, but if you create a uh a physical uh representation, I mean an explicit representation of a cliff, let's say, uh, and and you move the uh uh the point of view of of the agent or the viewer over that cliff, will it know that it's no longer standing on solid ground or will it float in uh in space?

Fei-Fei Li:

Yeah, so what you are describing is both physical and semantic, to be honest. Like, you know, of course, falling off a cliff is very much dependent on the law of gravity and all that, but um and but the the fact that you the the you know going through a wall, it's very much material-based and and semantics-based, right? Solid object versus non-solid object. So RTFM as a current model right now is not focusing on the physics yet. Most of the physics, to be honest, coming out of the Gen AI models are statistics. If you look at, for example, you know, um a lot of these Gen Video models where you do see water running and trees moving, that is not based on a Newtonian law of forces and uh and masses. It's really based on I've seen plenty of movements of water and uh leaves in this particular way, and I'm just gonna follow that statistical pattern. So we have to be a little careful. Uh, right now, World Labs is still focusing on um generating and exploring and uh static uh worlds. We are going to explore dynamic, and there uh a lot of that will be you know statistical learning. Um I don't think any of today's AI, language AI or Pixel AI has the capability to abstract at a different level and deduce physics, like at the level of a Newtonian law out of out of it. Uh everything we've seen is statistics-based, statistical uh-based uh physics and dynamic, physical and dynamical learning. We could, on the other hand, put these worlds into physics-based physics engines. And these are like unreal handles. And they they eventually these uh physics engines, game engines, and and also world generation, eventually they're gonna come by into neuro engines. I don't even know. Maybe we should call them neurospatial engines or or something like that. I think that we are moving towards that direction, but it's still early days, right?

Craig S Smith:

Yeah, yeah. And I didn't mean to bit you against Jan. What I was uh driving at is it it seems like you're focused on explicit representations coming out of intern of an abstract internal representation. Jan is just focused on on the internal representation and and learning, uh, and and that's where the learning takes place. Um, and it it it just seemed to me that they would marry beautifully.

Fei-Fei Li:

That's a possibility. Like I said, like you already repeated, we are, you know, we are exploring both. The explicit output is actually a very deliberate approach because uh we want to be useful for people. We want to be useful for people who are creating, who are simulating, who are designing. And if you look at today's industry, whether you're creating uh VFX effects or you're creating games or you're designing interiors or you're simulation, uh simulating for robots or autonomous vehicles or or industry digital twins or whatever you're doing, it's very 3D, right? So it's uh the there's entire industry after industry that's very much 3D uh in the in the workflow of 3D. And we want to be absolutely useful for people and uh and businesses to to use these models.

Craig S Smith:

Uh and the reason I brought up continuous learning, uh right now uh the the model understands uh you know depth of field and and other sort of spatial properties, yeah. Um what is the uh the learning that that is implicit in the model or that is is uh that the the model uh has that allows it to uh to generate that explicit representation? I mean, does it be because the ultimate goal is to build a model that presumably can learn uh over time. I mean, I'm sure everybody, you and everybody else, I mean, I imagine it having a model that uh maybe it's on uh in a robot or maybe it's uh in maybe it's attached to a it's getting data from a video camera that moves around in the world, but that that eventually learns uh not only uh the scene that it's it's seeing, but understands the uh the physicality of the space, and uh and eventually then you marry that with language, and you've got a really powerful intelligence. Uh, is is that something that that you think about uh and and that would require continual learning that That it's it's not a finite data set that you're feeding the model, that the model is continuously learning from its interaction in the world.

Fei-Fei Li:

Yeah, definitely, especially as as one comes close to a use case, especially if the use case requires continuous learning. You know, there are many ways of continuous learning, right? Like in language model, we see that um taking to context itself is continuous learning, right? Like having the context as memory, uh, so it helps uh inference. And then of course there is other methods of online learning, of fine-tuning. So continuous learning is uh is a term that can encompass multiple ways of doing it. And I think in spatial intelligence, especially like you said, some of these uh some of these use cases, whether it's robots in personal or in customized situations or artists with particular styles, uh creators with particular styles, uh, these will all eventually push the technology to be able to be more responsive in whichever time horizon that the use case requires it to be. Some are real time, some are are maybe just you know more uh you know segmented uh you know uh in the time horizon point of view. So it depends.

Craig S Smith:

Build the future of multi-agent software with agency. That's AGNTCY. Now an open source Linux Foundation project. Agency is building the Internet of Agents, a collaborative layer where AI agents can discover, connect, and work across any framework. All the pieces engineers need to deploy multi-agent systems now belong to everyone who builds on agency, including robust identity and access management that ensures every agent is authenticated and trusted before interacting. Agency also provides open standardized tools for agent discovery, seamless protocols for agent-to-agent communication, and modular components for scalable workflows. Collaborate with developers from Cisco, Dell Technologies, Google Cloud, Oracle, Red Hat, and more than 75 other supporting companies to build next generation AI infrastructure together. Agency is dropping code, specs and services, no strings attached. Visit agency.org to contribute. That's agntcy.org. Yeah, and you know, you've come a tremendous way uh since you're running a dry cleaning shop, as I recall, in New Jersey, and in a very short period of time. Uh and you're moving very quickly now. Uh, do you do you have any sense of where you'll be in five years with this technology? Will you uh have you know a uh uh some physics engine built into the model, or will you have uh you know the ability to learn uh uh in longer time frames and build up richer internal representations that that uh the model can can begin to understand the physical world?

Fei-Fei Li:

Yes. I I actually think, you know, it's uh as a scientist, it's very hard for me to give a prediction in terms of precise time because some part of the technology has moved much faster than I thought, some a much slower than I thought. But uh I think that is very much a very good goal. And also it's uh five actually, five years is kind of a fair guesstimate. I don't know if we'll be faster, but it's a it's a more it's a better guesstimate than 50 years in my mind, or five, or five months.

Craig S Smith:

Can you talk a little bit about why you think um spatial intelligence is the next frontier? Um I mean uh we we've already said that uh human knowledge is contained in text is only a subset of all human knowledge, and while it's very rich, uh it it you you can't expect um an AI model to understand the world simply through text. Uh can you talk about why that's important and and how marble and world labs relate to that larger goal?

Fei-Fei Li:

Yeah, so I think fundamentally uh technology should help people. In the meantime, understanding the science of intelligence itself is the most fascinating, audacious, ambitious scientific quest I can think of myself, and it's a quest of 21st century. Both of these, whether you're fascinated by the curiosity of the science or you're motivated by using technology to help people, it points to one fact is that so much of our intelligence and so much of our intelligence at work goes beyond language. Right? It's uh I I kind of uh lightheartedly say that you cannot use language to put down the fire. In my manifesto, I I described examples, whether it's the spatial reasoning deduction of DNA double helix structure or a first responder putting down fire in a rapidly evolving situation with a team of co-workers. Um a lot of this goes beyond language. So it is obvious, both in terms of use case point of view as well as in terms of a scientific quest, that we should try our best to unlock how to develop spatial intelligence technology that takes us to the different uh to the next level. So that's really the the the 30,000 feet view that that describes how I'm motivated by these dual uh purpose of uh scientific discovery, as well as uh making uh useful tools for for people. And then we can dive uh and then we can dive deeper into, for example, being useful, like we alluded to a little bit earlier, whether you're talking about creativity or simulation or design or immersive experiences or education or healthcare or you know uh manufacturing, there's just so much that you could um do with spatial intelligence. I actually I'm very um excited that so many people who are thinking about education and immersive learning and immersive experiences are telling me that uh Marble, our first release of the model, is inspiring them to think how to use it for immersive experiences in, you know, to make learning more interactive and fun. And and that's just so natural because you know, pre-verbal children learn entirely through immersive experience. And and even today, as as as grown-ups, our life, right? Like so much of our life is immersed in this world and and involves, of course, it involves speaking and writing and reading, but it involves doing and interacting and and and enjoying and and and and all that. So it's it's so natural.

Craig S Smith:

Yeah, I mean one of the things that that has struck everybody uh is that this uh marble, I'm not sure it's marble or or the uh itself or the or the uh RTFM, the the the generation of uh next frame um in in a moving uh moving through a space, uh, is that it's running on one H100 GPU. Uh it's and and I've heard you in other talks refer to uh experiencing the multiverse, um, which you know everyone was very excited about until they realized how much computation it was gonna take uh and how expensive it is. Uh, it is is do you really think that uh this is a step toward creating worlds for education, for example, and because you've been able to reduce the computational load?

Fei-Fei Li:

Um, not just that. Uh first of all, I really believe at the inference front, we will be speeding up, we'll be more efficient, and we'll be also better, bigger, higher quality, um longer, you know, um experiences, all this I that is the uh trend of technology. I I really do. Things are moving fast. I also do believe in multiverse, a multiverse experience. You know, humanity, um, as far as we know, the entire history of humanity, our world experience is in one world, and it's literally physically this earth, right? Okay, a few, a handful of human beings have gone on to the moon, but that's about it. Um that is the one shared uh 3D space. And we build civilization, we live our lives, we we we we do everything in it. But with the digital revolution, digital expl explosion, we are moving part of our lives in the digital world. And there is also a lot of crossover. This is not, I don't want to paint a dystopian picture of we have abandoned our physical world, uh, nor am I gonna paint a utopian, total hyperbolic utopian world of everybody wear a headset and never even enjoy and look at our beautiful real world, and that's that's the fullest of life. I I would reject both notions, but both pragmatically speaking, as well as just projecting to the exciting future, the digital world is boundless, it's unlimited, and it gives us so much more dimension and experiences that our physical world will not allow us. For example, we already talked about learning. Just I would love to have learned chemistry in a much more interactive and uh immersive way, because I remember my college chemistry class has so much to do with arranging molecules, understanding the parities and the asymmetry of the molecular structure. Man, I wish I could just experience that in a in a in a experience uh uh you know in an immersive way. I would love for creators to um, you know, creators, I meet these creators, I I realize in their mind's eye, they simultaneously, every given moment, they have so many ways to tell their story. There's so much in their head, but they are they're rate limited by tools. You know, if you use Unreal Engine, it's gonna take weeks, hours after hours to express one world in your head. And and uh and whether you're making the fantastical musical production, or you are, you know, designing the bedroom for your newborn child, so many of these moments, if we allow people to use the digital universe as much as the physical world to experiment, to to iterate, to communicate, to to create, it's just so much more fun. And also, we also are digital age is also helping us to break the physical boundaries of uh of uh labor, right? Like I can tell, I mean, we're looking at um um robots that can be teleoperated. I can also totally imagine um creators collaborate across the globe through embodied uh uh you know um robotic arms or whatever form factor and through the digital space uh so that they're both they can both work in the physical world as well as work in the in the digital world. And movies, right? Movies will completely change. Right now it's passive experience as as much as it's beautiful, but we're gonna change the way we got entertained. So all this required multiverse.

Craig S Smith:

And and in in teleportation or you know, the teleoperated robots, uh, you could, you know, there's a lot of talk of mining rare earths on asteroids and stuff. Um, you know, you don't need to physically be there if you if you can uh uh operate a robot uh remotely that's that's in those spaces. What uh in what you're talking about is again creating uh explicit representations of of 3D space that people can can experience. Uh, how much in your models does the model itself understand uh the spaces that that it's uh internalizing before it it explicitly uh uh projects them? I mean and that's again one thing that that I think about more than the the productization or practical uses of this is working toward an AI that truly understands the world, doesn't just have a representation of 3D space that really understands not only the laws of physics but but uh what it's seeing and and maybe the values or the uh the usefulness of what it sees and ways that it could manipulate the physical world. How much of that understanding do you think is there and what needs to happen for the models to really understand uh the world?

Fei-Fei Li:

Yeah, okay. Great question. So this word understanding is a very profound word. You know, when AI understands something, it just fundamentally is different from how humans understand it, if partially because we're very different beings, right? Humans have a level of consciousness and self-awareness in a embodied um um in an embodied body. For example, when we understand something, when we understand my friend is really happy, it's not just an abstract understanding my friend is happy. You actually have chemical reactions in your body that releases happy hormones or or whatever chemicals that your heartbeat might increase, your your your your mood might so so that level of understanding is very different from an abstract AI agent that has the capability of correctly assign meaning and and link uh meaning to each other. For example, in Marble, our product uh our model product, you can actually go to an advanced mode of world generation. And in in this advanced mode, it asks you, it allows you to edit. You can take a preview of the world and say, I don't like this couch being pink, change it to blue, and it it changes it to blue. Does it understand at the level of blue couch and the word change? It does, because without that understanding, we wouldn't change the pink couch to the blue couch. Does it understand it that that the same way you and I understand it, including everything about this couch couch, including every useful or not even useful information of the couch? Does it have a memory of the couch? Does it, you know, take the concept of couch to afford us too many other things? No, it doesn't. It as a model, it's limited to allowing you to, you know, do whatever it's necessary the model needs to do, which is to create a space that has the couch that that looks blue. So I think so. The answer to your question is I I do think AI understands, but let's not mistaken that understanding from a anthropomorphic human-level understanding.

Craig S Smith:

Yeah, and in fact, the the whatever understanding there is uh when when you tell it to swap a red couch for a blue couch, that's semantic. That's uh semantic understanding. It's it's not understanding at the level of uh you know light and hitting the retina and and knowing and having a concept of blueness as opposed to redness. Um you know, I saw your uh talk with Peter Diomandis and uh Eric Schmidt in um Saudi Arabia. Uh and uh uh and and I think you you were wonderful in that much more grounded than some of the questions or even uh Eric's uh views. Uh but uh one of the things uh that struck me is there was a brief discussion about the potential for AI to be creative or to to help in uh scientific research and the the analogy that is given, uh, you know, if uh if there had been AI in, you know, before some of them like uh Einstein's relativity theory, or I can't remember the other one that was referenced, uh could AI reason to that discovery. Uh what's missing for AI to uh to be creative? I'm not talking about creative in the arts, I'm talking about uh in scientific reasoning. It seems to me that that should be within reach. I mean, I'm again with no time frame, but that that seems just intuitively uh something that should be possible.

Fei-Fei Li:

Yeah, that's a great question. Um you know I would think we're closer to AI deducing double helix structure than AI formulating special relativity. Um partially because I mean we already have seen a lot of great work in protein folding, um, partially because deducing double helix structure is still the representation is is more grounded in in space and and um geometry, whereas the formulation of special relativity is on an abstract layer that is not just expressed in unlimited amount of words. That that's just not how it works, right? It everything we see in physics from Newtonian law to to quantum, we abstract it to a causal level. That the the the relationship of the world, the concept, whether it's mass or it's force or is abstracted at a level that is just no longer a pure statistical pattern generation. Like language can be very statistical. Um 3D world or or 2D world, doesn't matter, can be very statistical. Dynamics can be very statistical, but the causal abstraction of forces and masses and uh uh magnetism and all this are not purely statistical. Um it it it is very profoundly causal and abstract. So I'm more in the pontification mode than answering your question mode. I, you know, I think Eric and I on stage were saying, you know, we've got plenty of celestial body data, movement data in the world now. Just aggregate all the satellite data and all that. Give it to today's AI. Can it come up with Newton Newtonian law of motion?

Craig S Smith:

See, yeah, and actually that I found that a really that was the other example I was trying to think of. Relativity, yeah, there there's there's no physical system that you can observe, at least uh in in day-to-day life, that would would lead to that deduction or inference. But but with the uh with the data of of the movement of of uh celestial objects, uh I I it I'm not I'm not sure, just again intuitively, I'm a journalist, uh it's it seems that maybe not today's AI systems, but that an artificial intelligence uh could deduce the laws of motion if if given uh the data and uh and uh given the the the time to to think, uh why do you think it would not be able to deduce those those laws?

Fei-Fei Li:

When we say those laws are deduced, Newton had to deduce, has to abstract concepts like forces and mass and acceleration and and and fundamental constants. Those are at an abstract level that I have not yet to see today's AI can take that whatever vast amount of data and abstract representation or variables or or relationships at that level. It's uh there isn't much evidence yet. I'm I you know I don't know everything that's going on AI, so I'm happy to be proven wrong. I just haven't heard any work that has done that level of abstraction. And in the architecture of a transformer model, um, I don't see where that abstraction can come from yet. Um, so so that's why I'm I'm questioning that.

Craig S Smith:

Yeah, yeah. It just uh from my my sense, you know, you're building these internal, abstract internal representations, and as that knowledge accrues, uh I I just I don't see why uh an AI model wouldn't be able to apply the you know rules of logic knowledge.

Fei-Fei Li:

I'm not saying AI shouldn't or shouldn't try, but it probably takes more uh progress in our fundamental architecture um of our algorithms.

Craig S Smith:

Yeah, yeah. And that was something I wanted to ask. I mean, you use transformers in in your uh in your models, uh and I've been talking to people about post-transformer architectures. Uh you know, we're certainly not at a breakthrough, but uh you know, you we started talking about this uh this uh predictive function, uh universal task function, as you called it. Uh, is is uh do you have a an expectation that there will be that kind of a breakthrough, uh a new architecture that will unlock some of these capabilities?

Fei-Fei Li:

I do. I do I I do think we will have architectural breakthroughs. I I do not think Transformer is the last invention of uh AI. Um and uh uh you know, in the grand scheme of things, humanity hasn't been around for that long compared to the entire history of the universe we know of. But in this short history of thousands of years, we have never stopped innovating. So I do not think Transformer is the last algorithm architecture of uh AI.

Craig S Smith:

In one of the conversations uh or something you wrote, uh I saw you you you were talking about your the history of your research, and that at one time you felt if you could get an AI system to label or caption images, uh, you would have that would have been the pinnacle of your career. And of course, you've blown past that. Uh what is what do you imagine as the as the uh crowning achievement of your future career today?

Fei-Fei Li:

I do think unlocking spatial intelligence, creating a model that really connects perception to reasoning, spatial reasoning, seeing to doing, including planning, and imagine imagining to creation, um would be incredible. Uh a model that can do all three of that.

Craig S Smith:

I have a couple of uh silly questions I'm gonna ask you. Uh, what's your favorite food or one of your favorite foods? I'm just curious. I mean, you you were born and spent some of your early years in China. Is it Chinese food? Is it uh, you know?

Fei-Fei Li:

So I'm married to an Italian and he's a phenomenal cook, in addition to being a phenomenal AI scientist. So I I love Italian. And um in our family, uh when we are being nonpartisan, we love Japanese food.

Craig S Smith:

These days, yeah. That's uh yeah. And then uh what what's your guilty pleasure? Do you read Romance novels on the plane or or uh binge watch uh uh you know?

Fei-Fei Li:

Uh yeah, that's a great question. First of all, I don't have too much. Leisure time because what I do is so much fun, but I do have a guilty pleasure. I don't think I've ever told anybody. Um, if I'm really tired in a plane, I watch Big Band Theory. I love that show. And then I graduated from Caltech. I was a physics major. Everything of that show, I just identify so much with with the with the people there. It's it's my I don't binge watch because it depends on the duration of the flight, but that's my totally if I'm totally exhausted, I watch that.

Craig S Smith:

Yeah, yeah, and you get the jokes, right? Because a lot of it.

Fei-Fei Li:

I can tell you I love every single jokes and every single character, the nerds in the show.

Sonix is the world’s most advanced automated transcription, translation, and subtitling platform. Fast, accurate, and affordable.

Automatically convert your mp3 files to text (txt file), Microsoft Word (docx file), and SubRip Subtitle (srt file) in minutes.

Sonix has many features that you'd love including automatic transcription software, automated subtitles, transcribe multiple languages, upload many different filetypes, and easily transcribe your Zoom meetings. Try Sonix for free today.

Learn more

Eye On AI features a podcast with senior researchers and entrepreneurs in the deep learning space. We also offer a weekly newsletter tracking deep-learning academic papers.

WEEKLY NEWSLETTER | Research Watch

Week Ending 11.23.2025 — Newly published papers and discussions around them. Read more