AI Pioneer Fei-Fei Li Has a Vision for Computer Vision

Professor at Stanford University Fei-Fei Li he has earned his place in AI history. He played a major role in deep learning revolution by laboring for years to make the ImageNet dataset and competition, which challenges AI systems to recognize objects and animals in 1,000 categories. In 2012, a neural network called AlexNet sent shockwaves through the AI research community when it outperformed all kinds of models and won the ImageNet contest. From there, neural networks taken, powered by the many free training data now available on the Internet and GPUs which delivers unprecedented computing power.
In the 13 years since ImageNet, computer vision researchers have mastered object recognition and continued image and video production. Li founded Stanford’s Institute for Human-Centered AI (HAI) and continues to push the boundaries of computer vision. Just this year he launched a startup, Labs of the Worldwhich produces 3D scenes that users can explore. World Labs is dedicated to giving AI “spatial intelligence,” or the ability to create, reason within, and interact with 3D worlds. Li gave a keynote yesterday at NeuroIPSthe big AI conference, about his vision for machine vision, and he gave IEEE Spectrum an exclusive interview before his speech.
Why did you title your talk “Climbing the Ladder of Visual Intelligence”?
Fei-Fei Li: I think it is intuitive that intelligence has different levels of complexity and complexity. In the speech, I would like to express the sense that in the past decades, especially in the past 10-plus years of deep learning revolution, the things we learn to do with visual intelligence are amazing. We are becoming more and more capable of technology. And I was also inspired by Judea Pearl’s “ladder of reason” (in her 2020 book The Book of Why).
The talk is also subtitled, “From Seeing to Doing.” This is something people don’t appreciate enough: that seeing is closely coupled with interaction and doing things, for animals as well as AI agents. And this is a departure from language. Language is primarily a communication tool used to convey ideas. In my mind, these are very complementary, but equally profound, modalities of intelligence.
Do you mean that we instinctively respond to certain scenes?
Lee: I’m not just talking about instinct. If you look at the evolution of vision and the evolution of animal intelligence, they are deeply, deeply intertwined. Every time we get more information from the environment, the force of evolution pushes the ability and intelligence forward. If you cannot feel the environment, your relationship with the world is more passive; if you eat or eat is a passive act. But when you’re able to pick up cues from the environment through vision, the evolutionary pressure actually increases, and that pushes intelligence forward.
Do you think that’s how we create deeper and deeper machine intelligence? By allowing machines to better understand the environment?
Li: I don’t know if “deep” is the adjective to use. I think we are creating a lot of capabilities. I think it’s becoming more complex, more capable. I think it is absolutely true that solving the problem of spatial intelligence is a fundamental and critical step towards holistic intelligence.
I saw the demos at World Labs. Why do you want to research spatial intelligence and build these 3D worlds?
Lee: I think spatial intelligence is where visual intelligence goes. If we are serious about cracking the problem of perception and also connecting it with performance, there is a very simple, drawn-in-the-day truth: The world is 3D. We do not live in a flat world. Our physical agents, whether they are robots or devices, live in a 3D world. Even the virtual world is becoming more 3D. If you talk to artists, game developers, designers, architects, doctors, even if they work in a virtual world, most of it is 3D. If you take a moment and recognize this simple but profound truth, there is no question that cracking the 3D intelligence problem is important.
I’m curious how the scenes from World Labs maintain object permanence and obey the laws of physics. That feels like an exciting step forward, because video creation tools like Sora still confused about such things.
Li: Once you respect the 3D-ness of the world, much of it is natural. For example, in one of the videos we posted on social media, basketballs were placed in a scene. Because it’s 3D, it allows you to have that kind of capability. If the scenery is just 2D-generated pixels, basketball has nowhere to go.
Or, like Sora, it can go somewhere but disappear. What are the biggest technical challenges you face as you try to push the technology forward?
Li: No one has solved this problem, right? It’s very difficult. You can see (in a World Labs demo video) that we took a Van Gogh painting and created the whole landscape around it in a consistent style: the artistic style, the lighting, any kind of buildings. which is in the neighborhood. If you turn around and it becomes skyscrapers, it’s not very convincing, is it? And it has to be 3D. You have to navigate it. So it’s not just pixels.
Can you talk about the data you used to train it?
Lee: A lot.
Do you have technical challenges regarding computing burden?
Li: A lot of computing. This is the kind of calculation that the public sector cannot afford. This is part of the reason I feel so excited to take this sabbatical, to do it the private sector way. And this is also part of the reason that I advocate for access to computers in the public sector because my own experience emphasizes the importance of innovation with a sufficient amount of resourcing.
It is good to empower the public sector, because it is often more motivated by acquiring knowledge for its own good and knowledge for the good of the people.
Li: Knowledge discovery must be supported by resources, right? In Galileo’s time, it was the best telescope that allowed astronomers to observe new celestial objects. It was Hooke who realized that the magnifying glass would become a microscope and discovered cells. Whenever there is a new technological tool, it helps in the search for knowledge. And now, in the age of AI, technological tools include computation and data. We need to recognize that for the public sector.
What would you like to happen at a federal level to provide resources?
Li: This has been the work of Stanford HAI for the past five years. We worked with Congress, the Senate, the White House, industry, and other universities to make the NAIRR, the National AI Research Resource.
Assuming we can get AI systems to actually understand the 3D world, what does it give us?
Li: This will open up a lot of creativity and productivity for people. I love designing my home in a more efficient way. I know that many medical applications involve understanding a particular 3D world, which is the human body. We often talk about the future where people create robots that help usbut robots navigate a 3D world, and they need spatial intelligence as part of their brain. We are also talking about virtual worlds that will allow people to visit places or learn concepts or have fun. And those that use 3D technology, especially hybrids, which we call AR (augmented reality). I love walking in a national park with a pair of glasses that give me information about the trees, the trail, the clouds. I also like to learn different skills with the help of spatial intelligence.
What kind of skills?
Li: My lame example is if I get a flat tire on the highway, what do I do? Right now, I’m opening a “how to change a tire” video. But if I put on glasses and see what’s going on with my car and then be guided through that process, that’s fine. But that’s a lame example. You can think about cooking, you can think about carving—fun things.
How far do you think we can take it in our lifetime?
Li: Oh, I think it will happen in our lifetime because the pace of technological development is so fast. You see what the last 10 years have brought. This is definitely a sign of what is to come.
From Your Site Articles
Related Articles Across the Web
https://spectrum.ieee.org/media-library/fei-fei-li-wearing-a-black-dress-and-posing-against-a-concrete-wall-with-arms-crossed.jpg?id=55290951&width=1200&height=600&coordinates=0%2C648%2C0%2C242
2024-12-12 20:46:25