Transformers, large language models, foundational models, agentic AI … The past two years have been an explosion of Generative AI technology, accompanied by the requisite new lexicon of terminology, acronyms, and concepts to get familiar with. As we tidy up the last loose ends of 2024 and look ahead to ‘25, it’s time to get acquainted with another AI term: World models.
What is a world model? There’s no single agreed upon definition, but I like Runway’s take from last year:
A world model is an AI system that builds an internal representation of an environment, and uses it to simulate future events within that environment.
A good way to explain what a world model is to liken it to the way us humans understand the world around us. Human beings create mental models of the world around us to help us make sense of, and act in, our surroundings. These models help us anticipate — predict, really — what’s going to happen in a given situation, and decide how to react.
This world models research paper uses the example of a batter in a baseball game leveraging mental models to hit a 100 mph fastball. A ball travelling that fast will reach home plate before visual signals from the pitch make it to the batter’s brain. So the batter instead relies on what they know about how to hit a fastball — their mental model of the situation — to decide when and where to swing. Swinging at the pitch is essentially instinctual for a seasoned hitter who’s created a mental model of facing fastballs based on countless at-bats over a lifetime of playing.
That added dimension of understanding is the big difference between a generative model and a world model. A generative AI model can predict what the next frame should look like in a video of a leaf floating along in the breeze; a world model will understand the physics that make the leaf move (and can generate a video based on that understanding). While world models are capable of more than video generation, there’s evidence — and lots of optimism — that they might be key to advancing the state of the art in AI generated video. If a world model understands physics, for instance, it would know that fingers on a human hand can’t combine, separate, and otherwise morph into other shapes in real time. Ergo, it wouldn’t spit out the kinds of surreal video clips involving hands that current Gen AI video systems are known for.
[For what it’s worth, OpenAI published a research paper about the potential of their own video model, Sora, as a world simulator.]
Researchers and business leaders alike are also excited about the potential of world models to advance the state of the art in robotics, self-driving vehicles, and other aspects of physical AI. I’ve heard and read several very accomplished AI research leaders speak in excited tones recently of world models that “predict the future.”
Expect big world model announcements from some big names in the coming months. Like this one DeepMind just dropped: Genie 2: A large-scale foundation world model.
As Import AI reported:
DeepMind has demonstrated Genie 2, a world model that makes it possible to turn any still image into an interactive, controllable world … “Genie 2 is a world model, meaning it can simulate virtual worlds, including the consequences of taking any action (e.g. jump, swim, etc.)” DeepMind writes. “It was trained on a large-scale video dataset and, like other generative models, demonstrates various emergent capabilities at scale, such as object interactions, complex character animation, physics, and the ability to model and thus predict the behavior of other agents.” … Genie 2 hints at a future where entertainment is generated on the fly and is endlessly customizable and interactive.
Little by little and then all at once, digital entertainment may well be about to change.