The Direct from Imagination Era Has Begun
Generative AI, virtual worlds, parallel compute and compositional frameworks will transcend the holodeck from Star Trek.
When I was a kid, I wanted to build a holodeck — the immersive 3D simulation system from Star Trek, so I started making games, beginning with online multiplayer games for bulletin board systems. Eventually, I even got to make a massively multiplayer mobile game based on Star Trek that a few million people played.
Although one feature of the holodeck — manipulating physical force fields — may remain the domain of science fiction — just about everything else is rapidly becoming technological reality (just as Star Trek foresaw so many other things, like mobile phones, voice recognition and tablet-based computers).
What we’re entering is what I call the direct-from-imagination era: you’ll speak entire worlds into existence.
This is core to my creative vision for what the metaverse is really about: not only a convergence of technologies, but a place for us to express our digital identities creatively.
This article explores the technological and business trends enabling this future.
If you wanted to build a holodeck:
You’d need a way to generate and compose ideas: “Computer, make me a fantasy world with elves and dragons… except made of Legos.”
You’d need a way to visualize the experience. Physics and realistic light simulation (ray tracing)
You’d need a way to have a persistent world with data, continuity, rules, systems.
Let’s look at a couple of the ways that generative artificial intelligence taps into your creativity:
ChatGPT as a Virtual Engine
There are a number of ways to conceptualize a large language model like ChatGPT; but one is that it is actually a virtual world engine. An example of this is how it can be used to dream of virtual machines and text adventure games.
Lensa and Self-Expression
Lensa grew to tens of millions of revenue in only a few weeks. It lets you imagine different versions of yourself and share it with your friends, gratifying our egos and our creativity.
Its enabling technology, Stable Diffusion, is disruptive because it dramatically reduces the cost of generating artwork, enabling new use cases like Lensa.
Why Building Virtual Worlds is Hard
The above diagram illustrates just a few of the steps involved in creating a virtual world: a game, an MMORPG, a simulation, a metaverse, or whatever term you prefer. There are actually far more iterative loops and revisitations to earlier phases throughout the process of building a large world, and a few types of content are left out (for example, audio).
A number of emerging technologies — not only generative AI, but advancements in compositional frameworks, computer graphics and parallel computation — are organizing, simplifying and eliminating formerly labor-intensive elements of the process. The impact of this is vast: not only accelerating production velocity and reducing costs, but enabling new use cases.
Compositional Frameworks
Before generative AI and “professional” platforms for worldbuilding became available, a number of other platforms existed. Let’s take a look at those:
Dungeons & Dragons
I’ve often called D&D the first metaverse: it’s an imaginative space with enough structure to allow collaborative storytelling and simulation. It had persistent, virtual worlds called campaigns.
For its first few decades, it was mostly non-digital. But more recently, tools have helped dematerialize the experience and make it more easy to conduct your campaign online. Generative AI tools have also helped dungeon masters create imagery to share with their groups.
Minecraft: the Sandbox
Minecraft is not only a creative tapestry for individuals — it is a space of shared imagination where people compose vast worlds.
Screens here are taken from Divine Journey 2, a colossal modpack composed of many other mods and deployed on servers for players to experience together:
Roblox: the Walled Garden
Roblox is not a game — it is a multiverse of games, each created by members of its community.
Many of the most popular experiences of Roblox are not “games” in the traditional sense.
Many would not have gotten greenlit in the mainstream game publishing business — but in a shared space of creativity, new types of virtual worlds flourish.
3D Engines
A decade ago, if you wanted to build an immersive world in 3D, you’d need to know a lot about graphics APIs and matrix math.
For people who couldn’t realize their creativity in a sandbox or walled-garden — platforms like Unreal and Unity enable the creation of real-time, immersive worlds that simulate reality.
Persistent Worlds
3D engines provide a window into a world. But the memory of what happens in a world — the history, economy, social structure — as well as the rules that undergird a world, require a means of achieving consensus between all participants.
Walled-gardens like Roblox do this for you: but large-scale worlds have required the work of large engineering teams who build from scratch.
You Will Speak Worlds into Existence
Compositional Frameworks will use generative AI to accelerate the worldbuilding process; begin with words, refine with words.
Physics-based methods such as ray tracing will simplify the creative process while delivering amazing experiences.
Generative AI will become part of the loop of games and online experiences, creating undreamt-of interactive forms.
Compute-on-demand will enable scalable, persistent worlds with whatever structure the creator imagines.
Parallel Computation
Computers can dream of worlds — and we can see into them — due to advances in parallel computing.
The next few sections will explain the exponential rise in computation — in your devices and in the cloud — driving the direct-from-imagination revolution, and then return to what the near-term future has in store.
Compute before 2020 is a rounding error vs. today
The top 500 supercomputing clusters in the world show us the exponential rise in computing power over the last few decades.
However, the top 500 only captures a small fraction of the overall compute that’s available. A few metrics to be aware of (Discussion):
Most 2022 phones had 2+ TFLOPs* of compute (2x10¹²) which is 100,000,000 faster than the computer that sent Apollo to the moon
The Frontier supercomputer passed 1.0 exaflops (10¹⁸)
1.5 exaflops on the “virtual supercomputer” that combined for the Folding@Home Covid 19 simulation.
Top500 Supercomputing clusters add up to ~10 exaflops
NVIDIA RTX-4090’s shipped at least 13 exaflops
Playstation 5’s combined surpasses 250 exaflops
Apple ships over 1 zettaflop (10²¹) of compute in 2022
Intel is working toward a zettaflop supercomputer
By 2027, hundreds of zettaflops seems plausible. By then, compute at the start of 2023 compute will seem like a rounding error again.
Technical note: in all these comps I blur single vs. double precision & matrix vs. vector ops, so it isn’t apples-to-apple. This will be a topic for a future post on global compute; meanwhile, this still ought to provide a rough order-of-magnitude.
Parallel Computation
Much of the increase in global computation capacity has occurred because of parallel computation. Within parallel computation there are two main types:
Programs that are especially parallel-compatible: this includes just about everything that benefits from matrix math, such as as graphics and artificial intelligence. This software benefits from adding lots of GPU cores (the cores themselves keep getting more specialized, like Tensor cores for AI, or raytracing cores for realtime physics-based rendering).
Programs that remain CPU-bound (more complicated programs with a lot of steps along the way), which benefit from multiple CPU cores.
Although having multiple CPU cores has helped us run multitasking programs more efficiently, most of the continuity of Moore’s Law is the result of the growth in GPUs.
However, AI models are growing a lot faster than GPU performance:
Fortunately, cost per FLOP is decreasing at the same time:
Similarly, algorithms are getting much better. ImageNet training costs have decreased more than 95% over 4 years, and new AI are even discovering new and more efficient ways of running themselves:
Scaling Parallel Compute
For huge workloads — like training a huge AI model or running a persistent virtual world for millions of people — you have a couple main options:
Build an actual supercomputer (CPUs/CPUs all in one location, which needs high speed interconnects and shared memory spaces). Currently, this is needed for workloads like certain kinds of simulations or training large models like GPT-3.
Build a virtual supercomputer. Examples:
Folding@Home, an example of a distributed set of workloads which can be performed asynchronously and without shared memory. This approach is good for huge workloads when latency and shared memory don’t matter much. Folding@Home was able to simulate proteins for 0.1 seconds by distributing the workload across >1 exaflop of citizen-scientist computers on the internet.
Ethereum network — good for cryptographic and smart contract workloads
Put code into containers and orchestrate them over large CPU capacity using Kubernetes, Docker Swarm, Amazon ECS/EKS, etc.
Because of the combination of innovation in speed/density computing cores (mostly GPUs) alongside networking GPUs together into supercomputers, we’ve exponentially increased the amount of compute that’s available. This is illustrated in Gwern’s diagram of the power of the supercomputing clusters used to train the largest AI models created so far:
Scaling for Users
When a workload is more compute-bound, but can be broken down into separate containers (containing microservices, lambdas, etc.) you can use orchestration technologies like Kubernetes and Amazon ECS to rapidly deploy a large number of virtual machines to service demand. This is mostly useful for making software available for large number of users (rather than simply making software run faster). How many virtual machines? This chart gives you an idea of how quickly one can provision containers using state-of-the-art orchestration technologies in large datacenters:
Vast worlds may be simulated on-device
It’s important to understand all aspects of this rise in compute. Cloud-based capacity and actual supercomputers are enabling training huge models and unifying applications that need to be accessed by millions of concurrent users.
However, the exponential rise in compute at the edge and in your own devices is just as important for building metaverses and holodecks. Here’s why: many things are simply done most-efficiently right in front of you. For one reason, there’s the speed of light: we’ll never be able to do things like generate real-time graphics as quickly in the cloud and ship it to you as we can on your device, not to mention that it is far more bandwidth-efficient to use the network to provide updates to geometry/vectors than it is to provide rasterized images. And many of the more interesting applications will need to perform local inference and localized graphics computation; cloud-based approaches will simply be too slow, cumbersome or violate privacy norms.
At the same time as our local hardware is getting better, the software is also improving at an exponential rate. This is illustrated in Unreal Engine, which has a few key features worth noting:
World partitioning allows open worlds to be stitched together
Nanite allows designers to create images of any geometry and place it in any world (cuts down on the optimization passes, constantly refining objects to lower polygons, etc., that comes up in realtime graphics systems).
Lumen is a global illumination system that uses ray tracing. It looks amazing, runs on consumer hardware, and spares developers from having to “bake” lighting before each build. The reason the latter is important is that most realtime lighting systems in use today (such as in games) require a time-consuming “baking” process to pre-calculate lighting in an environment before shipping the graphics to the user. This is not only a nuisance from a productivity standpoint, it also limits the extent of your creativity: dynamic global illumination means you can also have environments that change dynamically (e.g., allowing people to build their own structures within a virtual world).
Realtime ray tracing was a demo in 2018 that required a $60,000 PC. Now, it’s possible on a PlayStation 5. Technologies like Lumen, as well as more specialized GPUs such as that found in the NVIDIA RTX-4090, demonstrate how far both physics-based hardware and software have come in a short period of time.
Similarly, these improvements will not simply be a cloud-based realm of AI model-training and Web-based inference apps like ChatGPT. Hardware and algorithm improvements will make it possible to train your own models for your team, your game studio and even yourself; and on-device inference will unlock games, applications and virtual world experiences that were only dreams until recently.
Generative AI
A complete tour of generative AI would fill a bookshelf (or maybe even a whole library). I want to share a few examples of how generative technologies will replace some of the steps in the production process of building virtual worlds.
At this point, you’ve probably been bombarded with AI-generated art. But it’s worth a reminder of just how far use cases like concept-art generation have come in only a year:
3D Generative Art
One must distinguish between generative art that “looks like 3D” and art that is actually 3D. The former is simply another example of a generative 2D art; the latter uses a mesh geometry to render scenes with physics-based lighting systems. We’ll need that to build virtual worlds.
This is a domain that’s still in its infancy, but remember how quickly 2D developer. This is likely to improve dramatically in the near future. OpenAI has already demonstrated the ability to generate point clouds of 3D objects from a text prompt:
Neural Radiance Fields (NeRF)
NeRF generates 3D scenes and meshes generated from 2D images taken from small number of viewpoints. The simplest way to think about NeRF’s is that it is “inverse ray tracing,” where the 3D structure of a scene is learned from the way light falls on different cameras. Some of the applications include:
Make 3D creation accessible to photographers — more storytelling and virtual world content
An alternative to complicated photogrammetry
Beyond the immediate applications, reverse ray-tracing is a domain that will eventually help us generate accurate 3D models based on photos.
Text-to-NeRF
Natural language is becoming the unifying interface for many of the generative technologies, and NeRF is another example of that:
AI Could Generate Entire Multiplayer Worlds
Text interfaces will also become a means of organizing larger-scale compositions. At Beamable, we made a proof-of-concept illustrating how you could use ChatGPT to generate the Unreal Engine Blueprints that would include the components necessary to build persistent virtual worlds:
AI can play sophisticated social games
In 2022, Meta AI showed that an AI (CICERO) could be trained on games recorded in a Web-based Diplomacy platform. This requires a combination of strategic reason as well as Natural Language Processing. This hints at a future with AI that will:
Help you work through longer, more-complex plans like composing an entire world
Participate “in-the-loop” of virtual experiences and games, acting as social collaborators and competitors
AI can learn and play compositional methods
In 2022, OpenAI demonstrated through a method called Video Pre-Training (VPT) that an AI could learn to play Minecraft.
This resulted in the ability to perform common gameplay behaviors — as well as compositional activities like building a base.
This further reinforces the idea of AI-based virtual beings that can populate worlds — as well as act as partners in the creative process.
AI Can Watch Videos to Make a Game
In a demo called GTA V: GAN Theft Auto, an AI was trained to watch videos of Grand Theft Auto. It learned to play the game, and from the learning process it was also able to generate a game based on what it saw. The result was a bit rough, but it’s still extremely compelling to imagine how this will improve over time.
Real-Time Compositional Frameworks
What happens when you combine the ability to do real-time ray tracing, generative AI within an online compositional framework? You should just watch this video of NVIDIA’s Omniverse platform for yourself:
Connecting Persistent Worlds
One of the big “last mile” problems in delivering virtual worlds is connecting all of this amazing composition and real-time graphics back to a persistent world engine. That’s what the team at Beamable has focused on. Rather than hire up teams of programmers to code server programs and DevOps personell to provision and manage servers — Beamable makes it possible drag-and-drop persistent world features into within Unity and Unreal. This sort of simplified compositional framework is the key to unlocking the metaverse for everyone:
Decentralizing the Metaverse
Where workloads live today:
Today, most AI training happens in the cloud (such as with the foundation models, or the proprietary models like GPT-3). Similarly, most inference still happens in the cloud (the AI is happening on a computing cluster, not your own device).
And although the technology now exists to deliver ray tracing on-device, it’s unevenly distributed — so developers are still pre-rendering graphics, baking lighting and learning on their in-house shader wizard to make things look great.
Multiplayer conensus in big persistent worlds tends to be the domain of centralized CPU computing backends (for example, walled-garden systems like Roblox, enormous datacenters run by World of Warcraft, or managed services like AWS).
Where workloads are going:
Personalized AI on-device
Localized AI inference
Teams that train their own models to generate hyperspecialized graphics, content, narratives, ect.
Physics-based simulation on your device (including ray tracing); product teams will shift to focusing on the deliverables, rather than the process.
More of the work related to multiplayer consensus will shift to decentralized approaches: this includes identity, blockchain-based economies, containerized code and distributed use of virtual machines.
Augmented Reality transcends the Holodeck
One of the beneficiaries of decentralization will be augmented reality (AR), because changing the view of reality around us will simply be too slow to offload all of the inference and graphics generation to the cloud.
A feature of the Holodeck was actual force-feedback. But beyond some simple haptic feedback, it may not be great to get slammed by force fields.
Unlike the holodeck, the metaverse will infuse the real world with digital holograms, AI-inference of the local environment and and computation driven by digital twins. We’ll collaborate, play and learn in ways unbounded by any one environment.
Digital Twins in Virtual Worlds
Just as augmented reality will exploit many of the technologies I’ve discussed in this article, digital twins (which are digital models of something in the real world which provide realtime data about themselves) will make it back into all manner of virtual world: online games, simulations, and virtual reality. A number of companies are even working towards making a planet-scale digital twin of the Earth. Cesium, used in the flight simulator example above, is one such company.
Who will be Disrupted?
Venture capital firm a16z estimates that games will be impacted most.
The impact will not simply be the disruption from letting people make the same games but cheaper — it will be making new kinds of games with new and smaller teams.
Many categories of traditional media are projecting into virtual worlds, becoming more game-like. Consider that by January 2023, 198M people had viewed the Travis Scott music concert that originally appeared inside Fortnite.
All media will follow where games will lead.
Everything will be Disrupted
However, it is important to realize that this is a disruption that will affect everything and everyone. Two areas in particular that will be disrupted — but also benefit from — the technologies covered here include the creator economy and the experiences of the metaverse:
The creator economy will dramatically expand to include far more participants. However, team sizes are going to shrink and I expect it will be a tough time ahead for many teams that compete purely on the basis of the scale of workers they can bring to bear on a project. Smaller teams will do the work that much larger teams could only do in the past. Eventually, it may be that a single auteur could imagine an experience and sculpt it into something that currently requires hundreds of people.
Direct-from-Imagination
The world that is arriving is one where we can imagine anything — and experience these virtual worlds alongside our friends.
The metaverse of multiverses beckons us.
And the universe said you are the universe tasting
itself, talking to itself, reading its own code
–Julian Gough, Minecraft End Poem
Further Reading
If you’d like to have this entire discussion in a compact, sharable deck version, here is how it originally appeared on LinkedIn:
Here are some other articles you might enjoy:
Explore the proof-of-concept deck on how we used ChatGPT to implement a persistent virtual world with Beamable and Unreal Engine.
Composability is the Most Powerful Creative Force in the Universe explains the importance and power of compositional frameworks.
The Metaverse Value Chain explores the key elements of the industries powering the metaverse.
My article on the a16z Future blog, Unbundling Digital Identity Unlocks New Ways to Play and Build sets the stage: self-expression and creativity are proximally how we express our identity (avatars) but ultimately extends into the worlds we shape around us.