The Cloud Native Game Development Canon

Live Services games: containers, microservices, distributed actors and data fabrics

Mar 10, 2023

This article is for people who want to understand the intersection between “cloud native” technologies and the field of game development. It is both a high-level framework for the key topics, as well as a set of articles and videos I’ve found that are helpful for either learning—or filling-in the gaps.

For anyone working on a live service game (i.e,. “games as a service”) I think you’ll find something helpful in the articles here.

This will be a living document to keep track of articles about cloud-native game development. Did I miss an article that you really love? Let me know in the comments—if it’s great, I’ll want to include it!

Midjourney image: yes, I know the difference between a canon and a cannon, but it looked cool—OK?

Cloud Native Games: Foundational

In What Does Cloud-Native Even Mean, Raph Koster does an excellent job of outlining what this is all about:

We design and architect knowing the game is on the server, not baked into the client — with everything that entails for how the game is distributed, accessed, and played. The game should meet the player where the player is.

We design and architect our servers as citizens of the modern Web, so that they can draw on all that infrastructure, all those APIs, and the vast amount of scalable compute that is available, instead of just putting a LAN server up on cloud hosting.

— Raph Koster

If you read nothing else in this list, I’d urge you to read Enabling LiveOps Across Games with Shared Operational Excellence (DORA) by Molly Sheets—this is simply the best overall view into the components of cloud-native architecture. It drills into the real technological elements, and proposes the business metrics behind why these technologies matter:

Time is the single most valuable resource in the world and hard to recreate, except in this context. Operational excellence, achieved through testing in production, is an adventure of sharing empathy as we gift time to each other. — Molly Sheets

My own article, Cloud-Native Worlds lays out some high-level principles of cloud-oriented game development and virtual world building (visibility, designed-in scalability and composability).

The rest of this article goes into a deeper analysis of some of the specific technologies that are the building-blocks of the cloud-native toolkit for game developers:

Containers
Microservices
Distributed Actors
Data Fabric

In addition, I go into a few topics related to “Cloud Gaming” (which normally refers to GPU-in-the-cloud rendering, where games are streamed instead of installed on your device—which is part of the cloud-native possibility-space, but not synonymous with it).

I also touch on blockchain as a possible component of cloud architectures, as well as provide some historical content for those interested in the origin story of cloud-native gaming.

Scaling with Containers

What is Kubernetes? is a brief description, including a simple explanation of the difference between Docker and Kubernetes.

How to Use Docket To Make Local Development A Breeze is a good video explaining one of the most important reasons to use containers: to make it so that your local development matches what you can expect in the cloud (if you’re still logging into remote shells for everything you do, stop everything right now and watch this):

Kubernetes for Game Development by Jonas Lundgren—a bachelor’s thesis on how Kubernetes can be applied to game development; don’t let the “bachelor’s” mislead you—this is a detailed and thorough explanation of how k8s can be applied to the world of game development, and it’s better than most of what you’ll find online.

Scaling Containers on AWS in 2022 shows performance benchmarks for various approaches to cloud-based code (Lambdas, vs. containers on ECS and EKS.

Microservices

Microservices are a pattern that allow you to isolate functionality into simpler, independent modules that can scale more easily—and be built by more-efficient teams.

What are Microservices? is an Amazon article that gives a good overview of the microservices pattern.

Why you should run your game servers independently from your chat is a specific use-case for microservices in the field of game development. The example of isolating chat from the rest of the functionality can be extended to many other features of a game to make it easier to scale and manage.

Non-Gaming Case Studies on Microservices

Game developers can learn a lot by how the largest services on the internet have scaled; due to the complexity and high level of interaction within games, the breaking-points for scalability often come a lot sooner than they do elsewhere!

Breaking the Monolith at Twitch — tells the story of how Twitch went from a monolithic architecture to a microservices architecture.
How Discord Stores Trillions of Messages — explains how Discord approached the problem of managing trillions of messages, and migrated from monoliths to “data services” (along with notes on their movement across database platforms).
How Netflix Scales its API with GraphQL Federation — explains the migration across several architectures at Netflix, from monoliths to microservices towards a federated, load-balanced gateway to organize their microservices.

Evolution of an API Architecture — From: How Netflix Scales its API with GraphQL Federation

Once you’ve looked at those, watch Krazam’s hilarious, cautionary tale on microservices. Yes, there’s such a thing as going to far with with this pattern.

Distributed Actors

Distributed Actors are a programming model that improves fault-tolerance and scalability by implementing independent software services that communicate through asynchronous messaging. The programming model isn’t new (it dates to the 1970’s!) but has become more popular recently with the large-scale demands placed on servers.

Actor Model on Wikipedia gives a good overview of the field.

Architecting & Launching the Halo 4 Services explains the use of the Orleans actor model within the Halo franchise.

From: Architecting & Launching the Halo 4 Services

Since Orleans, a number of other actor models have emerged—namely, Akka and Proto.Actor. Both were inspired by Orleans, and Proto.Actor was created by the implementor of Akka. Benchmark: .NET virtual actor frameworks comprise the performance of these implementations:

Virtual actors article_11 — From: Benchmark: .NET virtual actor frameworks

The following video is an interview with the creator of Proto.Actor, where he explains some of the thinking behind why he created it:

Cloud Gaming

“Cloud Gaming” refers to games where all or part of the game’s rendering happens in the cloud. Potential advantages of cloud gaming include better security, faster (or nonexistent) installation, and the ability to deliver complex graphical experiences to lower-end devices.

Behind the “cloud game” you’ll still want to build with the same cloud-native technologies mentioned above: containers, distributed actors, etc. Indeed, containers are likely to be one of the most efficient ways to distribute the rendering logic out to the edge networks where rendering will happen—as well as the game-server logic that implements rules, multiplayer functionality, etc.

NVIDIA’s Cloud Gaming is the Industry’s Future is a good non-technical explanation of cloud gaming, along with a focus on NVIDIA’s recent efforts in this area (GeForce NOW).

An Analysis of Cloud Gaming Platforms Behavior under Different Network Constraints is an academic paper comparing the performance and quality of experience (QoE) of some of the major cloud gaming platforms.

Cloud-gaming: Analysis of Google Stadia traffic describes the ill-fated Stadia platform. While Stadia has been canceled, this is one of the better overviews explaining how Google approached this technology, with some time spent on the concept of “negative latency,” an attempt to apply artificial intelligence predictive technology to reduce the perceived latency while playing. I believe that artificial intelligence research applied to cloud gaming will be one of the ways we’ll eventually produce the illusion of real-time even though it is impossible to actually exceed the speed-of-light constraint of delivering streamed games over the internet. At this point, “negative latency” still remains an in-the-laboratory dream.

Polystream has a different take on cloud gaming: namely, that game servers stream GPU commands—but not render. If this worked (and people have enough GPUs on their devices) then it could provide a more scalable economic model compared to the energy and computational costs of cloud-based rendering.

Meta (f/k/a Facebook) isn’t particularly known for delivering games, but their article Under the hood: Meta’s cloud gaming infrastructure does give a good technical overview of what a cloud gaming architecture looks like.

Data Fabrics

Data Fabrics are an architecture for data that utilize cloud-native software development patterns to provide a composable framework for working with data across a range of applications. Gartner forecasts that be 2024, data fabrics will reduce the data management tasks in half and quadruple data efficiency.

My article, Data Fabric for the Metaverse explains why this approach can accelerate development around data in the same way that microservices can accelerate development around code modules.

Data Fabric for Dummies is a Hitachi e-book on data fabrics that goes into depth on the subject. If you want a simple overview, this explainer from IBM does a great job:

Blockchain

This is Hard Architecture by Molly Sheets is the skeptical take on blockchain technologies—pointing out that they just don’t scale well enough for many use cases. Molly also raises a lot of security concerns.

Polygon: A Game Changer for Web3 Gaming and NFTs explains Polygon’s approach to scaling and distributed consensus, and what some of the proper use cases are.

Historical Stuff

“Cloud computing” first came about in 2006, when Amazon launched AWS. But it took over a decade to think of “cloud-native” as anything more than simply deploying code at someone else’s datacenter. Cloud Computing History gives a good overview of where things went from 2006-2020:

I wrote Anatomy of an MMORPG in 2007—some of the things I wrote here probably remain true for some games (World of Warcraft?) but it is mostly a historical artifact to think about how far we’ve come.

Life’s a Game, and then You Die was an article I commission from science-fiction author Charlie Stross in 2007. One of his questions that’s pertinent to cloud-native development: “how big can we make an MMO shard?” Again, it’s a fun read for video game archaeologists.

Metavert Meditations

Discussion about this post