At StarkWare, we faced a problem: our backend was complex: API servers, a blockchain node, databases, microservices, the works. However, it was new (this was 2019, the entire company was new), and few of us knew it well, and most of us had little to no experience in incident response whatsoever. We could have put everyone through an on-call rotation and waited for bad things to happen in production. But a trial-by-fire puts production at unnecessary risk and stresses people out. We felt there was another way.
So, we thought of "Gameshow.".
It’s similar in spirit to Google SRE’s “The Wheel of Misfortune,” but with a crucial difference: Google’s version is a tabletop exercise - essentially Dungeons & Dragons. Gameshow is a Live Action Role-Playing (LARP) situation. It’s entirely hands-on.
The Setup
We spun up three fully-fledged, non-production copies of our live backend on Kubernetes. No mocks. A real blockchain node, a real database, the whole shebang. We then pumped synthetic, production-like traffic into them.
Then, an experienced engineer (Shahar Papini, in this case) intentionally sabotaged these systems. He drew on actual past malfunctions, like misconfiguring a database endpoint, to wreak havoc. Three broken systems gave us about an hour of gameplay.
The Gameplay
We gathered the team in a room. One by one, engineers were called to the "stage," connecting their laptops to the main screen.
We handed them a broken system and a vague prompt: "A user complained their transactions are failing" or "X alert just fired."
The player on stage was the driver, but the entire room was involved in helping debug. Once the system was declared fixed, the moderator either confirmed it or provided more fake user feedback that it was still broken.
Key points
Scientific Mindset: Form a hypothesis, verify or disprove it, and reason out loud. Just restarting a service may fix the problem at times, but sometimes this is a temporary fix, covering up the actual issue.
Team Effort: The room sounds out opinions. This limits the stress on the "driver" while giving them hands-on keyboard time.
Poker Face: The moderator gives nothing away. The system itself must be the only source of feedback (with the exception of the initial fake user complaint).
I recently adapted this methodology on a smaller scale at doubleAI to onboard a new engineer. It felt to me just as effective for 1-on-1 instruction as it did for a whole team.
Tabletop exercises are fine, and sometimes the ROI of replicating a large system for the sake of training does not justify the effort. In a small startup you play to your strengths: being small and agile, you can usually spin up a sandbox and get everyone hands-on experience, LARPing your way forward.