Epic Games took the initiative to explain to Fortnite players this week why they took down Playground LTM and how they eventually fixed it. In a letter to the players, the company explained how the system got so bogged down and what lengths they had to go to in order to make sure everything ran smoothly. We have a snippet from the letter here where they go into detail over what was changed to get it back online back on July 2nd.
The first thing we did after disabling the mode was to split Playground MMS to run on its own service cluster. This was necessary not only to keep a traffic jam from affecting the base game modes, but also to allow us to iterate and tweak the service as often as we needed while we worked to get Playground back online. We tried increasing levels of dramatic re-architecturing, and tested at each stage until we reached the acceptance criteria to re-release the mode.
Once we identified the root of the problem as the exhaustion of sessions from local lists, the solution was to give the cluster the ability to bulk rebalance sessions from other nodes to ensure repeated lookups were not necessary. With the system constantly shifting regional capacity from nodes with an excess to nodes that might be running low, the odds of a node running dry for a particular region and having to search outside its local list have been drastically reduced. While not an issue right now in the primary Fortnite Battle Royale game modes, this is an upgrade we are bringing over to the main MMS cluster as well to future-proof the system.
We pushed the load-testing process to the limits during our MMS restructuring, because the scale of what we were trying to simulate was so far beyond normal usage or testing patterns. We needed to spin up many millions of theoretical users and hurl them at our Playground MMS system in a big, crashing wave in an attempt to strain our new session rebalancer. While the tweak – test – evaluate cycle took several hours per loop, it allowed us to develop and refine the rebalance behavior to a point where we felt it could stand up to the traffic, as well as to identify and fix edge-case bugs that could have torpedoed the effort to bring Playground back online.