AWS offers a variety of tools enabling users to create highly scalable, durable and resilient architectures and the user community has developed a broad range of best practices and frameworks to build rock-solid systems on top of AWS.
Many customers such as Netflix, Airbnb, Smugmug and others, already demonstrated those practices in production systems. While the theoretical concepts of building resilient architectures are well established, the practices of maintaining such systems are less understood, mostly because of the unpredictability of production environments under stress conditions.
To address this issue, some adopted the concept of Game Days, which consists of simulating unexpected failures to test the resilience, detect and fix flaws, and more importantly train the operation teams on emergency situations. This session covers the best practices learned from many AWS customers who implemented the Game Days practice and the different failure simulation techniques that can be used on AWS.