Architecting & Launching the Halo 4 Services - SRE CON 15

Architecting & Launching the Halo 4 Services - SRE CON 15

Halo 4 is a first-person shooter on the Xbox 360, with fast-paced, competitive gameplay. To complement the code on disc, a set of services were developed and deployed in Azure to store player statistics, display player presence information, deliver daily challenges, modify playlists, catch cheaters, and more. As of June 2013, Halo 4 had 11.6 million players who played 1.5 billion games, logging 270 million hours of gameplay.

The Halo 4 services were built from the ground up to support high demand, low latency, and high availability. In addition, video games have unique load patterns where the majority of the traffic and sales occurs within the first few weeks after launch, making this a critical time period for the game and supporting services. Halo 4 went from 0 to 1 million users on day 1, and 4 million users within the first week.

9128d500301ae51524e887bb680f471d?s=128

Caitie McCaffrey

March 18, 2015
Tweet

Transcript

  1. Architecting & Launching the Halo 4 Services SRECON ‘15

  2. Caitie McCaffrey! Distributed Systems Engineer @Caitie CaitieM.com

  3. None
  4. • Halo Services Overview • Architectural Challenges • Orleans Basics

    • Tales From Production
  5. None
  6. Presence Statistics Title Files Cheat Detection User Generated Content

  7. None
  8. None
  9. None
  10. None
  11. Halo:CE - 6.43 million Halo 2 - 8.49 million Halo

    3 - 11.87 million Halo 3: ODST - 6.22 million Halo Reach - 9.52 million
  12. $220 million in sales ! 1 million players online Day

    One
  13. $300 million in sales ! 4 million players online !

    31.4 million hours Week One
  14. 11.6 million players ! 1.5 billion games ! 270 million

    hours Overall
  15. Architectural Challenges

  16. Load Patterns Load Patterns

  17. Azure Worker Roles Azure Table Azure Blob Azure Service Bus

  18. Always Available

  19. Low Latency & High Concurrency

  20. Stateless 3 Tier ! Architecture

  21. Latency Issues

  22. Add A Cache

  23. Concurrency 
 Issues

  24. Data Locality

  25. The Actor Model A framework & basis for reasoning about

    concurrency A Universal Modular Actor Formalism for Artificial Intelligence ! Carl Hewitt, Peter Bishop, Richard Steiger (1973)
  26. Send A Message Create a New Actor Change Internal

  27. State-full Services

  28. Philip A. Bernstein, Sergey Bykov, Alan Geller, Gabriel Kliot, Jorgen

    Thelin Orleans: Distributed Virtual Actors for Programmability and Scalability eXtreme Computing Group MSR
  29. “Orleans is a runtime and programming model for building distributed

    systems, based on the actor model”
  30. Virtual Actors “An Orleans actor always exists, virtually. It cannot

    be explicitly created or destroyed”
  31. Virtual Actors • Perpetual Existence • Automatic Instantiation • Location

    Transparency • Automatic Scale out
  32. Runtime • Messaging • Hosting • Execution

  33. Orleans Programming Model

  34. Reliability “Orleans manages all aspects of reliability automatically”

  35. TOO!

  36. None
  37. TOO!

  38. TOO!

  39. Performance & Scalability

  40. “Orleans applications run at very high CPU Utilization. We have

    run load tests with full saturation of 25 servers for many days at 90%+ CPU utilization without any instability”
  41. None
  42. Load Patterns Load Patterns

  43. Orleans is AP

  44. • Statefull Services • Virtual Actor Abstraction • Self Healing

    Frameworks
  45. Orleans & Halo

  46. None
  47. None
  48. Get Orleans https://github.com/dotnet/orleans! 
 
 
 


  49. Tales From Production

  50. DevOps! noun ! 1. The Decisions You Make Now Will

    Affect the Quality of Sleep You Get Later
  51. Load Patterns Load Patterns

  52. Story: No Data Like Prod Data aka Halo 4 launch

    night was not the first time Azure & Orleans saw Production Data
  53. New Technology • Orleans: MSR Technology • Azure • Dispatcher

  54. Halo Reach: Presence Service

  55. None
  56. Memory Leak

  57. Practice DevOps

  58. Story: Validate Dependencies aka the time we broke Azure Service

    Bus
  59. STOP WHAT YOU’RE DOING!!!!

  60. WHAT WERE YOU DOING???

  61. None
  62. None
  63. Backup the Backup

  64. Story: Clients are Jerks aka remember that time the game

    DOS’d us at Launch
  65. Different Priorities

  66. Release Valves

  67. Back Pressure

  68. Protect Your Services

  69. Let’s Wrap it Up

  70. Distributed Systems is hard

  71. CAP Theorem aka why we can’t have nice things

  72. Know You’re Tradeoffs hint: you are making one whether you

    know it or not
  73. Consistency or Availability

  74. Questions @Caitie