Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling and load testing Battlelog in Battlefield 4

Håkan Rosenhorn
March 06, 2015
75

Scaling and load testing Battlelog in Battlefield 4

Talk given at Uppsala Tech Meetup #2
http://www.meetup.com/Uppsala-Tech-Meetup/events/220347895/

Handling 6 million simultaneous users on launch day. Looking at the architecture used and how load testing was done to achieve this.

Håkan Rosenhorn

March 06, 2015
Tweet

Transcript

  1. AGENDA → ABOUT BATTLELOG → PREPARING FOR BATTLEFIELD 4 LAUNCH

    → ARCHITECTURE → LOAD TESTING → MINIMIZING RISK WHEN LIVE → POST-LAUNCH NUMBERS → QUESTIONS?
  2. ESTIMATIONS FOR LAUNCH → 6 MILLION SIMULTANEOUS USERS → 50

    000 HTTP REQUESTS/SEC → < 300 MS RESPONSE TIME → LESS THAN 0.1% REQUEST ERRORS WIKIPEDIA IS THE 6TH LARGEST SITE SERVES 100 000 REQ/S DURING PEAKS
  3. APP SERVERS → BUILT IN PYTHON → SINGLE THREADED NON

    BLOCKING IO → ONE OS PROCESS PER CPU CORE → IDENTICAL SERVERS → BEHIND A SINGLE LOAD BALANCER → EASY TO ADD OR REMOVE SERVERS
  4. DATABASE SERVERS → NO JOINS → NO MASTER/SLAVE → HOT

    AND COLD DATA → APP-LEVEL SHARDING → SERVERS ON BARE METAL → IN-MEMORY KEY-VALUE STORAGE → PERSISTENT → ONLY HOT DATA → APP-LEVEL SHARDING → SERVERS ON BARE METAL + CASSANDRA FOR AN INTERNAL SERVICE
  5. CACHE SERVERS → MEMCACHED → FINE-GRAINED AND CONDITIONAL CACHING →

    APP-LEVEL SHARDING → ~300K REQ/S AT FULL LOAD → SERVERS ON BARE METAL
  6. OUR RECIPE 1. USE IDENTICAL SETUP TO PRODUCTION 2. SIMULATE

    USER BEHAVIOR 3. START WITH THE “HAPPY CASE” 4. LOAD TESTING INSIDE DEV TEAM 5. REAL-TIME MONITORING OF SERVERS 6. ENSURE FAST TURNAROUND TIMES FOR CODE 7. MINIMIZE SUBSYSTEMS THAT NEED TESTING
  7. 2 SIMULATE USER BEHAVIOR → DESCRIBE USER BEHAVIOR WITH CODE

    → DISTRIBUTED → CAN SIMULATE MILLIONS OF USERS → ADJUST LOAD RUN-TIME → BUILT IN PYTHON → WE RUN LOCUST ON AMAZON’S CLOUD, EC2 → ~40 SERVERS → PAY PER HOUR → ENABLES RAPID SCALE UP/DOWN OF TESTS LOCUST AN OPEN SOURCE LOAD TESTING TOOL FROM ESN
  8. 3 START WITH ”HAPPY CASE” → BEGIN TESTING WITH OPTIMAL

    CONDITIONS → REMOVE EXTERNAL SUBSYSTEMS → DISABLE NON-VITAL INTERNAL SERVICES → EMPTY AND UNPOPULATED DATABASES → GRADUALLY ADD MORE SUBSYSTEMS → POPULATE DATA
  9. 3 START WITH ”HAPPY CASE” NUCLEUS USERS, CREDENTIALS AND ENTITLEMENTS

    ORIGIN FRIENDS USER PRESENCE AND FRIEND RELATIONS BLAZE GAME SERVERS AND PLAYER STATS MOCKLEUS READ-ONLY MOCK SERVER AMIGO READ-ONLY MOCK SERVER BLAZE NOT YET
  10. 5 REAL-TIME MONITORING → INSTRUMENT CODE WITH KEY METRICS →

    CODE PROFILER EMBEDDED → LIVE TRACING OF NETWORK CALLS → DATABASE, CACHES, INTERNAL + EA SERVICES → REAL-TIME SYSTEM METRICS → MEASURE CPU, MEMORY AND I/O
  11. 6 FAST CODE CHANGES → ITERATE AND USE SAME BRANCH

    AS REST OF DEV TEAM → ENSURE QUICK BUILDS IN BUILD SERVER → IN OUR CASE, LESS THAN 10 MINUTES → WE USE TORRENTS TO DEPLOY BUILDS TO +100 SERVERS → TAKES LESS THAN 2 MINUTES FOR 800 MB BUILDS
  12. 7 MINIMIZE NEW TESTING → DON’T RE-TEST PERFORMING SERVICES →

    KNOW THEIR PERFORMANCE CHARACTERISTICS → KNOW HOW TO SCALE THEM OUT → SERVICES WE DIDN’T RE-TEST FOR BF4 → BEACONPUSH – USER PRESENCE → SERAPHINE – GAME SERVER INDEX → MYSQL, MEMCACHED AND SO ON
  13. 7 INDISPENSABLE TOOLS KILL SWITCHES ZERO DOWNTIME UPDATES MAINTENANCE CONSOLE

    METRICS INSTANT CONFIG CHANGES FEATURE ROLLOUT CIRCUIT BREAKERS
  14. KILL SWITCHES → OVER 50 KILL SWITCHES → 5 EXAMPLES

    → BATTLELOG IN-GAME → FORUM → ACTIVITY FEED → BLAZE CALLS / EVENTS → ORIGIN FRIENDS CALLS ABILITY TO TURN OFF FEATURES INSTANTLY
  15. PEAKED AT ~15 000 REQ/S EXCLUDING STATIC CONTENT 0 MINUTES

    OF DOWNTIME SINCE LAUNCH 450 COMMITS CHANGES DEPLOYED TO PRODUCTION FIRST 7 DAYS SPREAD OVER 7 ZERO DOWNTIME UPDATES