This is a talk I held at the AWS Summit in Stockholm 2014, telling the story of QuizUp, the choices that were made and why, and how those choices impacted the success of QuizUp (and our lives) during the days after launch.
QuizUp? • Social trivia game for iOS and android • Launched November 7th 2013 • .. March 6th 2014 on Android • Currently 16+ million users • One of the fastest growing apps/networks
QuizUp Background • .. started making small, topical trivia apps • such as: • Eurovision QuizUp • Twilight QuizUp • Math QuizUp • NatGeo QuizUp • First: proof-of-concepts for investors • Then: Satellites to pull users to QuizUp network
Engineering Team • Small server team (3-4 people, depending on perspective) • Backgrounds in telecom, finance, design, math and music • Me: f/oss devops guy, first in web-tech, then telecom (mobile)
The story: iOS Launch • Expected 1M users in 2013 • Got 1M users in 8 days • Capacity planning was hard • Executed all scaling strategies within a week
Why and how: use “the cloud”? • Mostly in IaaS fashion (aws) • Prefer SaaS to in-house solutions • Allows a small team to accomplish a lot quickly • We also use Heroku for many internal apps • Intended to use more PaaS
QuizUp Architecture • Inspired by • 12factor.net • Netflix Engineering • Most moving parts are scalable • Stateless “immutable” app servers • Sharded player data • Scalable datastores
QuizUp Architecture • Worse is (often) better • Optimizing is a luxury problem • Outsource to SaaS what we can • Pusher, DataDog, Pingdom, PagerDuty, Travis, Sentry • …etc
QuizUp Architecture • A lot inherited from “legacy” • Large monolithic quizup-server API implemented in python • Decoupling now • Separate services • Routing requests to different ELBs
How did we prepare? • Metrics, metrics, metrics • Code freeze — an entire *week* before launch! • Load testing (locust, 20x m1.small nodes) • 5 weeks of beta • Force-update and graceful maintenance features built into API and clients • Coordinate with Infrastructure vendor: • prewarm ELBs • Increased instance limits
Downtime? • Went from 1 hi1.4xlarge database master to 8, in 6 days. • Yes there was downtime: • First db sharding ~2 days after launch (29m) • Second sharding 5 days after launch (90m) • Third sharding 6 days after launch (40m)
What lessons did we learn? • Monitoring and metrics PAY OFF • Tools to help deal with users • Invest in configuration management • Dynamic configuration with switches/throttles