Scaling from First Principles - Session 1

Slide 1

Slide 1 text

Scaling from First Principles Case Study Piyush Goel Nov 19th @pigol1

Slide 2

Slide 2 text

What didn’t work?

Slide 3

Slide 3 text

Context - Capillary (Circa - Aug 2012) ● Problem Statement ○ ~10K Desktop applications (.NET) talking to the Cloud ○ Control the apps via commands ■ Sync Logs -- most common ask ■ Refresh Conﬁgurations ﬁle. ■ Sync the Telemetry data on demand. ● Scheduled push already happening.

Slide 4

Slide 4 text

First Principles ● Need a custom Command Protocol that Server and Desktop App understand. ✅ ● Sync Approach: ○ Poll? ■ 5 sec ⇒ 2000 reqs/sec. Really? ❌ ** ○ Long Polling Connections? ✔ ○ Web-Sockets ✅ ● But we want fancy tech!! =) ** This number was incorrectly presented as 10K in the talk.

Slide 5

Slide 5 text

Solution ● XMPP ○ Gives persistent/long-lived connections out of the box. ○ XMPP Extension Protocols (XEP’s) ■ Ad-hoc commands (https://xmpp.org/extensions/xep-0050.html) ○ What more? ■ We can embed our own chat clients - That’s so cool !! ■ Inventing a problem that doesn’t exist! ● eJabberd? → We don’t know Erlang? ● OpenFire Server -- Java ○ Allows User Plugins => We can hack Custom Protocols. ○ What’s the big deal? ● .NET client libs for XMPP

Slide 6

Slide 6 text

Solution - Where did it go wrong? ● OpenFire documentation ⇒ Sparse! ● New Technology! ● Basic plugins for reference. Reverse engineer from OpenFire code. ● Scaling challenges ○ Concurrency constructs not very obvious ⇒ Poor documentation kills you! ● Unknown Unknowns galore! ● Took ~ 3 months to stabilize for production ● The developer had already lost the steam towards the end. ● It worked until it stopped ● Facing an Unknown Enemy! ● It was a mess!

Slide 7

Slide 7 text

Solution - First Principles ● A simple Node App on Web-Sockets could have solved this for us. ● Less than 200-300 lines of JS code ● Could have gone live in less than 3 weeks. ● Low maintenance headache & Fewer unknowns! ● Didn’t have to solve for problems that didn’t exist!

Slide 8

Slide 8 text

What worked?

Slide 9

Slide 9 text

Context - TravelTriangle (Circa Dec 2016) ● A Travel Marketplace. ● 3 types of products ○ Consumer Facing (B2C) ■ Read-heavy ■ 150K => 750K unique users/day => 1M => 6M page views/day ○ Seller Facing (B2B) ■ Reads & Writes ■ Heavy Listing and Ranking use-cases. Long Sessions (8-10 hours) ■ 2500 => 10K DAU ○ Admin Products (Operations Support) ■ Reads & Writes ■ Heavy Listing, Ranking, & Suggestion use-cases. Long Sessions (8-9 hours) ■ 250 => 1000 DAU

Slide 10

Slide 10 text

Tech Stack - TravelTriangle ● Monolith Ruby on Rails App ○ Separate Service deployments for each product. ● MySQL ● ElasticSearch ● Redis ● Message Queues ● Amazon Web Services. ○ 30 EC2 instances - 16 instances on at the application services layer. ○ RDS ○ ElastiCache ○ Elastic Load Balancers ● Cloudinary & Akamai

Slide 11

Slide 11 text

Problem Statement ● Assuming your data stores can handle the extra traﬃc, scale up the application services infra to meet the 5X growth in traﬃc. ● Constraints: ○ You have limited $$ at your disposal to spend on Infra. ○ You have only 1 Senior Engineer to spare. :) ● Welcome to a start-up! =)

Slide 12

Slide 12 text

Thoughts? ● Containers (Docker, LXC, et al.)? ● Cluster Managers (Docker Swarm, Kubernetes - kops)? ● Auto-Scaling Groups on EC2? ● New Technologies ○ Learning Curve - Engineers love to tinker with “Cool” tech. ○ Maintenance cost ?? ■ Known Unknowns vs Unknown Unknowns!

Slide 13

Slide 13 text

First Principles ● Need more servers to increase capacity ✅ ● Elasticity Factor ○ Seconds? -- Lambda or Cloud Functions? ❌ ○ Mins? -- Containers? ❌ ○ 30 mins - 1 Hour? -- Virtual Machines perhaps? ✅

Slide 14

Slide 14 text

Traﬃc Patterns

Slide 15

Slide 15 text

Traﬃc Patterns

Slide 16

Slide 16 text

Traﬃc Patterns

Slide 17

Slide 17 text

First Principles ● Need more servers to increase capacity ✅ ● Elasticity Factor ○ Seconds? -- Lambda or Cloud Functions? ❌ ○ Mins? -- Containers? ❌ ○ Hour? -- Virtual Machines perhaps? ✅ ● Cost Savings ○ Do I need to save costs for extra seconds/mins? ❌ ○ Do I need to save costs for extra hours? ✅

Slide 18

Slide 18 text

Solution - Poor Man’s Scaling =)

Slide 19

Slide 19 text

Solution - Poor man’s scaling (scheduled) ● Jenkins Cron - 15 mins (Jenkins already used heavily within the team) ● AWS SimpleDB ○ JSON Documents: Service => Time & Server Count Mapping ○ Easy GUI to manipulate JSON documents. ● Ruby Scripts ○ Scale-up ■ Capistrano for deploying latest code builds. ■ Register under the Load Balancer ○ Scale down ■ Fail the liveness probe for the Load Balancer ■ Wait for the request queue to drain - Ruby scripts to check Passenger queue lengths. ■ De-register & shutdown.

Slide 20

Slide 20 text

Solution - Poor man’s scaling - Cost? ● 3 days to move to pre-production & 2 days to move to prod. ● 1 Engineer ● Can’t get cheaper than this =))

Slide 21

Slide 21 text

Solution - Poor man’s scaling - ROI? ● Scaled up to 7X traﬃc volumes ● Solution worked for 18 months before we explored containers. ● Saved 40% cost over on-demand instances. ○ Rough Calculations showed we could have saved an additional ~10% with container-based scaling. ○ At what cost? ● Templatized the solution ○ QA environment moved to this model. ○ Regression & Sanity Suites invoke the scaling jobs as a prerequisite. ○ Each QA group could spawn their environment at will.

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text