Slide 1

Slide 1 text

Scaling from First Principles Case Study Piyush Goel Nov 19th @pigol1

Slide 2

Slide 2 text

What didn’t work?

Slide 3

Slide 3 text

Context - Capillary (Circa - Aug 2012) ● Problem Statement ○ ~10K Desktop applications (.NET) talking to the Cloud ○ Control the apps via commands ■ Sync Logs -- most common ask ■ Refresh Configurations file. ■ Sync the Telemetry data on demand. ● Scheduled push already happening.

Slide 4

Slide 4 text

First Principles ● Need a custom Command Protocol that Server and Desktop App understand. ✅ ● Sync Approach: ○ Poll? ■ 5 sec ⇒ 2000 reqs/sec. Really? ❌ ** ○ Long Polling Connections? ✔ ○ Web-Sockets ✅ ● But we want fancy tech!! =) ** This number was incorrectly presented as 10K in the talk.

Slide 5

Slide 5 text

Solution ● XMPP ○ Gives persistent/long-lived connections out of the box. ○ XMPP Extension Protocols (XEP’s) ■ Ad-hoc commands (https://xmpp.org/extensions/xep-0050.html) ○ What more? ■ We can embed our own chat clients - That’s so cool !! ■ Inventing a problem that doesn’t exist! ● eJabberd? → We don’t know Erlang? ● OpenFire Server -- Java ○ Allows User Plugins => We can hack Custom Protocols. ○ What’s the big deal? ● .NET client libs for XMPP

Slide 6

Slide 6 text

Solution - Where did it go wrong? ● OpenFire documentation ⇒ Sparse! ● New Technology! ● Basic plugins for reference. Reverse engineer from OpenFire code. ● Scaling challenges ○ Concurrency constructs not very obvious ⇒ Poor documentation kills you! ● Unknown Unknowns galore! ● Took ~ 3 months to stabilize for production ● The developer had already lost the steam towards the end. ● It worked until it stopped ● Facing an Unknown Enemy! ● It was a mess!

Slide 7

Slide 7 text

Solution - First Principles ● A simple Node App on Web-Sockets could have solved this for us. ● Less than 200-300 lines of JS code ● Could have gone live in less than 3 weeks. ● Low maintenance headache & Fewer unknowns! ● Didn’t have to solve for problems that didn’t exist!

Slide 8

Slide 8 text

What worked?

Slide 9

Slide 9 text

Context - TravelTriangle (Circa Dec 2016) ● A Travel Marketplace. ● 3 types of products ○ Consumer Facing (B2C) ■ Read-heavy ■ 150K => 750K unique users/day => 1M => 6M page views/day ○ Seller Facing (B2B) ■ Reads & Writes ■ Heavy Listing and Ranking use-cases. Long Sessions (8-10 hours) ■ 2500 => 10K DAU ○ Admin Products (Operations Support) ■ Reads & Writes ■ Heavy Listing, Ranking, & Suggestion use-cases. Long Sessions (8-9 hours) ■ 250 => 1000 DAU

Slide 10

Slide 10 text

Tech Stack - TravelTriangle ● Monolith Ruby on Rails App ○ Separate Service deployments for each product. ● MySQL ● ElasticSearch ● Redis ● Message Queues ● Amazon Web Services. ○ 30 EC2 instances - 16 instances on at the application services layer. ○ RDS ○ ElastiCache ○ Elastic Load Balancers ● Cloudinary & Akamai

Slide 11

Slide 11 text

Problem Statement ● Assuming your data stores can handle the extra traffic, scale up the application services infra to meet the 5X growth in traffic. ● Constraints: ○ You have limited $$ at your disposal to spend on Infra. ○ You have only 1 Senior Engineer to spare. :) ● Welcome to a start-up! =)

Slide 12

Slide 12 text

Thoughts? ● Containers (Docker, LXC, et al.)? ● Cluster Managers (Docker Swarm, Kubernetes - kops)? ● Auto-Scaling Groups on EC2? ● New Technologies ○ Learning Curve - Engineers love to tinker with “Cool” tech. ○ Maintenance cost ?? ■ Known Unknowns vs Unknown Unknowns!

Slide 13

Slide 13 text

First Principles ● Need more servers to increase capacity ✅ ● Elasticity Factor ○ Seconds? -- Lambda or Cloud Functions? ❌ ○ Mins? -- Containers? ❌ ○ 30 mins - 1 Hour? -- Virtual Machines perhaps? ✅

Slide 14

Slide 14 text

Traffic Patterns

Slide 15

Slide 15 text

Traffic Patterns

Slide 16

Slide 16 text

Traffic Patterns

Slide 17

Slide 17 text

First Principles ● Need more servers to increase capacity ✅ ● Elasticity Factor ○ Seconds? -- Lambda or Cloud Functions? ❌ ○ Mins? -- Containers? ❌ ○ Hour? -- Virtual Machines perhaps? ✅ ● Cost Savings ○ Do I need to save costs for extra seconds/mins? ❌ ○ Do I need to save costs for extra hours? ✅

Slide 18

Slide 18 text

Solution - Poor Man’s Scaling =)

Slide 19

Slide 19 text

Solution - Poor man’s scaling (scheduled) ● Jenkins Cron - 15 mins (Jenkins already used heavily within the team) ● AWS SimpleDB ○ JSON Documents: Service => Time & Server Count Mapping ○ Easy GUI to manipulate JSON documents. ● Ruby Scripts ○ Scale-up ■ Capistrano for deploying latest code builds. ■ Register under the Load Balancer ○ Scale down ■ Fail the liveness probe for the Load Balancer ■ Wait for the request queue to drain - Ruby scripts to check Passenger queue lengths. ■ De-register & shutdown.

Slide 20

Slide 20 text

Solution - Poor man’s scaling - Cost? ● 3 days to move to pre-production & 2 days to move to prod. ● 1 Engineer ● Can’t get cheaper than this =))

Slide 21

Slide 21 text

Solution - Poor man’s scaling - ROI? ● Scaled up to 7X traffic volumes ● Solution worked for 18 months before we explored containers. ● Saved 40% cost over on-demand instances. ○ Rough Calculations showed we could have saved an additional ~10% with container-based scaling. ○ At what cost? ● Templatized the solution ○ QA environment moved to this model. ○ Regression & Sanity Suites invoke the scaling jobs as a prerequisite. ○ Each QA group could spawn their environment at will.

Slide 22

Slide 22 text

Questions?

Slide 23

Slide 23 text

Online Coordinates ● @pigol1 ● [email protected]