Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling from First Principles - Session 1

pigol
November 19, 2020

Scaling from First Principles - Session 1

pigol

November 19, 2020
Tweet

More Decks by pigol

Other Decks in Programming

Transcript

  1. Context - Capillary (Circa - Aug 2012) • Problem Statement

    ◦ ~10K Desktop applications (.NET) talking to the Cloud ◦ Control the apps via commands ▪ Sync Logs -- most common ask ▪ Refresh Configurations file. ▪ Sync the Telemetry data on demand. • Scheduled push already happening.
  2. First Principles • Need a custom Command Protocol that Server

    and Desktop App understand. ✅ • Sync Approach: ◦ Poll? ▪ 5 sec ⇒ 2000 reqs/sec. Really? ❌ ** ◦ Long Polling Connections? ✔ ◦ Web-Sockets ✅ • But we want fancy tech!! =) ** This number was incorrectly presented as 10K in the talk.
  3. Solution • XMPP ◦ Gives persistent/long-lived connections out of the

    box. ◦ XMPP Extension Protocols (XEP’s) ▪ Ad-hoc commands (https://xmpp.org/extensions/xep-0050.html) ◦ What more? ▪ We can embed our own chat clients - That’s so cool !! ▪ Inventing a problem that doesn’t exist! • eJabberd? → We don’t know Erlang? • OpenFire Server -- Java ◦ Allows User Plugins => We can hack Custom Protocols. ◦ What’s the big deal? • .NET client libs for XMPP
  4. Solution - Where did it go wrong? • OpenFire documentation

    ⇒ Sparse! • New Technology! • Basic plugins for reference. Reverse engineer from OpenFire code. • Scaling challenges ◦ Concurrency constructs not very obvious ⇒ Poor documentation kills you! • Unknown Unknowns galore! • Took ~ 3 months to stabilize for production • The developer had already lost the steam towards the end. • It worked until it stopped • Facing an Unknown Enemy! • It was a mess!
  5. Solution - First Principles • A simple Node App on

    Web-Sockets could have solved this for us. • Less than 200-300 lines of JS code • Could have gone live in less than 3 weeks. • Low maintenance headache & Fewer unknowns! • Didn’t have to solve for problems that didn’t exist!
  6. Context - TravelTriangle (Circa Dec 2016) • A Travel Marketplace.

    • 3 types of products ◦ Consumer Facing (B2C) ▪ Read-heavy ▪ 150K => 750K unique users/day => 1M => 6M page views/day ◦ Seller Facing (B2B) ▪ Reads & Writes ▪ Heavy Listing and Ranking use-cases. Long Sessions (8-10 hours) ▪ 2500 => 10K DAU ◦ Admin Products (Operations Support) ▪ Reads & Writes ▪ Heavy Listing, Ranking, & Suggestion use-cases. Long Sessions (8-9 hours) ▪ 250 => 1000 DAU
  7. Tech Stack - TravelTriangle • Monolith Ruby on Rails App

    ◦ Separate Service deployments for each product. • MySQL • ElasticSearch • Redis • Message Queues • Amazon Web Services. ◦ 30 EC2 instances - 16 instances on at the application services layer. ◦ RDS ◦ ElastiCache ◦ Elastic Load Balancers • Cloudinary & Akamai
  8. Problem Statement • Assuming your data stores can handle the

    extra traffic, scale up the application services infra to meet the 5X growth in traffic. • Constraints: ◦ You have limited $$ at your disposal to spend on Infra. ◦ You have only 1 Senior Engineer to spare. :) • Welcome to a start-up! =)
  9. Thoughts? • Containers (Docker, LXC, et al.)? • Cluster Managers

    (Docker Swarm, Kubernetes - kops)? • Auto-Scaling Groups on EC2? • New Technologies ◦ Learning Curve - Engineers love to tinker with “Cool” tech. ◦ Maintenance cost ?? ▪ Known Unknowns vs Unknown Unknowns!
  10. First Principles • Need more servers to increase capacity ✅

    • Elasticity Factor ◦ Seconds? -- Lambda or Cloud Functions? ❌ ◦ Mins? -- Containers? ❌ ◦ 30 mins - 1 Hour? -- Virtual Machines perhaps? ✅
  11. First Principles • Need more servers to increase capacity ✅

    • Elasticity Factor ◦ Seconds? -- Lambda or Cloud Functions? ❌ ◦ Mins? -- Containers? ❌ ◦ Hour? -- Virtual Machines perhaps? ✅ • Cost Savings ◦ Do I need to save costs for extra seconds/mins? ❌ ◦ Do I need to save costs for extra hours? ✅
  12. Solution - Poor man’s scaling (scheduled) • Jenkins Cron -

    15 mins (Jenkins already used heavily within the team) • AWS SimpleDB ◦ JSON Documents: Service => Time & Server Count Mapping ◦ Easy GUI to manipulate JSON documents. • Ruby Scripts ◦ Scale-up ▪ Capistrano for deploying latest code builds. ▪ Register under the Load Balancer ◦ Scale down ▪ Fail the liveness probe for the Load Balancer ▪ Wait for the request queue to drain - Ruby scripts to check Passenger queue lengths. ▪ De-register & shutdown.
  13. Solution - Poor man’s scaling - Cost? • 3 days

    to move to pre-production & 2 days to move to prod. • 1 Engineer • Can’t get cheaper than this =))
  14. Solution - Poor man’s scaling - ROI? • Scaled up

    to 7X traffic volumes • Solution worked for 18 months before we explored containers. • Saved 40% cost over on-demand instances. ◦ Rough Calculations showed we could have saved an additional ~10% with container-based scaling. ◦ At what cost? • Templatized the solution ◦ QA environment moved to this model. ◦ Regression & Sanity Suites invoke the scaling jobs as a prerequisite. ◦ Each QA group could spawn their environment at will.