Scaling from First Principles - Session 1

Scaling from First Principles Case Study Piyush Goel Nov 19th
@pigol1

What didn’t work?

Context - Capillary (Circa - Aug 2012) • Problem Statement
◦ ~10K Desktop applications (.NET) talking to the Cloud ◦ Control the apps via commands ▪ Sync Logs -- most common ask ▪ Refresh Conﬁgurations ﬁle. ▪ Sync the Telemetry data on demand. • Scheduled push already happening.

First Principles • Need a custom Command Protocol that Server
and Desktop App understand. ✅ • Sync Approach: ◦ Poll? ▪ 5 sec ⇒ 2000 reqs/sec. Really? ❌ ** ◦ Long Polling Connections? ✔ ◦ Web-Sockets ✅ • But we want fancy tech!! =) ** This number was incorrectly presented as 10K in the talk.

Solution • XMPP ◦ Gives persistent/long-lived connections out of the
box. ◦ XMPP Extension Protocols (XEP’s) ▪ Ad-hoc commands (https://xmpp.org/extensions/xep-0050.html) ◦ What more? ▪ We can embed our own chat clients - That’s so cool !! ▪ Inventing a problem that doesn’t exist! • eJabberd? → We don’t know Erlang? • OpenFire Server -- Java ◦ Allows User Plugins => We can hack Custom Protocols. ◦ What’s the big deal? • .NET client libs for XMPP

Solution - Where did it go wrong? • OpenFire documentation
⇒ Sparse! • New Technology! • Basic plugins for reference. Reverse engineer from OpenFire code. • Scaling challenges ◦ Concurrency constructs not very obvious ⇒ Poor documentation kills you! • Unknown Unknowns galore! • Took ~ 3 months to stabilize for production • The developer had already lost the steam towards the end. • It worked until it stopped • Facing an Unknown Enemy! • It was a mess!

Solution - First Principles • A simple Node App on
Web-Sockets could have solved this for us. • Less than 200-300 lines of JS code • Could have gone live in less than 3 weeks. • Low maintenance headache & Fewer unknowns! • Didn’t have to solve for problems that didn’t exist!

What worked?

Context - TravelTriangle (Circa Dec 2016) • A Travel Marketplace.
• 3 types of products ◦ Consumer Facing (B2C) ▪ Read-heavy ▪ 150K => 750K unique users/day => 1M => 6M page views/day ◦ Seller Facing (B2B) ▪ Reads & Writes ▪ Heavy Listing and Ranking use-cases. Long Sessions (8-10 hours) ▪ 2500 => 10K DAU ◦ Admin Products (Operations Support) ▪ Reads & Writes ▪ Heavy Listing, Ranking, & Suggestion use-cases. Long Sessions (8-9 hours) ▪ 250 => 1000 DAU

Tech Stack - TravelTriangle • Monolith Ruby on Rails App
◦ Separate Service deployments for each product. • MySQL • ElasticSearch • Redis • Message Queues • Amazon Web Services. ◦ 30 EC2 instances - 16 instances on at the application services layer. ◦ RDS ◦ ElastiCache ◦ Elastic Load Balancers • Cloudinary & Akamai

Problem Statement • Assuming your data stores can handle the
extra traﬃc, scale up the application services infra to meet the 5X growth in traﬃc. • Constraints: ◦ You have limited $$ at your disposal to spend on Infra. ◦ You have only 1 Senior Engineer to spare. :) • Welcome to a start-up! =)

Thoughts? • Containers (Docker, LXC, et al.)? • Cluster Managers
(Docker Swarm, Kubernetes - kops)? • Auto-Scaling Groups on EC2? • New Technologies ◦ Learning Curve - Engineers love to tinker with “Cool” tech. ◦ Maintenance cost ?? ▪ Known Unknowns vs Unknown Unknowns!

First Principles • Need more servers to increase capacity ✅
• Elasticity Factor ◦ Seconds? -- Lambda or Cloud Functions? ❌ ◦ Mins? -- Containers? ❌ ◦ 30 mins - 1 Hour? -- Virtual Machines perhaps? ✅

Traﬃc Patterns

First Principles • Need more servers to increase capacity ✅
• Elasticity Factor ◦ Seconds? -- Lambda or Cloud Functions? ❌ ◦ Mins? -- Containers? ❌ ◦ Hour? -- Virtual Machines perhaps? ✅ • Cost Savings ◦ Do I need to save costs for extra seconds/mins? ❌ ◦ Do I need to save costs for extra hours? ✅

Solution - Poor Man’s Scaling =)

Solution - Poor man’s scaling (scheduled) • Jenkins Cron -
15 mins (Jenkins already used heavily within the team) • AWS SimpleDB ◦ JSON Documents: Service => Time & Server Count Mapping ◦ Easy GUI to manipulate JSON documents. • Ruby Scripts ◦ Scale-up ▪ Capistrano for deploying latest code builds. ▪ Register under the Load Balancer ◦ Scale down ▪ Fail the liveness probe for the Load Balancer ▪ Wait for the request queue to drain - Ruby scripts to check Passenger queue lengths. ▪ De-register & shutdown.

Solution - Poor man’s scaling - Cost? • 3 days
to move to pre-production & 2 days to move to prod. • 1 Engineer • Can’t get cheaper than this =))

Solution - Poor man’s scaling - ROI? • Scaled up
to 7X traﬃc volumes • Solution worked for 18 months before we explored containers. • Saved 40% cost over on-demand instances. ◦ Rough Calculations showed we could have saved an additional ~10% with container-based scaling. ◦ At what cost? • Templatized the solution ◦ QA environment moved to this model. ◦ Regression & Sanity Suites invoke the scaling jobs as a prerequisite. ◦ Each QA group could spawn their environment at will.

Questions?

Online Coordinates • @pigol1 • piyush.goel@capillarytech.com

Scaling from First Principles - Session 1

Scaling from First Principles - Session 1

pigol

More Decks by pigol

Other Decks in Programming

Featured

Transcript

Scaling from First Principles Case Study Piyush Goel Nov 19th

What didn’t work?

Context - Capillary (Circa - Aug 2012) • Problem Statement

First Principles • Need a custom Command Protocol that Server

Solution • XMPP ◦ Gives persistent/long-lived connections out of the

Solution - Where did it go wrong? • OpenFire documentation

Solution - First Principles • A simple Node App on

What worked?

Context - TravelTriangle (Circa Dec 2016) • A Travel Marketplace.

Tech Stack - TravelTriangle • Monolith Ruby on Rails App

Problem Statement • Assuming your data stores can handle the

Thoughts? • Containers (Docker, LXC, et al.)? • Cluster Managers

First Principles • Need more servers to increase capacity ✅

Traﬃc Patterns

Traﬃc Patterns

Traﬃc Patterns

First Principles • Need more servers to increase capacity ✅

Solution - Poor Man’s Scaling =)

Solution - Poor man’s scaling (scheduled) • Jenkins Cron -

Solution - Poor man’s scaling - Cost? • 3 days

Solution - Poor man’s scaling - ROI? • Scaled up

Questions?

Online Coordinates • @pigol1 • piyush.goel@capillarytech.com