Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Oh Crap, We're Popular!

Oh Crap, We're Popular!

# Oh Crap, We’re Popular!

# Description/Abstract
The VP of Sales just announced the company closed a major deal that is 100x larger than any other previous contract. The excitement pours over you, you can finally expense that Herman-Miller chair! However, you quickly realize you’ve never tested the production application at that scale before. What do you do?

Scaling a web application is an exciting, and sometimes nerve-wracking problem. Modern frameworks buy us a lot of affordances, but we still need to think like engineers to make things scale in a maintainable, and affordable, manner.

We’ll walk through this hypothetical scenario together, taking time to explore options and tradeoffs for major decisions. These topics will include:

* Effective caching
* Keeping your database performing efficiently
* Tradeoffs of moving to microservices
* Security and safety measures

Most importantly, we’ll explore how to know when, and if, it is time for any of these things. Even how we can defer some decisions to keep our team working efficiently. For that, we’ll need to learn:

* How to add and review telemetry for our application
* How to determine key metrics to identify when we’re in trouble
* How to collaborate with the business for best success

Buckle up, because we’re going from zero to web scale in one talk!

Matt Machuga

November 04, 2023

More Decks by Matt Machuga

Other Decks in Programming


  1. Paraphrased from someone smarter than me…can’t remember who though “You

    should teach something important within the f irst 30 seconds of a talk”
  2. DevOps Today, you are a DevOps Engineer. That is, you

    are an engineer who follows DevOps principles. It’s not a job title.
  3. Microsoft “DevOps is the union of people, process, and technology

    to continually provide value to customers.”
  4. Project Management Institute “As a way of working, agile is

    an iterative approach to work that helps teams deliver value faster and with fewer headaches”
  5. Based on True Stories • Names changed • Numbers conveniently

    rounded • Simpli f ied to one scenario • Share your solutions! Told through HIMYM characters
  6. HappyHours • 2yrs old • 250 employees • Hosted SaaS

    Platform • 300 customers in NA Time Tracking Simpli f ied
  7. HappyHours • PHP & Laravel Monolith • PostgreSQL database •

    “RESTful” • Single droplet on Digital Ocean • DB and App on same instance • Datadog & Sentry for telemetry Tech Info
  8. HappyHours • Barney: Head of Engineering • Robin: Sta ff

    Engineer • Ted: Head of Product • Lily: Lead Designer • Marshall: Engineer with Obscure Knowledge Product Delivery
  9. Big News! • $3.5m ARR contract with PartyHearty! • 2yr

    commitment • Everything is awesome! The Captain, VP of Sales
  10. Slight Concerns • Current: 5,000 users in NA • Zero

    load testing in the past • No ops team • Roadmap deadlines looming Maybe some mild panic 1 million is a lot more than 5k, Ted
  11. Family Feud Scaling Edition Top 5 answers to scale an

    app in an emergency 1. 2. 3. 4. 5. Panic “Throw money at it” Microservices Kubernetes Ka f ka
  12. Good News! • Gradual rollout over 4 months • Start

    onboarding in 2 weeks • December 1: 50k pilot in NA • January 1: 200k in NA & EU • February 1: 250k globally • March 1: 500k globally Sales bought us some time
  13. What do we know? What can we learn? • During

    NA business hours (11hrs) • CPU: 80% • Memory: Irrelevant for simplicity • DB: ~40% of CPU, no details • Background workers are most of o ff -hours tra ff ic • Tra ff ic bursts toward beginning- and end-of-day
  14. Scaling Vertically More resources on the same machine • Machine

    has 4 cores • 32-core is the highest available • 5,000 / 4 = 1,250 users per core • 1,250 * 32 = 40,000 • 40,000 < 55,000
  15. 40% of CPU is DB Can we move the DB

    o ff the app machine?
  16. Scaling Vertically More resources on the same machine • 40%

    of the CPU usage is DB • 1,250 users consume 40% per core, so 2,500 users consume 80% per core • 2,500 * 32 = 80,000 users • 80,000 > 55,000
  17. Risks Nothing is free • Network latency may increase •

    Database consumption may not actually always be 40%, our telemetry is limited • This won’t scale past the f irst two weeks
  18. Verifying Testing • Create a new node `test` • Create

    new independent DB • Populate with Prod Data • Load test with similar tra ff ic
  19. Okay, here is the plan • Team A: Cloning and

    load testing vertical scaling solution. • Team B: Determine, document, and possibly automating recreating a production environment • Team C: Improve application telemetry and look for our largest bottlenecks and improvements • Team D: Research easy hosting solution • Barney and Ted will work with stakeholders Distribute work
  20. Team A Second Instance • Cloned the droplet • Changed

    out keys for sandboxes • Setup a database on AWS, cloned data with `pgdump` overnight and populated new instance • Bombarded the application with JMeter and k6 for 4 hours, increasing load until failure. • 66rps with CPU around 65-70% on average
  21. Team B Reproducible Environment • Can create environment from scratch!

    • All keys and secrets stored in a 1Password Vault • Ubuntu, PHP, Composer up-to-date • All documented in playbook
  22. Team C Telemetry and Improvements • Slow queries and database

    performance on new DB • HTTP request/response info • Request tracing (App -> DB) • Looking for quick wins next
  23. Team D New Hosting Provider • Decision: Deploy with Heroku

    • Reminder: No ops specialty internally • Easy and automatic scaling • Simple permissions and security controls • Can rollout changes quickly
  24. Plan Let’s ship some stu f • Enable maintenance mode

    • Parallelize • Database • Dump the database • Provision the new DB on Heroku • Load the sqldump to the new DB • Provision the new 32-core droplet using playbook • All secrets loaded from 1Password • DB refers to the new production DB • Veri f ied can connect to DB from app • Disable maintenance mode • Test
  25. Team C Telemetry and Improvements • Hot spots can be

    optimized • Percentiles for web request performance • DB performance in prod (5k) is 👍 • DB performance in test is (55k) is 😐
  26. Family Feud DB Performance Edition Top 5 wrong answers to

    optimize database performance 1. 2. 3. 4. 5. Panic MongoDB Index every permutation of columns “Throw money at it” Ka f ka
  27. DB Performance What to Consider • Telemetry to identify slow

    queries • Explain/Analyze to identify query planner bottlenecks • Index types • Cache optimization • Unlogged vs. Logged tables • Triggers • Foreign Keys • Paging • Vacuum/Autovacuum tuning • IOPS Allocation • Memory • Scheduling • Clustering • Replication delays • Replication types 🫢🫨😬😳😱🤯
  28. Superstitions, Beliefs, and Instincts No comment on whether they’re right

    • Foreach is better than mapping over collections • For loops are better than foreach loops • Pass by reference vs pass by value • Closures vs. method calls • Single quotes > double quotes • The framework is slow
  29. Usually Safe to Optimize Usually • Separation of concerns •

    Loops with DB calls (N+1) • Can they be joins? • Can we inline the items to be queried? • Decouple IO from inner loops - batch • From slurping to streaming
  30. Performance Degradations Less than ideal • CPU is back around

    80% on average • Overall slowdown in response times, not critical • Timeouts at the top of every hour • Support tickets about people unable to sign in periodically • Ultimately everything is functional most of the time • SLA could be at risk • We can’t tell why yet
  31. Change our View Slice up data di ff erently •

    HTTP Responses • Visualize p50, p95, p99 response times • Visualize responses by status code • Visualize 5xx responses by endpoint • Visualize response times by endpoint • Visualize deployments or infrastructure changes • Visualize which functions consume most CPU time • Visualize slowest and most frequent queries
  32. What do we see? Fresh perspective • Most endpoints are

    f ine most of the time • Slowest endpoints: • Manager Dashboard • Some Pro f ile updates • Some Time updates • High CPU consumption: Authentication • Timeouts most commonly begin at the top of every hour; continue for ~2 minutes • Timeouts correlate to beginning and end of reporting runs
  33. Caching Greatness incarnate • Caching is storing data to be

    retrieved faster later • Expensive queries may be cached (memory, f ile, Redis, DB) for quicker secondary lookups • Large templates may be cached rather than regenerating • Things that cannot be made cheap every time may be made cheap most of the time
  34. Caching Evil incarnate • Caching is pain and su ff

    ering • One of the hardest problems in computing • Cache eviction/busting • Cache lifespan • Caching for the right audience • Eventual consistency
  35. Image Upload/Processing Things to know • Uploads are unbounded currently

    (can be any size) • Pro f ile pictures are processed and optimized on the server (expensive) • Manual time sheet screenshots are sometimes uploaded as evidence • These need converted to standard size and format
  36. Rate Limiting Protection • Protecting your application (and users) from

    misuse or overuse • Preventing a dangerous number of requests based on some identi f ier • IP address • User • Token • Tenant (company) • Often per-endpoint • In-app or on-edge
  37. On the way to 255k Users Things to consider or

    remedy • Successfully migrated to Heroku, scaled out to multiple nodes • Rate limiting in place - tenant and IP • File uploads limited to 1.5mb • Image processing moved to background job on a worker node • Database performance has been optimized • Response times are better overall • Caching has improved Manager Page for many users, but unusable for others
  38. Timeouts Postgres and load balancer protecting themselves • The Manager

    Dashboard pages are unpaginated • Managers with more than 1000 indirect reports are impacted • Each manager can see their sta ff and drill down in to their time allocations on-screen.
  39. New UX Needed • Design a new f low for

    managers • Explain our pain points • Avoid loading all data at once • Explain what we are changing in the backend • Feature Flag Lily, Lead Designer
  40. Feature Flags Feature toggles, whatever you’d like to call them

    • Allow functionality to be enabled/disabled for subset of users/customers • Essential for trunk-based development • Great for testing new ideas that may be disruptive for customers
  41. Partner with Product • Don’t change things without Product knowing!

    • Partner to solve the problem • Partner for rollout strategy Collaboration is key
  42. Recap for stage of growth TL;DR • Allow for horizontal

    scaling, autoscaling where possible • Introduce rate limiting to protect your application • Put bounds around unbounded contexts or resources • Move heavy work to background jobs • Explore and visualize your data • Implement caching (carefully) • Have conversations with your partners in Product and Design!
  43. Things to Consider at Global Scale What is acceptable? •

    What expectations do customers/support have for you? • Uptime • Response time • Support • What happens if you’re down while the team is sleeping? • What happens if jobs fail for hours? • What does your team do when failures arise?
  44. On-call Rotations • Use Datadog monitoring to trigger alerts •

    Tools • PagerDuty • Opsgenie • Grafana OnCall • Splunk On-Call • Beware of local laws Also a global concern
  45. Incident Management • Learn from when things go wrong •

    Transparency for the future • Blameless culture of improvement • Determine where your processes are lacking Continuous learning and swift response
  46. CDNs and Locality • CDNs can cache assets and serve

    them from close to users • Distributed databases allow a similar tactic with data • Host in multiple regions • f ly.io is a player in this area for applications Keep things close to users
  47. Update Nearing 500k users • Load on app servers is

    okay - autoscaling well • Multiple worker nodes are now active for background jobs • New UX f low + pagination made PartyHearty happy. • Database has been getting slower. • Reports take longer to generate • Degrades DB performance • Degrades app performance • Endpoints that use time entry data is slowing over time
  48. Reporting Recomputing continuously • Generated reports, rollups, historical info •

    Only change at given intervals • Computationally and time intensive • 0 reason to impact the production database
  49. Followers / Replicas Multiple instances • A follower database is

    a replica of the primary database • Move load o ff the primary database • Can be used to separate read/writes • Can also be used as a failover if needed
  50. Update jobs to use follower DB Better! • Reports generate

    much more quickly • No more lag periods during report runs • Time entry endpoints are still slowing over time
  51. We have 1.5B time entries Queries and indices need to

    dig through a lot • Rely on Explain/Analyze to f ind easy wins • Eventually easy wins will stop showing up • Remember that list from earlier?
  52. DB Performance Time to Consider • Telemetry to identify slow

    queries • Explain/Analyze to identify query planner bottlenecks • Index types • Cache optimization • Unlogged vs. Logged tables • Triggers • Foreign Keys • Paging • Vacuum/Autovacuum tuning • IOPS Allocation • Memory • Scheduling • Clustering • Replication delays • Replication types
  53. Deus Ex Machina This slide deck is long enough as-is!

    • Decide to partition time entry data by project keys (hashed) • Lookups can be done by loading projects or by user through projects • Entire table can still be fetched • Some things get harder
  54. Did you notice I didn’t mention microservices? Use cases •

    Di ff erent scaling/deployment needs in distinct areas • Completely orthogonal application concerns • Concrete example • A service that ingests data continuously from a Ka f ka stream • No API • Single instance • An API service • Serves this ingested data from a database • Minimum 3 instances at any point to support query tra ff ic
  55. Scaling Problems Never End They just change along the way

    • Don’t be attached to any one solution • Don’t be attached to any one language • In general • Use money to solve the problem until you can’t • Outsource what you cannot support • Outsource what is not important to your business
  56. We did it! • Our newest customer is happy •

    Our teams are happy • Our product roadmap can resume • We understand scaling For now!
  57. Glossary Some things I am going to be lazy about

    saying/typing • ARR: Annual Recurring Revenue • RPS: Requests per Second (throughput) • Droplet: Server instance / virtual machine • Dyno: Server instance / virtual machine • DevOps: A software development paradigm breaking down barriers between Developers and Ops. Does not belong in a job title. • Ops: Short for Operations. Engineers keeping the infrastructure and networking online. Should not be confused with DevOps • Vertical Scaling: Increasing capacity of a server by increasing the resources • Horizontal Scaling: Increasing capacity of a system by adding more instances to it • Telemetry: Metrics that allow us to measure, monitor, and alert on system behavior • IOPS: Input/Output operations per second • Auth: For this talk I speci f ically mean authentication, not authorization
  58. Sources Mostly pictures, really • https://www.goodfreephotos.com/albums/other-photos/stack-of- f lat-rocks.jpg • https://images.squarespace-cdn.com/content/v1/54ab3e9fe4b032a504c6d379/1585700699235-D2JWY12NNBDI22MKAPSN/Gatsby+cheers2.gif

    • https://giphy.com/gifs/la ff -tv-comedy-himym-how-i-met-your-mother-hU3t1jZUBbAxJiowLy • https://i.pinimg.com/originals/a4/9c/47/a49c478954c0b8ccb9a25d906700040d.gif • https://www.cheatsheet.com/wp-content/uploads/2021/03/How-I-Met-Your-Mother.jpg • https://uploads.wornontv.net/2012/01/himym-s07e04-robins-grey-dress.jpg • https://images4.fanpop.com/image/photos/20000000/BRo-3-barney-and-robin-20078189-500-294.gif • https://people.com/thmb/gC-QtelEMXrs9mcFi9smPuh9sEc=/750x0/ f ilters:no_upscale():max_bytes(150000):strip_icc():focal(299x0:301x2):format(webp)/ himym-600-f226b6133d6648b1a6d9b27c999d7872.jpg • https://y.yarn.co/88151a7d-90e1-4710-bc06-4df149750518_text.gif • https://michelechynoweth1. f iles.wordpress.com/2018/10/rabbit-hole-2.jpg • https://y.yarn.co/9854afd8-e1c1-4 f bf-9f81-34e8c5d45225_text.gif • https://hellogiggles.com/wp-content/uploads/sites/7/2016/03/23/ted.gif • https://static1.srcdn.com/wordpress/wp-content/uploads/2020/01/Lily-and-Banrey-HIMYM.jpg • https://media.tenor.com/lmWVCW4VCeYAAAAC/phone- f ive-barney-stinson.gif • https://www.rollingstone.com/wp-content/uploads/2018/06/rs-28075-20140326-himym-x1800-1395857056.jpg?w=595&h=395&crop=1