Oh Crap, We're Popular!

Oh Crap, We’re Popular! Matt Machuga | Longhorn PHP |
November 2023 A Tale of Scale

Paraphrased from someone smarter than me…can’t remember who though “You
should teach something important within the f irst 30 seconds of a talk”

DevOps Today, you are a DevOps Engineer. That is, you
are an engineer who follows DevOps principles. It’s not a job title.

Microsoft “DevOps is the union of people, process, and technology
to continually provide value to customers.”

Agile Don’t call yourself an Agile Engineer either. That’s weird.

Project Management Institute “As a way of working, agile is
an iterative approach to work that helps teams deliver value faster and with fewer headaches”

Communication Measurement Delivering Value That’s the gist of it.

Hi! I’m Matt Machuga Director of Engineering @ Oyster

Based on True Stories • Names changed • Numbers conveniently
rounded • Simpli f ied to one scenario • Share your solutions! Told through HIMYM characters

WARNING! Simpli f ications Ahead Simpli f ied Math Missing
nuances Deus Ex Machina

Meet HappyHours

HappyHours • 2yrs old • 250 employees • Hosted SaaS
Platform • 300 customers in NA Time Tracking Simpli f ied

HappyHours • PHP & Laravel Monolith • PostgreSQL database •
“RESTful” • Single droplet on Digital Ocean • DB and App on same instance • Datadog & Sentry for telemetry Tech Info

HappyHours • Barney: Head of Engineering • Robin: Sta ff
Engineer • Ted: Head of Product • Lily: Lead Designer • Marshall: Engineer with Obscure Knowledge Product Delivery

Monday, 12pm

Big News! • $3.5m ARR contract with PartyHearty! • 2yr
commitment • Everything is awesome! The Captain, VP of Sales

One More Thing™ 1 million new users! And 99.9% uptime
SLA in contract.

Oh Crap…

Slight Concerns • Current: 5,000 users in NA • Zero
load testing in the past • No ops team • Roadmap deadlines looming Maybe some mild panic 1 million is a lot more than 5k, Ted

Before the story continues

Family Feud Scaling Edition Top 5 answers to scale an
app in an emergency 1. 2. 3. 4. 5. Panic “Throw money at it” Microservices Kubernetes Ka f ka

Great job. Back to the story!

Monday, 1pm

Good News! • Gradual rollout over 4 months • Start
onboarding in 2 weeks • December 1: 50k pilot in NA • January 1: 200k in NA & EU • February 1: 250k globally • March 1: 500k globally Sales bought us some time

So what do we know?

200x in 4 months Only 11x in 2 weeks

“Only”

Can we scale vertically?

What do we know? What can we learn? • During
NA business hours (11hrs) • CPU: 80% • Memory: Irrelevant for simplicity • DB: ~40% of CPU, no details • Background workers are most of o ff -hours tra ff ic • Tra ff ic bursts toward beginning- and end-of-day

Basic Formula Total Users = Users Per Core * Number
of Cores

Caution: Napkin math has f laws

Scaling Vertically More resources on the same machine • Machine
has 4 cores • 32-core is the highest available • 5,000 / 4 = 1,250 users per core • 1,250 * 32 = 40,000 • 40,000 < 55,000

We’re doomed.

40% of CPU is DB Can we move the DB
o ff the app machine?

Scaling Vertically More resources on the same machine • 40%
of the CPU usage is DB • 1,250 users consume 40% per core, so 2,500 users consume 80% per core • 2,500 * 32 = 80,000 users • 80,000 > 55,000

Risks Nothing is free • Network latency may increase •
Database consumption may not actually always be 40%, our telemetry is limited • This won’t scale past the f irst two weeks

How can we test our hypothesis?

Verifying Testing • Create a new node `test` • Create
new independent DB • Populate with Prod Data • Load test with similar tra ff ic

Can we even recreate production right now?

Okay, here is the plan • Team A: Cloning and
load testing vertical scaling solution. • Team B: Determine, document, and possibly automating recreating a production environment • Team C: Improve application telemetry and look for our largest bottlenecks and improvements • Team D: Research easy hosting solution • Barney and Ted will work with stakeholders Distribute work

Thursday Check-in

Team A Second Instance • Cloned the droplet • Changed
out keys for sandboxes • Setup a database on AWS, cloned data with `pgdump` overnight and populated new instance • Bombarded the application with JMeter and k6 for 4 hours, increasing load until failure. • 66rps with CPU around 65-70% on average

Team B Reproducible Environment • Can create environment from scratch!
• All keys and secrets stored in a 1Password Vault • Ubuntu, PHP, Composer up-to-date • All documented in playbook

Team C Telemetry and Improvements • Slow queries and database
performance on new DB • HTTP request/response info • Request tracing (App -> DB) • Looking for quick wins next

Controversy Ahead Apology not found

Team D New Hosting Provider • Decision: Deploy with Heroku
• Reminder: No ops specialty internally • Easy and automatic scaling • Simple permissions and security controls • Can rollout changes quickly

Sunday Night

Plan Let’s ship some stu f • Enable maintenance mode
• Parallelize • Database • Dump the database • Provision the new DB on Heroku • Load the sqldump to the new DB • Provision the new 32-core droplet using playbook • All secrets loaded from 1Password • DB refers to the new production DB • Veri f ied can connect to DB from app • Disable maintenance mode • Test

Nothing Happens Why?

DNS. It’s always DNS. Point it to the server aaaaaaand…

Great success!

Monday, 1 week left

Team C Telemetry and Improvements • Hot spots can be
optimized • Percentiles for web request performance • DB performance in prod (5k) is 👍 • DB performance in test is (55k) is 😐

Database Performance

Family Feud DB Performance Edition Top 5 wrong answers to
optimize database performance 1. 2. 3. 4. 5. Panic MongoDB Index every permutation of columns “Throw money at it” Ka f ka

DB Performance What to Consider • Telemetry to identify slow
queries • Explain/Analyze to identify query planner bottlenecks • Index types • Cache optimization • Unlogged vs. Logged tables • Triggers • Foreign Keys • Paging • Vacuum/Autovacuum tuning • IOPS Allocation • Memory • Scheduling • Clustering • Replication delays • Replication types 🫢🫨😬😳😱🤯

We’ll come back to these later

Missing and inef f icient indices

Pro f iling

Tackling Areas that “Look Slow”

Superstitions, Beliefs, and Instincts

Superstitions, Beliefs, and Instincts No comment on whether they’re right
• Foreach is better than mapping over collections • For loops are better than foreach loops • Pass by reference vs pass by value • Closures vs. method calls • Single quotes > double quotes • The framework is slow

Abstractions are Still Valuable

Correctness is Implementation Speci f ic JavaScript is Implementation Speci
f ic

Usually Safe to Optimize Usually • Separation of concerns •
Loops with DB calls (N+1) • Can they be joins? • Can we inline the items to be queried? • Decouple IO from inner loops - batch • From slurping to streaming

Keep code readable by default Tune by measuring and encapsulate

Case-by-case

Monday, Go Live!

It’s not that bad

Performance Degradations Less than ideal • CPU is back around
80% on average • Overall slowdown in response times, not critical • Timeouts at the top of every hour • Support tickets about people unable to sign in periodically • Ultimately everything is functional most of the time • SLA could be at risk • We can’t tell why yet

Ideas?

Change our View Slice up data di ff erently •
HTTP Responses • Visualize p50, p95, p99 response times • Visualize responses by status code • Visualize 5xx responses by endpoint • Visualize response times by endpoint • Visualize deployments or infrastructure changes • Visualize which functions consume most CPU time • Visualize slowest and most frequent queries

What do we see? Fresh perspective • Most endpoints are
f ine most of the time • Slowest endpoints: • Manager Dashboard • Some Pro f ile updates • Some Time updates • High CPU consumption: Authentication • Timeouts most commonly begin at the top of every hour; continue for ~2 minutes • Timeouts correlate to beginning and end of reporting runs

Manager Dashboards are Slow

Let’s talk about something great

Caching Greatness incarnate • Caching is storing data to be
retrieved faster later • Expensive queries may be cached (memory, f ile, Redis, DB) for quicker secondary lookups • Large templates may be cached rather than regenerating • Things that cannot be made cheap every time may be made cheap most of the time

Let’s talk about something terrible

Caching Evil incarnate • Caching is pain and su ff
ering • One of the hardest problems in computing • Cache eviction/busting • Cache lifespan • Caching for the right audience • Eventual consistency

“Some” Updates are Slow

Image Upload/Processing

Image Upload/Processing Things to know • Uploads are unbounded currently
(can be any size) • Pro f ile pictures are processed and optimized on the server (expensive) • Manual time sheet screenshots are sometimes uploaded as evidence • These need converted to standard size and format

Authentication is Expensive & Slow

Bcrypt is computationally expensive

What happens if a bot hits the auth endpoint?

Rate Limiting Protection • Protecting your application (and users) from
misuse or overuse • Preventing a dangerous number of requests based on some identi f ier • IP address • User • Token • Tenant (company) • Often per-endpoint • In-app or on-edge

Background jobs timeouts. Reports are expensive.

Growing to 255k

Set clear KPIs and Acceptance Criteria

Two Weeks Later

On the way to 255k Users Things to consider or
remedy • Successfully migrated to Heroku, scaled out to multiple nodes • Rate limiting in place - tenant and IP • File uploads limited to 1.5mb • Image processing moved to background job on a worker node • Database performance has been optimized • Response times are better overall • Caching has improved Manager Page for many users, but unusable for others

Review Datadog

Query Timeouts Request Timeouts

Timeouts Postgres and load balancer protecting themselves • The Manager
Dashboard pages are unpaginated • Managers with more than 1000 indirect reports are impacted • Each manager can see their sta ff and drill down in to their time allocations on-screen.

Still unusable for some managers

Customer Support Needs Answers Ted for Damage Control

New UX Needed • Design a new f low for
managers • Explain our pain points • Avoid loading all data at once • Explain what we are changing in the backend • Feature Flag Lily, Lead Designer

Feature Flags Feature toggles, whatever you’d like to call them
• Allow functionality to be enabled/disabled for subset of users/customers • Essential for trunk-based development • Great for testing new ideas that may be disruptive for customers

Partner with Product • Don’t change things without Product knowing!
• Partner to solve the problem • Partner for rollout strategy Collaboration is key

Recap for stage of growth TL;DR • Allow for horizontal
scaling, autoscaling where possible • Introduce rate limiting to protect your application • Put bounds around unbounded contexts or resources • Move heavy work to background jobs • Explore and visualize your data • Implement caching (carefully) • Have conversations with your partners in Product and Design!

Going Global 🌎

Things to Consider at Global Scale What is acceptable? •
What expectations do customers/support have for you? • Uptime • Response time • Support • What happens if you’re down while the team is sleeping? • What happens if jobs fail for hours? • What does your team do when failures arise?

On-call Rotations • Use Datadog monitoring to trigger alerts •
Tools • PagerDuty • Opsgenie • Grafana OnCall • Splunk On-Call • Beware of local laws Also a global concern

Incident Management • Learn from when things go wrong •
Transparency for the future • Blameless culture of improvement • Determine where your processes are lacking Continuous learning and swift response

CDNs and Locality • CDNs can cache assets and serve
them from close to users • Distributed databases allow a similar tactic with data • Host in multiple regions • f ly.io is a player in this area for applications Keep things close to users

On our way to 500k users

Update Nearing 500k users • Load on app servers is
okay - autoscaling well • Multiple worker nodes are now active for background jobs • New UX f low + pagination made PartyHearty happy. • Database has been getting slower. • Reports take longer to generate • Degrades DB performance • Degrades app performance • Endpoints that use time entry data is slowing over time

Reporting

Ideas?

Reporting Recomputing continuously • Generated reports, rollups, historical info •
Only change at given intervals • Computationally and time intensive • 0 reason to impact the production database

Follower Databases

Followers / Replicas Multiple instances • A follower database is
a replica of the primary database • Move load o ff the primary database • Can be used to separate read/writes • Can also be used as a failover if needed

Update jobs to use follower DB Better! • Reports generate
much more quickly • No more lag periods during report runs • Time entry endpoints are still slowing over time

We have 1.5B time entries Queries and indices need to
dig through a lot • Rely on Explain/Analyze to f ind easy wins • Eventually easy wins will stop showing up • Remember that list from earlier?

DB Performance Time to Consider • Telemetry to identify slow
queries • Explain/Analyze to identify query planner bottlenecks • Index types • Cache optimization • Unlogged vs. Logged tables • Triggers • Foreign Keys • Paging • Vacuum/Autovacuum tuning • IOPS Allocation • Memory • Scheduling • Clustering • Replication delays • Replication types

Deus Ex Machina This slide deck is long enough as-is!
• Decide to partition time entry data by project keys (hashed) • Lookups can be done by loading projects or by user through projects • Entire table can still be fetched • Some things get harder

Did you notice I didn’t mention microservices? Use cases •
Di ff erent scaling/deployment needs in distinct areas • Completely orthogonal application concerns • Concrete example • A service that ingests data continuously from a Ka f ka stream • No API • Single instance • An API service • Serves this ingested data from a database • Minimum 3 instances at any point to support query tra ff ic

1 million Users and beyond

Scaling Problems Never End They just change along the way
• Don’t be attached to any one solution • Don’t be attached to any one language • In general • Use money to solve the problem until you can’t • Outsource what you cannot support • Outsource what is not important to your business

We did it! • Our newest customer is happy •
Our teams are happy • Our product roadmap can resume • We understand scaling For now!

Thank You! Slides available at http://matthewmachuga.com over the weekend. @[email protected]

Glossary Some things I am going to be lazy about
saying/typing • ARR: Annual Recurring Revenue • RPS: Requests per Second (throughput) • Droplet: Server instance / virtual machine • Dyno: Server instance / virtual machine • DevOps: A software development paradigm breaking down barriers between Developers and Ops. Does not belong in a job title. • Ops: Short for Operations. Engineers keeping the infrastructure and networking online. Should not be confused with DevOps • Vertical Scaling: Increasing capacity of a server by increasing the resources • Horizontal Scaling: Increasing capacity of a system by adding more instances to it • Telemetry: Metrics that allow us to measure, monitor, and alert on system behavior • IOPS: Input/Output operations per second • Auth: For this talk I speci f ically mean authentication, not authorization

Sources Mostly pictures, really • https://www.goodfreephotos.com/albums/other-photos/stack-of- f lat-rocks.jpg • https://images.squarespace-cdn.com/content/v1/54ab3e9fe4b032a504c6d379/1585700699235-D2JWY12NNBDI22MKAPSN/Gatsby+cheers2.gif
• https://giphy.com/gifs/la ff -tv-comedy-himym-how-i-met-your-mother-hU3t1jZUBbAxJiowLy • https://i.pinimg.com/originals/a4/9c/47/a49c478954c0b8ccb9a25d906700040d.gif • https://www.cheatsheet.com/wp-content/uploads/2021/03/How-I-Met-Your-Mother.jpg • https://uploads.wornontv.net/2012/01/himym-s07e04-robins-grey-dress.jpg • https://images4.fanpop.com/image/photos/20000000/BRo-3-barney-and-robin-20078189-500-294.gif • https://people.com/thmb/gC-QtelEMXrs9mcFi9smPuh9sEc=/750x0/ f ilters:no_upscale():max_bytes(150000):strip_icc():focal(299x0:301x2):format(webp)/ himym-600-f226b6133d6648b1a6d9b27c999d7872.jpg • https://y.yarn.co/88151a7d-90e1-4710-bc06-4df149750518_text.gif • https://michelechynoweth1. f iles.wordpress.com/2018/10/rabbit-hole-2.jpg • https://y.yarn.co/9854afd8-e1c1-4 f bf-9f81-34e8c5d45225_text.gif • https://hellogiggles.com/wp-content/uploads/sites/7/2016/03/23/ted.gif • https://static1.srcdn.com/wordpress/wp-content/uploads/2020/01/Lily-and-Banrey-HIMYM.jpg • https://media.tenor.com/lmWVCW4VCeYAAAAC/phone- f ive-barney-stinson.gif • https://www.rollingstone.com/wp-content/uploads/2018/06/rs-28075-20140326-himym-x1800-1395857056.jpg?w=595&h=395&crop=1

Oh Crap, We're Popular!

Oh Crap, We're Popular!

More Decks by Matt Machuga

Other Decks in Programming

Featured

Transcript