Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Surviving Black Friday - Tales from an e-commerce engineer

Surviving Black Friday - Tales from an e-commerce engineer

Black Friday through Cyber Monday is a critical sales period for US retail companies.

At Glossier, a fast-growing e-commerce brand for skincare and beauty products, this peak holiday traffic can represent over a month of typical sales in a few short days.

In this talk, I’ll share the story of how Glossier’s Tech team prepared for Black Friday 2018, with an emphasis on our technical infrastructure and cross-team coordination.

I’ll present how our Marketing, Logistics, Customer Experience, Data, and Tech teams worked closely to plan for the holiday surge. I’ll share how our Tech team used load testing to ensure we met our target capacity; and how we leveled up our debugging and system engineering skills to fix bugs and remove bottlenecks.

I’ll share our successes, surprises, lessons learned, and how we’re preparing for next year.

Aaron Suggs

April 17, 2019
Tweet

More Decks by Aaron Suggs

Other Decks in Technology

Transcript

  1. SURVIVING BLACK FRIDAY PEAK HOLIDAY TRAFFIC - THE “13TH MONTH”

    ▸ Weekend after Thanksgiving in the US (Fri-Mon) ▸ A typical month’s worth of revenue in a few days ▸ Glossier runs our sole 20% off promotion
  2. CAPACITY TESTING: 1. DEFINE A TARGET 2. MEASURE YOUR CAPACITY

    3. REMOVE BOTTLENECKS UNTIL YOU MEET THE TARGET
  3. CAPACITY TESTING: 1. DEFINE A TARGET 2. MEASURE YOUR CAPACITY

    3. REMOVE BOTTLENECKS UNTIL YOU MEET THE TARGET
  4. CAPACITY TESTING: 1. DEFINE A TARGET 2. MEASURE YOUR CAPACITY

    3. REMOVE BOTTLENECKS UNTIL YOU MEET THE TARGET
  5. CAPACITY TESTING: 1. DEFINE A TARGET 2. MEASURE YOUR CAPACITY

    3. REMOVE BOTTLENECKS UNTIL YOU MEET THE TARGET
  6. CAPACITY TESTING 1. DEFINE A TARGET ▸ Peak Orders /

    minute ▸ Peak page views / minute across homepage, PLPs, and PDPs
  7. CAPACITY TESTING 1. DEFINE A TARGET - BUT HOW? ▸

    Used prior data from Oct-Nov 2017 ▸ Assumed same proportional change in 2018 to make targets ▸ Discussed expected customer behavior: ▸ Steady growth throughout the morning, peaking ~11am ▸ Not a countdown to a flash sale
  8. CAPACITY TESTING 2. MEASURE YOUR CAPACITY ▸ flood.io was very

    helpful, wrote TypeScript files to model our test. ▸ Used production env, Stripe test sandbox ▸ Ensured orders were not fulfilled by the warehouse ▸ Ensured orders were excluded from biz reporting ▸ Pro tip: crowdsource capacity estimates, celebrate the winner ▸ BONUS: Revealed several bugs to make order processing more reliable!
  9. CAPACITY TESTING 3. REMOVE BOTTLENECKS UNTIL YOU MEET CAPACITY TARGET

    ▸ Scale up (add capacity) ▸ Optimize (use existing capacity more efficiently) ▸ Performance improvements only increase capacity if they’re the bottleneck.
  10. CAPACITY TESTING A FRAMEWORK TO IDENTIFY BOTTLENECKS: PICK ONE OF

    EACH COLUMN Compute resources: ▸ CPU ▸ Disk IO ▸ Network IO System tiers: ▸ Load balancer ▸ App servers ▸ Cache servers ▸ Database ▸ External APIs ▸ …etc.
  11. CAPACITY TESTING: 1. DEFINE A TARGET 2. MEASURE YOUR CAPACITY

    3. REMOVE BOTTLENECKS UNTIL YOU MEET THE TARGET
  12. PREPARING FOR PEAK MORE OF THE PLAN AND HELPFUL TACTICS

    ▸ Hired familiar contractor to make quick wins of checkout optimizations ▸ Put all the copy changes and promo code behind a feature flag. ▸ Good internal communication: ▸ Made a dedicated cross-functional Peak Slack channel ▸ Special PagerDuty escalation to get Tech attention quickly. ▸ Tech had an hour-by-hour on-call roster for the Peak weekend
  13. PREPARING FOR PEAK CAPACITY TESTING RESULTS ▸ Exceeded our target

    by 2-4x ▸ Know our bottlenecks at capacity: ▸ Checkout bottleneck: DB CPU (from inventory accounting) ▸ Page view bottleneck: App CPU
  14. PREPARING FOR PEAK PREPARED MITIGATION TECHNIQUES ▸ Disable inventory accounting

    per SKU ▸ Add more app servers (takes ~20 minutes) ▸ Vertically scale DB (takes ~15 minutes, 2-5 min downtime) ▸ Various feature flags to disable non-critical features.
  15. FRIDAY, NOV 23 FIXING BROKEN ORDERS ▸ Orders placed from

    ~12-1am were often in an inconsistent state. ▸ Symptoms: missing collateral; uncollected payments ▸ Conf call with Customer Experience team (2-3am) ▸ Workshopped a fix and customer comms together. ▸ Created a script to invoke appropriate order callbacks idempotently.
  16. PREPARING FOR NEXT YEAR LEARNING REVIEW ▸ Midnight behavior was

    a surprise. Expect it in the future. ▸ Should we just pad our target estimates more? ▸ NO. This is an anti-pattern. Strive for accuracy. ▸ Make architectural improvements that dramatically improve capacity.
  17. PREPARING FOR NEXT YEAR LEARNINGS INFORMED OUR 2019 TECH ROADMAP

    ▸ Dramatically improve reliability and performance: ▸ Pre-generated pages (homepage, PLP, PDP) ▸ Async checkout flow optimistically takes orders with little resource contention ▸ Connects to biz goals: improve conversion and retention. ▸ Deeper debugging and systems-thinking expertise ▸ Capacity testing is super helpful! Use it often.