Pro Yearly is on sale from $80 to $50! »

Surviving Black Friday - Tales from an e-commerce engineer

Surviving Black Friday - Tales from an e-commerce engineer

Black Friday through Cyber Monday is a critical sales period for US retail companies.

At Glossier, a fast-growing e-commerce brand for skincare and beauty products, this peak holiday traffic can represent over a month of typical sales in a few short days.

In this talk, I’ll share the story of how Glossier’s Tech team prepared for Black Friday 2018, with an emphasis on our technical infrastructure and cross-team coordination.

I’ll present how our Marketing, Logistics, Customer Experience, Data, and Tech teams worked closely to plan for the holiday surge. I’ll share how our Tech team used load testing to ensure we met our target capacity; and how we leveled up our debugging and system engineering skills to fix bugs and remove bottlenecks.

I’ll share our successes, surprises, lessons learned, and how we’re preparing for next year.

5888fc25101419e40b7de521f8524dad?s=128

Aaron Suggs

April 17, 2019
Tweet

Transcript

  1. SURVIVING BLACK FRIDAY TALES FROM AN E-COMMERCE ENGINEER

  2. AARON SUGGS @KTHEORY Director of Engineering Glossier, Inc

  3. None
  4. SURVIVING BLACK FRIDAY PEAK HOLIDAY TRAFFIC - THE “13TH MONTH”

    ▸ Weekend after Thanksgiving in the US (Fri-Mon) ▸ A typical month’s worth of revenue in a few days ▸ Glossier runs our sole 20% off promotion
  5. None
  6. “SITE IS UNDER HIGH LOAD, WOULD APPRECIATE ASSISTANCE” VP Eng,

    via PagerDuty at 12:07am FRIDAY, NOV 23:
  7. PREPARING FOR PEAK SECTION 1

  8. 1. MAKE A TEAM 2. MAKE A PLAN 3. EXECUTE

    THE PLAN
  9. None
  10. None
  11. THE PLAN: CAPACITY TESTING

  12. CAPACITY TESTING: 1. DEFINE A TARGET 2. MEASURE YOUR CAPACITY

    3. REMOVE BOTTLENECKS UNTIL YOU MEET THE TARGET
  13. CAPACITY TESTING: 1. DEFINE A TARGET 2. MEASURE YOUR CAPACITY

    3. REMOVE BOTTLENECKS UNTIL YOU MEET THE TARGET
  14. CAPACITY TESTING: 1. DEFINE A TARGET 2. MEASURE YOUR CAPACITY

    3. REMOVE BOTTLENECKS UNTIL YOU MEET THE TARGET
  15. CAPACITY TESTING: 1. DEFINE A TARGET 2. MEASURE YOUR CAPACITY

    3. REMOVE BOTTLENECKS UNTIL YOU MEET THE TARGET
  16. CAPACITY TESTING 1. DEFINE A TARGET ▸ Peak Orders /

    minute ▸ Peak page views / minute across homepage, PLPs, and PDPs
  17. CAPACITY TESTING 1. DEFINE A TARGET - BUT HOW? ▸

    Used prior data from Oct-Nov 2017 ▸ Assumed same proportional change in 2018 to make targets ▸ Discussed expected customer behavior: ▸ Steady growth throughout the morning, peaking ~11am ▸ Not a countdown to a flash sale
  18. CAPACITY TESTING 2. MEASURE YOUR CAPACITY ▸ flood.io was very

    helpful, wrote TypeScript files to model our test. ▸ Used production env, Stripe test sandbox ▸ Ensured orders were not fulfilled by the warehouse ▸ Ensured orders were excluded from biz reporting ▸ Pro tip: crowdsource capacity estimates, celebrate the winner ▸ BONUS: Revealed several bugs to make order processing more reliable!
  19. CAPACITY TESTING 3. REMOVE BOTTLENECKS UNTIL YOU MEET CAPACITY TARGET

    ▸ Scale up (add capacity) ▸ Optimize (use existing capacity more efficiently) ▸ Performance improvements only increase capacity if they’re the bottleneck.
  20. CAPACITY TESTING A FRAMEWORK TO IDENTIFY BOTTLENECKS: PICK ONE OF

    EACH COLUMN Compute resources: ▸ CPU ▸ Disk IO ▸ Network IO System tiers: ▸ Load balancer ▸ App servers ▸ Cache servers ▸ Database ▸ External APIs ▸ …etc.
  21. None
  22. CAPACITY TESTING: 1. DEFINE A TARGET 2. MEASURE YOUR CAPACITY

    3. REMOVE BOTTLENECKS UNTIL YOU MEET THE TARGET
  23. PREPARING FOR PEAK MORE OF THE PLAN AND HELPFUL TACTICS

    ▸ Hired familiar contractor to make quick wins of checkout optimizations ▸ Put all the copy changes and promo code behind a feature flag. ▸ Good internal communication: ▸ Made a dedicated cross-functional Peak Slack channel ▸ Special PagerDuty escalation to get Tech attention quickly. ▸ Tech had an hour-by-hour on-call roster for the Peak weekend
  24. PREPARING FOR PEAK CAPACITY TESTING RESULTS ▸ Exceeded our target

    by 2-4x ▸ Know our bottlenecks at capacity: ▸ Checkout bottleneck: DB CPU (from inventory accounting) ▸ Page view bottleneck: App CPU
  25. PREPARING FOR PEAK PREPARED MITIGATION TECHNIQUES ▸ Disable inventory accounting

    per SKU ▸ Add more app servers (takes ~20 minutes) ▸ Vertically scale DB (takes ~15 minutes, 2-5 min downtime) ▸ Various feature flags to disable non-critical features.
  26. FRIDAY, NOV 23 SECTION 2

  27. “SITE IS UNDER HIGH LOAD, WOULD APPRECIATE ASSISTANCE” VP Eng,

    via PagerDuty at 12:07am FRIDAY, NOV 23
  28. 10pm: “T-Minus 2 hours” email

  29. 10pm: “T-Minus 2 hours” email

  30. 12:00am: 20% off promo goes live

  31. SYMPTOMS [BIZ LEVEL]: SITE IS SLUGGISH, FREQUENT ERRORS

  32. SYMPTOMS [APP LEVEL]: VERY HIGH PAGE VIEWS, CHECKOUT RATE, TIMEOUT

    ERRORS
  33. SYMPTOMS [SYS LEVEL]: APP AND DB CPU PEGGED AT 90%+

  34. REMEDIATION: 1. DISABLE INVENTORY TRACKING 2. SCALE UP APP SERVERS

    3. SCALE UP DATABASE
  35. KEY LEARNING: PREPARE MITIGATION SCRIPTS AHEAD OF TIME

  36. KEY LEARNING: SCALING RDS POSTGRES UNDER LOAD TAKES LONGER (~20MIN)

  37. 12am-1am: Site barely works

  38. FRIDAY, NOV 23 FIXING BROKEN ORDERS ▸ Orders placed from

    ~12-1am were often in an inconsistent state. ▸ Symptoms: missing collateral; uncollected payments ▸ Conf call with Customer Experience team (2-3am) ▸ Workshopped a fix and customer comms together. ▸ Created a script to invoke appropriate order callbacks idempotently.
  39. Rest of Friday:

  40. OVERALL: SUCCESS*

  41. PREPARING FOR NEXT YEAR SECTION 3

  42. PREPARING FOR NEXT YEAR LEARNING REVIEW ▸ Midnight behavior was

    a surprise. Expect it in the future. ▸ Should we just pad our target estimates more? ▸ NO. This is an anti-pattern. Strive for accuracy. ▸ Make architectural improvements that dramatically improve capacity.
  43. PREPARING FOR NEXT YEAR LEARNINGS INFORMED OUR 2019 TECH ROADMAP

    ▸ Dramatically improve reliability and performance: ▸ Pre-generated pages (homepage, PLP, PDP) ▸ Async checkout flow optimistically takes orders with little resource contention ▸ Connects to biz goals: improve conversion and retention. ▸ Deeper debugging and systems-thinking expertise ▸ Capacity testing is super helpful! Use it often.
  44. SUCCESS! NEW BRAND LAUNCH

  45. THANK YOU! @KTHEORY