Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Zero downtime migration of our distributed tas...

Edmund Lam
February 28, 2025

Zero downtime migration of our distributed task queue system

Set sail with us as we navigate the complex waters of migrating a critical task queue system, all without causing a single ripple of downtime.

This session will explore the reasons why we switched systems, and the strategic use of feature flags and robust observability in ensuring a seamless transition.

We share our lessons learned so that you can gain insights that can guide your own system migrations.

Edmund Lam

February 28, 2025
Tweet

More Decks by Edmund Lam

Other Decks in Programming

Transcript

  1. Why this talk matters: Change needed System struggles Requirements change

    Build it, and it works Systems evolve - Change is inevitable
  2. The ASDI metric Alexandre Sleep Deprivation Index • Alexandre •

    Adrien • Arthur • Aube • sAm Our on-call team:
  3. A backlog of tasks should not bring down the system.

    Setting Non-Functional Requirements Availability It should be easy to see the number of tasks in the queue. Observability The system must scale automatically based on task queue usage. Scalability It should be easy to manage and extend. Maintainability
  4. Feature flags Trunk flags Ops flags • Permanent • Turn

    parts of the app on and off • Transient • Deploy changes/fixes • Long lived • Configurations Our Flag Structure
  5. Problematic Tasks The new system brought in new constraints: 1.

    No time limit --> 3 minute time limit 2. Undefined max payload size --> 256 KiB max payload size Use observability to detect issues beforehand
  6. What if we need to switch back? Actually, this isn’t

    working, we need to run this in huey
  7. Check the flag and requeue if needed Flag Check -

    Tasks in flight redirected to huey
  8. Horizontal Rollout Before Phase 1 After Both systems deployed, tasks

    routed mostly to huey Both systems deployed, tasks migrated to Celery Phase 2
  9. Phased Rollout Before Phase 1 After Vertical Rollout to a

    subset of large instances Horizontal Rollout to Remaining instances Phase 2
  10. ADRs: Architectural Decision Records 1. Capture the context of a

    decision 2. Provide a decision snapshot 3. Help teams move forward, even without full consensus 4. Save time in the long run 5. Help prepare talks
  11. Wrap up our journey The First System wasn’t a Mistake

    - it reached its limits Migrations are Risk Management, not just a Tech upgrade Feature Flags Give you Control over Change Observability is a Superpower You Can’t Predict every Edge Case - So build for Adaptability Documentation helps you make (and defend) decisions The old ship served us well Navigating choppy waters takes careful planning Adjust the sails to steer the winds A captain’s best friend is a good map and compass Prepare for rogue waves A captain’s log keeps the course accountable