Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DB Migrations equal Pain

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

DB Migrations equal Pain

Avatar for Eugene Oskin

Eugene Oskin

January 20, 2018
Tweet

More Decks by Eugene Oskin

Other Decks in Programming

Transcript

  1. Context • Look is an application for live video streaming

    • Backend, iOS and Android client, Admin page, frontend for customers • Good management • Good architecture
  2. Context • 3 environments: develop, qa, production (and local) •

    3 core services: ◦ web (aka api) ◦ rtmp (video streaming) ◦ cent (realtime messaging)
  3. Context • There are 2 backend developers • We think

    about code quality: ◦ very strict linter ◦ tests: unit and behave ◦ deploy in 1 command
  4. Story • Deployment after 3 monthes of development • DB

    redesign: changed one of the core models to fit business logic ◦ Schema migration ◦ Data migration • Statistics on the admin page • Successfully deployed to dev and qa
  5. Story • Data migrations was running during 40 minutes: ◦

    I was ready for it • Production was down during 5 hours ◦ Kernel Panic! • I deployed the previous version and restore DB from snapshot – lost last 3 hours of data
  6. What was the symptoms? • Django was not responding to

    request at all • Memory usage was fine • CPU was fine • Network was fine • Actually, Django was responding with HUGE latency ◦ the best case was 5 minutes, to the simplest request!
  7. How did we investigate? • Find bottlenecks: ◦ analyze latencies

    locally – django-silk is the best • Fix them one by one • Test the fixes on the develop environment
  8. How did we fix it? • Speed up data migrations:

    40 minutes → 7 minutes ◦ select_related • Move all long running tasks to celery tasks • To prevent race between celery and django we run them on separate instances
  9. How did we fix it? • Simplify admin page ◦

    Calculate metrics in periodic celery task ▪ each 10 minutes, with timeout 1 hour ◦ Keep in DB ◦ Join with the metric table
  10. Our way • Use 2 web instances: ◦ Current ◦

    Staging • Use 2 DB instances: ◦ Current ◦ Staging
  11. Our way • Deployment steps: ◦ Deploy to staging ◦

    Run migrations ◦ Wait ◦ Swap the DNS
  12. The guess • Look at whole stack: ◦ DB flood

    the disk space ◦ The free disk space metric has reverse sawtooth form • Super hot fix: turn off metric task ◦ The free disk space metric have the same period as the periodic task for calculating metrics
  13. Investigation • Use the production DB clone • Run the

    raw query that collects metrics ◦ It was running 1 hour! • This is the reason!
  14. How did we fix it? • The raw query looks

    like: − SELECT DISTINCT − 8 LEFT OUTER JOINs − 5 COUNTs − 3 CASEs − GROUP BY user.id • Use EXPLAIN
  15. How did we fix it? • We were not trying

    to use the raw query in django ◦ There is no reasons to do so • Attempts: ◦ Remove metrics that requires CASEs ◦ Reduce amount of COUNTs and JOINs ◦ Remove DISTINCT – Fetch row by row ◦ Use one query for each metric
  16. How did we fix it? • The fix is: ◦

    Use one query for each metric ▪ The best performance in the production case
  17. The lesson • Good management and good architecture are matter

    • Deploy more frequently • Do not use data migrations as is – Use commands • Django admin is not efficient for aggregation queries • Analyze and synthesize are matter
  18. A proof • I have refactored another core model: ◦

    A schema migration ◦ A command for data migration • I have deployed it without downtime • Look production environment is still alive