Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DB Migrations equal Pain

DB Migrations equal Pain

Eugene Oskin

January 20, 2018
Tweet

More Decks by Eugene Oskin

Other Decks in Programming

Transcript

  1. Context • Look is an application for live video streaming

    • Backend, iOS and Android client, Admin page, frontend for customers • Good management • Good architecture
  2. Context • 3 environments: develop, qa, production (and local) •

    3 core services: ◦ web (aka api) ◦ rtmp (video streaming) ◦ cent (realtime messaging)
  3. Context • There are 2 backend developers • We think

    about code quality: ◦ very strict linter ◦ tests: unit and behave ◦ deploy in 1 command
  4. Story • Deployment after 3 monthes of development • DB

    redesign: changed one of the core models to fit business logic ◦ Schema migration ◦ Data migration • Statistics on the admin page • Successfully deployed to dev and qa
  5. Story • Data migrations was running during 40 minutes: ◦

    I was ready for it • Production was down during 5 hours ◦ Kernel Panic! • I deployed the previous version and restore DB from snapshot – lost last 3 hours of data
  6. What was the symptoms? • Django was not responding to

    request at all • Memory usage was fine • CPU was fine • Network was fine • Actually, Django was responding with HUGE latency ◦ the best case was 5 minutes, to the simplest request!
  7. How did we investigate? • Find bottlenecks: ◦ analyze latencies

    locally – django-silk is the best • Fix them one by one • Test the fixes on the develop environment
  8. How did we fix it? • Speed up data migrations:

    40 minutes → 7 minutes ◦ select_related • Move all long running tasks to celery tasks • To prevent race between celery and django we run them on separate instances
  9. How did we fix it? • Simplify admin page ◦

    Calculate metrics in periodic celery task ▪ each 10 minutes, with timeout 1 hour ◦ Keep in DB ◦ Join with the metric table
  10. Our way • Use 2 web instances: ◦ Current ◦

    Staging • Use 2 DB instances: ◦ Current ◦ Staging
  11. Our way • Deployment steps: ◦ Deploy to staging ◦

    Run migrations ◦ Wait ◦ Swap the DNS
  12. The guess • Look at whole stack: ◦ DB flood

    the disk space ◦ The free disk space metric has reverse sawtooth form • Super hot fix: turn off metric task ◦ The free disk space metric have the same period as the periodic task for calculating metrics
  13. Investigation • Use the production DB clone • Run the

    raw query that collects metrics ◦ It was running 1 hour! • This is the reason!
  14. How did we fix it? • The raw query looks

    like: − SELECT DISTINCT − 8 LEFT OUTER JOINs − 5 COUNTs − 3 CASEs − GROUP BY user.id • Use EXPLAIN
  15. How did we fix it? • We were not trying

    to use the raw query in django ◦ There is no reasons to do so • Attempts: ◦ Remove metrics that requires CASEs ◦ Reduce amount of COUNTs and JOINs ◦ Remove DISTINCT – Fetch row by row ◦ Use one query for each metric
  16. How did we fix it? • The fix is: ◦

    Use one query for each metric ▪ The best performance in the production case
  17. The lesson • Good management and good architecture are matter

    • Deploy more frequently • Do not use data migrations as is – Use commands • Django admin is not efficient for aggregation queries • Analyze and synthesize are matter
  18. A proof • I have refactored another core model: ◦

    A schema migration ◦ A command for data migration • I have deployed it without downtime • Look production environment is still alive