DB Migrations equal Pain

DB Migrations = Pain

Context • Look is an application for live video streaming
• Backend, iOS and Android client, Admin page, frontend for customers • Good management • Good architecture

Context • 3 environments: develop, qa, production (and local) •
3 core services: ◦ web (aka api) ◦ rtmp (video streaming) ◦ cent (realtime messaging)

Context • There are 2 backend developers • We think
about code quality: ◦ very strict linter ◦ tests: unit and behave ◦ deploy in 1 command

Story • Deployment after 3 monthes of development • DB
redesign: changed one of the core models to fit business logic ◦ Schema migration ◦ Data migration • Statistics on the admin page • Successfully deployed to dev and qa

Story • Data migrations was running during 40 minutes: ◦
I was ready for it • Production was down during 5 hours ◦ Kernel Panic! • I deployed the previous version and restore DB from snapshot – lost last 3 hours of data

Plan Analyze Fix Learn the lesson

What was the symptoms? • Django was not responding to
request at all • Memory usage was fine • CPU was fine • Network was fine • Actually, Django was responding with HUGE latency ◦ the best case was 5 minutes, to the simplest request!

How did we investigate? • Find bottlenecks: ◦ analyze latencies
locally – django-silk is the best • Fix them one by one • Test the fixes on the develop environment

How did we fix it? • Speed up data migrations:
40 minutes → 7 minutes ◦ select_related • Move all long running tasks to celery tasks • To prevent race between celery and django we run them on separate instances

How did we fix it? • Simplify admin page ◦
Calculate metrics in periodic celery task ▪ each 10 minutes, with timeout 1 hour ◦ Keep in DB ◦ Join with the metric table

What do we need to do? Zero downtime deployment aka
Continuous Deployment

Continuous Deployment Blue Green Deployment

Our way • Use 2 web instances: ◦ Current ◦
Staging • Use 2 DB instances: ◦ Current ◦ Staging

Our way • Deployment steps: ◦ Deploy to staging ◦
Run migrations ◦ Wait ◦ Swap the DNS

The fixes deployment • Production was down during 4 hour
◦ Panic! • The same symptoms!

The guess • Look at whole stack: ◦ DB flood
the disk space ◦ The free disk space metric has reverse sawtooth form • Super hot fix: turn off metric task ◦ The free disk space metric have the same period as the periodic task for calculating metrics

Investigation • Use the production DB clone • Run the
raw query that collects metrics ◦ It was running 1 hour! • This is the reason!

How did we fix it? • The raw query looks
like: − SELECT DISTINCT − 8 LEFT OUTER JOINs − 5 COUNTs − 3 CASEs − GROUP BY user.id • Use EXPLAIN

How did we fix it? • We were not trying
to use the raw query in django ◦ There is no reasons to do so • Attempts: ◦ Remove metrics that requires CASEs ◦ Reduce amount of COUNTs and JOINs ◦ Remove DISTINCT – Fetch row by row ◦ Use one query for each metric

How did we fix it? • The fix is: ◦
Use one query for each metric ▪ The best performance in the production case

Did it help? Yes

The lesson • Good management and good architecture are matter
• Deploy more frequently • Do not use data migrations as is – Use commands • Django admin is not efficient for aggregation queries • Analyze and synthesize are matter

A proof • I have refactored another core model: ◦
A schema migration ◦ A command for data migration • I have deployed it without downtime • Look production environment is still alive

Summary • Analyze • Fix • Learn the lesson

References • https://crystalnix.com/works/look/ • http://martinfowler.com/bliki/BlueGreenDeployment .html • https://gist.github.com/EvgeneOskin/99880b7b7e0 cd2d0115f87b7eeb5ae57

DB Migrations equal Pain

DB Migrations equal Pain

Eugene Oskin

More Decks by Eugene Oskin

Other Decks in Programming

Featured

Transcript

DB Migrations = Pain

Context • Look is an application for live video streaming

Context • 3 environments: develop, qa, production (and local) •

Context • There are 2 backend developers • We think

Story • Deployment after 3 monthes of development • DB

Story • Data migrations was running during 40 minutes: ◦

Plan Analyze Fix Learn the lesson

What was the symptoms? • Django was not responding to

How did we investigate? • Find bottlenecks: ◦ analyze latencies

How did we fix it? • Speed up data migrations:

How did we fix it? • Simplify admin page ◦

What do we need to do? Zero downtime deployment aka

Continuous Deployment Blue Green Deployment

Our way • Use 2 web instances: ◦ Current ◦

Our way • Deployment steps: ◦ Deploy to staging ◦

The fixes deployment • Production was down during 4 hour

The guess • Look at whole stack: ◦ DB flood

Investigation • Use the production DB clone • Run the

How did we fix it? • The raw query looks

How did we fix it? • We were not trying

How did we fix it? • The fix is: ◦

Did it help? Yes

The lesson • Good management and good architecture are matter

A proof • I have refactored another core model: ◦

Summary • Analyze • Fix • Learn the lesson

References • https://crystalnix.com/works/look/ • http://martinfowler.com/bliki/BlueGreenDeployment .html • https://gist.github.com/EvgeneOskin/99880b7b7e0 cd2d0115f87b7eeb5ae57