MongoDB - Operations Best Practices

Tales from the field

Today’s Plan Review Analyze Distill Summarize

Scenario 1 •  Fire, it is on fire! •  Users
notice response time takes 1-3 sec •  App logs show timeouts •  Server log show socket exceptions

Scenario 1 - Diagnostics •  Logs •  Understanding the timeouts
•  Client read timeout set •  Connection closed/discarded •  Symptom not cause •  Server connection exceptions •  Match timing of client timeouts •  Symptom not cause

Scenario 1 - Monitoring •  Graphs speak a thousand words

Scenario 1 - Takeaways •  Monitor Logs •  Alert, escalate
•  Correlate •  Disk •  Monitor •  Moved to RAID (10) •  Instrument/Monitor App •  Know your application and application (write) characteristics

Scenario 2 •  Alerts warn that server is running hot
•  Random (small) slowdowns •  Increased traffic/queries

•  High CPU use •  Similar query pattern Scenario 2
- Symptoms

Scenario 2 - Diagnostics •  Turn on DB Profiling • 
Look at logs •  Identify query patterns •  taking longest •  highest frequency •  Run: <query>.explain()

Scenario 2 - Explain db.s2.find({...}).sort({...}).explain() { "cursor" : "BtreeCursor
ABC", "nscanned" : 160677, "nscannedObjects" : 12015, "n" : 55, "millis" : 99, "scanAndOrder" : true, "indexBounds" : {...} }

Scenario 2 - Diagnostics •  Create a compound index • 
Used for criteria and sort •  Reduced CPU dramatically

Scenario 2 - Takeaways •  Performance test/analyze system behavior • 
Load test before deployment •  Alert on abnormal states •  +CPU may be a sign of poorly indexed data •  Perform a rolling upgrade for indexes

Scenario 3 •  Application: General slowdown on login •  Mongo:
High disk utilization

Scenario 3 - Diagnostics Device: rrqm/s wrqm/s r/s w/s rsec/s
wsec/s sdp 0.00 0.00 0.50 0.00 27.86 0.00 avgrq-sz avgqu-sz await svctm %util 56.00 149.58 20320.00 2010.00 100.00 •  iostat

Scenario 3 - More Diagnostics $ blockdev --report RO RA
SSZ BSZ StartSec Size Device rw 8096 512 4048 0 1099494850560 /dev/sdp •  Huge read-ahead of 4MB

Scenario 3 - Takeaways •  Pay attention to disk configurations
•  Load testing would have found this early •  MongoDB depends on the OS a lot •  Connect the dots from disproportionate effects •  Using blockdev, be aware of layering!

Best Practices Covered - 1 •  System provisioning •  Configuration
•  Performance •  Capacity •  Logs •  Review •  Alert •  Rotate and collect (per cluster)

Best Practices Covered - 2 •  Query/Index Analysis •  Instrument
app code, generate metrics •  Run .explain() •  Database Profiler •  Plan/test rollouts •  Rolling upgrade for Replica Sets •  Generate indexes on Secondaries first •  Use DNS, not IPs

MongoDB - Operations Best Practices

MongoDB - Operations Best Practices

Mike Fiedler

More Decks by Mike Fiedler

Other Decks in Technology

Featured

Transcript

Tales from the field

Today’s Plan Review Analyze Distill Summarize

Scenario 1 •  Fire, it is on fire! •  Users

Scenario 1 - Diagnostics •  Logs •  Understanding the timeouts

Scenario 1 - Monitoring •  Graphs speak a thousand words

Scenario 1 - Takeaways •  Monitor Logs •  Alert, escalate

Scenario 2 •  Alerts warn that server is running hot

•  High CPU use •  Similar query pattern Scenario 2

Scenario 2 - Diagnostics •  Turn on DB Profiling •

Scenario 2 - Explain db.s2.find({...}).sort({...}).explain() { "cursor" : "BtreeCursor

Scenario 2 - Diagnostics •  Create a compound index •

Scenario 2 - Takeaways •  Performance test/analyze system behavior •

Scenario 3 •  Application: General slowdown on login •  Mongo:

Scenario 3 - Diagnostics Device: rrqm/s wrqm/s r/s w/s rsec/s

Scenario 3 - More Diagnostics $ blockdev --report RO RA

Scenario 3 - Takeaways •  Pay attention to disk configurations

Best Practices Covered - 1 •  System provisioning •  Configuration

Best Practices Covered - 2 •  Query/Index Analysis •  Instrument