The Plan
• Review support cases
o Taken from real issues
o Names/ips/dates changed to protect identities
• Analyze reported issues
• Distill best practices
• Summarize takeaways
• Repeat...
Slide 3
Slide 3 text
Scenario 1
• Fire, it is on fire!
• Users notice response time takes 1-3 sec
• App logs show timeouts
• Server log show socket exceptions
Slide 4
Slide 4 text
Scenario 1 - Diagnostics
• Logs
• Understanding the timeouts
o Client read timeout set
o Connection closed/discarded
o Symptom not cause
• Server connection exceptions
o Match timing of client timeouts
o Symptom not cause
Slide 5
Slide 5 text
Scenario 1 - Monitoring
Graphs speak a thousand words
Slide 6
Slide 6 text
Scenario 1 - Takeaways
• Monitor Logs
o Alert, escalate
o Correlate
• Disk
o Monitor
o Moved to RAID (10)
• Instrument/Monitor App
• Know your application and application (write)
characteristics
Slide 7
Slide 7 text
Scenario 2
• Alerts warn that server is running hot
• Random (small) slowdowns
• Increased traffic/queries
Slide 8
Slide 8 text
Scenario 2 - Symptoms
High use cpu
Similar query
pattern
Slide 9
Slide 9 text
Scenario 2 - Diagnostics
• Turn on DB Profiling
• Look at logs
Identify query patterns taking longest or with
highest frequency and run explain
Scenario 2 - Diagnostics
• Create a compound index
o Used for criteria and sort
o Reduced CPU dramatically
Slide 12
Slide 12 text
Scenario 2 - Takeaways
• Performance test/analyze system behavior
• Load test before deployment
• Alert on abnormal states
• High CPU is a sign of poorly indexed
• Rolling upgrade for indexes
Slide 13
Slide 13 text
Scenario 3
• General slowdown on login
• High disk utilization
Scenario 3 - Takeaways
• Pay attention to disk configurations
• Load testing would have found this early
• MongoDB depends on the OS a lot
• Connect the dots from disportionate effects
Slide 17
Slide 17 text
Best Practices Learned
• System provisioning
o Capacity
o Performance
o Scale
o Configuration
• Logs
o Review
o Alert
o Rotate and collect (per cluster)
Slide 18
Slide 18 text
Best Practices Learned
• Query/Index Analysis
o Database Profiler
o Run explain periodically (sampled)
o Instrument code, generate metrics
• Plan/test rollouts
o Rolling upgrade for Replica Set
o Generate indexes on secondaries first
o Name services, use redirection
Slide 19
Slide 19 text
Thanks, more refs
Please take a look at http://mongodb.org (docs)
• Ask on mongodb-user group
• Use MMS or historic monitoring
o Watch for trends
o Create alerts
o Forecast capacity for provisioning
• logrotate unix command
• monitor disk - munin or the like