Slide 1

Slide 1 text

Operational Best Practices Tales from the field

Slide 2

Slide 2 text

The Plan •  Review support cases o  Taken from real issues o  Names/ips/dates changed to protect identities •  Analyze reported issues •  Distill best practices •  Summarize takeaways •  Repeat...

Slide 3

Slide 3 text

Scenario 1 •  Fire, it is on fire! •  Users notice response time takes 1-3 sec •  App logs show timeouts •  Server log show socket exceptions

Slide 4

Slide 4 text

Scenario 1 - Diagnostics •  Logs •  Understanding the timeouts o  Client read timeout set o  Connection closed/discarded o  Symptom not cause •  Server connection exceptions o  Match timing of client timeouts o  Symptom not cause

Slide 5

Slide 5 text

Scenario 1 - Monitoring Graphs speak a thousand words

Slide 6

Slide 6 text

Scenario 1 - Takeaways •  Monitor Logs o  Alert, escalate o  Correlate •  Disk o  Monitor o  Moved to RAID (10) •  Instrument/Monitor App •  Know your application and application (write) characteristics

Slide 7

Slide 7 text

Scenario 2 •  Alerts warn that server is running hot •  Random (small) slowdowns •  Increased traffic/queries

Slide 8

Slide 8 text

Scenario 2 - Symptoms High use cpu Similar query pattern

Slide 9

Slide 9 text

Scenario 2 - Diagnostics •  Turn on DB Profiling •  Look at logs Identify query patterns taking longest or with highest frequency and run explain

Slide 10

Slide 10 text

Scenario 2 - Explain db.scenario2.find({...}).sort({...}).explain() { "cursor" : "BtreeCursor ABC", "nscanned" : 160677, "nscannedObjects" : 12015, "n" : 55, "millis" : 99, "scanAndOrder" : true, "indexBounds" : {...} }

Slide 11

Slide 11 text

Scenario 2 - Diagnostics •  Create a compound index o  Used for criteria and sort o  Reduced CPU dramatically

Slide 12

Slide 12 text

Scenario 2 - Takeaways •  Performance test/analyze system behavior •  Load test before deployment •  Alert on abnormal states •  High CPU is a sign of poorly indexed •  Rolling upgrade for indexes

Slide 13

Slide 13 text

Scenario 3 •  General slowdown on login •  High disk utilization

Slide 14

Slide 14 text

Scenario 3 - Diagnostics iostat Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdp 0.00 0.00 0.50 0.00 27.86 0.00 56.00 149.58 20320.00 2010.00 100.00

Slide 15

Slide 15 text

Scenario 3 $ blockdev --report RO RA SSZ BSZ StartSec Size Device rw 8096 512 4048 0 1099494850560 /dev/ sdp Huge read-ahead of 4MB

Slide 16

Slide 16 text

Scenario 3 - Takeaways •  Pay attention to disk configurations •  Load testing would have found this early •  MongoDB depends on the OS a lot •  Connect the dots from disportionate effects

Slide 17

Slide 17 text

Best Practices Learned •  System provisioning o  Capacity o  Performance o  Scale o  Configuration •  Logs o  Review o  Alert o  Rotate and collect (per cluster)

Slide 18

Slide 18 text

Best Practices Learned •  Query/Index Analysis o  Database Profiler o  Run explain periodically (sampled) o  Instrument code, generate metrics •  Plan/test rollouts o  Rolling upgrade for Replica Set o  Generate indexes on secondaries first o  Name services, use redirection

Slide 19

Slide 19 text

Thanks, more refs Please take a look at http://mongodb.org (docs) •  Ask on mongodb-user group •  Use MMS or historic monitoring o  Watch for trends o  Create alerts o  Forecast capacity for provisioning •  logrotate unix command •  monitor disk - munin or the like

Slide 20

Slide 20 text

Questions