@mrtazz
Syslog-Ng
• Web, Search, Gearman, Photos, Nagios,
Network, VPN
• 1.2GB written/minute
• Chef role attribute based config
• Rule ordering!
Slide 19
Slide 19 text
No content
Slide 20
Slide 20 text
@mrtazz
github.com/etsy/logster
• Extract metrics from log files
• Written in Python
• Runs every minute via cron
Slide 21
Slide 21 text
@mrtazz
Splunk
• Indexes all of our log files
• Easy search for patterns
• Saved searches for interesting ones
• Basically using it as a glorified grep
Slide 22
Slide 22 text
@mrtazz
Logstash
• Experiment status
• Makes it easier integrate different sources
• Easy to set up in dev environment
• Trying to figure out where/how it fits into
our infrastructure
Slide 23
Slide 23 text
@mrtazz
Eventinator
• Tracks all events in our infrastructure
• Chef runs and changes
• DNS changes
• Network
• Deploys
• Server provisioning and decommissioning
• ~ 12 million events in the last 2 years
Slide 24
Slide 24 text
@mrtazz
Slide 25
Slide 25 text
@mrtazz
Chef
• rules everything around me
• Same cookbooks on prod and dev
• every node runs Chef every 10 minutes
• ton of knife plugins and handlers
Slide 26
Slide 26 text
@mrtazz
Slide 27
Slide 27 text
@mrtazz
> 120 recipes
Slide 28
Slide 28 text
@mrtazz
Slide 29
Slide 29 text
@mrtazz
Nagios
Slide 30
Slide 30 text
@mrtazz
Nagios
• 2 instances in each DC/environment
• Fully Chef generated configuration
• Service checks and contacts in git
• Notifications via email->SMS gateway
• ~75% ops on-call
Slide 31
Slide 31 text
@mrtazz
github.com/lozzd/nagdash
Slide 32
Slide 32 text
@mrtazz
Slide 33
Slide 33 text
@mrtazz
Slide 34
Slide 34 text
@mrtazz
Slide 35
Slide 35 text
@mrtazz
Nagios Herald
• Add context to nagios alerts
• What are the first 5 things you do when
you get paged?
• You already have the phone in your hand
• nagios notification handler
Slide 36
Slide 36 text
@mrtazz
Slide 37
Slide 37 text
@mrtazz
The Toys are real
Slide 38
Slide 38 text
@mrtazz
There’s another
side of heaven
Slide 39
Slide 39 text
@mrtazz
Ops Weekly
Slide 40
Slide 40 text
@mrtazz
Ops Weekly
Slide 41
Slide 41 text
@mrtazz
Summary
• Set of trusted tools
• Enhance where they come short
• Try out new things
• Write tools where applicable
• Continuous monitoring and adaptation