ChatOps for Incidents

ChatOps for Incidents

Culmination of many previous presentations. Done for Rx Savings Solutions in Overland Park, a company currently going through explosive growth.

260a95e08b7880ecd76b964203f25c87?s=128

Aaron Blythe

October 05, 2018
Tweet

Transcript

  1. ChatOps for Incidents

  2. Aaron Blythe (@ablythe) • Lead Organizer • @devopskc @devopsdayskc http://aaronblythe.com

  3. http://devopsdayskc.org Use code DEVOPSCOMMUNITY to get $50 off

  4. Outline •Just Chat •Just Incident Ops •ChatOps •Advanced things I

    have picked up
  5. Outline •Just Chat •Just Incident Ops •ChatOps •Advanced things I

    have picked up
  6. talking

  7. shouting

  8. Email

  9. Slack, the Email killer?

  10. Best Practices ! Say who you are and what team

    you are on ! Threaded Conversations ! Set Away Settings ! Star your favorite Channels ! @ mention people ! Manage Notifications - Preferences
  11. Simple Features ! Star vs. Pinned ! View Members !

    Reactions ! Activity (the @ sign in upper right) ! Snooze notifications ! All Threads
  12. Neat Features ! Simple Slack Bot - Wifi, Apples ->

    How do you like them apples? ! Integrations - Pipeline Gitlab ! I would like to add the Jira Integration and Dynatrace Integrations soon ! Alexa connect in for PagerDuty and New Relic
  13. Slack Integrations

  14. Maturity with Chat 1. Some teams start getting accustomed to

    chat 2. Design and/or long term technical topic channels form 3. Useful apps start to be installed that behave like command line apps 4. Incidents are being ran on public channels
  15. Outline •Just Chat •Just Incident Ops •ChatOps •Advanced things I

    have picked up
  16. Things that annoy me in Incidents

  17. Things that annoy me in Incidents

  18. Things that annoy me in Incidents • Trying to figure

    out what has been done so far (if I don’t know yet) • Re-hashing what we know (if I am the one that knows)
  19. OTHER Things that annoy me in incidents •Fear-based Ops •Hero

    Culture
  20. Matty Stratton - PagerDuty, Arrested DevOps Podcast https://noti.st/mattstratton/rZ8NCv Matty Stratton

    - Incidents and Accidents • Have clearly defined roles • Rules change when you go from Normal to Emergency 
 • Post incident criteria widely —> Do not litigate during the call 

  21. PagerDuty Incident Response Docs • PagerDuty Documentation: https://github.com/PagerDuty/incident- response-docs •

    Rendered: https://github.com/PagerDuty/incident-response-docs
  22. Service Level (from Google SRE Book) • SLI - Service

    Level Indicator • Example: Status code is 200 • SLO - Service Level Objective • Example: Service is available 99.9% • SLA - Service Level Agreement • Example: Partial subscription fee refunded if not 99% availability met • NOTE: Should be less than SLO so you have space • Click here: https://twitter.com/rakyll/status/974826146343788544?lang=en
  23. None
  24. None
  25. Outline •Just Chat •Just Incident Ops •ChatOps •Advanced things I

    have picked up
  26. 26 https://victorops.com/chatops-for-dummies/ Free Download By: Jason Hand https://victorops.com/chatops/

  27. Chat Client + Bot 27 Err

  28. None
  29. Slack Integrations

  30. Slack Integrations • Benefits • Quick Setup (minimal configuration) •

    Often managed by company that owns integration • Drawbacks • Often simplistic workflow
  31. Hubot

  32. Hubot Brain

  33. Hubot Plugins • hubot-pager-me • hubot-confluence • hubot-leankit • hubot-newrelic2

    • hubot-sumologic
  34. hubot-newrelic2

  35. None
  36. http://devopsreactions.tumblr.com/post/127777547677/when-you-see-the-outage- starting-and-you-cant-do

  37. Slack Hubot PagerDuty Architecture 37

  38. Webhooks 38 Pager Duty Outgoing Slack Incoming

  39. Slack Hubot PagerDuty Webhook Webhook Architecture 39

  40. hubot-incident Start/Triggered Acknowledged Resolved Closed

  41. hubot-incident Start/Triggered Acknowledged Resolved Closed

  42. hubot-incident Start/Triggered Acknowledged Resolved Closed

  43. hubot-incident Start/Triggered Acknowledged Resolved Closed

  44. hubot-incident Start/Triggered Acknowledged Resolved Closed After hold Post Mortem

  45. Blameless Post Mortem

  46. Trust, Just Culture and Blameless Post-Mortem http://aaronblythe.com/presentations/

  47. None
  48. http://devopsreactions.tumblr.com/post/145902399369/manager-on-a-call-during-an- outage

  49. Notes in PagerDuty (For Post Mortem)

  50. http://devopsreactions.tumblr.com/post/122408751191/alerts-when-an-outage-starts

  51. https://github.com/HearstAT/hubot-incident

  52. SLAPI Bot (Slack API) • Why? • Take advantage of

    Slack API (Hubot is least common denom.) • Language agnostic plugins • Docker as packaging system • https://github.com/ImperialLabs/slapi
  53. Outline •Just Chat •Just Incident Ops •ChatOps •Advanced things I

    have picked up
  54. • Practice in a non-stressful situation • In an incident

    - don’t use “can someone…?” • In a post-mortem - stop on each “could have”, “should have” or “would have” • Continuous Improvement just like any other part of Software Engineering • Automate the Mundane • Spending time, Saves Time