$30 off During Our Annual Pro Sale. View Details »

ChatOps for Incidents

ChatOps for Incidents

Culmination of many previous presentations. Done for Rx Savings Solutions in Overland Park, a company currently going through explosive growth.

Aaron Blythe

October 05, 2018
Tweet

More Decks by Aaron Blythe

Other Decks in Technology

Transcript

  1. ChatOps for Incidents

    View Slide

  2. Aaron Blythe (@ablythe)
    • Lead Organizer

    @devopskc
    @devopsdayskc
    http://aaronblythe.com

    View Slide

  3. http://devopsdayskc.org
    Use code DEVOPSCOMMUNITY to get $50 off

    View Slide

  4. Outline
    •Just Chat

    •Just Incident Ops

    •ChatOps

    •Advanced things I have picked up

    View Slide

  5. Outline
    •Just Chat

    •Just Incident Ops

    •ChatOps

    •Advanced things I have picked up

    View Slide

  6. talking

    View Slide

  7. shouting

    View Slide

  8. Email

    View Slide

  9. Slack, the Email killer?

    View Slide

  10. Best Practices
    ! Say who you are and what team
    you are on
    ! Threaded Conversations
    ! Set Away Settings
    ! Star your favorite Channels
    ! @ mention people
    ! Manage Notifications -
    Preferences

    View Slide

  11. Simple Features
    ! Star vs. Pinned
    ! View Members
    ! Reactions
    ! Activity (the @ sign in upper
    right)
    ! Snooze notifications
    ! All Threads

    View Slide

  12. Neat Features
    ! Simple Slack Bot - Wifi, Apples ->
    How do you like them apples?
    ! Integrations - Pipeline Gitlab
    ! I would like to add the Jira
    Integration and Dynatrace
    Integrations soon
    ! Alexa connect in for PagerDuty
    and New Relic

    View Slide

  13. Slack Integrations

    View Slide

  14. Maturity with Chat
    1. Some teams start getting accustomed to chat

    2. Design and/or long term technical topic channels form

    3. Useful apps start to be installed that behave like command line apps

    4. Incidents are being ran on public channels

    View Slide

  15. Outline
    •Just Chat

    •Just Incident Ops

    •ChatOps

    •Advanced things I have picked up

    View Slide

  16. Things that annoy me in Incidents

    View Slide

  17. Things that annoy me in Incidents

    View Slide

  18. Things that annoy me in Incidents
    • Trying to figure out what has been done so far (if I don’t know yet)

    • Re-hashing what we know (if I am the one that knows)

    View Slide

  19. OTHER
    Things that annoy me in incidents
    •Fear-based Ops

    •Hero Culture

    View Slide

  20. Matty Stratton
    - PagerDuty, Arrested DevOps Podcast
    https://noti.st/mattstratton/rZ8NCv
    Matty Stratton - Incidents and Accidents
    • Have clearly defined roles
    • Rules change when you go from Normal to Emergency 

    • Post incident criteria widely —> Do not litigate during the call 


    View Slide

  21. PagerDuty Incident Response Docs
    • PagerDuty Documentation: https://github.com/PagerDuty/incident-
    response-docs

    • Rendered: https://github.com/PagerDuty/incident-response-docs

    View Slide

  22. Service Level (from Google SRE Book)
    • SLI - Service Level Indicator

    • Example: Status code is 200

    • SLO - Service Level Objective

    • Example: Service is available 99.9%

    • SLA - Service Level Agreement

    • Example: Partial subscription fee refunded if not 99% availability met

    • NOTE: Should be less than SLO so you have space

    • Click here: https://twitter.com/rakyll/status/974826146343788544?lang=en

    View Slide

  23. View Slide

  24. View Slide

  25. Outline
    •Just Chat

    •Just Incident Ops

    •ChatOps

    •Advanced things I have picked up

    View Slide

  26. 26
    https://victorops.com/chatops-for-dummies/
    Free Download
    By: Jason Hand
    https://victorops.com/chatops/

    View Slide

  27. Chat Client + Bot
    27
    Err

    View Slide

  28. View Slide

  29. Slack Integrations

    View Slide

  30. Slack Integrations
    • Benefits
    • Quick Setup (minimal configuration)
    • Often managed by company that owns integration
    • Drawbacks
    • Often simplistic workflow

    View Slide

  31. Hubot

    View Slide

  32. Hubot Brain

    View Slide

  33. Hubot Plugins
    • hubot-pager-me
    • hubot-confluence
    • hubot-leankit
    • hubot-newrelic2
    • hubot-sumologic

    View Slide

  34. hubot-newrelic2

    View Slide

  35. View Slide

  36. http://devopsreactions.tumblr.com/post/127777547677/when-you-see-the-outage-
    starting-and-you-cant-do

    View Slide

  37. Slack
    Hubot
    PagerDuty
    Architecture
    37

    View Slide

  38. Webhooks
    38
    Pager Duty Outgoing
    Slack Incoming

    View Slide

  39. Slack
    Hubot
    PagerDuty
    Webhook
    Webhook
    Architecture
    39

    View Slide

  40. hubot-incident
    Start/Triggered
    Acknowledged
    Resolved
    Closed

    View Slide

  41. hubot-incident
    Start/Triggered
    Acknowledged
    Resolved
    Closed

    View Slide

  42. hubot-incident
    Start/Triggered
    Acknowledged
    Resolved
    Closed

    View Slide

  43. hubot-incident
    Start/Triggered
    Acknowledged
    Resolved
    Closed

    View Slide

  44. hubot-incident
    Start/Triggered
    Acknowledged
    Resolved
    Closed
    After hold Post Mortem

    View Slide

  45. Blameless Post Mortem

    View Slide

  46. Trust, Just Culture and Blameless Post-Mortem
    http://aaronblythe.com/presentations/

    View Slide

  47. View Slide

  48. http://devopsreactions.tumblr.com/post/145902399369/manager-on-a-call-during-an-
    outage

    View Slide

  49. Notes in PagerDuty
    (For Post Mortem)

    View Slide

  50. http://devopsreactions.tumblr.com/post/122408751191/alerts-when-an-outage-starts

    View Slide

  51. https://github.com/HearstAT/hubot-incident

    View Slide

  52. SLAPI Bot (Slack API)
    • Why?
    • Take advantage of Slack API (Hubot is least common denom.)
    • Language agnostic plugins
    • Docker as packaging system
    • https://github.com/ImperialLabs/slapi

    View Slide

  53. Outline
    •Just Chat

    •Just Incident Ops

    •ChatOps

    •Advanced things I have picked up

    View Slide

  54. • Practice in a non-stressful situation

    • In an incident - don’t use “can someone…?”

    • In a post-mortem - stop on each “could have”, “should have” or “would
    have”

    • Continuous Improvement just like any other part of Software Engineering

    • Automate the Mundane

    • Spending time, Saves Time

    View Slide