Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Self-Healing Code: A Journey Through Auto-Remed...

Self-Healing Code: A Journey Through Auto-Remediation

Managing modern Product Infrastructure and applications is daunting because the bigger the infrastructure, more complicated the operational challenges you face. Things break, daemons die, services stop, clusters fall – not to mention writing the root cause analysis (RCA) documents and runbooks on how to fix the same problem in the future. If you keep on adding monitoring, you end up having a huge pile of alerts and failures every day.

To ensure availability of product as you scale, either you automate or you die. This is where Auto- Remediation comes into picture.

Auto-Remediation, or Self-Healing, is a workflow which triggers and responds to alerts or events by executing actions that can prevent or fix the problem.

The simplest example of auto-remediation is restarting a service (let’s say apache) when it’s down. Imagine an automated action that is triggered by a monitoring system to restart the service and prevent the application outage. In addition, it creates a task and sends a notification so that the engineer can find the root cause during business hours, and there is no need to do it in the middle of the night. Furthermore, the event-driven automation can be used for assisted troubleshooting, so when you get an alert it includes related logs, monitoring metrics/graphs, and so on.

Avatar for Arun Kumar Singh

Arun Kumar Singh

September 08, 2017
Tweet

Other Decks in Technology

Transcript

  1. Agenda v How SRE thinks? v Problem Statement v Prerequisites

    v What is auto-remediation? v Pillars of auto-remediation v General Auto-remediation workflow v What is SaltStack? v Architecture v Steps to Auto-remediate v Achievements v Future
  2. How SRE Thinks? I have 30k servers, if I get

    just one alert on a server in a week, I will have to handle 30k alerts in a week Oh Shit, I have to take on-call tonight. It would be so nice if I have a bot who takes care of alerts tonight and create a ticket for me with all details if investigation required Why am I doing same task everyday? Why don’t I train my servers to do it for me
  3. Prerequisites v Basic knowledge of Python v Basic Knowledge of

    Salt Stack (Python based open source configuration management tool) v Knowledge of Nagios or any other monitoring tool
  4. What is Auto-Remediation? v Auto-Remediation, or Self-Healing, is a workflow

    which triggers and responds to alerts or events by executing actions that can prevent or fix the problem. v Simplest example is to restart a service when its down and create a ticket to do root cause analysis in business hours
  5. Pillars of Auto-remediation v Detect failure v Notify “the system”

    of failure event v “The System” listens for failure events and triggers auto-remediation response v Logging of all actions and attempts to remediate; to monitor and improve “the system”
  6. What is SaltStack? Why SaltStack? v SaltStack is a python

    based open source configuration management and remote execution tool v SaltStack is a revolutionary approach to infrastructure management that replaces complexity with speed. SaltStack is simple enough to get running in minutes, scalable enough to manage tens of thousands of servers, and fast enough to communicate with each system in seconds.
  7. Steps to Auto-Remediate v Step 1: Salt Minion recognizes an

    alert via any monitoring tool. v Step 2: Salt Minion sends an event to Salt Master with required details like minion id, event tag, process id, job id, service name, data from alert etc. v Step 3: Salt Master responds on the event via Salt Reactor v Step 4: Salt reactor runs salt state/salt module/python script etc on Salt Minion to auto- remediate alert v Step 5: Salt reactor also runs a salt state which will notify on slack or any other notification channel about the execution of auto-remediation
  8. Salt event example v Event tag - company/product/application/service_name/fail v Hostname

    – ab1234.hostname.com v Service Name – service_name v Message – [service_name > FAIL] v Job ID - jid
  9. Achievements v ~40% of alerts get resolved by themselves (Using

    Auto-remediation) v Reduced Customer Service Outage (CSO) by approx 20% v SRE (Site Reliability Engineer) saves almost 2 hours everyday and uses it for other automation tasks v Reduced MTTR (Mean Time To Resolve) for alerts
  10. Future v Add better logging and tracking of failure events

    v Use logs for Machine learning to make auto- remediation more intelligent v Use auto-remediation data for Analysis of pain points in system v Log other data with event- ps list, free -m, df -h, etc. of host. Get a view of what may be happening with host/service at time of alert. v Use RabbitMQ or Kafka for better resilience of event messages.
  11. Office Email - [email protected] Personal Email - [email protected] Phone -

    +91 8802332064 Blog Link - https://medium.com/adobe-io/self- healing-code-a-journey-through-auto- remediation-60367eea312 PPT Link - https://speakerdeck.com/arusing/self-healing- code-a-journey-through-auto-remediation LinkedIn - https://www.linkedin.com/in/arun- kumar-singh-17119b40/ Questions?