Self-Healing Code: A Journey Through Auto-Remediation

Self-Healing Code: A Journey Through Auto- Remediation ARUN SINGH (SITE
RELIABILITY ENGINEER) ADOBE

Agenda v How SRE thinks? v Problem Statement v Prerequisites
v What is auto-remediation? v Pillars of auto-remediation v General Auto-remediation workflow v What is SaltStack? v Architecture v Steps to Auto-remediate v Achievements v Future

How SRE Thinks? I have 30k servers, if I get
just one alert on a server in a week, I will have to handle 30k alerts in a week Oh Shit, I have to take on-call tonight. It would be so nice if I have a bot who takes care of alerts tonight and create a ticket for me with all details if investigation required Why am I doing same task everyday? Why don’t I train my servers to do it for me

Problem Statement Sleepless Nights Human Error Productivity Downgrade Alert Fatigue
Runbooks and Root cause Analysis Documentations

Prerequisites v Basic knowledge of Python v Basic Knowledge of
Salt Stack (Python based open source configuration management tool) v Knowledge of Nagios or any other monitoring tool

What is Auto-Remediation? v Auto-Remediation, or Self-Healing, is a workflow
which triggers and responds to alerts or events by executing actions that can prevent or fix the problem. v Simplest example is to restart a service when its down and create a ticket to do root cause analysis in business hours

Pillars of Auto-remediation v Detect failure v Notify “the system”
of failure event v “The System” listens for failure events and triggers auto-remediation response v Logging of all actions and attempts to remediate; to monitor and improve “the system”

General Auto-Remediation Workflow

What is SaltStack? Why SaltStack? v SaltStack is a python
based open source configuration management and remote execution tool v SaltStack is a revolutionary approach to infrastructure management that replaces complexity with speed. SaltStack is simple enough to get running in minutes, scalable enough to manage tens of thousands of servers, and fast enough to communicate with each system in seconds.

Auto-Remediation Architecture

Steps to Auto-Remediate v Step 1: Salt Minion recognizes an
alert via any monitoring tool. v Step 2: Salt Minion sends an event to Salt Master with required details like minion id, event tag, process id, job id, service name, data from alert etc. v Step 3: Salt Master responds on the event via Salt Reactor v Step 4: Salt reactor runs salt state/salt module/python script etc on Salt Minion to auto- remediate alert v Step 5: Salt reactor also runs a salt state which will notify on slack or any other notification channel about the execution of auto-remediation

Salt event example v Event tag - company/product/application/service_name/fail v Hostname
– ab1234.hostname.com v Service Name – service_name v Message – [service_name > FAIL] v Job ID - jid

Example of salt modules

Monitoring and Analytics

Monitoring and Analytics Pain Point

Achievements v ~40% of alerts get resolved by themselves (Using
Auto-remediation) v Reduced Customer Service Outage (CSO) by approx 20% v SRE (Site Reliability Engineer) saves almost 2 hours everyday and uses it for other automation tasks v Reduced MTTR (Mean Time To Resolve) for alerts

Future v Add better logging and tracking of failure events
v Use logs for Machine learning to make auto- remediation more intelligent v Use auto-remediation data for Analysis of pain points in system v Log other data with event- ps list, free -m, df -h, etc. of host. Get a view of what may be happening with host/service at time of alert. v Use RabbitMQ or Kafka for better resilience of event messages.

Office Email - [email protected] Personal Email - [email protected] Phone -
+91 8802332064 Blog Link - https://medium.com/adobe-io/self- healing-code-a-journey-through-auto- remediation-60367eea312 PPT Link - https://speakerdeck.com/arusing/self-healing- code-a-journey-through-auto-remediation LinkedIn - https://www.linkedin.com/in/arun- kumar-singh-17119b40/ Questions?

Self-Healing Code: A Journey Through Auto-Remed...

Self-Healing Code: A Journey Through Auto-Remediation

Arun Kumar Singh

Other Decks in Technology

Featured

Transcript