Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tulp: Integrating Machine Learning and Chaos En...

Yury Nino
November 22, 2020

Tulp: Integrating Machine Learning and Chaos Engineering

Yury Nino

November 22, 2020
Tweet

More Decks by Yury Nino

Other Decks in Technology

Transcript

  1. • Managing Incidents • Postmortems • Chaos Engineering • Approach

    with Machine Learning • Tulp: Classifying postmortems • Next steps Agenda
  2. After a storm, life tends to return to a state

    of normality. We have to write what? We are so tired, let me write it tomorrow! A friendly reminder, we have SLAs It is the time for the postmortem
  3. A postmortem is an artifact with a detailed description of

    exactly what went wrong in an incident. A postmortem is a written record of an incident, its impact, the actions taken to mitigate it, the root cause, and the follow-up actions to prevent the incident.
  4. It is the discipline of experimenting in production on a

    distributed system in order to reveal their weakness and to build confidence in their resilience capability. Chaos Engineering www.principleofchaos.com
  5. 2008 Chaos Engineering began at Netflix 2010 Chaos Monkey &

    Simian Army were launched 2016 Gremlin born 2019 1 Book Chaos massification 2017 SRE USenix Chaos IQ born ChaosConf 2018 1 Book Chaos Monkey for Spring Boot 2020 1 Book was published History of Chaos Engineering
  6. Experiment: Hypothesis Validate that there is no interruption in computing

    metrics when the different Spark components fail. To simulate such failures, we employed a whack-a-mole approach and killed the various Spark components.
  7. Artificial Intelligence Intelligence demonstrated by machines. Area of computer science

    that studies how machines can perform tasks that would normally require a sentient agent. From Artificial Intelligence with Python
  8. Chaos Engineering could be improved with Artificial Intelligence! • Handle

    large amounts of data in an efficient way. • Ingest data from multiple sources without any lag. • Learn from new data and update constantly using the right learning algorithms. • Continue with tasks without getting tired or needing breaks. Post- mortems