Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Next Wave of Reliability Engineering (Interop ITX 2018)

The Next Wave of Reliability Engineering (Interop ITX 2018)

In 2018, Site Reliability Engineering (SRE) will turn 15 years old. Since Google's inception of the term SRE, companies across the world have adopted a new operations mindset along with automation, deployment and monitoring principals. Most of what SRE does now is well established throughout the industry, so what is the next-wave of reliability principals and automation frameworks?

This session will dive into what the future holds for reliability engineering as a field and what will be the next areas of investment and improvement for reliability teams.


May 02, 2018

More Decks by Michael

Other Decks in Technology


  1. The Next Wave of Reliability Engineering Michael Kehoe Staff Site

    Reliability Engineer
  2. Today’s agenda 1 Introductions 2 Where have we come from

    3 What is Reliability Engineering 4 Where are we going 5 The Future of Reliability Engineering 6 Key Takeaways 7 Q&A
  3. Introduction

  4. Michael Kehoe $ WHOAMI • Staff Site Reliability Engineer @

    LinkedIn • Production-SRE Team • Funny accent = Australian + 4 years American • Former Network Engineer at the University of Queensland
  5. Production-SRE Team @ LinkedIn $ WHOAMI • Disaster Recovery -

    Planning & Automation • Incident Response – Process & Automation • Visibility Engineering – Making use of operational data • Reliability Principles – Defining best practice & automating it
  6. Where have we come from

  7. Development/ Operations Bottlenecks Traditional • Department Silo’s • Slow release

    cycle’s • High toil workloads • Poor operational visibility Where have we come from
  8. What is Reliability Engineering

  9. “What happens when a software engineer is tasked with what

    used to be called operations” B E N T R E Y N O R S L O S S
  10. “Helping Product and Engineering deliver the best experience possible for

    the end user from an operations perspective ”
  11. What is Reliability Engineering

  12. DevOps Concepts Operational silos Reduce Everything Measure Failure as normal

    Accept Gradual changes Implement Tooling and automation Leverage
  13. Operational Silos Reduce • Shared ownership of code & infrastructure

    • Sharing of tools • Expectation of collaboration DevOps Concepts
  14. Failure as Normal Accept • Expect & embrace risk •

    Quantify failure via SLO’s • Blameless postmortem DevOps Concepts
  15. Gradual Change Implement • Encourage organization to move quickly •

    Lower the cost of failure • Manage Risk DevOps Concepts
  16. Tooling and Automation Leverage • Automate toil away • Reduce

    ‘Human Touch’ DevOps Concepts
  17. Everything Measure • Measure all aspects of systems • Availability

    • Errors • Incident statistics DevOps Concepts
  18. Where are we going?

  19. Where are we going? Agility Increased Everything Measure Is the

    new normal Failure Is Ubiqitous Automation In Depth Observe
  20. The Next Wave of Reliability Engineering

  21. The Future of Reliability Engineering Of the Network Engineer Evolution

    And measure Observe Is the new normal Failure As a Service Automation Is king Cloud
  22. Making the network follow SRE practices Dawn of the Network

    Reliability Engineer https://forums.juniper.net/t5/SDN-and-NFV-Era/2018-and-the-Dawn-of-Network-Reliability-Engineering-NRE/ba-p/316915
  23. Of Network Automation Evolution 1. Manual Operations 2. Automation 3.

    Visibility & Visualization 4. Data Analysis & realization 5. Reactive, Predictive Self Operation Credit: Greg Ferro (Packet Pushers) http://packetpushers.net/taxonomy-five-levels-intent-based- networking-beta/
  24. Downgrade failures from exceptional to expected Failure is the new

    Normal https://azure.microsoft.com/en-us/blog/inside-azure-search-chaos-engineering/
  25. Is the new normal Failure • Accept failure as normal

    • Test for failure: • Application • Local Infrastructure • Global Infrastructure • Continuous experimentation
  26. Automation & Orchestration will be a part of all systems

    Automation as a Service
  27. Is ubiquitous Automation • Automation is expected • Automation is

    unified • No more one-off scripts • Automation extends to monitoring, triage & automation • Automation drives down: • Time to Detect • Time to Resolve
  28. Applications are built for the cloud Cloud is King https://woodby.com/pricing-plans

  29. Is King Cloud • Adoption of Private & Public Clouds

    will continue • Most infrastructure will be ephemeral • Applications will be engineered to be ‘Cloud Native’ • Engineering agility will continue to increase
  30. Making the most of operational data Observe & Measure https://www.acronis.com/en-us/blog/posts/web-application-monitoring-basic-framework

  31. And measure Observe • Machine driven triaging using tracing and

    advanced learning • Advanced analytics on performance to drive infrastructure optimization • Use of incident data to drive feedback loops
  32. Key Takeaways

  33. Key Takeawys DEVOPS CONCEPTS Operational silos Reduce Everything Measure Failure

    as normal Accept Gradual change Implement Tooling and automation Leverage

    Engineer Evolution And measure Observe Is the new normal Failure Is ubiquitous Automation Is king Cloud
  35. Q&A

  36. None