The Next Wave of Reliability Engineering (Interop ITX 2018)

The Next Wave of Reliability Engineering (Interop ITX 2018)

In 2018, Site Reliability Engineering (SRE) will turn 15 years old. Since Google's inception of the term SRE, companies across the world have adopted a new operations mindset along with automation, deployment and monitoring principals. Most of what SRE does now is well established throughout the industry, so what is the next-wave of reliability principals and automation frameworks?

This session will dive into what the future holds for reliability engineering as a field and what will be the next areas of investment and improvement for reliability teams.

0fe4657094b62f41fb86888015817359?s=128

Michael

May 02, 2018
Tweet

Transcript

  1. 2.

    Today’s agenda 1 Introductions 2 Where have we come from

    3 What is Reliability Engineering 4 Where are we going 5 The Future of Reliability Engineering 6 Key Takeaways 7 Q&A
  2. 4.

    Michael Kehoe $ WHOAMI • Staff Site Reliability Engineer @

    LinkedIn • Production-SRE Team • Funny accent = Australian + 4 years American • Former Network Engineer at the University of Queensland
  3. 5.

    Production-SRE Team @ LinkedIn $ WHOAMI • Disaster Recovery -

    Planning & Automation • Incident Response – Process & Automation • Visibility Engineering – Making use of operational data • Reliability Principles – Defining best practice & automating it
  4. 7.

    Development/ Operations Bottlenecks Traditional • Department Silo’s • Slow release

    cycle’s • High toil workloads • Poor operational visibility Where have we come from
  5. 9.

    “What happens when a software engineer is tasked with what

    used to be called operations” B E N T R E Y N O R S L O S S
  6. 10.

    “Helping Product and Engineering deliver the best experience possible for

    the end user from an operations perspective ”
  7. 12.

    DevOps Concepts Operational silos Reduce Everything Measure Failure as normal

    Accept Gradual changes Implement Tooling and automation Leverage
  8. 13.

    Operational Silos Reduce • Shared ownership of code & infrastructure

    • Sharing of tools • Expectation of collaboration DevOps Concepts
  9. 14.

    Failure as Normal Accept • Expect & embrace risk •

    Quantify failure via SLO’s • Blameless postmortem DevOps Concepts
  10. 15.

    Gradual Change Implement • Encourage organization to move quickly •

    Lower the cost of failure • Manage Risk DevOps Concepts
  11. 17.

    Everything Measure • Measure all aspects of systems • Availability

    • Errors • Incident statistics DevOps Concepts
  12. 19.

    Where are we going? Agility Increased Everything Measure Is the

    new normal Failure Is Ubiqitous Automation In Depth Observe
  13. 21.

    The Future of Reliability Engineering Of the Network Engineer Evolution

    And measure Observe Is the new normal Failure As a Service Automation Is king Cloud
  14. 22.

    Making the network follow SRE practices Dawn of the Network

    Reliability Engineer https://forums.juniper.net/t5/SDN-and-NFV-Era/2018-and-the-Dawn-of-Network-Reliability-Engineering-NRE/ba-p/316915
  15. 23.

    Of Network Automation Evolution 1. Manual Operations 2. Automation 3.

    Visibility & Visualization 4. Data Analysis & realization 5. Reactive, Predictive Self Operation Credit: Greg Ferro (Packet Pushers) http://packetpushers.net/taxonomy-five-levels-intent-based- networking-beta/
  16. 24.

    Downgrade failures from exceptional to expected Failure is the new

    Normal https://azure.microsoft.com/en-us/blog/inside-azure-search-chaos-engineering/
  17. 25.

    Is the new normal Failure • Accept failure as normal

    • Test for failure: • Application • Local Infrastructure • Global Infrastructure • Continuous experimentation
  18. 27.

    Is ubiquitous Automation • Automation is expected • Automation is

    unified • No more one-off scripts • Automation extends to monitoring, triage & automation • Automation drives down: • Time to Detect • Time to Resolve
  19. 29.

    Is King Cloud • Adoption of Private & Public Clouds

    will continue • Most infrastructure will be ephemeral • Applications will be engineered to be ‘Cloud Native’ • Engineering agility will continue to increase
  20. 31.

    And measure Observe • Machine driven triaging using tracing and

    advanced learning • Advanced analytics on performance to drive infrastructure optimization • Use of incident data to drive feedback loops
  21. 33.

    Key Takeawys DEVOPS CONCEPTS Operational silos Reduce Everything Measure Failure

    as normal Accept Gradual change Implement Tooling and automation Leverage
  22. 34.

    Key Takeaways THE FUTURE OF RELIABILITY ENGINEERING Of the Network

    Engineer Evolution And measure Observe Is the new normal Failure Is ubiquitous Automation Is king Cloud
  23. 35.

    Q&A

  24. 36.