Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Next Wave of Reliability Engineering (Interop ITX 2018)

The Next Wave of Reliability Engineering (Interop ITX 2018)

In 2018, Site Reliability Engineering (SRE) will turn 15 years old. Since Google's inception of the term SRE, companies across the world have adopted a new operations mindset along with automation, deployment and monitoring principals. Most of what SRE does now is well established throughout the industry, so what is the next-wave of reliability principals and automation frameworks?

This session will dive into what the future holds for reliability engineering as a field and what will be the next areas of investment and improvement for reliability teams.

Michael

May 02, 2018
Tweet

More Decks by Michael

Other Decks in Technology

Transcript

  1. The Next Wave of Reliability
    Engineering
    Michael Kehoe
    Staff Site Reliability Engineer

    View Slide

  2. Today’s
    agenda
    1 Introductions
    2 Where have we come from
    3 What is Reliability Engineering
    4 Where are we going
    5 The Future of Reliability Engineering
    6 Key Takeaways
    7 Q&A

    View Slide

  3. Introduction

    View Slide

  4. Michael Kehoe
    $ WHOAMI
    • Staff Site Reliability Engineer @ LinkedIn
    • Production-SRE Team
    • Funny accent = Australian + 4 years American
    • Former Network Engineer at the University of
    Queensland

    View Slide

  5. Production-SRE Team @ LinkedIn
    $ WHOAMI
    • Disaster Recovery - Planning & Automation
    • Incident Response – Process & Automation
    • Visibility Engineering – Making use of
    operational data
    • Reliability Principles – Defining best practice
    & automating it

    View Slide

  6. Where have we come from

    View Slide

  7. Development/ Operations Bottlenecks
    Traditional
    • Department Silo’s
    • Slow release cycle’s
    • High toil workloads
    • Poor operational visibility
    Where have we come from

    View Slide

  8. What is Reliability Engineering

    View Slide

  9. “What happens when a software engineer is tasked
    with what used to be called operations”
    B E N T R E Y N O R S L O S S

    View Slide

  10. “Helping Product and Engineering deliver the best
    experience possible for the end user from an
    operations perspective ”

    View Slide

  11. What is Reliability Engineering

    View Slide

  12. DevOps Concepts
    Operational silos
    Reduce
    Everything
    Measure
    Failure as normal
    Accept
    Gradual changes
    Implement
    Tooling and
    automation
    Leverage

    View Slide

  13. Operational Silos
    Reduce
    • Shared ownership of code &
    infrastructure
    • Sharing of tools
    • Expectation of collaboration
    DevOps Concepts

    View Slide

  14. Failure as Normal
    Accept
    • Expect & embrace risk
    • Quantify failure via SLO’s
    • Blameless postmortem
    DevOps Concepts

    View Slide

  15. Gradual Change
    Implement
    • Encourage organization to move
    quickly
    • Lower the cost of failure
    • Manage Risk
    DevOps Concepts

    View Slide

  16. Tooling and Automation
    Leverage
    • Automate toil away
    • Reduce ‘Human Touch’
    DevOps Concepts

    View Slide

  17. Everything
    Measure
    • Measure all aspects of systems
    • Availability
    • Errors
    • Incident statistics
    DevOps Concepts

    View Slide

  18. Where are we going?

    View Slide

  19. Where are we going?
    Agility
    Increased
    Everything
    Measure
    Is the new normal
    Failure
    Is Ubiqitous
    Automation
    In Depth
    Observe

    View Slide

  20. The Next Wave of Reliability
    Engineering

    View Slide

  21. The Future of Reliability Engineering
    Of the Network
    Engineer
    Evolution
    And measure
    Observe
    Is the new normal
    Failure
    As a Service
    Automation
    Is king
    Cloud

    View Slide

  22. Making the network follow SRE
    practices
    Dawn of the Network
    Reliability Engineer
    https://forums.juniper.net/t5/SDN-and-NFV-Era/2018-and-the-Dawn-of-Network-Reliability-Engineering-NRE/ba-p/316915

    View Slide

  23. Of Network Automation
    Evolution
    1. Manual Operations
    2. Automation
    3. Visibility & Visualization
    4. Data Analysis & realization
    5. Reactive, Predictive Self Operation
    Credit: Greg Ferro (Packet Pushers)
    http://packetpushers.net/taxonomy-five-levels-intent-based-
    networking-beta/

    View Slide

  24. Downgrade failures from
    exceptional to expected
    Failure is the new
    Normal
    https://azure.microsoft.com/en-us/blog/inside-azure-search-chaos-engineering/

    View Slide

  25. Is the new normal
    Failure
    • Accept failure as normal
    • Test for failure:
    • Application
    • Local Infrastructure
    • Global Infrastructure
    • Continuous experimentation

    View Slide

  26. Automation & Orchestration
    will be a part of all systems
    Automation as a
    Service

    View Slide

  27. Is ubiquitous
    Automation
    • Automation is expected
    • Automation is unified
    • No more one-off scripts
    • Automation extends to monitoring,
    triage & automation
    • Automation drives down:
    • Time to Detect
    • Time to Resolve

    View Slide

  28. Applications are built for the
    cloud
    Cloud is King
    https://woodby.com/pricing-plans

    View Slide

  29. Is King
    Cloud
    • Adoption of Private & Public Clouds
    will continue
    • Most infrastructure will be
    ephemeral
    • Applications will be engineered to be
    ‘Cloud Native’
    • Engineering agility will continue to
    increase

    View Slide

  30. Making the most of operational
    data
    Observe & Measure
    https://www.acronis.com/en-us/blog/posts/web-application-monitoring-basic-framework

    View Slide

  31. And measure
    Observe
    • Machine driven triaging using tracing
    and advanced learning
    • Advanced analytics on performance
    to drive infrastructure optimization
    • Use of incident data to drive
    feedback loops

    View Slide

  32. Key Takeaways

    View Slide

  33. Key Takeawys
    DEVOPS CONCEPTS
    Operational silos
    Reduce
    Everything
    Measure
    Failure as normal
    Accept
    Gradual change
    Implement
    Tooling and
    automation
    Leverage

    View Slide

  34. Key Takeaways
    THE FUTURE OF RELIABILITY ENGINEERING
    Of the Network
    Engineer
    Evolution
    And measure
    Observe
    Is the new normal
    Failure
    Is ubiquitous
    Automation
    Is king
    Cloud

    View Slide

  35. Q&A

    View Slide

  36. View Slide