$30 off During Our Annual Pro Sale. View Details »

Building Disaster Recovery via Resilience Engineering (SV Chaos Engineering Meetup 2018)

Michael
March 28, 2018

Building Disaster Recovery via Resilience Engineering (SV Chaos Engineering Meetup 2018)

How often have you heard stories where someone thought they had a disaster strategy, never tested it and it fails when you need it the most? LinkedIn has evolved from serving live traffic out of one data center to four data centers spread geographically. Serving live traffic from four data centers at the same time has taken the company from a disaster recovery model to a disaster avoidance model, where an unhealthy data center can be taken out of rotation and its traffic redistributed to the healthy data centers within minutes, with virtually no visible impact to users.

As LinkedIn transitioned from big monolithic applications to microservices, it was difficult to determine capacity constraints of individual services to handle extra load during disaster scenarios. Stress testing individual services using artificial load in a complex microservices architecture wasn’t sufficient to provide enough confidence in data center’s capacity. To solve this problem, LinkedIn moves live traffic to services site-wide by shifting traffic between datacenters to simulate a disaster every business day!

Michael Kehoe will discuss how LinkedIn shifts traffic between its data centers to chaos/ disaster test site-wide services for improved disaster recovery preparation.

Michael

March 28, 2018
Tweet

More Decks by Michael

Other Decks in Technology

Transcript

  1. Building Disaster Recovery via
    Resilience Engineering
    Michael Kehoe
    Staff SRE - LinkedIn

    View Slide

  2. Tonight’s
    agenda
    1 Introductions
    2 What is Resilience Engineering
    3 The Problem Statement
    4 Project Overview
    5 Testing Process
    6 Project Outcomes
    7 Key Takeaways
    8 Q&A

    View Slide

  3. Introduction

    View Slide

  4. Michael Kehoe
    /USR/BIN/WHOAMI
    • Staff Site Reliability Engineer @ LinkedIn
    • Production-SRE Team
    • Funny accent = Australian + 4 years American
    • Former Network Engineer at the University of
    Queensland

    View Slide

  5. Who are we?
    PRODUCTION-SRE TEAM AT LINKEDIN
    • Disaster Recovery Planning and Automation
    • Incident Response and Automation
    • Visibility Engineering
    • Reliability Principles

    View Slide

  6. LinkedIn
    EVOLUTION OF THE INFRASTRUCTURE
    2003 2010 2011 2013 2014 2015
    Active &
    Passive
    Active &
    Active
    Multi-colo 3-
    way Active &
    Active
    Multi-colo n-
    way Active &
    Active

    View Slide

  7. LinkedIn
    2018
    4 Data Centers 21 PoPs 1000+ services

    View Slide

  8. What is Resilience
    Engineering?

    View Slide

  9. What is Resilience Engineering?
    • Projects that directly demand increased
    resilience from our applications and
    infrastructure.
    • Application Injection Failure
    • Infrastructure Injection Failure
    • Full Disaster-Recovery Tests

    View Slide

  10. Problem Statement

    View Slide

  11. How often have you heard stories where someone thought
    they had a disaster strategy, never tested it and it fails when
    you need it the most?

    View Slide

  12. Problem Statement
    • How do we ensure that we always have
    disaster recovery ability without incident?
    • How do we consistently test for disaster
    recovery ability without disrupting the
    company?

    View Slide

  13. Project Overview

    View Slide

  14. Project Overview
    1
    • Build a process (with Automation) to facilitate disaster recovery
    • Operate the process on regular cadence
    • Provide reporting on outcomes of tests with engineering executives

    View Slide

  15. Testing Process

    View Slide

  16. What is Load Testing?
    5x a week Peak hour traffic Fixed SLA

    View Slide

  17. LinkedIn Traffic-Tier
    Border
    Router IPVS ATS ATS Frontend
    EDGE FABRIC
    Stickyrouting

    View Slide

  18. LinkedIn Traffic-Tier
    Fabric
    Buckets
    1
    91
    2 3 10
    92 93 100

    View Slide

  19. LinkedIn Traffic-Tier
    EDGE FABRIC
    DC1
    DC2
    DC1 in Cookie
    Got DC2 as secondary fabric
    Gets secondary
    fabric for user
    Stickyrouting

    View Slide

  20. TrafficShift Architecture
    Web
    application
    Salt master
    Stickyrouting
    Service
    Couchbase
    Backend Worker
    Processes
    FABRIC
    BUCKETS

    View Slide

  21. Load Testing
    FABRIC
    DC3
    DC1 DC2
    60%
    Traffic
    Percentage

    View Slide

  22. Load Testing
    22

    View Slide

  23. Project Outcomes

    View Slide

  24. Benefits of Load-testing
    Capacity Planning Identify Bugs Confidence

    View Slide

  25. Benefits of Load-testing
    CAPACITY PLANNING
    • Through this process, we continuously validate our infrastructure
    capacity
    • This is the best signal we can possibly get since we’re simulating a real
    disaster

    View Slide

  26. Benefits of Load-testing
    IDENTIFY BUGS
    2
    • Some bugs are only found at high load (under duress)
    • Helps find inefficiency’s that otherwise may not be found until it’s too late
    • Gives us clues on how to make our code more resilient to potential failure

    View Slide

  27. Benefits of Load-testing
    CONFIDENCE
    2
    • Through load-testing, we’ve built confidence in our disaster recovery
    strategy
    • We understand exactly:
    • What process to follow
    • How long it takes to avert disaster
    • What are the risks associated with a disaster incident

    View Slide

  28. Key Takeaways

    View Slide

  29. Key Takeaways
    • Resilience Engineering is a must for LinkedIn
    • Design infrastructure to facilitate disaster
    recovery
    • Disaster-test regularly to avoid surprises
    • Automate your testing/ process to reduce
    engagement time

    View Slide

  30. Q&A

    View Slide

  31. View Slide