Building Disaster Recovery via Resilience Engineering (SV Chaos Engineering Meetup 2018)

0fe4657094b62f41fb86888015817359?s=47 Michael
March 28, 2018

Building Disaster Recovery via Resilience Engineering (SV Chaos Engineering Meetup 2018)

How often have you heard stories where someone thought they had a disaster strategy, never tested it and it fails when you need it the most? LinkedIn has evolved from serving live traffic out of one data center to four data centers spread geographically. Serving live traffic from four data centers at the same time has taken the company from a disaster recovery model to a disaster avoidance model, where an unhealthy data center can be taken out of rotation and its traffic redistributed to the healthy data centers within minutes, with virtually no visible impact to users.

As LinkedIn transitioned from big monolithic applications to microservices, it was difficult to determine capacity constraints of individual services to handle extra load during disaster scenarios. Stress testing individual services using artificial load in a complex microservices architecture wasn’t sufficient to provide enough confidence in data center’s capacity. To solve this problem, LinkedIn moves live traffic to services site-wide by shifting traffic between datacenters to simulate a disaster every business day!

Michael Kehoe will discuss how LinkedIn shifts traffic between its data centers to chaos/ disaster test site-wide services for improved disaster recovery preparation.

0fe4657094b62f41fb86888015817359?s=128

Michael

March 28, 2018
Tweet

Transcript

  1. Building Disaster Recovery via Resilience Engineering Michael Kehoe Staff SRE

    - LinkedIn
  2. Tonight’s agenda 1 Introductions 2 What is Resilience Engineering 3

    The Problem Statement 4 Project Overview 5 Testing Process 6 Project Outcomes 7 Key Takeaways 8 Q&A
  3. Introduction

  4. Michael Kehoe /USR/BIN/WHOAMI • Staff Site Reliability Engineer @ LinkedIn

    • Production-SRE Team • Funny accent = Australian + 4 years American • Former Network Engineer at the University of Queensland
  5. Who are we? PRODUCTION-SRE TEAM AT LINKEDIN • Disaster Recovery

    Planning and Automation • Incident Response and Automation • Visibility Engineering • Reliability Principles
  6. LinkedIn EVOLUTION OF THE INFRASTRUCTURE 2003 2010 2011 2013 2014

    2015 Active & Passive Active & Active Multi-colo 3- way Active & Active Multi-colo n- way Active & Active
  7. LinkedIn 2018 4 Data Centers 21 PoPs 1000+ services

  8. What is Resilience Engineering?

  9. What is Resilience Engineering? • Projects that directly demand increased

    resilience from our applications and infrastructure. • Application Injection Failure • Infrastructure Injection Failure • Full Disaster-Recovery Tests
  10. Problem Statement

  11. How often have you heard stories where someone thought they

    had a disaster strategy, never tested it and it fails when you need it the most?
  12. Problem Statement • How do we ensure that we always

    have disaster recovery ability without incident? • How do we consistently test for disaster recovery ability without disrupting the company?
  13. Project Overview

  14. Project Overview 1 • Build a process (with Automation) to

    facilitate disaster recovery • Operate the process on regular cadence • Provide reporting on outcomes of tests with engineering executives
  15. Testing Process

  16. What is Load Testing? 5x a week Peak hour traffic

    Fixed SLA
  17. LinkedIn Traffic-Tier Border Router IPVS ATS ATS Frontend EDGE FABRIC

    Stickyrouting
  18. LinkedIn Traffic-Tier Fabric Buckets 1 91 2 3 10 92

    93 100
  19. LinkedIn Traffic-Tier EDGE FABRIC DC1 DC2 DC1 in Cookie Got

    DC2 as secondary fabric Gets secondary fabric for user Stickyrouting
  20. TrafficShift Architecture Web application Salt master Stickyrouting Service Couchbase Backend

    Worker Processes FABRIC BUCKETS
  21. Load Testing FABRIC DC3 DC1 DC2 60% Traffic Percentage

  22. Load Testing 22

  23. Project Outcomes

  24. Benefits of Load-testing Capacity Planning Identify Bugs Confidence

  25. Benefits of Load-testing CAPACITY PLANNING • Through this process, we

    continuously validate our infrastructure capacity • This is the best signal we can possibly get since we’re simulating a real disaster
  26. Benefits of Load-testing IDENTIFY BUGS 2 • Some bugs are

    only found at high load (under duress) • Helps find inefficiency’s that otherwise may not be found until it’s too late • Gives us clues on how to make our code more resilient to potential failure
  27. Benefits of Load-testing CONFIDENCE 2 • Through load-testing, we’ve built

    confidence in our disaster recovery strategy • We understand exactly: • What process to follow • How long it takes to avert disaster • What are the risks associated with a disaster incident
  28. Key Takeaways

  29. Key Takeaways • Resilience Engineering is a must for LinkedIn

    • Design infrastructure to facilitate disaster recovery • Disaster-test regularly to avoid surprises • Automate your testing/ process to reduce engagement time
  30. Q&A

  31. None