Building Disaster Recovery via Resilience Engineering (SV Chaos Engineering Meetup 2018)

0fe4657094b62f41fb86888015817359?s=47 Michael
March 28, 2018

Building Disaster Recovery via Resilience Engineering (SV Chaos Engineering Meetup 2018)

How often have you heard stories where someone thought they had a disaster strategy, never tested it and it fails when you need it the most? LinkedIn has evolved from serving live traffic out of one data center to four data centers spread geographically. Serving live traffic from four data centers at the same time has taken the company from a disaster recovery model to a disaster avoidance model, where an unhealthy data center can be taken out of rotation and its traffic redistributed to the healthy data centers within minutes, with virtually no visible impact to users.

As LinkedIn transitioned from big monolithic applications to microservices, it was difficult to determine capacity constraints of individual services to handle extra load during disaster scenarios. Stress testing individual services using artificial load in a complex microservices architecture wasn’t sufficient to provide enough confidence in data center’s capacity. To solve this problem, LinkedIn moves live traffic to services site-wide by shifting traffic between datacenters to simulate a disaster every business day!

Michael Kehoe will discuss how LinkedIn shifts traffic between its data centers to chaos/ disaster test site-wide services for improved disaster recovery preparation.

0fe4657094b62f41fb86888015817359?s=128

Michael

March 28, 2018
Tweet

Transcript

  1. 2.

    Tonight’s agenda 1 Introductions 2 What is Resilience Engineering 3

    The Problem Statement 4 Project Overview 5 Testing Process 6 Project Outcomes 7 Key Takeaways 8 Q&A
  2. 4.

    Michael Kehoe /USR/BIN/WHOAMI • Staff Site Reliability Engineer @ LinkedIn

    • Production-SRE Team • Funny accent = Australian + 4 years American • Former Network Engineer at the University of Queensland
  3. 5.

    Who are we? PRODUCTION-SRE TEAM AT LINKEDIN • Disaster Recovery

    Planning and Automation • Incident Response and Automation • Visibility Engineering • Reliability Principles
  4. 6.

    LinkedIn EVOLUTION OF THE INFRASTRUCTURE 2003 2010 2011 2013 2014

    2015 Active & Passive Active & Active Multi-colo 3- way Active & Active Multi-colo n- way Active & Active
  5. 9.

    What is Resilience Engineering? • Projects that directly demand increased

    resilience from our applications and infrastructure. • Application Injection Failure • Infrastructure Injection Failure • Full Disaster-Recovery Tests
  6. 11.

    How often have you heard stories where someone thought they

    had a disaster strategy, never tested it and it fails when you need it the most?
  7. 12.

    Problem Statement • How do we ensure that we always

    have disaster recovery ability without incident? • How do we consistently test for disaster recovery ability without disrupting the company?
  8. 14.

    Project Overview 1 • Build a process (with Automation) to

    facilitate disaster recovery • Operate the process on regular cadence • Provide reporting on outcomes of tests with engineering executives
  9. 19.

    LinkedIn Traffic-Tier EDGE FABRIC DC1 DC2 DC1 in Cookie Got

    DC2 as secondary fabric Gets secondary fabric for user Stickyrouting
  10. 25.

    Benefits of Load-testing CAPACITY PLANNING • Through this process, we

    continuously validate our infrastructure capacity • This is the best signal we can possibly get since we’re simulating a real disaster
  11. 26.

    Benefits of Load-testing IDENTIFY BUGS 2 • Some bugs are

    only found at high load (under duress) • Helps find inefficiency’s that otherwise may not be found until it’s too late • Gives us clues on how to make our code more resilient to potential failure
  12. 27.

    Benefits of Load-testing CONFIDENCE 2 • Through load-testing, we’ve built

    confidence in our disaster recovery strategy • We understand exactly: • What process to follow • How long it takes to avert disaster • What are the risks associated with a disaster incident
  13. 29.

    Key Takeaways • Resilience Engineering is a must for LinkedIn

    • Design infrastructure to facilitate disaster recovery • Disaster-test regularly to avoid surprises • Automate your testing/ process to reduce engagement time
  14. 30.

    Q&A

  15. 31.