Building Disaster Recovery via Resilience Engineering (SV Chaos Engineering Meetup 2018)

Building Disaster Recovery via Resilience Engineering Michael Kehoe Staff SRE
- LinkedIn

Tonight’s agenda 1 Introductions 2 What is Resilience Engineering 3
The Problem Statement 4 Project Overview 5 Testing Process 6 Project Outcomes 7 Key Takeaways 8 Q&A

Introduction

Michael Kehoe /USR/BIN/WHOAMI • Staff Site Reliability Engineer @ LinkedIn
• Production-SRE Team • Funny accent = Australian + 4 years American • Former Network Engineer at the University of Queensland

Who are we? PRODUCTION-SRE TEAM AT LINKEDIN • Disaster Recovery
Planning and Automation • Incident Response and Automation • Visibility Engineering • Reliability Principles

LinkedIn EVOLUTION OF THE INFRASTRUCTURE 2003 2010 2011 2013 2014
2015 Active & Passive Active & Active Multi-colo 3- way Active & Active Multi-colo n- way Active & Active

LinkedIn 2018 4 Data Centers 21 PoPs 1000+ services

What is Resilience Engineering?

What is Resilience Engineering? • Projects that directly demand increased
resilience from our applications and infrastructure. • Application Injection Failure • Infrastructure Injection Failure • Full Disaster-Recovery Tests

Problem Statement

How often have you heard stories where someone thought they
had a disaster strategy, never tested it and it fails when you need it the most?

Problem Statement • How do we ensure that we always
have disaster recovery ability without incident? • How do we consistently test for disaster recovery ability without disrupting the company?

Project Overview

Project Overview 1 • Build a process (with Automation) to
facilitate disaster recovery • Operate the process on regular cadence • Provide reporting on outcomes of tests with engineering executives

Testing Process

What is Load Testing? 5x a week Peak hour traffic
Fixed SLA

LinkedIn Traffic-Tier Border Router IPVS ATS ATS Frontend EDGE FABRIC
Stickyrouting

LinkedIn Traffic-Tier Fabric Buckets 1 91 2 3 10 92
93 100

LinkedIn Traffic-Tier EDGE FABRIC DC1 DC2 DC1 in Cookie Got
DC2 as secondary fabric Gets secondary fabric for user Stickyrouting

TrafficShift Architecture Web application Salt master Stickyrouting Service Couchbase Backend
Worker Processes FABRIC BUCKETS

Load Testing FABRIC DC3 DC1 DC2 60% Traffic Percentage

Load Testing 22

Project Outcomes

Benefits of Load-testing Capacity Planning Identify Bugs Confidence

Benefits of Load-testing CAPACITY PLANNING • Through this process, we
continuously validate our infrastructure capacity • This is the best signal we can possibly get since we’re simulating a real disaster

Benefits of Load-testing IDENTIFY BUGS 2 • Some bugs are
only found at high load (under duress) • Helps find inefficiency’s that otherwise may not be found until it’s too late • Gives us clues on how to make our code more resilient to potential failure

Benefits of Load-testing CONFIDENCE 2 • Through load-testing, we’ve built
confidence in our disaster recovery strategy • We understand exactly: • What process to follow • How long it takes to avert disaster • What are the risks associated with a disaster incident

Key Takeaways

Key Takeaways • Resilience Engineering is a must for LinkedIn
• Design infrastructure to facilitate disaster recovery • Disaster-test regularly to avoid surprises • Automate your testing/ process to reduce engagement time

Building Disaster Recovery via Resilience Engin...

Building Disaster Recovery via Resilience Engineering (SV Chaos Engineering Meetup 2018)

Michael

More Decks by Michael

Other Decks in Technology

Featured

Transcript

Building Disaster Recovery via Resilience Engineering Michael Kehoe Staff SRE

Tonight’s agenda 1 Introductions 2 What is Resilience Engineering 3

Introduction

Michael Kehoe /USR/BIN/WHOAMI • Staff Site Reliability Engineer @ LinkedIn

Who are we? PRODUCTION-SRE TEAM AT LINKEDIN • Disaster Recovery

LinkedIn EVOLUTION OF THE INFRASTRUCTURE 2003 2010 2011 2013 2014

LinkedIn 2018 4 Data Centers 21 PoPs 1000+ services

What is Resilience Engineering?

What is Resilience Engineering? • Projects that directly demand increased

Problem Statement

How often have you heard stories where someone thought they

Problem Statement • How do we ensure that we always

Project Overview

Project Overview 1 • Build a process (with Automation) to

Testing Process

What is Load Testing? 5x a week Peak hour traffic

LinkedIn Traffic-Tier Border Router IPVS ATS ATS Frontend EDGE FABRIC

LinkedIn Traffic-Tier Fabric Buckets 1 91 2 3 10 92

LinkedIn Traffic-Tier EDGE FABRIC DC1 DC2 DC1 in Cookie Got

TrafficShift Architecture Web application Salt master Stickyrouting Service Couchbase Backend

Load Testing FABRIC DC3 DC1 DC2 60% Traffic Percentage

Load Testing 22

Project Outcomes

Benefits of Load-testing Capacity Planning Identify Bugs Confidence

Benefits of Load-testing CAPACITY PLANNING • Through this process, we

Benefits of Load-testing IDENTIFY BUGS 2 • Some bugs are

Benefits of Load-testing CONFIDENCE 2 • Through load-testing, we’ve built

Key Takeaways

Key Takeaways • Resilience Engineering is a must for LinkedIn

Q&A