Patterns for Resilient Architecture

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. About availability Availability Downtime per year 99% (2-nines) 3 days 15 hours 99.9% (3-nines) 8 hours 45 minutes 99.99% (4-nines) 52 minutes 99.999% (5-nines) 5 minutes 99.9999% (6-nines) 31 seconds

Slide 10

Slide 10 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. System availability Availability = Normal Operation Time Total Time MTBF** MTBF** + MTTR* = * Mean Time To Repair (MTTR) **Mean Time Between Failure (MTBF)

Slide 11

Slide 11 text

Slide 12

Slide 12 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Availability in series Component Availability Downtime X 99% (2-nines) 3 days 15 hours Y 99.99% (4-nines) 52 minutes X and Y Combined 98.99% 3 days 16 hours 33 minutes

Slide 13

Slide 13 text

Slide 14

Slide 14 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Availability in parallel Component Availability Downtime X 99% (2-nines) 3 days 15 hours Two X in parallel 99.99% (4-nines) 52 minutes Three X in parallel 99.9999% (6-nines) 31 seconds

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Region and availability zones Region Availability zone a Availability zone b Availability zone c data center data center data center data center data center data center data center data center data center

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Combined with disaster recovery Service 1 Service 2 Service 3 Service 4 Service 1 Service 2 Service 3 Service 4 US-WEST-2 US-EAST-1

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Auto Scaling group Service A Availability zone 1 Auto Scaling group AWS Region Service A Availability zone 2 Service B Service B Auto-scaling

Slide 33

Slide 33 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Immutable Infrastructure • No updates on live systems • Always start from a new resource being provisioned • Deploy the new software • Test in different environments (dev, staging) • Deploy to prod (inactive) • Change references (DNS or Load Balancer) • Keep old version around (inactive) • Fast rollback if things go wrong

Slide 34

Slide 34 text

Slide 35

Slide 35 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. CAP Theorem Consistency Availability Partition Tolerance Data is consistent. All nodes see the same state. Every request is non-failing. Service still responds as expected if some nodes crash. Distributed System In the presence of a network partition, you must choose between consistency and availability!

Slide 36

Slide 36 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Embrase eventual consistency … if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. Availability An eventually consistent system can return any value before it converges!! https://en.wikipedia.org/wiki/Eventual_consistency Distributed System Every request is non-failing.

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. API: {DO foo} PUT JOB: {JobID: 0001, Task: DO foo} API: {JobID: 0001} GET JOB: {JobID: 0001, Task: DO foo} {JobID: 0001, Result: bar} Cache node Worker Instance Worker Instance Queue API Instance API Instance API Instance

Slide 40

Slide 40 text

Slide 41

Slide 41 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Degrade & prioritize traffic with queues Worker Instance Worker Instance API Instance API Instance API Instance High Priority Queue Low Priority Queue

Slide 42

Slide 42 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Read / Write Sharding DB Instance DB instance read replica DB instance read replica DB instance read replica Instance Instance Instance

Slide 43

Slide 43 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Database Federation Users DB Products DB Instance Instance Instance DB Instance DB instance read replica DB Instance DB instance read replica

Slide 44

Slide 44 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Database Sharding User ShardID 002345 A 002346 B 002347 C 002348 B 002349 A C B A Instance Instance Instance DB Instance DB instance read replica DB Instance DB instance read replica DB Instance DB instance read replica

Slide 45

Slide 45 text

Slide 46

Slide 46 text

Slide 47

Slide 47 text

Slide 48

Slide 48 text

Slide 49

Slide 49 text

Slide 50

Slide 50 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Users App DB Conn Pool INSERT INSERT INSERT INSERT What happens if the DB “slows down”? Timeout client side Timeout backend side ? ?

Slide 51

Slide 51 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. User 1 App DB Conn Pool INSERT Timeout client side = 10s Timeout backend side = Not implemented Retry INSERT Retry INSERT ERROR: Failed to get connection from pool Retry

Slide 52

Slide 52 text

http://docs.python-requests.org/en/master/user/advanced/#timeouts

Slide 53

Slide 53 text

http://docs.python-requests.org/en/master/user/advanced/#timeouts

Slide 54

Slide 54 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. @timeout_decorator.timeout(5, timeout_exception=StopIteration) def timed_get(url): return requests.get(url) https://pypi.org/project/timeout-decorator/

Slide 55

Slide 55 text

Slide 56

Slide 56 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How else could we have prevented the error? User 1 DB Conn Pool INSERT Retry INSERT Retry INSERT Retry ERROR: Failed to get connection from pool

Slide 57

Slide 57 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. User 1 DB Conn Pool INSERT Timeout client side = 10s Timeout backend side = 10s Wait 2s before Retry INSERT INSERT Wait 4s before Retry Wait 8s before Retry Wait 16s before Retry Backing off between retries Releasing connections Backoff

Slide 58

Slide 58 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. No jitter With jitter https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ Simple Exponential Backoff is not enough: Add Jitter

Slide 59

Slide 59 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: add jitter 0-1000ms def get_item(self, url, n=1): MAX_TRIES = 12 try: res = requests.get(url) except: if n > MAX_TRIES: return None n += 1 time.sleep((2 ** n) + (random.randint(0, 1000) / 1000.0)) return self.get_item(url, n) else: return res

Slide 60

Slide 60 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. @backoff.on_exception(backoff.full_jitter, max_time=60) def poll_for_message(queue): return queue.get() https://pypi.org/project/backoff/ As of version 1.2, the default jitter function backoff.full_jitter implements the ‘Full Jitter’ algorithm as defined in the AWS Architecture Blog’s Exponential Backoff And Jitter post.

Slide 61

Slide 61 text

Slide 62

Slide 62 text

Slide 63

Slide 63 text

Slide 64

Slide 64 text

Slide 65

Slide 65 text

Slide 66

Slide 66 text

Slide 67

Slide 67 text

Slide 68

Slide 68 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Circuit Breaker • Wrap a protected function call in a circuit breaker object, which monitors for failures. • If failures reach a certain threshold, the circuit breaker trips. Producer Circuit Breaker Consumer Connection Monitoring Timeouts Breaking Circuit

Slide 69

Slide 69 text

Slide 70

Slide 70 text

Slide 71

Slide 71 text

Slide 72

Slide 72 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Conway’s Law User UI Team Application Team DBA Team ”Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.” http://www.melconway.com/Home/Conways_Law.html Siloed Teams Siloed Applications

Slide 73

Slide 73 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Conway’s Law http://www.melconway.com/Home/Conways_Law.html Services Cross-Functional Teams ”Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.”

Slide 74

Slide 74 text

Slide 75

Slide 75 text

Slide 76

Slide 76 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Building confidence through testing Unit testing of components: • Tested in isolation to ensure function meets expectations. Functional testing of integrations: • Each execution path tested to assure expected results. Is it enough???

Slide 77

Slide 77 text

Slide 78

Slide 78 text

Slide 79

Slide 79 text

Slide 80

Slide 80 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” http://principlesofchaos.org

Slide 81

Slide 81 text

Slide 82

Slide 82 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Failure injection • Start small & build confidence • Application level • Host failure • Resource attacks (CPU, memory, …) • Network attacks (dependencies, latency, …) • Region attacks • “Paul” attack

Slide 83

Slide 83 text

Slide 84

Slide 84 text

Slide 85

Slide 85 text

Slide 86

Slide 86 text

Slide 87

Slide 87 text

Slide 88

Slide 88 text

Slide 89

Slide 89 text

Slide 90

Slide 90 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What is steady state? • ”normal” behavior of your system • Business Metric https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a

Slide 91

Slide 91 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Business metrics at work Amazon: 100 ms of extra load time caused a 1% drop in sales (Greg Linden). Google: 500 ms of extra load time caused 20% fewer searches (Marissa Mayer). Yahoo!: 400 ms of extra load time caused a 5–9% increase in the number of people who clicked “back” before the page even loaded (Nicole Sullivan).

Slide 92

Slide 92 text

Slide 93

Slide 93 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What if…? “What if this load balancer breaks?” “What if Redis becomes slow?” “What if a host on Cassandra goes away?” ”What if latency increases by 300ms?” ”What if the database stops?” Make it everyone’s problem!

Slide 94

Slide 94 text

Slide 95

Slide 95 text

Slide 96

Slide 96 text

Slide 97

Slide 97 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Rules of thumbs • Start with very small • As close as possible to production • Minimize the blast radius. • Have an emergency STOP!

Slide 98

Slide 98 text

Slide 99

Slide 99 text

Slide 100

Slide 100 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Quantifying the result of the experiment • Time to detect? • Time for notification? And escalation? • Time to public notification? • Time for graceful degradation to kick-in? • Time for self healing to happen? • Time to recovery – partial and full? • Time to all-clear and stable?

Slide 101

Slide 101 text

Slide 102

Slide 102 text

Slide 103

Slide 103 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. More questions to ask • Can you clarify if there were any preceding events? • Why would they believe acting in this way was the best course of action to deliver the desired outcome? • Is there another failure mode that could present here? • What decisions or events prior to this made this work before? • Why stop there – are there places to dig deeper that could shine a light more on this? • Did others step in to help, to advise, or to intercede?

Slide 104

Slide 104 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Rules to remember! 1. Failure requires multiple faults 2. There is no isolated ‘cause’ of an accident. 3. There are multiple contributors to accidents.

Slide 105

Slide 105 text

Slide 106

Slide 106 text

Slide 107

Slide 107 text

Slide 108

Slide 108 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Big challenges to chaos engineering Mostly Cultural • no time or flexibility to simulate disasters. • teams already spending all of its time fixing things. • can be very political. • might force deep conversations. • deeply invested in a specific technical roadmap (micro-services) that chaos engineering tests show is not as resilient to failures as originally predicted.