Slide 1

Slide 1 text

Resilience Cloud providers’s promise. YURY NIÑO ROA SRE EN ADL @yurynino

Slide 2

Slide 2 text

What is a https://www.yurynino.dev/

Slide 3

Slide 3 text

People with good intentions make promises but people with good character keep them. https://www.yurynino.dev/

Slide 4

Slide 4 text

Promises are made to be broken, even before they are made. https://www.yurynino.dev/

Slide 5

Slide 5 text

1. Cloud providers promises. 2. Well architected framework. 3. Resilience: the promise. 4. Resilience on AWS, Azure & GCP. 5. Keeping promises with Chaos Engineering. AGENDA

Slide 6

Slide 6 text

Cloud providers promises.

Slide 7

Slide 7 text

We offer computing, storage, database, content delivery and many other features that help organizations scale, grow and transform. We offer hybrid solutions to help you meet your customers' business needs. Millions of customers—including startups, largest enterprises, and government agencies—are using cloud to lower costs, become more agile, and innovate faster. https://www.yurynino.dev/

Slide 8

Slide 8 text

We are a global team supporting large corporations to achieve business goals using cloud computing technology. We promise: ● Operational Excellence. ● Security. ● Reliability. ● Performance Efficiency. ● Cost Optimization. We enable you to thrive in the digital services market to ensure your success. https://www.yurynino.dev/

Slide 9

Slide 9 text

If it is a real, why happen these disasters?

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

https://www.yurynino.dev/

Slide 12

Slide 12 text

Netflix Twitter The infrastructure required by a software system can be as complex as the software itself. Every production failure is unique. No two incidents will share the precise chain of failure! https://www.yurynino.dev/

Slide 13

Slide 13 text

Antipatterns & Patterns!

Slide 14

Slide 14 text

https://www.yurynino.dev/

Slide 15

Slide 15 text

Butterfly Spiderman Integration Points https://www.yurynino.dev/

Slide 16

Slide 16 text

Slow Responses https://www.yurynino.dev/

Slide 17

Slide 17 text

Bulkhead Fail Fast https://www.yurynino.dev/

Slide 18

Slide 18 text

Taken from Release it! https://www.yurynino.dev/ Circuit Breaker

Slide 19

Slide 19 text

The promise: Well Architected Framework

Slide 20

Slide 20 text

1. Bullet one 2. Bullet two 3. Bullet three 4. Bullet four 5. Bullet five Title left aligned https://www.yurynino.dev/

Slide 21

Slide 21 text

1. Bullet one 2. Bullet two 3. Bullet three 4. Bullet four 5. Bullet five https://www.yurynino.dev/

Slide 22

Slide 22 text

A Fine-Grained: Reliability

Slide 23

Slide 23 text

Reliability It is the ability to operate and test a workload through its total lifecycle. According to Google it is the most important feature! https://www.yurynino.dev/

Slide 24

Slide 24 text

Resilience Means that the critical parts of an electrical supply system can mitigate and recover from high impact threats. Reliability Means that the light always come on when you throw the switch. https://www.yurynino.dev/

Slide 25

Slide 25 text

A resilient system can maintain an acceptable level of service in the face of failure. A resilient system can weather the storm such a large scale natural disaster or a controlled chaos engineering. https://www.yurynino.dev/

Slide 26

Slide 26 text

The promise for reliability

Slide 27

Slide 27 text

Reliability is defined by the user. For user-facing workloads, measure the user experience, for example, query success ratio or the rows being scanned per time window. Use sufficient reliability. Systems should be reliable enough that users are happy, but not excessively reliable such that the investment is unjustified.

Slide 28

Slide 28 text

Create redundancy Systems needs must have no single points of failure, and their resources must be replicated across multiple failure domains. Include horizontal scalability Ensure that every component of your system can accommodate growth in traffic or data by adding more resources.

Slide 29

Slide 29 text

Include rollback capability Any change an operator makes to a service must have a well-defined method to undo, roll back the change. Ensure overload tolerance Design services to degrade gracefully under load. Prevent traffic spikes Too many clients sending traffic at the same instant causes traffic spikes!

Slide 30

Slide 30 text

Detect failure There is a tradeoff between alerting too soon and burning out the operation team versus alerting too late and having extended service outages. Make incremental changes You should roll out changes gradually, with "canary testing" to detect bugs in the early stages of a rollout where their impact on users is minimal.

Slide 31

Slide 31 text

Coordinate emergency response Design operational practices to minimize the duration of outages and formalize response procedures with well-defined roles and communication channels. Instrument for observability Systems must be instrumented to enable rapid triaging, troubleshooting, and diagnosis of problems to minimize TTM.

Slide 32

Slide 32 text

Automate emergency responses In an emergency, people have difficulty performing complex tasks. Therefore, preplan emergency actions, document them, and ideally automate them. Perform capacity management Forecast traffic and provision resources in advance of peak traffic events.

Slide 33

Slide 33 text

Test failure recovery If you haven't recently tested your operational procedures to recover from failures, the procedures probably won't work when you need them. Reduce toil Toil is manual and repetitive work with no enduring value, and it increases as the service grows. Continually aim to reduce toil.

Slide 34

Slide 34 text

El framework es un marco de trabajo para gestión de seguridad, infraestructura y administración cloud. El framework sirve como referencia para los servicios que se incluyen en el portafolio de ADL Digital Labs. El framework provee referencias, lineamientos, políticas, mejores prácticas y protocolos que se administran de manera centralizada. A distributed system on production needs to be resilient in order to be reliable and this is precisely a target that we Software Engineers, Systems Engineers, Site Reliability Engineers and Chaos Engineers always aim! https://www.yurynino.dev/

Slide 35

Slide 35 text

Chaos Engineering

Slide 36

Slide 36 text

What is Chaos Engineering? It is the discipline of experimenting failures in production in order to reveal their weakness and to build confidence in their resilience capability. https://principlesofchaos.org/

Slide 37

Slide 37 text

2008 Chaos Engineering began at Netflix 2010 Chaos Monkey & Simian Army were launched 2016 Gremlin born 2019 1 Book Chaos massification 2017 SRE USenix Chaos IQ born ChaosConf 2018 1 Book Chaos Monkey for Spring Boot 2020 1 Book was published

Slide 38

Slide 38 text

What my mom thinks I do What my friends thinks I do What software engineers think I do What I really do Who is a Chaos Engineer?

Slide 39

Slide 39 text

Resilience on AWS, Azure & GCP.

Slide 40

Slide 40 text

https://cloud.google.com/architecture/framework

Slide 41

Slide 41 text

https://cloud.google.com/architecture/framework

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

https://www.wellarchitectedlabs.com/

Slide 46

Slide 46 text

https://www.wellarchitectedlabs.com/

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

https://docs.microsoft.com/en-us/azure/architecture/framework/

Slide 49

Slide 49 text

https://docs.microsoft.com/en-us/azure/architecture/framework/

Slide 50

Slide 50 text

https://docs.microsoft.com/en-us/azure/architecture/framework/

Slide 51

Slide 51 text

Configuration

Slide 52

Slide 52 text

Configuration

Slide 53

Slide 53 text

Configuration

Slide 54

Slide 54 text

Configuration

Slide 55

Slide 55 text

Configuration

Slide 56

Slide 56 text

Configuration

Slide 57

Slide 57 text

Configuration

Slide 58

Slide 58 text

When we make promises, we assume: That we would beat the tides of time. That we would escape change. That our feelings for the person we made the promise to, would always be the same. If that were true: Marriages would indeed have been Happily Ever After. Friendships would have lasted forever. There would have been no bankruptcies or defaults. Ever. And this world would be a much better place. https://www.yurynino.dev/

Slide 59

Slide 59 text

https://www.yurynino.dev/ yurynino Thank you!