Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resilience: Cloud Providers Promises

Yury Nino
October 17, 2020

Resilience: Cloud Providers Promises

Yury Nino

October 17, 2020
Tweet

More Decks by Yury Nino

Other Decks in Technology

Transcript

  1. People with good intentions make promises but people with good

    character keep them. https://www.yurynino.dev/
  2. 1. Cloud providers promises. 2. Well architected framework. 3. Resilience:

    the promise. 4. Resilience on AWS, Azure & GCP. 5. Keeping promises with Chaos Engineering. AGENDA
  3. We offer computing, storage, database, content delivery and many other

    features that help organizations scale, grow and transform. We offer hybrid solutions to help you meet your customers' business needs. Millions of customers—including startups, largest enterprises, and government agencies—are using cloud to lower costs, become more agile, and innovate faster. https://www.yurynino.dev/
  4. We are a global team supporting large corporations to achieve

    business goals using cloud computing technology. We promise: • Operational Excellence. • Security. • Reliability. • Performance Efficiency. • Cost Optimization. We enable you to thrive in the digital services market to ensure your success. https://www.yurynino.dev/
  5. Netflix Twitter The infrastructure required by a software system can

    be as complex as the software itself. Every production failure is unique. No two incidents will share the precise chain of failure! https://www.yurynino.dev/
  6. 1. Bullet one 2. Bullet two 3. Bullet three 4.

    Bullet four 5. Bullet five Title left aligned https://www.yurynino.dev/
  7. 1. Bullet one 2. Bullet two 3. Bullet three 4.

    Bullet four 5. Bullet five https://www.yurynino.dev/
  8. Reliability It is the ability to operate and test a

    workload through its total lifecycle. According to Google it is the most important feature! https://www.yurynino.dev/
  9. Resilience Means that the critical parts of an electrical supply

    system can mitigate and recover from high impact threats. Reliability Means that the light always come on when you throw the switch. https://www.yurynino.dev/
  10. A resilient system can maintain an acceptable level of service

    in the face of failure. A resilient system can weather the storm such a large scale natural disaster or a controlled chaos engineering. https://www.yurynino.dev/
  11. Reliability is defined by the user. For user-facing workloads, measure

    the user experience, for example, query success ratio or the rows being scanned per time window. Use sufficient reliability. Systems should be reliable enough that users are happy, but not excessively reliable such that the investment is unjustified.
  12. Create redundancy Systems needs must have no single points of

    failure, and their resources must be replicated across multiple failure domains. Include horizontal scalability Ensure that every component of your system can accommodate growth in traffic or data by adding more resources.
  13. Include rollback capability Any change an operator makes to a

    service must have a well-defined method to undo, roll back the change. Ensure overload tolerance Design services to degrade gracefully under load. Prevent traffic spikes Too many clients sending traffic at the same instant causes traffic spikes!
  14. Detect failure There is a tradeoff between alerting too soon

    and burning out the operation team versus alerting too late and having extended service outages. Make incremental changes You should roll out changes gradually, with "canary testing" to detect bugs in the early stages of a rollout where their impact on users is minimal.
  15. Coordinate emergency response Design operational practices to minimize the duration

    of outages and formalize response procedures with well-defined roles and communication channels. Instrument for observability Systems must be instrumented to enable rapid triaging, troubleshooting, and diagnosis of problems to minimize TTM.
  16. Automate emergency responses In an emergency, people have difficulty performing

    complex tasks. Therefore, preplan emergency actions, document them, and ideally automate them. Perform capacity management Forecast traffic and provision resources in advance of peak traffic events.
  17. Test failure recovery If you haven't recently tested your operational

    procedures to recover from failures, the procedures probably won't work when you need them. Reduce toil Toil is manual and repetitive work with no enduring value, and it increases as the service grows. Continually aim to reduce toil.
  18. El framework es un marco de trabajo para gestión de

    seguridad, infraestructura y administración cloud. El framework sirve como referencia para los servicios que se incluyen en el portafolio de ADL Digital Labs. El framework provee referencias, lineamientos, políticas, mejores prácticas y protocolos que se administran de manera centralizada. A distributed system on production needs to be resilient in order to be reliable and this is precisely a target that we Software Engineers, Systems Engineers, Site Reliability Engineers and Chaos Engineers always aim! https://www.yurynino.dev/
  19. What is Chaos Engineering? It is the discipline of experimenting

    failures in production in order to reveal their weakness and to build confidence in their resilience capability. https://principlesofchaos.org/
  20. 2008 Chaos Engineering began at Netflix 2010 Chaos Monkey &

    Simian Army were launched 2016 Gremlin born 2019 1 Book Chaos massification 2017 SRE USenix Chaos IQ born ChaosConf 2018 1 Book Chaos Monkey for Spring Boot 2020 1 Book was published
  21. What my mom thinks I do What my friends thinks

    I do What software engineers think I do What I really do Who is a Chaos Engineer?
  22. When we make promises, we assume: That we would beat

    the tides of time. That we would escape change. That our feelings for the person we made the promise to, would always be the same. If that were true: Marriages would indeed have been Happily Ever After. Friendships would have lasted forever. There would have been no bankruptcies or defaults. Ever. And this world would be a much better place. https://www.yurynino.dev/