Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SRE Heroes and Villains

SRE Heroes and Villains

Yury Nino

March 04, 2021
Tweet

More Decks by Yury Nino

Other Decks in Technology

Transcript

  1. Witold Pilecki In the spring of 1940, Pilecki volunteered to

    sneak into Auschwitz. His mission: he would allow himself to get arrested, and once there, he would organize with other Polish soldiers, coordinate a mutiny, and break out of the prison camp. www.yurynino.com
  2. Witold Pilecki Pilecki, the comic book superhero motherfucker, had still

    somehow set up an espionage operation. www.yurynino.com
  3. Witold Pilecki In 1943, Pilecki realized that his plans were

    never going to happen! In a night, he cut an alarm wire and escaped! Pilecki’s story is the single most heroic thing I’ve ever come across in my life. www.yurynino.com
  4. Eventually, realizing they could get no information from him, the

    Communists decided to make an example of him. In 1948, they held a show trial and charged Pilecki with everything from falsifying documents and violating curfew to engaging in espionage and treason. A month later, he was found guilty and sentenced to death. On the final day of the trial, Pilecki was allowed to speak. He stated that his allegiance had always been to Poland and its people, that he had never harmed or betrayed any Polish citizen, and that he regretted nothing. He concluded his statement with “I have tried to live my life such that in the hour of my death I would feel joy rather than fear.” As Pilecki we, who work in DevOps as seen as Villains! However we shouldn’t be super heroes www.yurynino.com
  5. Agenda • Why are we the villains? • SRE foundations

    • We don’t need heroes • SRE principles • SRE practices www.yurynino.com
  6. Operations teams are then responsible for deploying and managing the

    software with little-to-no direct interaction with the development teams. Why are we the Villains? Developers package an application with documentation. The QA teams install and test the application.
  7. Bravery is common. Resilience is common. But heroism has a

    philosophical component to it. we are so desperate for a hero today: not because things are necessarily so bad, but because we’ve lost the capacity to take responsability. www.yurynino.com
  8. DevOps is the combination of cultural philosophies, practices, and tools

    that increases an organization’s ability to deliver applications and services at high velocity: evolving and improving products at a faster pace than organizations using traditional software development and infrastructure management processes. DEVOPS https://aws.amazon.com/devops/what-is-devops/ www.yurynino.com
  9. DevOps is a mindset for enhanced collaboration within operations teams

    and developer teams. SRE is considered specific technical expertise that is held by engineers. SRE is an implementation of DevOps www.yurynino.com
  10. 2003 - 2008 Ben Treynor coined SRE DevOps is born

    2014 First Conference about SRE: SRECon 2016-2018 SRE Books are released 2019 SRE massification SRE History www.yurynino.com
  11. They are expensive They are not sustainable They create dependencies

    They hide the problems! They live with frustration Heroes are a problem! even they solve everything! www.yurynino.com
  12. Eventually, realizing they could get no information from him, the

    Communists decided to make an example of him. In 1948, they held a show trial and charged Pilecki with everything from falsifying documents and violating curfew to engaging in espionage and treason. A month later, he was found guilty and sentenced to death. On the final day of the trial, Pilecki was allowed to speak. He stated that his allegiance had always been to Poland and its people, that he had never harmed or betrayed any Polish citizen, and that he regretted nothing. He concluded his statement with “I have tried to live my life such that in the hour of my death I would feel joy rather than fear.” SRE Principles www.yurynino.com
  13. 7 principles Embracing Risk Service Level Objectives Eliminating Toiling Monitoring

    Automation Release Engineering Simplicity www.yurynino.com
  14. Service Level Objectives Key measurements of the availability of a

    system. Goals we set for how much availability we expect out of a system. Contracts with what happens if the system doesn’t meet its SLOs. Service Level Indicator Service Level Objective Service Level Agreements
  15. Monitoring Collecting, processing, aggregating, and displaying real-time quantitative data about

    a system: counts and types, error counts and processing times. A Google SRE team with 10–12 members typically has one members whose primary assignment is to build and maintain monitoring systems for their service.
  16. Embracing the Risk You expect to build 100% reliable services—ones

    that never fail. However, increasing reliability is worse for a service rather than better! Extreme reliability comes at a cost! Embrace the Risk! No sé me gusta la adrenalina
  17. Automation Automate yourself Out of a Job: Automate ALL the

    Things! • User account creation. • Software or hardware installation preparation and decommissioning. • Rollouts of new software versions. • Runtime configuration changes. Automate the Resilience!
  18. Release Engineering Continuous Build and Deployment! Release engineers have a

    solid understanding of source code management, compilers, build configuration languages, automated build tools, package managers, and installers. 4 principles: Self-Service Model, High Velocity, Hermetic Builds, Enforcement Policies and Procedures.
  19. Eventually, realizing they could get no information from him, the

    Communists decided to make an example of him. In 1948, they held a show trial and charged Pilecki with everything from falsifying documents and violating curfew to engaging in espionage and treason. A month later, he was found guilty and sentenced to death. On the final day of the trial, Pilecki was allowed to speak. He stated that his allegiance had always been to Poland and its people, that he had never harmed or betrayed any Polish citizen, and that he regretted nothing. He concluded his statement with “I have tried to live my life such that in the hour of my death I would feel joy rather than fear.” SRE PRACTICES www.yurynino.com
  20. 18 PRACTICES Effective Troubleshooting Practical Alerting Emergency Response Being On-Call

    Postmortems Managing Incidents Tracking Outages Testing for Reliability
  21. Handling Overload Software Engineering Cascading Failures Load Balancing Distributed Consensus

    Data Pipelines Data Integrity Distributed Scheduling 18 PRACTICES
  22. Practical Alerting Multi-User Alerting Notify multiple responders at once to

    orchestrate a real-time, cross-functional response. Alert Noise Reduction Group related alerts into a single incident, minimizing alert fatigue while centralizing critical context to accelerate triage. Enriched Incident Context Include graphs, images, runbook links, or links to conference calls directly in the incident details. Multiple Alert Types Send automated notifications via SMS message, mobile app push notification, phone call, or email. Todo está bien!
  23. Practical Alerting Rich HTML Email Notifications See critical details, monitoring

    graphs, images, and more directly with email notifications, enabling your team to shave time off the response workflow. Dynamic Notifications Customize notification channels and behavior based on event payloads, service, or time of day. Incident History Audit Keep an audit trail of all notifications and status updates directly in the incident, including confirmation of notification delivery to devices. Todo está bien!
  24. Being On-Call • When on-call, an engineer is available to

    perform operations on production within 5 minutes for user-facing for highly time-critical and 30 minutes for less time-sensitive. • The company provides the page-receiving device, which is typically a phone. • Google has flexible alert delivery systems that can dispatch pages via multiple mechanisms (email, SMS, robot call, app) across multiple devices.
  25. Be warned that being an expert is more than understanding

    how a system is supposed to work. Expertise is gained by investigating why a system doesn't work. Brian Redman Ways in which things go right are special cases of the ways in which things go wrong. John Allspaw www.yurynino.com Effective Troubleshooting
  26. • It is an outline of processes that need to

    be executed upon in the event of an IT incident. • An incident response process is something you hope to never need. LEARN Incident post-mortems are a great way for teams to continuously learn ASSESS Collaborate with subject matter experts and work with your incident commander RESOLVE Once a plan of attack has been formulated, incident resolution begins. Response Emergency Response
  27. Preparation • Training and support, general knowledge management • Needs

    constant iteration Analysis & Learning • Understand what happened and build new action plans and training • Siloed apps for reporting and analysis make it hard to summarize incidents, delaying post mortems Detection & Alerting • Monitoring systems send alerts for issues which need to be reviewed and escalated • Alerting systems generate noise which is lost in email, unclear where to escalate Remediation • Resolve the issues • Multiple tools, people and processes need to be coordinated for the implementation of long-term solutions for incidents. Process Overview & Challenges Containment • Data must be reviewed and the current situation assessed • Damage has to be contained quickly and efficiently to minimize the impact. Incident Management
  28. Postmortem www.yurynino.dev What went wrong, and how do we learn

    from it? A postmortem is an artifact with a detailed description of exactly what went wrong in an incident. A postmortem is a written record of an incident, its impact, the actions taken to mitigate it, the root cause, and the follow-up actions to prevent the incident.
  29. Handling Overload Software Engineering Cascading Failures Load Balancing Distributed Consensus

    Data Pipelines Data Integrity Distributed Scheduling 18 PRACTICES
  30. Cascading Failures A cascading failure occurs when cracks jump from

    one system to another, until the threads will be blocked forever. Defend with: • Defensive programming. • Circuit breakers. • Timeouts.
  31. Eventually, realizing they could get no information from him, the

    Communists decided to make an example of him. In 1948, they held a show trial and charged Pilecki with everything from falsifying documents and violating curfew to engaging in espionage and treason. A month later, he was found guilty and sentenced to death. On the final day of the trial, Pilecki was allowed to speak. He stated that his allegiance had always been to Poland and its people, that he had never harmed or betrayed any Polish citizen, and that he regretted nothing. He concluded his statement with “I have tried to live my life such that in the hour of my death I would feel joy rather than fear.” Muchas gracias! @yurynino www.yurynino.com