SRE Heroes and Villains

@yurynino www.yurynino.com

Witold Pilecki In the spring of 1940, Pilecki volunteered to
sneak into Auschwitz. His mission: he would allow himself to get arrested, and once there, he would organize with other Polish soldiers, coordinate a mutiny, and break out of the prison camp. www.yurynino.com

Witold Pilecki Pilecki, the comic book superhero motherfucker, had still
somehow set up an espionage operation. www.yurynino.com

Witold Pilecki In 1943, Pilecki realized that his plans were
never going to happen! In a night, he cut an alarm wire and escaped! Pilecki’s story is the single most heroic thing I’ve ever come across in my life. www.yurynino.com

Eventually, realizing they could get no information from him, the
Communists decided to make an example of him. In 1948, they held a show trial and charged Pilecki with everything from falsifying documents and violating curfew to engaging in espionage and treason. A month later, he was found guilty and sentenced to death. On the final day of the trial, Pilecki was allowed to speak. He stated that his allegiance had always been to Poland and its people, that he had never harmed or betrayed any Polish citizen, and that he regretted nothing. He concluded his statement with “I have tried to live my life such that in the hour of my death I would feel joy rather than fear.” As Pilecki we, who work in DevOps as seen as Villains! However we shouldn’t be super heroes www.yurynino.com

Agenda • Why are we the villains? • SRE foundations
• We don’t need heroes • SRE principles • SRE practices www.yurynino.com

Operations teams are then responsible for deploying and managing the
software with little-to-no direct interaction with the development teams. Why are we the Villains? Developers package an application with documentation. The QA teams install and test the application.

Bravery is common. Resilience is common. But heroism has a
philosophical component to it. we are so desperate for a hero today: not because things are necessarily so bad, but because we’ve lost the capacity to take responsability. www.yurynino.com

DevOps is the combination of cultural philosophies, practices, and tools
that increases an organization’s ability to deliver applications and services at high velocity: evolving and improving products at a faster pace than organizations using traditional software development and infrastructure management processes. DEVOPS https://aws.amazon.com/devops/what-is-devops/ www.yurynino.com

DevOps is a mindset for enhanced collaboration within operations teams
and developer teams. SRE is considered speciﬁc technical expertise that is held by engineers. SRE is an implementation of DevOps www.yurynino.com

2003 - 2008 Ben Treynor coined SRE DevOps is born
2014 First Conference about SRE: SRECon 2016-2018 SRE Books are released 2019 SRE massiﬁcation SRE History www.yurynino.com

www.yurynino.com

It looks like we need heroes! www.yurynino.com

They are expensive They are not sustainable They create dependencies
They hide the problems! They live with frustration Heroes are a problem! even they solve everything! www.yurynino.com

Communists decided to make an example of him. In 1948, they held a show trial and charged Pilecki with everything from falsifying documents and violating curfew to engaging in espionage and treason. A month later, he was found guilty and sentenced to death. On the final day of the trial, Pilecki was allowed to speak. He stated that his allegiance had always been to Poland and its people, that he had never harmed or betrayed any Polish citizen, and that he regretted nothing. He concluded his statement with “I have tried to live my life such that in the hour of my death I would feel joy rather than fear.” SRE Principles www.yurynino.com

7 principles Embracing Risk Service Level Objectives Eliminating Toiling Monitoring
Automation Release Engineering Simplicity www.yurynino.com

Service Level Objectives Key measurements of the availability of a
system. Goals we set for how much availability we expect out of a system. Contracts with what happens if the system doesn’t meet its SLOs. Service Level Indicator Service Level Objective Service Level Agreements

Service Level Indicators

Service Level Objectives

Observability with Anthos https://www.qwiklabs.com/focuses/13393?parent=catalog My Recommendation

Monitoring Collecting, processing, aggregating, and displaying real-time quantitative data about
a system: counts and types, error counts and processing times. A Google SRE team with 10–12 members typically has one members whose primary assignment is to build and maintain monitoring systems for their service.

Monitoring

Observability & Telemetry

NewRelic: https://learn.newrelic.com/ Datadog: https://learn.datadoghq.com/login/index.php My Recommendation

Embracing the Risk You expect to build 100% reliable services—ones
that never fail. However, increasing reliability is worse for a service rather than better! Extreme reliability comes at a cost! Embrace the Risk! No sé me gusta la adrenalina

Automation Automate yourself Out of a Job: Automate ALL the
Things! • User account creation. • Software or hardware installation preparation and decommissioning. • Rollouts of new software versions. • Runtime conﬁguration changes. Automate the Resilience!

DevOps on AWS: https://www.qwiklabs.com/focuses/16240?catalog_rank=%7B%22rank%22%3A 3%2C%22num_filters%22%3A0%2C%22has_search%22%3Atrue%7D&parent=c atalog&search_id=9148791 My Recommendation

Release Engineering Continuous Build and Deployment! Release engineers have a
solid understanding of source code management, compilers, build conﬁguration languages, automated build tools, package managers, and installers. 4 principles: Self-Service Model, High Velocity, Hermetic Builds, Enforcement Policies and Procedures.

Release Engineering

GitOps https://www.gitops.tech/ https://cloud.google.com/kubernetes-engine/docs/tutorials/gitops-cloud-build https://www.gitops.tech/tutorial.html My Recommendation

Communists decided to make an example of him. In 1948, they held a show trial and charged Pilecki with everything from falsifying documents and violating curfew to engaging in espionage and treason. A month later, he was found guilty and sentenced to death. On the final day of the trial, Pilecki was allowed to speak. He stated that his allegiance had always been to Poland and its people, that he had never harmed or betrayed any Polish citizen, and that he regretted nothing. He concluded his statement with “I have tried to live my life such that in the hour of my death I would feel joy rather than fear.” SRE PRACTICES www.yurynino.com

18 PRACTICES Effective Troubleshooting Practical Alerting Emergency Response Being On-Call
Postmortems Managing Incidents Tracking Outages Testing for Reliability

Handling Overload Software Engineering Cascading Failures Load Balancing Distributed Consensus
Data Pipelines Data Integrity Distributed Scheduling 18 PRACTICES

Practical Alerting Multi-User Alerting Notify multiple responders at once to
orchestrate a real-time, cross-functional response. Alert Noise Reduction Group related alerts into a single incident, minimizing alert fatigue while centralizing critical context to accelerate triage. Enriched Incident Context Include graphs, images, runbook links, or links to conference calls directly in the incident details. Multiple Alert Types Send automated notiﬁcations via SMS message, mobile app push notiﬁcation, phone call, or email. Todo está bien!

Practical Alerting Rich HTML Email Notifications See critical details, monitoring
graphs, images, and more directly with email notifications, enabling your team to shave time off the response workflow. Dynamic Notifications Customize notification channels and behavior based on event payloads, service, or time of day. Incident History Audit Keep an audit trail of all notifications and status updates directly in the incident, including confirmation of notification delivery to devices. Todo está bien!

Alerting in PagerDutty https://support.pagerduty.com/docs/alerts My Recommendation

Being On-Call

Being On-Call • When on-call, an engineer is available to
perform operations on production within 5 minutes for user-facing for highly time-critical and 30 minutes for less time-sensitive. • The company provides the page-receiving device, which is typically a phone. • Google has ﬂexible alert delivery systems that can dispatch pages via multiple mechanisms (email, SMS, robot call, app) across multiple devices.

Avoid Operational Overload Alerting in PagerDutty https://support.pagerduty.com/docs/alerts My Recommendation

Be warned that being an expert is more than understanding
how a system is supposed to work. Expertise is gained by investigating why a system doesn't work. Brian Redman Ways in which things go right are special cases of the ways in which things go wrong. John Allspaw www.yurynino.com Effective Troubleshooting

Effective Troubleshooting

Emergency Response

• It is an outline of processes that need to
be executed upon in the event of an IT incident. • An incident response process is something you hope to never need. LEARN Incident post-mortems are a great way for teams to continuously learn ASSESS Collaborate with subject matter experts and work with your incident commander RESOLVE Once a plan of attack has been formulated, incident resolution begins. Response Emergency Response

Preparation • Training and support, general knowledge management • Needs
constant iteration Analysis & Learning • Understand what happened and build new action plans and training • Siloed apps for reporting and analysis make it hard to summarize incidents, delaying post mortems Detection & Alerting • Monitoring systems send alerts for issues which need to be reviewed and escalated • Alerting systems generate noise which is lost in email, unclear where to escalate Remediation • Resolve the issues • Multiple tools, people and processes need to be coordinated for the implementation of long-term solutions for incidents. Process Overview & Challenges Containment • Data must be reviewed and the current situation assessed • Damage has to be contained quickly and efficiently to minimize the impact. Incident Management

Postmortem www.yurynino.dev What went wrong, and how do we learn
from it? A postmortem is an artifact with a detailed description of exactly what went wrong in an incident. A postmortem is a written record of an incident, its impact, the actions taken to mitigate it, the root cause, and the follow-up actions to prevent the incident.

www.yurynino.com Postmortems

Testing Reliability

Handling Overload Software Engineering Cascading Failures Load Balancing Distributed Consensus
Data Pipelines Data Integrity Distributed Scheduling 18 PRACTICES

Plan Requirements Design Code Deploy Maintain Software Engineering

Deploy Maintain Plan Requirements Design Code Deploy Maintain Software Engineering

Load Balancing

Handling Overload Throttling

Handling Overload Priority Queue

Handling Overload Circuit Breaker

Cascading Failures A cascading failure occurs when cracks jump from
one system to another, until the threads will be blocked forever. Defend with: • Defensive programming. • Circuit breakers. • Timeouts.

Communists decided to make an example of him. In 1948, they held a show trial and charged Pilecki with everything from falsifying documents and violating curfew to engaging in espionage and treason. A month later, he was found guilty and sentenced to death. On the final day of the trial, Pilecki was allowed to speak. He stated that his allegiance had always been to Poland and its people, that he had never harmed or betrayed any Polish citizen, and that he regretted nothing. He concluded his statement with “I have tried to live my life such that in the hour of my death I would feel joy rather than fear.” Muchas gracias! @yurynino www.yurynino.com

SRE Heroes and Villains

SRE Heroes and Villains

More Decks by Yury Nino

Other Decks in Technology

Featured

Transcript