10-SRE, DevOps, Google and You by Björn Rabenstein

SRE? DevOps? Google and you? DevOps Gathering, Bochum – 2019-03-13
Björn “Beorn” Rabenstein, Production Engineer, SoundCloud Ltd.

What is SRE? How does it relate to DevOps? But
you are not Google so how does it apply to you?

BEFORE IT WAS COOL FAILING AT SRE

Also episode #260 of the Changelog: https://changelog.com/podcast/260

Scale Scale Scale Culture ~1B users ~10k engineers engineers >
services it’s complicated ~100M users ~100 engineers engineers < services it’s complicated

SRE is what happens when you ask a software engineer
to design an operations team. Ben Treynor, Google Inc.

Symptom-based alerting Operational underload Everything you always wanted to know
about SRE but were afraid to ask DevOps vs. SRE Dickerson's hierarchy of service reliability Error budgets On-call ProdEng at SoundCloud Less than 50% ops work Postmortems

https://github.com/beorn7/talks

Bonus slides

Less than 50% ops work.

“Solve production problems with software. It’s all just software.” traffic
* complexity operational load e.g. pages

Error budgets

Dickerson’s hierarchy of service reliability Site Reliability Engineering – How
Google Runs Production Systems, B. Beyer et al. (ed.), O’Reilly 2016, p104

DevOps vs. SRE

“True” DevOps is if there are no separate dev and
ops teams anymore, and not even designated dev or ops roles within a team. Björn Rabenstein & Matthias Rampke, SoundCloud Ltd.

DevOps is a set of practices intended to reduce the
time between committing a change to a system and the change being placed into normal production, while ensuring high quality. Len Bass, Ingo Weber, Liming Zhu: DevOps: A Software Architect's Perspective. DevOps is if you have CI/CD and run containers. J. Random Manager: During the last strategy meeting.

CALMS (John Willis, Damon Edwards, Jez Humble) Culture Automation Lean
(management or continuous improvement) Metrics Sharing [The first SRE book] explicitly references Culture, Automation, Metrics, and Sharing alongside anecdotes about Google’s journey to continuously improve. Andrew Clay Shafer: The Site Reliability Workbook.

I cringe when I hear someone say “SRE versus DevOps.”
Andrew Clay Shafer: Foreword II.

Ultimately, I know DevOps when I see it and I
see SRE at Google, in theory and practice, as one of the most advanced implementations. Andrew Clay Shafer: Foreword II.

The principles from the first SRE book align so well
with what I always imagined DevOps to be, and the practices are insightful, even when they aren’t 100% applicable outside of Google. Andrew Clay Shafer: Foreword II.

On-call

• Minimal size of an on-call rotation: ◦ 6 if
following the sun. ◦ 8 otherwise. • Minimal size of a dedicated SRE team: 8. • Feasible percentage of all engineers in SRE: 5%? 10%? • Number of SRE teams SoundCloud could afford: 1.

DevOps is the result of 5 principles: DevOps is ...
… if every person uses the same tool for the same job … codified knowledge - everybody contributing their part to common automation … if all people have the same privileges in their tooling … if human error is equally possible for Dev and Ops … replacing people interfaces by automated decisions and processes … the result Schlomo Schapiro

Take your own pager! It’s still “SRE in spirit”…

In Site Reliability Engineering, we did not make it sufficiently
clear that product development teams in Google own their service by default. SRE is neither available nor warranted for the bulk of services, although SRE principles still inform how services are managed throughout Google. Chapter 1: How SRE relates to DevOps

Production Engineering (ProdEng)

I have been waiting for this book ever since I
left Google’s enchanted castle.It is the gospel I am preaching to my peers at work. Beorn’s praise for the 1st SRE book Finally, this volume and its predecessor are not intended to be gospel. Please don’t treat them that way. Even after all these years, we’re still finding conditions and cases that cause us to tweak (or in some cases, replace) previously firmly held beliefs. Preface of the Site Reliability Workbook

10-SRE, DevOps, Google and You by Björn Rabens...

10-SRE, DevOps, Google and You by Björn Rabenstein

DevOps Gathering

More Decks by DevOps Gathering

Other Decks in Programming

Featured

Transcript

SRE? DevOps? Google and you? DevOps Gathering, Bochum – 2019-03-13

What is SRE? How does it relate to DevOps? But

BEFORE IT WAS COOL FAILING AT SRE

Also episode #260 of the Changelog: https://changelog.com/podcast/260

Scale Scale Scale Culture ~1B users ~10k engineers engineers >

SRE is what happens when you ask a software engineer

Symptom-based alerting Operational underload Everything you always wanted to know

https://github.com/beorn7/talks

Bonus slides

Less than 50% ops work.

“Solve production problems with software. It’s all just software.” traffic

Error budgets

Dickerson’s hierarchy of service reliability Site Reliability Engineering – How

DevOps vs. SRE

“True” DevOps is if there are no separate dev and

DevOps is a set of practices intended to reduce the

CALMS (John Willis, Damon Edwards, Jez Humble) Culture Automation Lean

I cringe when I hear someone say “SRE versus DevOps.”

Ultimately, I know DevOps when I see it and I

The principles from the first SRE book align so well

On-call

• Minimal size of an on-call rotation: ◦ 6 if

DevOps is the result of 5 principles: DevOps is ...

Take your own pager! It’s still “SRE in spirit”…

In Site Reliability Engineering, we did not make it sufficiently

Production Engineering (ProdEng)

I have been waiting for this book ever since I