Slide 1

Slide 1 text

(without introducing more risk) The Two Sides Puppet Gareth Rushgrove Of Google Infrastructure for Everyone Else

Slide 2

Slide 2 text

(without introducing more risk) @garethr

Slide 3

Slide 3 text

(without introducing more risk) Gareth Rushgrove

Slide 4

Slide 4 text

(without introducing more risk) Introduction A strange format for a talk

Slide 5

Slide 5 text

This is a debate Gareth Rushgrove

Slide 6

Slide 6 text

I’ll be debating both sides Gareth Rushgrove

Slide 7

Slide 7 text

Taking opposing viewpoints on the same issue, as a way of exploring it in-depth Gareth Rushgrove

Slide 8

Slide 8 text

The talk is split into two parts; a For part and an Against part Gareth Rushgrove

Slide 9

Slide 9 text

I’d like to explore: - Technical practice evolution - How we adopt software - The organisational context Gareth Rushgrove

Slide 10

Slide 10 text

This house believes… Gareth Rushgrove

Slide 11

Slide 11 text

Successful companies will look like Google in the future, so we should adopt Google-like software and practices today Gareth Rushgrove

Slide 12

Slide 12 text

Important disclaimer I’ve never worked for Google Gareth Rushgrove

Slide 13

Slide 13 text

(without introducing more risk) For

Slide 14

Slide 14 text

You’re probably: 1 Struggling with distributed systems 2 Missing out on machine learning 3 Wondering how to scale operations Gareth Rushgrove

Slide 15

Slide 15 text

Gareth Rushgrove have a 10+ year head start

Slide 16

Slide 16 text

publish research that influences out industry Gareth Rushgrove

Slide 17

Slide 17 text

Gareth Rushgrove MapReduce

Slide 18

Slide 18 text

Gareth Rushgrove Chubby

Slide 19

Slide 19 text

Gareth Rushgrove Borg

Slide 20

Slide 20 text

releases (and inspires) software we use Gareth Rushgrove

Slide 21

Slide 21 text

Gareth Rushgrove

Slide 22

Slide 22 text

Gareth Rushgrove Go

Slide 23

Slide 23 text

Gareth Rushgrove from

Slide 24

Slide 24 text

(without introducing more risk) GFS = HDFS BigTable = HBase Protocol Buffers = Thrift or Avro (serialization) Stubby = Thrift or Avro (RPC) ColumnIO = Parquet Dremel = Impala Omega = Mesos Blaze = Pants or Buck FlumeJava = Crunch Logsaver = Scribe or Flume Millwheel = Storm or Samza? Borgmon/Monarch = Graphite Dapper = Zipkin 2014 from @avibryant, @joshwills, @skamille, @marius, @wickman Gareth Rushgrove

Slide 25

Slide 25 text

We have a term for this; #GIFEE Gareth Rushgrove

Slide 26

Slide 26 text

Google Infrastructure for Everyone Else Gareth Rushgrove

Slide 27

Slide 27 text

Distributed systems are hard Gareth Rushgrove

Slide 28

Slide 28 text

Building your own in-house framework is likely a waste of time Gareth Rushgrove

Slide 29

Slide 29 text

Gareth Rushgrove From Adrian Colyer, Accel,

Slide 30

Slide 30 text

Kubernetes is the 3rd generation of Googles cluster management software Gareth Rushgrove

Slide 31

Slide 31 text

Gareth Rushgrove The Kubernetes API provides primitives that make doing the right thing easier

Slide 32

Slide 32 text

- Orchestration - Logging - Configuration - Self-healing - Storage Gareth Rushgrove - Load balancing - Service discovery - Scaling - Batch workloads - Lots more

Slide 33

Slide 33 text

Gareth Rushgrove Exposed via a modern API

Slide 34

Slide 34 text

Machine learning is going to be massive Gareth Rushgrove

Slide 35

Slide 35 text

Soon We Won’t Program Computers. We’ll Train Them Like Dogs Gareth Rushgrove ” “

Slide 36

Slide 36 text

TensorFlow is an open source software library for numerical computation Gareth Rushgrove

Slide 37

Slide 37 text

(without introducing more risk) Gareth Rushgrove …

Slide 38

Slide 38 text

- Nearest neighbour - Linear regression - Recurrent neural networks - Multilayer perceptron - Lots more Gareth Rushgrove

Slide 39

Slide 39 text

Gareth Rushgrove Introductory ML docs

Slide 40

Slide 40 text

How do I do devops? Gareth Rushgrove Everyone ever ” “

Slide 41

Slide 41 text

Gareth Rushgrove explain how they work too

Slide 42

Slide 42 text

Gareth Rushgrove

Slide 43

Slide 43 text

SRE: Have software engineers do operations Gareth Rushgrove Dan Luu, ex Google ” “

Slide 44

Slide 44 text

(without introducing more risk) Gareth Rushgrove Dev SRE Ops From by Matthew Skelton

Slide 45

Slide 45 text

The familiar: - Capacity planning - Performance - Change management - Monitoring Gareth Rushgrove

Slide 46

Slide 46 text

The unfamiliar: - Error budget - Strong software engineering skills - 50% operations work cap Gareth Rushgrove

Slide 47

Slide 47 text

A growing ecosystem Gareth Rushgrove

Slide 48

Slide 48 text

Gareth Rushgrove Friendly vendors

Slide 49

Slide 49 text

Gareth Rushgrove More friendly vendors

Slide 50

Slide 50 text

Gareth Rushgrove Even more nice vendors

Slide 51

Slide 51 text

(without introducing more risk) Summing up For

Slide 52

Slide 52 text

“infrastructure” is shifting to a higher level of abstraction Gareth Rushgrove

Slide 53

Slide 53 text

It’s fine to just be a consumer Gareth Rushgrove

Slide 54

Slide 54 text

You should be standing on the shoulders of giants Gareth Rushgrove

Slide 55

Slide 55 text

You should be standing on the shoulders of Gareth Rushgrove

Slide 56

Slide 56 text

(without introducing more risk) Against

Slide 57

Slide 57 text

Your organisation doesn’t look like Google Gareth Rushgrove

Slide 58

Slide 58 text


Slide 59

Slide 59 text

Could your organisation look like Google? Gareth Rushgrove

Slide 60

Slide 60 text

How many employees do you have? Google have about 60,000 Gareth Rushgrove

Slide 61

Slide 61 text

What proportion of your organisation are software engineers or operations? Gareth Rushgrove

Slide 62

Slide 62 text

50 percent? Based on the Google annual report December 2014 Gareth Rushgrove

Slide 63

Slide 63 text

How much do you pay software engineers? Gareth Rushgrove

Slide 64

Slide 64 text

Gareth Rushgrove Data from Glassdoor, June 2016, based on 14k salaries

Slide 65

Slide 65 text

Gareth Rushgrove The $3million engineer?

Slide 66

Slide 66 text

Gareth Rushgrove

Slide 67

Slide 67 text

Gareth Rushgrove Build your own chips?

Slide 68

Slide 68 text

Could your organisation really look like Google? Gareth Rushgrove

Slide 69

Slide 69 text

So much of the information in the SRE book makes PERFECT sense if you’re Google Gareth Rushgrove John Vincent, Ops Hero ” “

Slide 70

Slide 70 text

The reality outside Google Gareth Rushgrove

Slide 71

Slide 71 text

<1% of US workers are software engineers or programmers Gareth Rushgrove US Bureau of Labor Statistics 2002. 1,069,000 jobs in working age population of 185million

Slide 72

Slide 72 text

Strategic vendor relationships Gareth Rushgrove

Slide 73

Slide 73 text

Different application constrains as well as different organisational constrains Gareth Rushgrove

Slide 74

Slide 74 text

Goal of SRE team isn’t zero outages – SRE and product devs are incentive aligned to spend the error budget to get maximum feature velocity Gareth Rushgrove Dan Luu, ex Google ” “

Slide 75

Slide 75 text

What if you’re operating an air traffic control system or a nuclear power station? Your goal is probably closer to zero outages Gareth Rushgrove

Slide 76

Slide 76 text

Gareth Rushgrove John Vincent SRE review

Slide 77

Slide 77 text

bringing a software engineering perspective to a problem isn’t always the best or right solution Gareth Rushgrove ” “ John Vincent, Ops Hero

Slide 78

Slide 78 text

Many of Google’s conclusions to operations problems are not unique Gareth Rushgrove

Slide 79

Slide 79 text

Gareth Rushgrove

Slide 80

Slide 80 text

Gareth Rushgrove

Slide 81

Slide 81 text

Innovation happens elsewhere applies as much to Google as to other organisations Gareth Rushgrove

Slide 82

Slide 82 text

(without introducing more risk) Summing up Against

Slide 83

Slide 83 text

If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow Gareth Rushgrove Carla Geisser, Google SRE ” “

Slide 84

Slide 84 text

What is normal for Google may not be suitable for your organisation Gareth Rushgrove

Slide 85

Slide 85 text

Your startup with a single-purpose application does not have the luxury of having your operations team say I’m sorry you’re over your error budget Gareth Rushgrove John Vincent, Ops Hero ” “

Slide 86

Slide 86 text

Gareth Rushgrove

Slide 87

Slide 87 text

(without introducing more risk) Conclusions If all you take away is…

Slide 88

Slide 88 text

Who votes… Gareth Rushgrove For

Slide 89

Slide 89 text

Who votes… Gareth Rushgrove Against

Slide 90

Slide 90 text

Who thinks it’s the wrong question? Gareth Rushgrove

Slide 91

Slide 91 text

Context is king Gareth Rushgrove

Slide 92

Slide 92 text

Gareth Rushgrove

Slide 93

Slide 93 text

The Overwhelming power of context Gareth Rushgrove Charity Majors, Ops Person Extraordinaire ” “

Slide 94

Slide 94 text

The technology we run, and how we run it, are interlinked Gareth Rushgrove

Slide 95

Slide 95 text

(without introducing more risk) The field of Sociotechnical Systems suggests that all human systems include both a technical system and a social system Gareth Rushgrove

Slide 96

Slide 96 text

(without introducing more risk) Better outcomes are usually obtained by a reciprocal process of joint optimization, through which both the technical system and the social system change Gareth Rushgrove

Slide 97

Slide 97 text

Containers will not fix your broken culture Gareth Rushgrove Bridget Kromhout, Worlds nicest Ops Person ” “

Slide 98

Slide 98 text

Awesome culture will not fix your broken containers Gareth Rushgrove Me, paraphrasing Bridget ” “

Slide 99

Slide 99 text

We are all collectively evolving the practice of operations Gareth Rushgrove

Slide 100

Slide 100 text

Keep sharing, because it’s a pretty amazing ride Gareth Rushgrove

Slide 101

Slide 101 text

(without introducing more risk) Questions And thanks for listening