Slide 1

Slide 1 text

high reliability infrastructure migrations E Emmfmmi.si f.FmI s.iffE IaiTE JULIA EVANS bork

Slide 2

Slide 2 text

about me infrastructure engineer stripe payments company billions of dollars 1 year

Slide 3

Slide 3 text

our challenges 10millions sub.mil isec atencyRELlAB1LlTY SECURlTY

Slide 4

Slide 4 text

99.99 1 minute 1 week

Slide 5

Slide 5 text

we made 2 changes move some workloads to Kubernetes use Envoy for all Service to service networking

Slide 6

Slide 6 text

W S

Slide 7

Slide 7 text

reality ยต p A this was me

Slide 8

Slide 8 text

ojET Ei e on this

Slide 9

Slide 9 text

i to normal the problem is WHAT

Slide 10

Slide 10 text

what could go wrong 99.99 i WHAT normal

Slide 11

Slide 11 text

the goal qq.am FIXED W

Slide 12

Slide 12 text

how to get there understand the design run game days classify your failures have incidents only once make incremental changes have a rollback

Slide 13

Slide 13 text

Understand Kubernetes design

Slide 14

Slide 14 text

K8s designnderstand the design etch saetern Inetreaters everything else

Slide 15

Slide 15 text

understand the design ignore most new software Kubernetes Envoy that's

Slide 16

Slide 16 text

theory isn't enough pawing EIcankubern etes.ba

Slide 17

Slide 17 text

learn how 1 system breaks

Slide 18

Slide 18 text

cause problems on purpose info gqmg.rq.a.gs neither

Slide 19

Slide 19 text

Run gamedays game days test how your system behaves under known failures let you learn without duress share knowledge

Slide 20

Slide 20 text

Run gamedays terminate an eted instance push invalid configuration destroy all apiserver instances or just 1 container registry outage take down Envoy control plane Run these in QA but also in

Slide 21

Slide 21 text

Run gamedays Kubernetes terminated every running pod in the cluster pod eviction We fixed then tested the fix

Slide 22

Slide 22 text

classify your failure modes

Slide 23

Slide 23 text

classify your failure modes at the beginning j I

Slide 24

Slide 24 text

all our failure modes to containers don't start 0 permissions errors or networking issues

Slide 25

Slide 25 text

learn your failure mode Reasons pods don't start I AM rate limiting scheduler bug I 1 i so many eted is down reasons lots more

Slide 26

Slide 26 text

classification monitoring c heartbeat jobs

Slide 27

Slide 27 text

Have every incident only once A

Slide 28

Slide 28 text

Have incidents only once If you don't understand your incidents they will happen again

Slide 29

Slide 29 text

Your problem space in i ii normal WHAT

Slide 30

Slide 30 text

Your problem space he e normal WHAT

Slide 31

Slide 31 text

Have incidents only once Find a problem Find causes Implement remediations Problem never comes back usually

Slide 32

Slide 32 text

Fix categories of incidents

Slide 33

Slide 33 text

some Envoy issues refinement otimistionmT freshets thundering

Slide 34

Slide 34 text

all HTTP 11 Conn pool issues

Slide 35

Slide 35 text

solution UseHTTPl2 Envoy is designed for HTTP 12

Slide 36

Slide 36 text

e

Slide 37

Slide 37 text

Have incidents only once tell your coworkers what you learned incident reports example eted EBS issue leader elections

Slide 38

Slide 38 text

Have incidents only once io

Slide 39

Slide 39 text

Have incidents only once incidents teach you how to build a reliable system

Slide 40

Slide 40 text

make incremental changes

Slide 41

Slide 41 text

make incremental changes 5 of traffic 1 host a non critical service

Slide 42

Slide 42 text

make incremental changes establish an interface boundary

Slide 43

Slide 43 text

our deepq menmtal changes Besa

Slide 44

Slide 44 text

client make increment hanges BBB

Slide 45

Slide 45 text

make increment eatchanges item IBB

Slide 46

Slide 46 text

make incremental changes cnn.INT sneEer no haunted forests

Slide 47

Slide 47 text

make incremental changes don't expose Kubernetes to developers

Slide 48

Slide 48 text

reduce cognitive load reduce support burden

Slide 49

Slide 49 text

escape from YAML skycfg skycfg fun

Slide 50

Slide 50 text

YAML what other attributes are supported what k8s config does it generate name: missing-review-finder owner: risk schedule: 30 0 * * * disabled: false command: - ruby - scripts/cron/risk-missing-review-finder

Slide 51

Slide 51 text

code return stripe_service( image = default_image, command = einhorn(henson_service = "home-srv", script = "home/srv.rb", workers = 8, port = 9768, ), iam_role = "homesrv.kube.%s.%s" % ( ctx.vars["stripe.cluster"], ctx.vars["stripe.environment"], ), replicas = 3, cpu = kube.cores(4), mem = kube.gigabytes(16), block_egress = False, )

Slide 52

Slide 52 text

subset of Python typechecked sand boxed

Slide 53

Slide 53 text

github.com stripe skycfgskycfg fun

Slide 54

Slide 54 text

always have a rollback plan

Slide 55

Slide 55 text

have a rollback plan on h fhf Iit Ojos

Slide 56

Slide 56 text

a back.an i n

Slide 57

Slide 57 text

playbook understand the design run game days classify your failures have incidents only once make incremental changes have a rollback

Slide 58

Slide 58 text

culture leadership it's ok to start out not being an expert but you need to become one build an engine of learning building that expertise takes time

Slide 59

Slide 59 text

managers make space for your team to learn E jay manager

Slide 60

Slide 60 text

thanks a lot 9 ps we're hiring in Seattle stripe.com jobs