Berlin 2013 - Session - Reza Spagnolo

Slide 1

Slide 1 text

Adap%ve Applica%on Architecture Reza Spagnolo @rmspagnolo

Slide 2

Slide 2 text

Hey there ! Who am I ? •  A student •  An engineer, for 9 years now •  Interested in building systems •  Dev & Ops since the beginning

Slide 3

Slide 3 text

#monitoringsocks but never sucked for real

Slide 4

Slide 4 text

Monitoring is an architecture component

Slide 5

Slide 5 text

Infrastructure is code

Slide 6

Slide 6 text

Monitoring is code •  Development process •  Tes%ng •  Deployment

Slide 7

Slide 7 text

Monitoring is service •  Metrics •  Alerts

Slide 8

Slide 8 text

Namespaces There are only two hard things in Computer Science: cache invalida

Slide 9

Slide 9 text

#soLwaresucks without namespaces

Slide 10

Slide 10 text

Metrics namespaces •  Helps your mental model •  Helps iden%fying things •  Dimensions: loca%on, versions, etc

Slide 11

Slide 11 text

Monitoring based promo%on Acceptance Development Produc%on •  Produc%on conﬁgura%on •  Comparison •  Log analysis

Slide 12

Slide 12 text

Monitoring deployment •  Push changes •  Keep correspondence •  Automate •  Namespaces

Slide 13

Slide 13 text

Synthe%c traﬃc

Slide 14

Slide 14 text

Canaries

Slide 15

Slide 15 text

Miner’s canary •  If a customer lets you know about a problem then you have already failed at least twice •  The right quan%ty •  Filtering – see the right picture •  Document changes to your baselines

Slide 16

Slide 16 text

Other types of birds

Slide 17

Slide 17 text

The preXy ones we just saw

Slide 18

Slide 18 text

The Angry ones

Slide 19

Slide 19 text

And monkeys !

Slide 20

Slide 20 text

Audi%ng Events %meline •  Changes •  Deployments •  Rollbacks •  Alarms

Slide 21

Slide 21 text

Architecture •  Single responsibility principle •  Orchestra%on or Choreography •  Dynamic conﬁgura%on •  Failover and feedback cycles •  Rate limi%ng •  Integra%on paXerns

Slide 22

Slide 22 text

Single responsibility principle •  (Micro-‐)Services •  Components •  Small number of dependencies •  Predictable failure modes •  Easier adapta%on •  Expecta%on on metrics

Slide 23

Slide 23 text

Orchestra%on or Choreography •  Orchestra%on – May be simpler to reason about – Coupling with the director •  Choreography – Possibly more ﬂexible – Beware of corrup%on of state

Slide 24

Slide 24 text

Dynamic configura%on •  Reconfigurable at run%me •  Fast reac%on •  Beware of snowflakes

Slide 25

Slide 25 text

Failover and feedback cycles •  Automated failover •  Failover stress •  Beware of amplifying eﬀects •  Break cycles

Slide 26

Slide 26 text

Rate limi%ng •  Degraded is beXer than nothing •  Not only at the top level •  Component rate limi%ng •  Rate limi%ng should be dynamic •  Rate limi%ng can be par%%oned •  Clients should be part of the contract •  Rate limi%ng is aLer all handshaking •  Handshaking: within the protocol or out of band

Slide 27

Slide 27 text

Integra%on and component PaXerns •  Timeouts •  Circuit breakers •  Resource pools •  Fail fast •  Queue and retry •  Applica%on pings and sanity checks

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

Addi%onal prac%ces •  Quaran%ne •  Regenera%ve infrastructure •  Rollback and monitoring •  Automa%on of SOP – Runbook

Slide 30

Slide 30 text

Automated runbooks and checklists •  Automate your SOP •  Respond to failure with a checklist •  Automate checklists too •  Helps to avoid the cogni%ve bias and other nasty stuﬀ your brain does

Slide 31

Slide 31 text

Discipline !

Slide 32

Slide 32 text

Sources •  Recovery Oriented Compu%ng Papers •  James Hamilton LISA paper •  Release It ! •  Scalable Internet Architectures •  A ton of other great books and papers

Slide 33

Slide 33 text

The value Among the kinds of overhead: •  The opera%onal one •  The customers one No maXer how sophis%cated is our monitoring infrastructure issues no%ﬁed by customers are at the end the most important ones as they impact their experience directly and are oLen discovering unknown bugs. Freeing up the team as much as possible from the overhead of the ﬁrst type gives more %me to focus on the issues of the product itself.

Slide 34

Slide 34 text

Thank you !