Miner’s
canary
• If
a
customer
lets
you
know
about
a
problem
then
you
have
already
failed
at
least
twice
• The
right
quan%ty
• Filtering
–
see
the
right
picture
• Document
changes
to
your
baselines
Architecture
• Single
responsibility
principle
• Orchestra%on
or
Choreography
• Dynamic
configura%on
• Failover
and
feedback
cycles
• Rate
limi%ng
• Integra%on
paXerns
Slide 22
Slide 22 text
Single
responsibility
principle
• (Micro-‐)Services
• Components
• Small
number
of
dependencies
• Predictable
failure
modes
• Easier
adapta%on
• Expecta%on
on
metrics
Slide 23
Slide 23 text
Orchestra%on
or
Choreography
• Orchestra%on
– May
be
simpler
to
reason
about
– Coupling
with
the
director
• Choreography
– Possibly
more
flexible
– Beware
of
corrup%on
of
state
Slide 24
Slide 24 text
Dynamic
configura%on
• Reconfigurable
at
run%me
• Fast
reac%on
• Beware
of
snowflakes
Slide 25
Slide 25 text
Failover
and
feedback
cycles
• Automated
failover
• Failover
stress
• Beware
of
amplifying
effects
• Break
cycles
Slide 26
Slide 26 text
Rate
limi%ng
• Degraded
is
beXer
than
nothing
• Not
only
at
the
top
level
• Component
rate
limi%ng
• Rate
limi%ng
should
be
dynamic
• Rate
limi%ng
can
be
par%%oned
• Clients
should
be
part
of
the
contract
• Rate
limi%ng
is
aLer
all
handshaking
• Handshaking:
within
the
protocol
or
out
of
band
Slide 27
Slide 27 text
Integra%on
and
component
PaXerns
• Timeouts
• Circuit
breakers
• Resource
pools
• Fail
fast
• Queue
and
retry
• Applica%on
pings
and
sanity
checks
Slide 28
Slide 28 text
No content
Slide 29
Slide 29 text
Addi%onal
prac%ces
• Quaran%ne
• Regenera%ve
infrastructure
• Rollback
and
monitoring
• Automa%on
of
SOP
–
Runbook
Slide 30
Slide 30 text
Automated
runbooks
and
checklists
• Automate
your
SOP
• Respond
to
failure
with
a
checklist
• Automate
checklists
too
• Helps
to
avoid
the
cogni%ve
bias
and
other
nasty
stuff
your
brain
does
Slide 31
Slide 31 text
Discipline
!
Slide 32
Slide 32 text
Sources
• Recovery
Oriented
Compu%ng
Papers
• James
Hamilton
LISA
paper
• Release
It
!
• Scalable
Internet
Architectures
• A
ton
of
other
great
books
and
papers
Slide 33
Slide 33 text
The
value
Among
the
kinds
of
overhead:
• The
opera%onal
one
• The
customers
one
No
maXer
how
sophis%cated
is
our
monitoring
infrastructure
issues
no%fied
by
customers
are
at
the
end
the
most
important
ones
as
they
impact
their
experience
directly
and
are
oLen
discovering
unknown
bugs.
Freeing
up
the
team
as
much
as
possible
from
the
overhead
of
the
first
type
gives
more
%me
to
focus
on
the
issues
of
the
product
itself.