Miner’s
canary
• If
a
customer
lets
you
know
about
a
problem
then
you
have
already
failed
at
least
twice
• The
right
quan%ty
• Filtering
–
see
the
right
picture
• Document
changes
to
your
baselines
Single
responsibility
principle
• (Micro-‐)Services
• Components
• Small
number
of
dependencies
• Predictable
failure
modes
• Easier
adapta%on
• Expecta%on
on
metrics
Orchestra%on
or
Choreography
• Orchestra%on
– May
be
simpler
to
reason
about
– Coupling
with
the
director
• Choreography
– Possibly
more
flexible
– Beware
of
corrup%on
of
state
Rate
limi%ng
• Degraded
is
beXer
than
nothing
• Not
only
at
the
top
level
• Component
rate
limi%ng
• Rate
limi%ng
should
be
dynamic
• Rate
limi%ng
can
be
par%%oned
• Clients
should
be
part
of
the
contract
• Rate
limi%ng
is
aLer
all
handshaking
• Handshaking:
within
the
protocol
or
out
of
band
Automated
runbooks
and
checklists
• Automate
your
SOP
• Respond
to
failure
with
a
checklist
• Automate
checklists
too
• Helps
to
avoid
the
cogni%ve
bias
and
other
nasty
stuff
your
brain
does
Sources
• Recovery
Oriented
Compu%ng
Papers
• James
Hamilton
LISA
paper
• Release
It
!
• Scalable
Internet
Architectures
• A
ton
of
other
great
books
and
papers
The
value
Among
the
kinds
of
overhead:
• The
opera%onal
one
• The
customers
one
No
maXer
how
sophis%cated
is
our
monitoring
infrastructure
issues
no%fied
by
customers
are
at
the
end
the
most
important
ones
as
they
impact
their
experience
directly
and
are
oLen
discovering
unknown
bugs.
Freeing
up
the
team
as
much
as
possible
from
the
overhead
of
the
first
type
gives
more
%me
to
focus
on
the
issues
of
the
product
itself.