Failure Friday (Ignite Talk)

:( Your server has disappeared for no adequately explained reason.
Happy debugging. If you’d like to know more, you can search online later for this error: ERR_DEAL_WITH_IT

Embrace Failure "I come from a long line of quitters.
My father was a quitter, my grandfather was a quitter… I was raised to give up."

Simian Army "We have found that the best defense against
major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient." http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html https://github.com/Netflix/SimianArmy The Netflix Simian Army is available on GitHub:

“But We’re Not There Yet…” “But We’re Not There Yet…”

Failure Friday • 1 hour every Friday. • Operations and
Development teams attend. • Manually cause failures (No automation required). • Uncovers issues. • Builds team culture. • Tests incident response…

“Good Lord! I'm getting a reading of over 40 Mega-Fonzies!”

Getting Started • Notify team(s). • Set the meeting. •
Make an agenda.

Get Ready to Rumble • Announce the start. • Disable
cron jobs. • Leave alerts on!

1 break <thing> 2 wait 5mins 3 fix <thing> 4
wait until !broken 5 repeat

• Suspend/Stop Process. • Reboot Host. • Network Isolation. •
Network latency.

tc qdisc add dev eth0 root netem \ delay 500ms
100ms \ loss 5% >

> tc qdisc del dev eth0 root netem >

• Found bugs (and ﬁxed them). • Knowledge sharing between
dev and ops. • Highlighted untestable systems. • Incident training. • Excellent for onboarding.

Failure Friday (Ignite Talk)

Failure Friday (Ignite Talk)

Rich Adams

More Decks by Rich Adams

Other Decks in Technology

Featured

Transcript

:( Your server has disappeared for no adequately explained reason.

Embrace Failure "I come from a long line of quitters.

Simian Army "We have found that the best defense against

“But We’re Not There Yet…” “But We’re Not There Yet…”

Failure Friday • 1 hour every Friday. • Operations and

“Good Lord! I'm getting a reading of over 40 Mega-Fonzies!”

Getting Started • Notify team(s). • Set the meeting. •

Get Ready to Rumble • Announce the start. • Disable

1 break <thing> 2 wait 5mins 3 fix <thing> 4

• Suspend/Stop Process. • Reboot Host. • Network Isolation. •

tc qdisc add dev eth0 root netem \ delay 500ms

> tc qdisc del dev eth0 root netem >

• Found bugs (and ﬁxed them). • Knowledge sharing between