Slide 1

Slide 1 text

Mistakes were made Selena Deckelmann [email protected] Twitter/IRC: @selenamarie http://chesnok.com/daily

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

build_adu product_adu reports_user_info reports_clean reports raw_adu reports_duplicates addresses crash_types domains flash_versions os_names os_versions windows_versions plugins process_types products product_versions product_version _builds product_release _channels release_channels signatures uptime_levels reason update_ reports_clean update_lookup _new_reports update_ os_versions _new_reports hourly update_ product_versions FTP add_ new_product update_ build_adu update_ signatures daily update_adu Updated by hand processor Metrics processor update _reports _duplicates extensions bug_associations releases_raw processor FTP processor os_name _matches

Slide 6

Slide 6 text

There are only two hard problems in computer science: cache invalidation and naming things. -Phil Karlton

Slide 7

Slide 7 text

PostgreSQL *ahem*

Slide 8

Slide 8 text

In the cloud, availability is a hard problem.

Slide 9

Slide 9 text

Config Management Continuous Integration Distributed Systems 5-Nines uptime Sharding

Slide 10

Slide 10 text

DevOps

Slide 11

Slide 11 text

DevOps is a: (a) conspiracy to put developers on-call, (b) conspiracy to get sysadmins to code, (c) response to how bad software is, (d) recognition of how fast networked software evolves and breaks, (e) all of the above.

Slide 12

Slide 12 text

bit.ly/1141ZQH Daniel Dennett’s seven tools for thinking

Slide 13

Slide 13 text

#1 Use your mistakes

Slide 14

Slide 14 text

We are obsessed with failure.

Slide 15

Slide 15 text

Just not our own.

Slide 16

Slide 16 text

Every 1000 lines of code contains 2 to 75 bugs. T.J. Ostrand and E.J. Weyuker, The Distribution of Faults in a Large Industrial Software System, Proc. Int'l Symp. Software Testing and Analysis, ACM Press, 2002, pp. 55-64.

Slide 17

Slide 17 text

“We don’t need a risk management plan,” he emphatically stated, “because this project can’t be allowed to fail.” - Jim Hightower, http://jimhighsmith.com/2012/01/09/can-do-thinking-makes-risk- management-impossible/

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

Failure is an option.

Slide 20

Slide 20 text

http://thisisindexed.com/2010/03/boys-do-cry/ Honesty is hard.

Slide 21

Slide 21 text

“Ratio between success and failure is pretty stable.” Tina Seelig Stanford Technology Ventures Program http://ecorner.stanford.edu/authorMaterialInfo.html?mid=2270

Slide 22

Slide 22 text

Free and open source projects are learning communities.

Slide 23

Slide 23 text

(We fail a lot. Publicly.) http://images.t-nation.com/forum_images/2/c/2cb85_ORIG-I_LIKE_WHERE_THIS_THREAD_IS_GOING.jpg

Slide 24

Slide 24 text

We are experts in studying failure, collaboratively.

Slide 25

Slide 25 text

Teach the world to fail ✓ Plan for the worst. ✓ Minimize risk. ✓ Fail. ✓ Recover, gracefully.

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

"I think getting two accidents of this type at the same time is a freak occurrence." -David Cunliffe, NZ Communications Minister

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

“Further damage was incurred on Tuesday afternoon and our engineers returned to repair the damage,” said Virgin Media.

Slide 30

Slide 30 text

Plan for when things fail.

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

Tales of failure to... Document Test Verify Imagine Implement

Slide 34

Slide 34 text

Failure to document.

Slide 35

Slide 35 text

Moving Day Thanks, David Prior!

Slide 36

Slide 36 text

Prevent documentation failures. ✓ Write documentation. ✓ Update documentation. ✓ Make documenting a step in your written process. ✓ Assign a fixed amount of time to that step.

Slide 37

Slide 37 text

Documentation tools • Our baby is ugly. We need graphic designers. • Make and keep timelines for updates. • Use bug tracking. • Ordered todo lists.

Slide 38

Slide 38 text

Failure to test.

Slide 39

Slide 39 text

“My first day posing as a sysadmin (~1990, no previous training....) I deleted all zero length files on a Sun workstation.”

Slide 40

Slide 40 text

Prevent testing failures. ✓ Verify success criteria. ✓ Write tests. ✓ Test with a buddy. ✓ Have a plan.

Slide 41

Slide 41 text

Testing tools • All-pairs testing: http://1.usa.gov/dfwu4h • Your favorite test framework • Repeatable shell scripts • Staging environments

Slide 42

Slide 42 text

Failure to verify.

Slide 43

Slide 43 text

“What does ‘-d’ actually do?”

Slide 44

Slide 44 text

Prevent verification failures. ✓ Have a plan for things going wrong. ✓ Have a staging environment. ✓ Test your rollback plan, not just your implementation plan.

Slide 45

Slide 45 text

Verification tools • Staging environments • Your buddy

Slide 46

Slide 46 text

Failure to imagine.

Slide 47

Slide 47 text

For my group the bottom line was "don't trust anyone". Thanks, Maggie!

Slide 48

Slide 48 text

Recover from failures to imagine. ✓ Share your stories of failure. ✓ Talk with people who are different from you. ✓ Act out implementation scenarios.

Slide 49

Slide 49 text

Failure to implement.

Slide 50

Slide 50 text

Re-implement ✓ Fail fast and often. ✓ Learn from mistakes. ✓ Try again.

Slide 51

Slide 51 text

Making the change

Slide 52

Slide 52 text

Who is affected? ✓ Customers ✓ People making the change ✓ Others M aking the change

Slide 53

Slide 53 text

Before a change ✓ Plan to do a post-mortem. ✓ Document the plan with numbered steps and a timeline. ✓ Test the plan and the rollback plan. ✓ Identify a “point of no return”. M aking the change

Slide 54

Slide 54 text

During a change ✓ Share screens: UNIX screen, VNC ✓ Use a Chatroom: IRC, AIM, bots, logs ✓ Use Voice: Campfire, Skype, VOIP, POTS ✓ Have Headsets! ✓ Designate a time-keeper ✓ Update documentation M aking the change

Slide 55

Slide 55 text

When to you’ve failed • Know when the “point of no return” is • Decide how to decide (“3 strikes”) • Decide who will make the call M aking the change

Slide 56

Slide 56 text

After a change • Use “5 whys” to explore failures. • Hold a post-mortem to identify areas of success and areas for improvement. • Limit improvements to 1-2 things. M aking the change

Slide 57

Slide 57 text

Succeed with a Post-Mortem ✓ Set expectation for 100% participation ✓ Designate a note keeper & time keeper ✓ Everyone shares a success, failure, something to do better ✓ Vote anonymously on what to do next ✓ Communicate meeting notes out M aking the change

Slide 58

Slide 58 text

Failure is an Option: Failure Barriers and New Firm Performance -by Robert Eberhart, Charles Eesley, Kathleen Eisenhardt January 10, 2012 http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1982819 When you change the institutional expectation for failure, people take more and better risks.

Slide 59

Slide 59 text

Examples of how to lower failure barriers • Prioritize documentation • Fund staging environments • Schedule maintenance during normal working hours

Slide 60

Slide 60 text

Lower the barriers to failure.

Slide 61

Slide 61 text

Things to read • Checklist Manifesto, Atul Gawande • Liespotting: Proven Techniques to Detect Deception, Pam Meyer • Everything is Obvious, Duncan Watts • Ops presentations by Etsy.com • DailyWTF, Full Disclosure, Bruce Schneier

Slide 62

Slide 62 text

Thanks! Selena Deckelmann @selenamarie [email protected]

Slide 63

Slide 63 text

Photo credits • Flickr: sheepguardingllama • (thereifixedit link)