What Could Go Wrong? [ParisRB Conf 2020]

This is a simple talk. I’m going to start and
ﬁnish with a single question — if you forget everything in between, that’s ﬁne But as people who make things, I want to help us ask 1 simple question…

WHAT COULD GO WRONG? What could go wrong? That’s it.
Every LOC, every test. Every gem Every migration, deploy, rake task Every time you ssh into production (without telling anyone) What. could. go. wrong?

WHAT COULD GO WRONG? ¯\_(ϑ)_/¯ Now obviously there’s a way
to ask this rhetorically… [dismissively] what could go wrong? ¯\_(ϑ)_/¯ We know what this looks like…

Whether you’re about to set your friend on ﬁre…

- or you need to cross a bridge - with
just a bit more river than usual. - And there’s a familiar conversation happening in this car I think - We can make it! - No we can’t! - And I think we have this same conversation again and again - in professional and personal contexts - whether you’re parenting or pushing to production

WHAT COULD GO WRONG? " So my goal today is
to help us ask that question a little more critically what could go wrong? Not only so we can make better decisions, but also so we can disagree more rigorously about crossing the bridge

DANGER A long time ago, in another life time, I
studied environmental chemistry. I don’t remember much from my Chemistry degree, but I remember a couple of things: 1. everything is toxic if you have enough of it and 2…

RISK × HAZARD = DANGER any danger has two factors:
risk and hazard Now, chemistry was a long time ago and as soon as I say that, I’m terriﬁed someone’s going to come up after and say

Well, actually… well acktually: ISO 31-thousand deﬁnes risk as… You
are probably right. So let’s be simple…

DANGER NOUN THE POSSIBILITY OF SUFFERING HARM OR INJURY One
dictionary describes danger as [read] again, there’s two factors we can quantify “possibility” and “harm”

DANGER = RISK × HAZARD just like risk and hazard,
two factors The terminology doesn’t matter. But, just to appease the ISO gods, let’s choose some diﬀerent words…

DANGEROUSNESS = LIKELINESS × BADNESS the dangerousness of something is
determined by its likeliness and badness

WHAT COULD GO WRONG? HOW LIKELY IS IT? HOW BAD
IS IT? Ultimately I want us to ask “what could go wrong” - and then - how likely is it? - how bad is it? This is it. This is the talk. 1. What could go wrong? 2. How likely is it? 3. How bad is it? So let’s break this down a little bit

EXAMPLES ~[3:00] If I tried to walk across a balance
beam a meter or so oﬀ the ground, I’d be ﬁne. BUT… if we put that beam at the same height as a famous towering local landmark… perhaps

La tour Montparnasse no amount of money would convince me
to walk across it! and I’ve thought hard about this I think that’s intuitive for most people. But let’s think about what’s really changed here:

DANGEROUSNESS = LIKELINESS × BADNESS # * this likeliness of
falling stays the same: * I’m just as capable * if anything I’m far more motivated to step accurately

DANGEROUSNESS = LIKELINESS × BADNESS # $ but the badness
changes dramatically. From that height I’d hit the ground at ~200km/h

DANGEROUSNESS = LIKELINESS × BADNESS ☠ # $ * so
the scenario is more dangerous, * not because falling is more likely, * but because the “how bad would it be” changed from * a sore ankle * to je suis splat

- Now let’s return to our disagreement at the bridge
- What are we really disagreeing about here?

BRIDGE COLLAPSE? LIKELINESS × BADNESS & ' - Sure, everyone
agrees bridge collapsing would be very bad - One thinks: it’s deﬁnitely going to go - Other thinks: it’s probably strong enough - and straight away we get to the heart of the disagreement - We disagree about likelihood.

WAVE? LIKELINESS × BADNESS ' & - And maybe everyone
can see 100% - water is going to hit the car - But one thinks: that won’t be too bad - And other thinks: that’ll break the car - I no idea how cars/bridges/rivers work - But as we actually identify what could go wrong - and distinguish the badness and the likeliness - we can start to disagree more robustly about the danger Everyone agrees: very bad if we died we just disagree how likely that is Which brings me to…

MICROMORTS * Who’s heard of a micromort? * A micromort
= deﬁned quantum/amount of dangerousness

DANGEROUSNESS = LIKELINESS × BADNESS 1 µmort 1/million 1 death
* a one in a million chance of death * the likeliness is 1 in a million * the badness is a single death

Walking 27 km Cycling 32 km Driving 400 km Flying
1,600 km 1 MICROMORT And so you get micromort data for travel. suggesting — per km — ﬂying is far safer than anything else interesting to see per trip This was sourced from wiki so take it with a grain of salt

Scuba Diving 6 µmort per dive Marathon 7 µmort per
event Skydiving 8 µmort per jump Base-jumping 430 µmort per jump Climb Everest 39,427 µmort per ascent Similarly for recreational risks… Scuba, Marathon, and Skydiving all quite similar which surprises me because I’ve done both marathons and skydiving and they did not seem equally dangerous … then: crazy dangerous sports base jumping is notoriously risky everest has 223 fatalities / 5.5k ascents To put these risks in perspective, though:

Scuba Diving 6 µmort per dive Marathon 7 µmort per
event Skydiving 8 µmort per jump Childbirth mother 80 µmort per birth Base-jumping 430 µmort per jump Climb Everest 39,427 µmort per ascent According to the WHO France 2017, maternal mortality rate order of magnitude riskier than skydiving and that’s pretty good by world standards - which surprised me in terms of relative risk - doesn’t really match the stories we tell ourselves about these things - And the more I’ve researched these statistics the more I’m surprised how wrong my stories were - And it’s at this point that I start to realise how often my perception of danger - Isn’t driven by knowledge - but is driven by my discourses and social narratives. https://www.who.int/gho/maternal_health/countries/fra.pdf

SOCIAL BRAINS ~[7:00] Because underneath all this logic we have
to remember we have the brains of social primates. [ Pause for new section ]

A human is by nature a social animal “ 2.5kya,
Aristotle said that a “human is by nature a social animal” and modern neuroscientists like Matthew Lieberman seem to agree. He argues that the human brain has been primed by evolution to view the world in predominately social terms. And so it seems our neural biology is wired overwhelmingly for social cognition and much less for this statistical cognition needed for risk assessment

SINGLE FACTOR FIXATION One of the most common mistakes I’ve
noticed is the tendency to ﬁxate on a single factor when thinking about danger For example…

ň6*'-+&50'8'4 2.#;106*'&4+8'9#;ŉ FIXATE ON LOW LIKELINESS * We can easily
ﬁxate on a low likeliness * and completely discount the high badness * “complacency” We have an minivan, and even with a camera at the back there are a worrying amount of dents in the rear bumper

(%#/'4#Ŏ/+44145Ŏ .+0'1(5+)*6( FIXATE ON LOW LIKELINESS so now when we’re
reversing the kids recite a little safety jingle because it’s too easy to think it’ll never happen to you and forget how unimaginably bad it’d be if something did happen According to child injury prevention group SafeKids, a child is hospitalised with serious injuries after being struck by a car in a private driveway every two weeks in New Zealand. At least ﬁve children are killed each year in the same way.

#..6*'2.#0'5 #4'%4#5*+0) FIXATE ON HIGH BADNESS * Alternatively, we ﬁxate
on high badness * completely ignoring the risk factor * “Hysteria” Gerd Gigerenzer [gieger enzer] is a psychologist who studies decision making * He talks about how * in the year following 9/11 * An estimated 16,000 Americans died on the road * because they drove long distance instead of ﬂying

9'ņ&$'66'4 &4+8' FIXATE ON HIGH BADNESS So we see it’s
easy to focus on the how bad a plane crash is and ignore how likely a car crash is And none of us are immune! we can see from our collective behaviour these ﬁxations — these stories we tell ourselves are simply part of our social inheritance.

This is just how our brains work or complacency, hysteria
or whatever… This is how our brains work - and so many of our instincts are governed by social factors: - Your family of origin, your ideology - the beliefs we’re built around our identity and belonging - And the rational layers of the brain are really only a recent evolutionary add on

- But while we certainly have some built-in hurdles -
to think clearly about danger I believe if we can be mindful of our biases, and think about danger in a structured way experience shows we can actually be pretty good at this We might be social primates, but we socialised our way to the moon!

CODE ~[12:00] When we code, the answer to the question
“what could go wrong” depends on all of the broad context for the particular thing you’re doing So it’s kind of artiﬁcial to demonstrate in a conference talk. That said… let’s start at the LOC level.

def fetch_file(url) uri = URI(url) response = Net::HTTP.get_response(uri) response.body end
Here’s a simple, working method: fetch a “file” from a URL (If you can’t read it, don’t worry, you’ll get the gist) What could go wrong? Let’s start with `nil`, what if url is nil? How likely is that? I’m a rubyist, it’s extremely likely How bad is it? ArgumentError pretty good Not dangerous enough to fix What else? What about blank or junk strings? Turns out that’s a valid URI, so you get a really obscure error from Net::HTTP that’s bad, and I figure it’s just as likely to happen

def fetch_file(url) uri = URI(url) raise ArgumentError, "invalid HTTP url"
unless uri.scheme&.start_with?('http') response = Net::HTTP.get_response(uri) response.body end This worth ﬁxing. That’ll raise a clear error now. - What’s could go wrong next? - Class of connectivity errors: DNS/host/timeout - highly likely (it’s the internet) - not that bad (pretty good exception; maybe reraise) - again: it could go wrong, but not dangerous enough to ﬁx - We’ve got a response from the server. What’s wrong next? - What if it’s a 404/500? - How likely: it’s the internet - How bad: worse than an exception - you think it’s success!

unless uri.scheme&.start_with?('http') response = Net::HTTP.get_response(uri) response.value # raise if response isn't successful response.body end - Turns out ruby unintuitively gives us `value` method which raises on non-200 response - (that deﬁnitely gets a comment) - What else could go wrong on the response? - What if it’s a 3XX? - How bad? Raises an exception - How likely? I only realised it because it happened when testing!

unless uri.scheme&.start_with?('http') response = Net::HTTP.get_response(uri) return fetch_file(response['location']) if response.is_a?(Net::HTTPRedirection) response.value # raise if response isn't successful response.body end - No problem, we’ll just recurse on redirect. - What next? - I think it’s very unlikely that we’d get a redirect loop - Bad? Worse than failure - not misleading exception - no silent failure - stuck for ages -> StackError

def fetch_file(url, max_redirects: 3) uri = URI(url) raise ArgumentError, "invalid
HTTP url" unless uri.scheme&.start_with?('http') response = Net::HTTP.get_response(uri) if response.is_a?(Net::HTTPRedirection) raise "Max redirects reached" if max_redirects == 0 return fetch_file(response['location'], max_redirects: max_redirects - 1) end response.value # raise if response isn't successful response.body end - Phew, getting a little more complex with the max_redirects now. - We’ve got a robust way of fetch a response. - What else could go wrong? - We could keep thinking about fetching Gb ﬁles on a 512Mb dyno - etc - But I think that’s enough for now. Compare this to our original implementation…

def fetch_file(url) uri = URI(url) response = Net::HTTP.get_response(uri) response.body end
The happy path is roughly the same …

def fetch_file(url, max_redirects: 3) uri = URI(url) raise ArgumentError, "invalid
HTTP url" unless uri.scheme&.start_with?('http') response = Net::HTTP.get_response(uri) if response.is_a?(Net::HTTPRedirection) raise "Max redirects reached" if max_redirects == 0 return fetch_file(response['location'], max_redirects: max_redirects - 1) end response.value # raise if response isn't successful response.body end - But now we’ve asked… what could go wrong? - We haven’t mitigated everything - but the real dangers are handled

UUIDS So, we can ask that question for each LOC
here’s a counter-example though Years ago, I needed a unique non-guessable key for a record. pretty common and straight forward

class Order < ActiveRecord::Base before_create :generate_token private def generate_token self.token
||= SecureRandom.uuid end end - Back then, I just added - a callback to generate a UUID. Worked perfectly. What could go wrong?

- Now I had previously written a blog post about
this, - speciﬁcally noting…

You’d have to generate  112 terabytes of UUIDs before you’d
even have  a one in a billion chance  of a collision. “ [read slide] - I know — in part of my brain — the likeliness - And a collision isn’t even that bad - we have a unique constraint, it’s just an exception But somehow…

There’s no way I should’ve bothered but the part of
my brain that lets me sleep at night isn’t the part that quantiﬁes danger! And that’s just how it is

TESTS Let’s take another angle Recently => streaming salesforce data:
it’s complex, fragile, mission critical code very likely to break badly. Now the default is to just write automated tests. rspec and done But we can’t stop asking the question yet! What could go wrong with conventional tests? They’ll be very brittle to mock, They’re very likely to change They’re very diﬃcult to read.

We’re going to sink a lot of maintenance time So
if no-tests is dangerous and automated tests have dangerous, What next?

❯ bin/rspec -t integration … Executing: rake salesforce:stream ⚠ Waiting
for new event in Salesforce… (go and trigger one!) we actually dropped automated testing and wrote an interactive test script It’s an unconventional tradeoﬀ But it reduces the danger without introducing lots of new ones like automated testing would It’s easy to get stuck in patterns like 100% automated test coverage but if you keep asking the question, you can move beyond “best practice” and mitigate your speciﬁc dangers instead of the universal ones.

You can innovate around danger.

One last example before I wrap… A while back I
was working on some authentication and I noticed: * even though the response was identical for incorrect email or password * the response time was measurably diﬀerent (not noticeably)… So you could perform user enumeration via a timing attack!

RANDOM_DIGEST = User.new(password: SecureRandom.uuid).password_digest So I spent considerable eﬀort writing
constant time password auth What could go wrong? - The code is less intuitive - and small tweaks aﬀect timing. - very likely someone breaks timing safety so…

it 'fails in constant time' do results = compare_ips do
|tests| tests[:email] = -> { validate(email: 'wrong') } tests[:password] = -> { validate(password: 'wrong') } end expect(results[:email]).to be_within(10).percent_of(results[:password]) end I spent considerable eﬀort writing a spec to compare the timing What could go wrong? - well we now have a non-deterministic test - We’ve got to mitigate false positives and false negatives

module IPSHelper # Compare execution times across multiple supplied blocks
and return a hash of # iterations-per-second for each block. # # The IPSs are inferred by running each block `sample_size` times - in an # alternating manner to avoid timing jitters - and taking the median # realtime benchmark. The median is used to diminish the effect of outlying # execution times. For very fast operations, ìterations` can be increased # to make the timing measurement more reliable. # def compare_ips(sample_size: 50, iterations: 1, &block) procs = {}.tap(&block) # yield a hash for procs to be added to results = Hash.new { |h, k| h[k] = [].extend(DescriptiveStatistics) } procs.each_value(&:call) # Pre-run all the blocks once for warmup alternating_each_sample(procs, sample_size) do |name, proc| results[name] << benchmark_block(iterations: iterations, &proc) end results.transform_values { |runtimes| iterations / runtimes.median } end private def benchmark_block(iterations: 1) Benchmark.realtime { iterations.times { yield } } end def disable_gc # Get as clean a GC state as we can before benchmarking GC.start GC.disable yield ensure GC.enable end def alternating_each_sample(procs, sample_size, &block) disable_gc do sample_size.times do procs.each(&block) end end end end So I spent considerable effort - avoiding timing jitters [click] - taking medians [click] - interleaving iterations [click] - ensuring a statistically significant sample size [click] - even wrangling garbage collection… [click] all to write a reliable timing comparison spec helper Finally, nothing else could go wrong.

The thing is I never asked the question in the
ﬁrst place! in the context of the project user enumeration wasn’t even a big deal - Realistically it’s low badness / low likeliness / low danger You know the real reason I wrote those hundreds of lines of quite complex code.

EGO Because I wanted to prove I could my patterns
of social cognition overpowered my rational cognition and you never notice it at the time! only when you’re writing a conf talk months later. FWIW, all that code is gone now.

THE END So… the subtle science of coding for failure.
Next time you’re writing code, or designing some system, or backing out of the drive way, I want you to ask 3 questions: 1. What could go wrong? 2. How likely is it? 3. How bad is it?

- So when we’re sitting in the car together at
the bridge (ﬁguratively speaking) - we can answer those questions as rationally as possible - but also remember we’re not perfectly rational people - so let’s also use these big social brains - to at least be kind to each other. Ok, let’s see… [play]

What Could Go Wrong? [ParisRB Conf 2020]

What Could Go Wrong? [ParisRB Conf 2020]

More Decks by Daniel Fone

Other Decks in Programming

Featured

Transcript