Upgrade to Pro — share decks privately, control downloads, hide ads and more …

What Could Go Wrong? [ParisRB Conf 2020]

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
Avatar for Daniel Fone Daniel Fone
February 19, 2020

What Could Go Wrong? [ParisRB Conf 2020]

Software development is the anticipation of a thousand dangers: regressions, edge-cases, exceptions, attacks, downtime, the list is endless. Drawing from a wide range of disciplines, we’ll build a simple model for quantifying danger, explore why the human brain is so bad at it, and examine some practical applications for our development processes. There will be at least one science.

Avatar for Daniel Fone

Daniel Fone

February 19, 2020

More Decks by Daniel Fone

Other Decks in Programming

Transcript

  1. This is a simple talk. I’m going to start and

    finish with a single question — if you forget everything in between, that’s fine But as people who make things, I want to help us ask 1 simple question…
  2. WHAT COULD GO WRONG? What could go wrong? That’s it.

    Every LOC, every test. Every gem Every migration, deploy, rake task Every time you ssh into production (without telling anyone) What. could. go. wrong?
  3. WHAT COULD GO WRONG? ¯\_(ϑ)_/¯ Now obviously there’s a way

    to ask this rhetorically… [dismissively] what could go wrong? ¯\_(ϑ)_/¯ We know what this looks like…
  4. - or you need to cross a bridge - with

    just a bit more river than usual. - And there’s a familiar conversation happening in this car I think - We can make it! - No we can’t! - And I think we have this same conversation again and again - in professional and personal contexts - whether you’re parenting or pushing to production
  5. WHAT COULD GO WRONG? " So my goal today is

    to help us ask that question a little more critically what could go wrong? Not only so we can make better decisions, but also so we can disagree more rigorously about crossing the bridge
  6. DANGER A long time ago, in another life time, I

    studied environmental chemistry. I don’t remember much from my Chemistry degree, but I remember a couple of things: 1. everything is toxic if you have enough of it and 2…
  7. RISK × HAZARD = DANGER any danger has two factors:

    risk and hazard Now, chemistry was a long time ago and as soon as I say that, I’m terrified someone’s going to come up after and say
  8. DANGER NOUN THE POSSIBILITY OF SUFFERING HARM OR INJURY One

    dictionary describes danger as [read] again, there’s two factors we can quantify “possibility” and “harm”
  9. DANGER = RISK × HAZARD just like risk and hazard,

    two factors The terminology doesn’t matter. But, just to appease the ISO gods, let’s choose some different words…
  10. WHAT COULD GO WRONG? HOW LIKELY IS IT? HOW BAD

    IS IT? Ultimately I want us to ask “what could go wrong” - and then - how likely is it? - how bad is it? This is it. This is the talk. 1. What could go wrong? 2. How likely is it? 3. How bad is it? So let’s break this down a little bit
  11. EXAMPLES ~[3:00] If I tried to walk across a balance

    beam a meter or so off the ground, I’d be fine. BUT… if we put that beam at the same height as a famous towering local landmark… perhaps
  12. La tour Montparnasse no amount of money would convince me

    to walk across it! and I’ve thought hard about this I think that’s intuitive for most people. But let’s think about what’s really changed here:
  13. DANGEROUSNESS = LIKELINESS × BADNESS # * this likeliness of

    falling stays the same: * I’m just as capable * if anything I’m far more motivated to step accurately
  14. DANGEROUSNESS = LIKELINESS × BADNESS # $ but the badness

    changes dramatically. From that height I’d hit the ground at ~200km/h
  15. DANGEROUSNESS = LIKELINESS × BADNESS ☠ # $ * so

    the scenario is more dangerous, * not because falling is more likely, * but because the “how bad would it be” changed from * a sore ankle * to je suis splat
  16. - Now let’s return to our disagreement at the bridge

    - What are we really disagreeing about here?
  17. BRIDGE COLLAPSE? LIKELINESS × BADNESS & ' - Sure, everyone

    agrees bridge collapsing would be very bad - One thinks: it’s definitely going to go - Other thinks: it’s probably strong enough - and straight away we get to the heart of the disagreement - We disagree about likelihood.
  18. WAVE? LIKELINESS × BADNESS ' & - And maybe everyone

    can see 100% - water is going to hit the car - But one thinks: that won’t be too bad - And other thinks: that’ll break the car - I no idea how cars/bridges/rivers work - But as we actually identify what could go wrong - and distinguish the badness and the likeliness - we can start to disagree more robustly about the danger Everyone agrees: very bad if we died we just disagree how likely that is Which brings me to…
  19. MICROMORTS * Who’s heard of a micromort? * A micromort

    = defined quantum/amount of dangerousness
  20. DANGEROUSNESS = LIKELINESS × BADNESS 1 µmort 1/million 1 death

    * a one in a million chance of death * the likeliness is 1 in a million * the badness is a single death
  21. Walking 27 km Cycling 32 km Driving 400 km Flying

    1,600 km 1 MICROMORT And so you get micromort data for travel. suggesting — per km — flying is far safer than anything else interesting to see per trip This was sourced from wiki so take it with a grain of salt
  22. Scuba Diving 6 µmort per dive Marathon 7 µmort per

    event Skydiving 8 µmort per jump Base-jumping 430 µmort per jump Climb Everest 39,427 µmort per ascent Similarly for recreational risks… Scuba, Marathon, and Skydiving all quite similar which surprises me because I’ve done both marathons and skydiving and they did not seem equally dangerous … then: crazy dangerous sports base jumping is notoriously risky everest has 223 fatalities / 5.5k ascents To put these risks in perspective, though:
  23. Scuba Diving 6 µmort per dive Marathon 7 µmort per

    event Skydiving 8 µmort per jump Childbirth mother 80 µmort per birth Base-jumping 430 µmort per jump Climb Everest 39,427 µmort per ascent According to the WHO France 2017, maternal mortality rate order of magnitude riskier than skydiving and that’s pretty good by world standards - which surprised me in terms of relative risk - doesn’t really match the stories we tell ourselves about these things - And the more I’ve researched these statistics the more I’m surprised how wrong my stories were - And it’s at this point that I start to realise how often my perception of danger - Isn’t driven by knowledge - but is driven by my discourses and social narratives. https://www.who.int/gho/maternal_health/countries/fra.pdf
  24. SOCIAL BRAINS ~[7:00] Because underneath all this logic we have

    to remember we have the brains of social primates. [ Pause for new section ]
  25. A human is by nature a social animal “ 2.5kya,

    Aristotle said that a “human is by nature a social animal” and modern neuroscientists like Matthew Lieberman seem to agree. He argues that the human brain has been primed by evolution to view the world in predominately social terms. And so it seems our neural biology is wired overwhelmingly for social cognition and much less for this statistical cognition needed for risk assessment
  26. SINGLE FACTOR FIXATION One of the most common mistakes I’ve

    noticed is the tendency to fixate on a single factor when thinking about danger For example…
  27. ň6*'-+&50'8'4 2.#;106*'&4+8'9#;ʼn FIXATE ON LOW LIKELINESS * We can easily

    fixate on a low likeliness * and completely discount the high badness * “complacency” We have an minivan, and even with a camera at the back there are a worrying amount of dents in the rear bumper
  28. (%#/'4#Ŏ/+44145Ŏ .+0'1(5+)*6( FIXATE ON LOW LIKELINESS so now when we’re

    reversing the kids recite a little safety jingle because it’s too easy to think it’ll never happen to you and forget how unimaginably bad it’d be if something did happen According to child injury prevention group SafeKids, a child is hospitalised with serious injuries after being struck by a car in a private driveway every two weeks in New Zealand. At least five children are killed each year in the same way.
  29. #..6*'2.#0'5 #4'%4#5*+0) FIXATE ON HIGH BADNESS * Alternatively, we fixate

    on high badness * completely ignoring the risk factor * “Hysteria” Gerd Gigerenzer [gieger enzer] is a psychologist who studies decision making * He talks about how * in the year following 9/11 * An estimated 16,000 Americans died on the road * because they drove long distance instead of flying
  30. 9'ņ&$'66'4 &4+8' FIXATE ON HIGH BADNESS So we see it’s

    easy to focus on the how bad a plane crash is and ignore how likely a car crash is And none of us are immune! we can see from our collective behaviour these fixations — these stories we tell ourselves are simply part of our social inheritance.
  31. This is just how our brains work or complacency, hysteria

    or whatever… This is how our brains work - and so many of our instincts are governed by social factors: - Your family of origin, your ideology - the beliefs we’re built around our identity and belonging - And the rational layers of the brain are really only a recent evolutionary add on
  32. - But while we certainly have some built-in hurdles -

    to think clearly about danger I believe if we can be mindful of our biases, and think about danger in a structured way experience shows we can actually be pretty good at this We might be social primates, but we socialised our way to the moon!
  33. CODE ~[12:00] When we code, the answer to the question

    “what could go wrong” depends on all of the broad context for the particular thing you’re doing So it’s kind of artificial to demonstrate in a conference talk. That said… let’s start at the LOC level.
  34. def fetch_file(url) uri = URI(url) response = Net::HTTP.get_response(uri) response.body end

    Here’s a simple, working method: fetch a “file” from a URL (If you can’t read it, don’t worry, you’ll get the gist) What could go wrong? Let’s start with `nil`, what if url is nil? How likely is that? I’m a rubyist, it’s extremely likely How bad is it? ArgumentError pretty good Not dangerous enough to fix What else? What about blank or junk strings? Turns out that’s a valid URI, so you get a really obscure error from Net::HTTP that’s bad, and I figure it’s just as likely to happen
  35. def fetch_file(url) uri = URI(url) raise ArgumentError, "invalid HTTP url"

    unless uri.scheme&.start_with?('http') response = Net::HTTP.get_response(uri) response.body end This worth fixing. That’ll raise a clear error now. - What’s could go wrong next? - Class of connectivity errors: DNS/host/timeout - highly likely (it’s the internet) - not that bad (pretty good exception; maybe reraise) - again: it could go wrong, but not dangerous enough to fix - We’ve got a response from the server. What’s wrong next? - What if it’s a 404/500? - How likely: it’s the internet - How bad: worse than an exception - you think it’s success!
  36. def fetch_file(url) uri = URI(url) raise ArgumentError, "invalid HTTP url"

    unless uri.scheme&.start_with?('http') response = Net::HTTP.get_response(uri) response.value # raise if response isn't successful response.body end - Turns out ruby unintuitively gives us `value` method which raises on non-200 response - (that definitely gets a comment) - What else could go wrong on the response? - What if it’s a 3XX? - How bad? Raises an exception - How likely? I only realised it because it happened when testing!
  37. def fetch_file(url) uri = URI(url) raise ArgumentError, "invalid HTTP url"

    unless uri.scheme&.start_with?('http') response = Net::HTTP.get_response(uri) return fetch_file(response['location']) if response.is_a?(Net::HTTPRedirection) response.value # raise if response isn't successful response.body end - No problem, we’ll just recurse on redirect. - What next? - I think it’s very unlikely that we’d get a redirect loop - Bad? Worse than failure - not misleading exception - no silent failure - stuck for ages -> StackError
  38. def fetch_file(url, max_redirects: 3) uri = URI(url) raise ArgumentError, "invalid

    HTTP url" unless uri.scheme&.start_with?('http') response = Net::HTTP.get_response(uri) if response.is_a?(Net::HTTPRedirection) raise "Max redirects reached" if max_redirects == 0 return fetch_file(response['location'], max_redirects: max_redirects - 1) end response.value # raise if response isn't successful response.body end - Phew, getting a little more complex with the max_redirects now. - We’ve got a robust way of fetch a response. - What else could go wrong? - We could keep thinking about fetching Gb files on a 512Mb dyno - etc - But I think that’s enough for now. Compare this to our original implementation…
  39. def fetch_file(url, max_redirects: 3) uri = URI(url) raise ArgumentError, "invalid

    HTTP url" unless uri.scheme&.start_with?('http') response = Net::HTTP.get_response(uri) if response.is_a?(Net::HTTPRedirection) raise "Max redirects reached" if max_redirects == 0 return fetch_file(response['location'], max_redirects: max_redirects - 1) end response.value # raise if response isn't successful response.body end - But now we’ve asked… what could go wrong? - We haven’t mitigated everything - but the real dangers are handled
  40. UUIDS So, we can ask that question for each LOC

    here’s a counter-example though Years ago, I needed a unique non-guessable key for a record. pretty common and straight forward
  41. class Order < ActiveRecord::Base before_create :generate_token private def generate_token self.token

    ||= SecureRandom.uuid end end - Back then, I just added - a callback to generate a UUID. Worked perfectly. What could go wrong?
  42. - Now I had previously written a blog post about

    this, - specifically noting…
  43. You’d have to generate
 112 terabytes of UUIDs before you’d

    even have
 a one in a billion chance
 of a collision. “ [read slide] - I know — in part of my brain — the likeliness - And a collision isn’t even that bad - we have a unique constraint, it’s just an exception But somehow…
  44. There’s no way I should’ve bothered but the part of

    my brain that lets me sleep at night isn’t the part that quantifies danger! And that’s just how it is
  45. TESTS Let’s take another angle Recently => streaming salesforce data:

    it’s complex, fragile, mission critical code very likely to break badly. Now the default is to just write automated tests. rspec and done But we can’t stop asking the question yet! What could go wrong with conventional tests? They’ll be very brittle to mock, They’re very likely to change They’re very difficult to read.
  46. We’re going to sink a lot of maintenance time So

    if no-tests is dangerous and automated tests have dangerous, What next?
  47. ❯ bin/rspec -t integration … Executing: rake salesforce:stream ⚠ Waiting

    for new event in Salesforce… (go and trigger one!) we actually dropped automated testing and wrote an interactive test script It’s an unconventional tradeoff But it reduces the danger without introducing lots of new ones like automated testing would It’s easy to get stuck in patterns like 100% automated test coverage but if you keep asking the question, you can move beyond “best practice” and mitigate your specific dangers instead of the universal ones.
  48. One last example before I wrap… A while back I

    was working on some authentication and I noticed: * even though the response was identical for incorrect email or password * the response time was measurably different (not noticeably)… So you could perform user enumeration via a timing attack!
  49. RANDOM_DIGEST = User.new(password: SecureRandom.uuid).password_digest So I spent considerable effort writing

    constant time password auth What could go wrong? - The code is less intuitive - and small tweaks affect timing. - very likely someone breaks timing safety so…
  50. it 'fails in constant time' do results = compare_ips do

    |tests| tests[:email] = -> { validate(email: 'wrong') } tests[:password] = -> { validate(password: 'wrong') } end expect(results[:email]).to be_within(10).percent_of(results[:password]) end I spent considerable effort writing a spec to compare the timing What could go wrong? - well we now have a non-deterministic test - We’ve got to mitigate false positives and false negatives
  51. module IPSHelper # Compare execution times across multiple supplied blocks

    and return a hash of # iterations-per-second for each block. # # The IPSs are inferred by running each block `sample_size` times - in an # alternating manner to avoid timing jitters - and taking the median # realtime benchmark. The median is used to diminish the effect of outlying # execution times. For very fast operations, `iterations` can be increased # to make the timing measurement more reliable. # def compare_ips(sample_size: 50, iterations: 1, &block) procs = {}.tap(&block) # yield a hash for procs to be added to results = Hash.new { |h, k| h[k] = [].extend(DescriptiveStatistics) } procs.each_value(&:call) # Pre-run all the blocks once for warmup alternating_each_sample(procs, sample_size) do |name, proc| results[name] << benchmark_block(iterations: iterations, &proc) end results.transform_values { |runtimes| iterations / runtimes.median } end private def benchmark_block(iterations: 1) Benchmark.realtime { iterations.times { yield } } end def disable_gc # Get as clean a GC state as we can before benchmarking GC.start GC.disable yield ensure GC.enable end def alternating_each_sample(procs, sample_size, &block) disable_gc do sample_size.times do procs.each(&block) end end end end So I spent considerable effort - avoiding timing jitters [click] - taking medians [click] - interleaving iterations [click] - ensuring a statistically significant sample size [click] - even wrangling garbage collection… [click] all to write a reliable timing comparison spec helper Finally, nothing else could go wrong.
  52. The thing is I never asked the question in the

    first place! in the context of the project user enumeration wasn’t even a big deal - Realistically it’s low badness / low likeliness / low danger You know the real reason I wrote those hundreds of lines of quite complex code.
  53. EGO Because I wanted to prove I could my patterns

    of social cognition overpowered my rational cognition and you never notice it at the time! only when you’re writing a conf talk months later. FWIW, all that code is gone now.
  54. THE END So… the subtle science of coding for failure.

    Next time you’re writing code, or designing some system, or backing out of the drive way, I want you to ask 3 questions: 1. What could go wrong? 2. How likely is it? 3. How bad is it?
  55. - So when we’re sitting in the car together at

    the bridge (figuratively speaking) - we can answer those questions as rationally as possible - but also remember we’re not perfectly rational people - so let’s also use these big social brains - to at least be kind to each other. Ok, let’s see… [play]