$30 off During Our Annual Pro Sale. View Details »

Failure: Or the Unexpected Virtue of Functional Programming

Failure: Or the Unexpected Virtue of Functional Programming

Even correct software fails.

So what happens if we shift the focus of functional programming to reliable systems? Let’s attack the hard and ugly set of programming problems, the solutions that don’t naturally fall out from building a neat library. Let’s apply our functional programming toolkit to delivering systems end-to-end.

This is a talk on reliable systems. What it takes to build them, how functional programming can and is being leveraged, and perhaps of more interest, where current approaches are letting us down.

*This talk was presented as the keynote at YOW Lambdajam 2015 in Brisbane*

Mark Hibberd

May 22, 2015
Tweet

More Decks by Mark Hibberd

Other Decks in Programming

Transcript

  1. Failure
    or
    (The Unexpected Virtue of
    Functional Programming)
    @markhibberd

    View Slide

  2. View Slide

  3. Act I
    Working Software

    View Slide

  4. View Slide

  5. View Slide

  6. “Why do we continue in
    this miserable condition”
    - George Orwell, Animal Farm

    View Slide

  7. Reliability

    View Slide

  8. Correctness Reliability

    View Slide

  9. Correctness Reliability
    (the correct answer)

    View Slide

  10. Correctness
    Correctness
    (the correct answer)

    View Slide

  11. Correctness Reliability
    Correctness
    (the correct answer)

    View Slide

  12. Correctness Reliability
    (whenever i need it)
    (the correct answer)
    Correctness

    View Slide

  13. View Slide

  14. View Slide

  15. View Slide

  16. View Slide

  17. View Slide

  18. “Several of them would have
    protested if they could have
    found the right arguments.”
    - George Orwell, Animal Farm

    View Slide

  19. Act II
    Post Functional

    View Slide

  20. View Slide

  21. View Slide

  22. Data

    View Slide

  23. Decisions

    View Slide

  24. Outcomes

    View Slide

  25. Measurement

    View Slide

  26. λx.f x

    View Slide

  27. 120+ code bases
    Pure, Typed FP
    Haskell, Scala & Stuff

    View Slide

  28. View Slide

  29. Stats and Reliability

    View Slide

  30. bad things can happen…

    View Slide

  31. P(failure) = 0.1

    View Slide

  32. P(failure) = 0.1

    View Slide

  33. redundancy

    View Slide

  34. redundancy

    View Slide

  35. P(individual failure) = 0.1

    View Slide

  36. P(system failure) = 0.1^10

    View Slide

  37. are failures really independent?

    View Slide

  38. P(mutually assured destruction) = 1

    View Slide

  39. redundancy

    View Slide

  40. but if one goes…

    View Slide

  41. they all do

    View Slide

  42. P(individual failure) = 0.1

    View Slide

  43. P(individual success) = 1 - 0.1 = 0.9

    View Slide

  44. P(all successes) = 0.9^10

    View Slide

  45. P(system failure) = 1 - 0.9^10

    View Slide

  46. P(system failure) = 1 - 0.9^10 = 0.65

    View Slide

  47. View Slide

  48. P(system failure) = 1 - 0.9^10 = 0.65

    View Slide

  49. Correctness Reliability
    (whenever i need it)
    (the correct answer)
    Correctness

    View Slide

  50. Correctness Reliability
    (produce the decisions by X
    o’clock using the last vetted
    dataset)
    (the best set of measurable
    decisions for today)
    Correctness

    View Slide

  51. View Slide

  52. Separation of
    Data and
    Computation

    View Slide

  53. If we can achieve reliable data,
    reliable computation should be
    pretty straightforward

    View Slide

  54. Can you restart your system at
    any point?

    View Slide

  55. Could you turn your long
    running daemon into a cron job?

    View Slide

  56. Reliable Data

    View Slide

  57. If you have untangled your
    computation from your data,
    someone has probably solved
    your data storage requirements

    View Slide

  58. But…
    Failure is never clean. One of the
    most difficult challenges is ensuring
    that we only have known good states,
    failure must not corrupt.

    View Slide

  59. Do you know the provenance of each
    piece of data in your system?

    View Slide

  60. If you detected a failure, would you be
    able to identify the downstream
    effects?

    View Slide

  61. Are there multiple paths to build a
    dataset? Could we rebuild from an
    alternate source if we needed to?

    View Slide

  62. Fail Hard or
    Monitor

    View Slide

  63. “fault isolation advocates that the
    process software be fail-fast, it
    should either function correctly or
    it should detect the fault, signal
    failure and stop operating”
    - Jim Gray, Why Do Computers Stop and What Can Be Done About It?

    View Slide

  64. But…
    often it is the sorta close, kinda
    reasonable, inputs that will hurt

    View Slide

  65. garbage in, garbage out
    9134

    View Slide

  66. garbage in, garbage out
    9134 42

    View Slide

  67. View Slide

  68. Fail Fast & Hard,
    otherwise
    Monitor Heavily

    View Slide

  69. monitor data in context
    9134
    4
    3
    3

    View Slide

  70. monitor data in context
    9134
    4
    3
    3

    View Slide

  71. Reliable
    Sub-Systems

    View Slide

  72. P(failure) =
    0.1

    View Slide

  73. P(failure) =
    0.01

    View Slide

  74. P(failure) =
    0.001

    View Slide

  75. P(failure) =
    $$$

    View Slide

  76. P(failure) =
    sleep

    View Slide

  77. View Slide

  78. View Slide

  79. “All animals are equal, but
    some animals are more
    equal than others.”
    - George Orwell, Animal Farm

    View Slide

  80. überblock

    View Slide

  81. 0x00bab10c

    View Slide

  82. 0x00bab10c

    View Slide

  83. 0x00bab10c

    View Slide

  84. 0x00bab10c

    View Slide

  85. 0x00bab10c

    View Slide

  86. View Slide

  87. View Slide

  88. View Slide

  89. View Slide

  90. View Slide

  91. View Slide

  92. Ditto Blocks

    View Slide

  93. More Important,
    More Replication

    View Slide

  94. *bonus*

    View Slide

  95. *bonus*
    Built in data
    verification &
    self healing

    View Slide

  96. *bonus*
    Each block
    maintains
    integrity of
    children

    View Slide

  97. *bonus*
    Merkle Tree
    hash(b1, b2)
    hash(g1, g2) hash(g3, g4)
    hash(data)

    View Slide

  98. “ZFS has been subjected to
    over a million forced, violent
    crashes without losing data
    integrity or leaking a single
    block.”
    - Bonwick & Moore, ZFS The Last Word in File Systems

    View Slide

  99. Isolation
    End-to-End

    View Slide

  100. code

    View Slide

  101. build & test

    View Slide

  102. fail

    View Slide

  103. fail

    View Slide

  104. fail

    View Slide

  105. Almost everything that happens
    after a build undermines the
    isolation we have worked hard
    to achieve

    View Slide

  106. View Slide

  107. If I can’t run multiple versions of
    the same code in parallel, one
    programming error can bring
    everything down

    View Slide

  108. View Slide

  109. remember these?

    View Slide

  110. View Slide

  111. View Slide

  112. View Slide

  113. View Slide

  114. View Slide

  115. If I can run multiple versions of
    my code, but only one version of
    my infrastructure…

    View Slide

  116. View Slide

  117. View Slide

  118. View Slide

  119. View Slide

  120. View Slide

  121. View Slide

  122. Act III
    Building Systems

    View Slide

  123. “construct reliable systems from
    unreliable parts … from the
    knowledge that any component
    in the system might fail”
    - Holzman & Joshi, Reliable Software Systems Design

    View Slide

  124. the library
    worst library ever…

    View Slide

  125. View Slide

  126. the library
    worst library ever…

    View Slide

  127. the library
    worst library ever…

    View Slide

  128. the library
    P(failure) = 0.8

    View Slide

  129. the library
    P(failure) = 0.8
    No Separation of
    Computation and Data

    View Slide

  130. the library
    P(failure) = 0.8

    View Slide

  131. the library
    P(failure) = 0.8
    Crashes Corrupt
    The Data Store

    View Slide

  132. View Slide

  133. the library
    P(failure) = 0.8
    proxy

    View Slide

  134. the library
    P(failure) = 0.8
    proxy
    journal Reliable data storage

    View Slide

  135. the library
    P(failure) = 0.8
    proxy
    journal On failure replay journal

    View Slide

  136. the library
    P(failure) = 0.8
    proxy
    journal We have isolated failures

    View Slide

  137. the library
    P(failure) = 0.8^n
    proxy
    journal the library

    View Slide

  138. the library
    P(failure) = 0.8^2 = 0.64
    proxy
    journal the library

    View Slide

  139. the library
    P(failure) = 0.8^10 = 0.10
    proxy
    journal the library

    View Slide

  140. the library
    P(failure) = 0.8^20 = 0.01
    proxy
    journal the library

    View Slide

  141. “A beach house isn’t just real
    estate. It’s a state of mind.”
    - Douglas Adams, Mostly Harmless

    View Slide

  142. View Slide