Pro Yearly is on sale from $80 to $50! »

Failure: Or the Unexpected Virtue of Functional Programming

Failure: Or the Unexpected Virtue of Functional Programming

Even correct software fails.

So what happens if we shift the focus of functional programming to reliable systems? Let’s attack the hard and ugly set of programming problems, the solutions that don’t naturally fall out from building a neat library. Let’s apply our functional programming toolkit to delivering systems end-to-end.

This is a talk on reliable systems. What it takes to build them, how functional programming can and is being leveraged, and perhaps of more interest, where current approaches are letting us down.

*This talk was presented as the keynote at YOW Lambdajam 2015 in Brisbane*

42d9867a0fee0fa6de6534e9df0f1e9b?s=128

Mark Hibberd

May 22, 2015
Tweet

Transcript

  1. Failure or (The Unexpected Virtue of Functional Programming) @markhibberd

  2. None
  3. Act I Working Software

  4. None
  5. None
  6. “Why do we continue in this miserable condition” - George

    Orwell, Animal Farm
  7. Reliability

  8. Correctness Reliability

  9. Correctness Reliability (the correct answer)

  10. Correctness Correctness (the correct answer)

  11. Correctness Reliability Correctness (the correct answer)

  12. Correctness Reliability (whenever i need it) (the correct answer) Correctness

  13. None
  14. None
  15. None
  16. None
  17. None
  18. “Several of them would have protested if they could have

    found the right arguments.” - George Orwell, Animal Farm
  19. Act II Post Functional

  20. None
  21. None
  22. Data

  23. Decisions

  24. Outcomes

  25. Measurement

  26. λx.f x

  27. 120+ code bases Pure, Typed FP Haskell, Scala & Stuff

  28. None
  29. Stats and Reliability

  30. bad things can happen…

  31. P(failure) = 0.1

  32. P(failure) = 0.1

  33. redundancy

  34. redundancy

  35. P(individual failure) = 0.1

  36. P(system failure) = 0.1^10

  37. are failures really independent?

  38. P(mutually assured destruction) = 1

  39. redundancy

  40. but if one goes…

  41. they all do

  42. P(individual failure) = 0.1

  43. P(individual success) = 1 - 0.1 = 0.9

  44. P(all successes) = 0.9^10

  45. P(system failure) = 1 - 0.9^10

  46. P(system failure) = 1 - 0.9^10 = 0.65

  47. None
  48. P(system failure) = 1 - 0.9^10 = 0.65

  49. Correctness Reliability (whenever i need it) (the correct answer) Correctness

  50. Correctness Reliability (produce the decisions by X o’clock using the

    last vetted dataset) (the best set of measurable decisions for today) Correctness
  51. None
  52. Separation of Data and Computation

  53. If we can achieve reliable data, reliable computation should be

    pretty straightforward
  54. Can you restart your system at any point?

  55. Could you turn your long running daemon into a cron

    job?
  56. Reliable Data

  57. If you have untangled your computation from your data, someone

    has probably solved your data storage requirements
  58. But… Failure is never clean. One of the most difficult

    challenges is ensuring that we only have known good states, failure must not corrupt.
  59. Do you know the provenance of each piece of data

    in your system?
  60. If you detected a failure, would you be able to

    identify the downstream effects?
  61. Are there multiple paths to build a dataset? Could we

    rebuild from an alternate source if we needed to?
  62. Fail Hard or Monitor

  63. “fault isolation advocates that the process software be fail-fast, it

    should either function correctly or it should detect the fault, signal failure and stop operating” - Jim Gray, Why Do Computers Stop and What Can Be Done About It?
  64. But… often it is the sorta close, kinda reasonable, inputs

    that will hurt
  65. garbage in, garbage out 9134

  66. garbage in, garbage out 9134 42

  67. None
  68. Fail Fast & Hard, otherwise Monitor Heavily

  69. monitor data in context 9134 4 3 3

  70. monitor data in context 9134 4 3 3

  71. Reliable Sub-Systems

  72. P(failure) = 0.1

  73. P(failure) = 0.01

  74. P(failure) = 0.001

  75. P(failure) = $$$

  76. P(failure) = sleep

  77. None
  78. None
  79. “All animals are equal, but some animals are more equal

    than others.” - George Orwell, Animal Farm
  80. überblock

  81. 0x00bab10c

  82. 0x00bab10c

  83. 0x00bab10c

  84. 0x00bab10c

  85. 0x00bab10c

  86. None
  87. None
  88. None
  89. None
  90. None
  91. None
  92. Ditto Blocks

  93. More Important, More Replication

  94. *bonus*

  95. *bonus* Built in data verification & self healing

  96. *bonus* Each block maintains integrity of children

  97. *bonus* Merkle Tree hash(b1, b2) hash(g1, g2) hash(g3, g4) hash(data)

  98. “ZFS has been subjected to over a million forced, violent

    crashes without losing data integrity or leaking a single block.” - Bonwick & Moore, ZFS The Last Word in File Systems
  99. Isolation End-to-End

  100. code

  101. build & test

  102. fail

  103. fail

  104. fail

  105. Almost everything that happens after a build undermines the isolation

    we have worked hard to achieve
  106. None
  107. If I can’t run multiple versions of the same code

    in parallel, one programming error can bring everything down
  108. None
  109. remember these?

  110. None
  111. None
  112. None
  113. None
  114. None
  115. If I can run multiple versions of my code, but

    only one version of my infrastructure…
  116. None
  117. None
  118. None
  119. None
  120. None
  121. None
  122. Act III Building Systems

  123. “construct reliable systems from unreliable parts … from the knowledge

    that any component in the system might fail” - Holzman & Joshi, Reliable Software Systems Design
  124. the library worst library ever…

  125. None
  126. the library worst library ever…

  127. the library worst library ever…

  128. the library P(failure) = 0.8

  129. the library P(failure) = 0.8 No Separation of Computation and

    Data
  130. the library P(failure) = 0.8

  131. the library P(failure) = 0.8 Crashes Corrupt The Data Store

  132. None
  133. the library P(failure) = 0.8 proxy

  134. the library P(failure) = 0.8 proxy journal Reliable data storage

  135. the library P(failure) = 0.8 proxy journal On failure replay

    journal
  136. the library P(failure) = 0.8 proxy journal We have isolated

    failures
  137. the library P(failure) = 0.8^n proxy journal the library

  138. the library P(failure) = 0.8^2 = 0.64 proxy journal the

    library
  139. the library P(failure) = 0.8^10 = 0.10 proxy journal the

    library
  140. the library P(failure) = 0.8^20 = 0.01 proxy journal the

    library
  141. “A beach house isn’t just real estate. It’s a state

    of mind.” - Douglas Adams, Mostly Harmless
  142. None