Failure: Or the Unexpected Virtue of Functional Programming

Failure or (The Unexpected Virtue of Functional Programming) @markhibberd

Act I Working Software

“Why do we continue in this miserable condition” - George
Orwell, Animal Farm

Reliability

Correctness Reliability

Correctness Reliability (the correct answer)

Correctness Correctness (the correct answer)

Correctness Reliability Correctness (the correct answer)

Correctness Reliability (whenever i need it) (the correct answer) Correctness

“Several of them would have protested if they could have
found the right arguments.” - George Orwell, Animal Farm

Act II Post Functional

Decisions

Outcomes

Measurement

λx.f x

120+ code bases Pure, Typed FP Haskell, Scala & Stuff

Stats and Reliability

bad things can happen…

P(failure) = 0.1

redundancy

P(individual failure) = 0.1

P(system failure) = 0.1^10

are failures really independent?

P(mutually assured destruction) = 1

redundancy

but if one goes…

they all do

P(individual failure) = 0.1

P(individual success) = 1 - 0.1 = 0.9

P(all successes) = 0.9^10

P(system failure) = 1 - 0.9^10

P(system failure) = 1 - 0.9^10 = 0.65

Correctness Reliability (whenever i need it) (the correct answer) Correctness

Correctness Reliability (produce the decisions by X o’clock using the
last vetted dataset) (the best set of measurable decisions for today) Correctness

Separation of Data and Computation

If we can achieve reliable data, reliable computation should be
pretty straightforward

Can you restart your system at any point?

Could you turn your long running daemon into a cron
job?

Reliable Data

If you have untangled your computation from your data, someone
has probably solved your data storage requirements

But… Failure is never clean. One of the most difﬁcult
challenges is ensuring that we only have known good states, failure must not corrupt.

Do you know the provenance of each piece of data
in your system?

If you detected a failure, would you be able to
identify the downstream effects?

Are there multiple paths to build a dataset? Could we
rebuild from an alternate source if we needed to?

Fail Hard or Monitor

“fault isolation advocates that the process software be fail-fast, it
should either function correctly or it should detect the fault, signal failure and stop operating” - Jim Gray, Why Do Computers Stop and What Can Be Done About It?

But… often it is the sorta close, kinda reasonable, inputs
that will hurt

garbage in, garbage out 9134

garbage in, garbage out 9134 42

Fail Fast & Hard, otherwise Monitor Heavily

monitor data in context 9134 4 3 3

Reliable Sub-Systems

P(failure) = 0.1

P(failure) = 0.01

P(failure) = 0.001

P(failure) = $$$

P(failure) = sleep

“All animals are equal, but some animals are more equal
than others.” - George Orwell, Animal Farm

überblock

0x00bab10c

Ditto Blocks

More Important, More Replication

*bonus*

*bonus* Built in data veriﬁcation & self healing

*bonus* Each block maintains integrity of children

*bonus* Merkle Tree hash(b1, b2) hash(g1, g2) hash(g3, g4) hash(data)

“ZFS has been subjected to over a million forced, violent
crashes without losing data integrity or leaking a single block.” - Bonwick & Moore, ZFS The Last Word in File Systems

Isolation End-to-End

build & test

Almost everything that happens after a build undermines the isolation
we have worked hard to achieve

If I can’t run multiple versions of the same code
in parallel, one programming error can bring everything down

remember these?

If I can run multiple versions of my code, but
only one version of my infrastructure…

Act III Building Systems

“construct reliable systems from unreliable parts … from the knowledge
that any component in the system might fail” - Holzman & Joshi, Reliable Software Systems Design

the library worst library ever…

the library P(failure) = 0.8

the library P(failure) = 0.8 No Separation of Computation and
Data

the library P(failure) = 0.8

the library P(failure) = 0.8 Crashes Corrupt The Data Store

the library P(failure) = 0.8 proxy

the library P(failure) = 0.8 proxy journal Reliable data storage

the library P(failure) = 0.8 proxy journal On failure replay
journal

the library P(failure) = 0.8 proxy journal We have isolated
failures

the library P(failure) = 0.8^n proxy journal the library

the library P(failure) = 0.8^2 = 0.64 proxy journal the
library

library

“A beach house isn’t just real estate. It’s a state
of mind.” - Douglas Adams, Mostly Harmless

Failure: Or the Unexpected Virtue of Functional...

Failure: Or the Unexpected Virtue of Functional Programming

More Decks by Mark Hibberd

Other Decks in Programming

Featured

Transcript