Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Zebras all the way down: The engineering challe...

Zebras all the way down: The engineering challenges of the data path

My talk at the inaugural #UptimeConf in 2017. Video: https://www.youtube.com/watch?v=fE2KDzZaxvE

Bryan Cantrill

August 25, 2017
Tweet

More Decks by Bryan Cantrill

Other Decks in Technology

Transcript

  1. The luxury of statelessness • In service-oriented software systems, we

    love statelessness • And for good reason: stateless components — like finite state machines — lend systems many desirable properties! • Stateless components can be easily made immutable, scalable, re-deployable, restartable, upgradeable, etc. etc. • Of course, persistent state still very much exists — we just use separation of concerns to confine the management of state to those services that do it explicitly and exclusively…
  2. The data path • The data path consists of the

    software, hardware, and firmware components between a service endpoint that offers persistence and the implementation of that persistence • The data path always ends in non-volatile storage, which (for now, anyway) means either flash or magnetic media • The data path traverses many subsystems and components — and nearly always is a distributed system itself • We place great demands upon the data path…
  3. The demands of the data path • A data path

    that merely works much of the time is insufficient • We (rightfully) expect perfection from the data path: we expect it to be consistent, available and partition-tolerant! • Of course, Brewer’s CAP theorem tells us that this isn’t actually possible — we must make tradeoffs • Even a well-engineered system can’t beat CAP — but a poorly engineered one will be flailed by it, becoming pathologically unavailable or inconsistent • Zebras are the difference
  4. Zebras? • In American medical slang, a zebra is a

    rare and exotic condition that can be conflated with more common ailments • Medical students and residents are cautioned against diagnosing them, to the point of aphorism: “when you hear hoofbeats, think of horses not zebras” • But — as anyone who has been afflicted by one will affirm — zebras emphatically exist!
  5. Zebras in the data path? • Even though the data

    path runs on and ends with hardware, it consists of many disjoint and unseen software components • The paradox of software (especially that of the data path!) is that software is both information and machine • When software works correctly, it survives as information does: namely, in perpetuity • Especially where software is expensive to write and difficult to fix, there is an overwhelming bias towards extant software • Over time, the horses are found; only the zebras are left
  6. Hunting zebra • We must assume that unusual pathologies —

    especially in a distributed system — will not be readily reproducible! • When we are culturally afflicted with “bias for action”, it becomes tempting to immediately change the system to fix it • This is the wrong first motion: the choice to restore service versus understanding it is often a false dichotomy! • We must not change the system but rather observe it — we must focus not on snap hypotheses, but rather initial questions • The observability of the system is paramount!
  7. Observability at Joyent • Observability is an organizing principle at

    Joyent — it is a primary reason that we run SmartOS, our illumos derivative • Manta — our (open source, container-centric) object storage service — has SmartOS and ZFS at its core • Manta uses sharded PostgreSQL for metadata (+ ZooKeeper for leader election), with services primarily in node.js • We invested heavily in the observability and debuggability of node.js — and it is a (the?) reason we still use node.js
  8. Observability at Joyent Samsung! • Out of desire to build

    their own cloud based on Manta and Triton (our open source cloud management system), Samsung bought Joyent in June 2016 • While Manta has been in production for several years, Samsung’s level of scale has brought new-found challenges • Good news: between several years of production + observability (logging, DTrace, mdb) + hyperscale post-Samsung, we have nailed many thorny problems in Manta • Bad news: our stack — and that of every data path — has components that we still struggle to observe and debug…
  9. Zebra sanctuary • Unfortunately, the data path is laced with

    proprietary software that can’t be observed, audited, verified, or debugged • This is the software that interacts so directly with the hardware as to create the illusion of hardware to higher-level software • This is firmware, and it runs so dark and deep in the data path that much of it is impossible to see or catalogue • Firmware that operates silently will also fail implicitly — it is hardware failing with software’s failure modes
  10. Zebras in the spindle • Rotating magnetic media is a

    modern mechanical marvel • With sealed enclosures and helium-based drives, densities continue to increase — the disk will be with us for a long time! • Disks are vulnerable to vibe, temperature, particulates, aspersions, wear, etc. — magnetic media will fail! • But the disk knows this, and sophisticated on-head/on-controller firmware steers around failed media… • …leaving much nastier failure modes
  11. Zebras in the spindle • Disks can (emphatically!) read or

    write the wrong data • Seeing this coming reality in the early 2000s, ZFS was designed around total data path integrity via indirect checksums • ZFS has discovered all manner of data corruption in storage systems putatively too expensive to suffer such problems… • And yet even ZFS oversimplified the failure modes of disks: 15+ years of deploying ZFS, we have seen disks fail in much more exotic ways than we thought possible
  12. Zebras in the SSD • Flash wears out so frequently

    and quickly that much of an SSD is managing wear and mapping operations to functional flash • There are entire universes of system software in every SSD! • SSDs have incredible variety in their operating envelopes — and can accordingly fail in wildly divergent ways • This can represent systemic risk in that many SSDs can fail in the same way at the same time… • Confession: We’ve been so concerned about a flashtastrophy that we have always grossly over-engineered our own SSDs
  13. Zebras in the HBA • The host bus adapter is

    responsible for brokering I/O from the operating system to the physical devices • This is more complicated than it might seem — and in particular, HBA firmware is infamous for losing I/O under load • From the perspective of system software this will be an I/O that never returns — which means it will be timed out and retried • While the system will maintain liveness, this will induce a latency outlier — which can manifest itself far up the stack (e.g., TCP resets!)
  14. Zebras in the DIMM • DRAM is a capacitor that

    must be periodically refreshed • DRAM is susceptible to fatal failures (e.g., corrosion due to humidity, temperature or other environmental failures) • As the speed and density of DRAM have increased (and the voltage has dropped), DRAM has become more susceptible to transient bit failure not due to any hardware malfunction • The “Firmware First” (!) model of error handling in x86 (and the demise of CMCI) is leading to a silent epidemic of DIMM failure!
  15. Zebras in the chassis • Even the chassis itself is

    not immune from software failure • For example, software and firmware control fan speed — and failures in that software can result in fans stuck running at their highest speed • Fans are not designed to run at full power for extended periods of time; they wear out or (worse) induce vibration in the chassis • The effects of (say) vibration will be felt far from the source — and again, may only manifest latency not explicit failure
  16. Zebras in the NIC • Failure in the network interface

    card can be due to NIC firmware failure or hardware failure (e.g., the optical transceiver) • Networking failure should be entirely survivable by a distributed system, but that doesn’t mean it’s without consequence! • Use of the link aggregation control protocol (LACP) seems tempting — but can requires more sophisticated software in the switch (i.e., MLAG)… • …which itself can lead to new failure modes!
  17. Zebras in the top-of-rack switch • As their own complicated

    ecosystem of software and firmware, top-of-rack switches are prone to software failure • Failure in the top-of-rack (or worse, the L3 core) can have an enormous blast radius in a distributed system… • For example, a switch that drops its ARP tables can result in a distributed system going massively split brain… • Or a switch that gets stuck broadcasting traffic can easily DDOS an entire distributed system — revealing that there is a single- point-of-failure after all!
  18. Zebras all the way up • These problems do not

    manifest themselves cleanly at the point of origin for reasons both pragmatic and economic • Hardware vendors don’t want gear shipped back for RCCA! • Arguably, unreliable components allow (force?) upstack software to discover its novel failure modes • But that is an argument for debugging and resolving those (additional) problems upstack, not for unreliable components!
  19. Don’t fear the zebra • The data path is not

    to be undertaken lightly • Do not assume that testing and monitoring can substitute for system understanding; enshrine observability • Reward complete understanding, not merely resolution! • As long as it’s unobservable, firmware is the enemy — and trends toward sophisticated firmware are especially troubling! • Open source software affords us a quality ratchet: we shouldn’t spend our careers re-solving the same problems!
  20. Further reading and viewing • For an enlightening (and more

    positive) take on firmware, check out the amazing videos of Micah Elizabeth Scott (@scanlime) • For a snapshot of what we’re currently working on and thinking about with respect to Manta/Triton, see the Joyent Requests for Discussion (RFDs) — especially RFD 89 (“Project Tiresias”) • For more on node.js debuggability, see Dave Pacheco’s talk on “Industrial-grade node.js” • Also, thank you to Amanda Lundberg of White Coat Captioning for the superhuman real-time captioning!