Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Zebras all the way down: The engineering challenges of the data path

Zebras all the way down: The engineering challenges of the data path

My talk at the inaugural #UptimeConf in 2017. Video: https://www.youtube.com/watch?v=fE2KDzZaxvE

Bryan Cantrill

August 25, 2017
Tweet

More Decks by Bryan Cantrill

Other Decks in Technology

Transcript

  1. Zebras all the way down
    The engineering challenges of the data path
    CTO
    [email protected]
    Bryan Cantrill
    @bcantrill

    View full-size slide

  2. The luxury of statelessness
    • In service-oriented software systems, we love statelessness
    • And for good reason: stateless components — like finite state
    machines — lend systems many desirable properties!
    • Stateless components can be easily made immutable, scalable,
    re-deployable, restartable, upgradeable, etc. etc.
    • Of course, persistent state still very much exists — we just
    use separation of concerns to confine the management of state
    to those services that do it explicitly and exclusively…

    View full-size slide

  3. The data path
    • The data path consists of the software, hardware, and firmware
    components between a service endpoint that offers persistence
    and the implementation of that persistence
    • The data path always ends in non-volatile storage, which (for
    now, anyway) means either flash or magnetic media
    • The data path traverses many subsystems and components —
    and nearly always is a distributed system itself
    • We place great demands upon the data path…

    View full-size slide

  4. The demands of the data path
    • A data path that merely works much of the time is insufficient
    • We (rightfully) expect perfection from the data path: we expect it
    to be consistent, available and partition-tolerant!
    • Of course, Brewer’s CAP theorem tells us that this isn’t actually
    possible — we must make tradeoffs
    • Even a well-engineered system can’t beat CAP — but a poorly
    engineered one will be flailed by it, becoming pathologically
    unavailable or inconsistent
    • Zebras are the difference

    View full-size slide

  5. Zebras?
    • In American medical slang, a zebra is a rare and exotic
    condition that can be conflated with more common ailments
    • Medical students and residents are cautioned against
    diagnosing them, to the point of aphorism: “when you hear
    hoofbeats, think of horses not zebras”
    • But — as anyone who has been afflicted by one will affirm —
    zebras emphatically exist!

    View full-size slide

  6. A zebra close to home

    View full-size slide

  7. Zebras in the data path?
    • Even though the data path runs on and ends with hardware, it
    consists of many disjoint and unseen software components
    • The paradox of software (especially that of the data path!) is
    that software is both information and machine
    • When software works correctly, it survives as information does:
    namely, in perpetuity
    • Especially where software is expensive to write and difficult to
    fix, there is an overwhelming bias towards extant software
    • Over time, the horses are found; only the zebras are left

    View full-size slide

  8. Hunting zebra
    • We must assume that unusual pathologies — especially in a
    distributed system — will not be readily reproducible!
    • When we are culturally afflicted with “bias for action”, it
    becomes tempting to immediately change the system to fix it
    • This is the wrong first motion: the choice to restore service
    versus understanding it is often a false dichotomy!
    • We must not change the system but rather observe it — we
    must focus not on snap hypotheses, but rather initial questions
    • The observability of the system is paramount!

    View full-size slide

  9. Observability at Joyent
    • Observability is an organizing principle at Joyent — it is a
    primary reason that we run SmartOS, our illumos derivative
    • Manta — our (open source, container-centric) object storage
    service — has SmartOS and ZFS at its core
    • Manta uses sharded PostgreSQL for metadata (+ ZooKeeper
    for leader election), with services primarily in node.js
    • We invested heavily in the observability and debuggability of
    node.js — and it is a (the?) reason we still use node.js

    View full-size slide

  10. Observability at Joyent Samsung!
    • Out of desire to build their own cloud based on Manta and Triton
    (our open source cloud management system), Samsung bought
    Joyent in June 2016
    • While Manta has been in production for several years,
    Samsung’s level of scale has brought new-found challenges
    • Good news: between several years of production + observability
    (logging, DTrace, mdb) + hyperscale post-Samsung, we have
    nailed many thorny problems in Manta
    • Bad news: our stack — and that of every data path — has
    components that we still struggle to observe and debug…

    View full-size slide

  11. Zebra sanctuary
    • Unfortunately, the data path is laced with proprietary software
    that can’t be observed, audited, verified, or debugged
    • This is the software that interacts so directly with the hardware
    as to create the illusion of hardware to higher-level software
    • This is firmware, and it runs so dark and deep in the data path
    that much of it is impossible to see or catalogue
    • Firmware that operates silently will also fail implicitly — it is
    hardware failing with software’s failure modes

    View full-size slide

  12. Zebras in the spindle
    • Rotating magnetic media is a modern mechanical marvel
    • With sealed enclosures and helium-based drives, densities
    continue to increase — the disk will be with us for a long time!
    • Disks are vulnerable to vibe, temperature, particulates,
    aspersions, wear, etc. — magnetic media will fail!
    • But the disk knows this, and sophisticated on-head/on-controller
    firmware steers around failed media…
    • …leaving much nastier failure modes

    View full-size slide

  13. Zebras in the spindle
    • Disks can (emphatically!) read or write the wrong data
    • Seeing this coming reality in the early 2000s, ZFS was designed
    around total data path integrity via indirect checksums
    • ZFS has discovered all manner of data corruption in storage
    systems putatively too expensive to suffer such problems…
    • And yet even ZFS oversimplified the failure modes of disks: 15+
    years of deploying ZFS, we have seen disks fail in much more
    exotic ways than we thought possible

    View full-size slide

  14. Zebras in the SSD
    • Flash wears out so frequently and quickly that much of an SSD
    is managing wear and mapping operations to functional flash
    • There are entire universes of system software in every SSD!
    • SSDs have incredible variety in their operating envelopes —
    and can accordingly fail in wildly divergent ways
    • This can represent systemic risk in that many SSDs can fail in
    the same way at the same time…
    • Confession: We’ve been so concerned about a flashtastrophy
    that we have always grossly over-engineered our own SSDs

    View full-size slide

  15. Zebras in the HBA
    • The host bus adapter is responsible for brokering I/O from the
    operating system to the physical devices
    • This is more complicated than it might seem — and in particular,
    HBA firmware is infamous for losing I/O under load
    • From the perspective of system software this will be an I/O that
    never returns — which means it will be timed out and retried
    • While the system will maintain liveness, this will induce a
    latency outlier — which can manifest itself far up the stack (e.g.,
    TCP resets!)

    View full-size slide

  16. Zebras in the DIMM
    • DRAM is a capacitor that must be periodically refreshed
    • DRAM is susceptible to fatal failures (e.g., corrosion due to
    humidity, temperature or other environmental failures)
    • As the speed and density of DRAM have increased (and the
    voltage has dropped), DRAM has become more susceptible to
    transient bit failure not due to any hardware malfunction
    • The “Firmware First” (!) model of error handling in x86 (and the
    demise of CMCI) is leading to a silent epidemic of DIMM failure!

    View full-size slide

  17. Zebras in the chassis
    • Even the chassis itself is not immune from software failure
    • For example, software and firmware control fan speed — and
    failures in that software can result in fans stuck running at their
    highest speed
    • Fans are not designed to run at full power for extended periods
    of time; they wear out or (worse) induce vibration in the chassis
    • The effects of (say) vibration will be felt far from the source —
    and again, may only manifest latency not explicit failure

    View full-size slide

  18. Zebras in the NIC
    • Failure in the network interface card can be due to NIC firmware
    failure or hardware failure (e.g., the optical transceiver)
    • Networking failure should be entirely survivable by a distributed
    system, but that doesn’t mean it’s without consequence!
    • Use of the link aggregation control protocol (LACP) seems
    tempting — but can requires more sophisticated software in the
    switch (i.e., MLAG)…
    • …which itself can lead to new failure modes!

    View full-size slide

  19. Zebras in the top-of-rack switch
    • As their own complicated ecosystem of software and firmware,
    top-of-rack switches are prone to software failure
    • Failure in the top-of-rack (or worse, the L3 core) can have an
    enormous blast radius in a distributed system…
    • For example, a switch that drops its ARP tables can result in a
    distributed system going massively split brain…
    • Or a switch that gets stuck broadcasting traffic can easily DDOS
    an entire distributed system — revealing that there is a single-
    point-of-failure after all!

    View full-size slide

  20. Zebras all the way up
    • These problems do not manifest themselves cleanly at the point
    of origin for reasons both pragmatic and economic
    • Hardware vendors don’t want gear shipped back for RCCA!
    • Arguably, unreliable components allow (force?) upstack
    software to discover its novel failure modes
    • But that is an argument for debugging and resolving those
    (additional) problems upstack, not for unreliable components!

    View full-size slide

  21. Don’t fear the zebra
    • The data path is not to be undertaken lightly
    • Do not assume that testing and monitoring can substitute for
    system understanding; enshrine observability
    • Reward complete understanding, not merely resolution!
    • As long as it’s unobservable, firmware is the enemy — and
    trends toward sophisticated firmware are especially troubling!
    • Open source software affords us a quality ratchet: we
    shouldn’t spend our careers re-solving the same problems!

    View full-size slide

  22. Further reading and viewing
    • For an enlightening (and more positive) take on firmware, check
    out the amazing videos of Micah Elizabeth Scott (@scanlime)
    • For a snapshot of what we’re currently working on and thinking
    about with respect to Manta/Triton, see the Joyent Requests for
    Discussion (RFDs) — especially RFD 89 (“Project Tiresias”)
    • For more on node.js debuggability, see Dave Pacheco’s talk on
    “Industrial-grade node.js”
    • Also, thank you to Amanda Lundberg of White Coat Captioning
    for the superhuman real-time captioning!

    View full-size slide