was not just a new system, but a new way of thinking about systems • Instead of a sealed monolith, the operating system was a collection of small, easily understood programs • First Edition Unix (1971) contained many programs that we still use today (ls, rm, cat, mv) • Its very name conveyed this minimalist aesthetic: Unix is a homophone of “eunuchs” — a castrated Multics We were a bit oppressed by the big system mentality. Ken wanted to do something simple. — Dennis Ritchie
had the idea of connecting different components: At the same time that Thompson and Ritchie were sketching out a file system, I was sketching out how to do data processing on the blackboard by connecting together cascades of processes • This was the primordial pipe, but it took three years to persuade Thompson to adopt it: And one day I came up with a syntax for the shell that went along with the piping, and Ken said, “I’m going to do it!”
small-system aesthetic — gave rise to the Unix philosophy, as articulated by Doug McIlroy: • Write programs that do one thing and do it well • Write programs to work together • Write programs that handle text streams, because that is a universal interface • Four decades later, this philosophy remains the single most important revolution in software systems thinking!
the Epic Rap Battle of computer science history: Read a file of text, determine the n most frequently used words, and print out a sorted list of those words along with their frequencies. • Don Knuth’s solution: an elaborate program in WEB, a Pascal-like literate programming system of his own invention, using a purpose-built algorithm • Doug McIlroy’s solution shows the power of the Unix philosophy: tr -cs A-Za-z '\n' | tr A-Z a-z | \ sort | uniq -c | sort -rn | sed ${1}q Doug McIlroy v. Don Knuth: FIGHT!
paper (Dean et al., OSDI ’04) poses a problem disturbingly similar to Bentley’s challenge nearly two decades prior: Count of URL Access Frequency: The function processes logs of web page requests and outputs ⟨URL, 1⟩. The reduce function adds together all values for the same URL and emits a ⟨URL, total count⟩ pair • But the solutions do not adhere to the Unix philosophy... • ...and nor do they make use of the substantial Unix foundation for data processing • e.g., Appendix A of the OSDI ’04 paper has a 71 line word count in C++ — with nary a wc in sight
to allow for “big data” — quantities of data that dwarf a single machine • Must allow for massively parallel execution • Must allow for multi-tenancy • To make use of both the Unix philosophy and its toolset, must be able to virtualize the operating system
storage: block, file and object • Block (i.e., a SAN) is far too low an abstraction — and notoriously expensive to scale • File (i.e., NAS) is too permissive an abstraction — it implies a coherent store for arbitrary (partial) writes, trying (and failing) to be both C and A in CAP • Object (e.g., S3) is similar “enough” to a file-based abstraction, but by not allowing partial writes, allows for proper CAP tradeoffs
partial updates • For both durability and availability, objects are generally erasure encoded across spindles on different nodes • A different approach is to have a highly reliable local file system that erasure encodes across local spindles — with entire objects duplicated across nodes for availability • ZFS pioneered both reliability and efficiency of this model with RAID-Z — and has refined it over the past decade of production use • ZFS is one of the four foundational technologies in Joyent’s open source SmartOS
— systems have been virtualized at the level of hardware • Hardware virtualization has its advantages, but it’s heavyweight: operating systems are not designed to share resources like DRAM, CPU, I/O devices, etc. • One can instead virtualize at the level of the operating system: a single OS kernel that creates lightweight containers — on the metal, but securely partitioned • Pioneered by BSD’s jails; taken to a logical extreme by zones found in Joyent’s SmartOS
with the abstraction provided by zones to develop an object store that has compute as a first-class citizen? • ZFS rollback allows for zones to be trashed — simply rollback the zone after compute completes on an object • Add a job scheduling system that allows for both map and reduce phases of distributed work • Would allow for the Unix toolset to be used on arbitrary large amounts of data — unlocking big data one-liners • If it perhaps seems obvious now, it wasn’t at the time... Idea: ZFS + Zones?
and zones, we have built Manta, an internet-facing object storage system offering in situ compute • That is, the description of compute can be brought to where objects reside instead of having to backhaul objects to transient compute • The abstractions made available for computation are anything that can run on the OS... • ...and as a reminder, the OS — Unix — was built around the notion of ad hoc unstructured data processing, and allows for remarkably terse expressions of computation Manta: ZFS + Zones!
solution to Bentley’s challenge: mfind -t o /bcantrill/public/v7/usr/man | \ mjob create -o -m "tr -cs A-Za-z '\n' | \ tr A-Z a-z | sort | uniq -c" -r \ "awk '{ x[\$2] += \$1 } END { for (w in x) { print x[w] \" \" w } }' | \ sort -rn | sed ${1}q" • This description not only terse, it is high performing: data is left at rest — with the “map” phase doing heavy reduction of the data stream • As such, Manta — like Unix — is not merely syntactic sugar; it converges compute and data in a new way Manta: Unix for Big Data
we prefer consistency over availability for writes (but still availability for reads) • Many more details: http://dtrace.org/blogs/dap/2013/07/03/fault-tolerance-in-manta/ • Celebrity endorsement: Manta: CAP tradeoffs
implements proper directories, delimited with a forward slash • Manta implements a snapshot/link hybrid dubbed a snaplink; can be used to effect versioning • Manta has full support for CORS headers • Manta uses SSH-based HTTP auth for client-side tooling (IETF draft-cavage-http-signatures-00) • Manta SDKs exist for node.js, Java, Ruby, Python • “npm install manta” for command line interface Manta: Other design principles
big data: stores of record must support computation as a first-class, in situ operation • We believe that Unix is a natural way of expressing this computation — and that the OS is the right level at which to virtualize to support this securely • We believe that ZFS is the only sane storage substrate underpinning for such a system • Manta will surely not be the only system to represent the confluence of these — but it is the first • We are actively retooling our software stack in terms of Manta — Manta is changing the way we develop software! Manta and the future of big data
documentation: http://apidocs.joyent.com/manta/ • IRC, e-mail, Twitter, etc.: #manta on freenode, [email protected], @mcavage, @dapsays, @yunongx, @joyent • Here’s to the orgy of big data one-liners! Manta: More information