My Story C64 Wardialer (~1990) Inventory Management System (MS Access +VBA) (~1998) Large network deployment tools (1999) Content storage system (2000) IP Overlay Network (2005) Distributed File System (2006) Riak (2007-) Riak CS (2011-) Thursday, May 16, 13
The Free Lunch is Over 2005 paper by Herb Sutter “A fundamental turn towards concurrency in software” Today: A fundamental turn towards distributed systems Thursday, May 16, 13
The Distributed Systems Renaissance We’re all distributed systems people now Reasons: Larger problems Increased expectations Problems: This stuff is hard Thursday, May 16, 13
Revival and Renewal Of interest in formal specification and verification Of interest in consensus protocols Of new programming languages and paradigms to deal with the complexity of distributed Of databases! Thursday, May 16, 13
On Abstractions They’re the means by which we reason about complicated things They’re the means by which we make progress in software But... Thursday, May 16, 13
Where’s my libPaxos? Modern operating systems should have consensus capabilities in the OS VMS had DLM (Distributed Lock Manager) We’ve regressed! If Linux can have 50 toy filesystems, why can’t we have Paxos? Thursday, May 16, 13
Where’s my libARIES? Write-ahead logging should also be a reusable primitive Historically hard to implement in a layered fashion Stasis (http://code.google.com/p/stasis) is a good start Thursday, May 16, 13
Riak Core Dynamo Abstracted Not specific to databases Reasonably successful, despite sparse documentation: Multiple large production deployments (Yahoo, OpenX, StackMob) Used in a few university systems classes Thursday, May 16, 13
On Testing We can prove correctness of distributed algorithms, can we prove correctness of existing distributed systems? Unit tests grossly insufficient for large distributed systems QuickCheck is an improvement “testing only shows the presence, not the absence of bugs” - Dijkstra Thursday, May 16, 13
QuickCheck Write high-level assertions (“properties”) that a function should fulfill QuickCheck generates millions of test cases to try to falsify the property Code coverage vs. quality of coverage Thursday, May 16, 13
Case Study: Poolboy Poolboy: Erlang connection pool library Seemed to work fine: unit tests passed, Riak integration tests passed A day’s worth of QuickCheck testing revealed bugs in every major piece of functionality Thursday, May 16, 13
Problems Remain QuickCheck is complex, requires training and practice Code evolves separately from tests Large up-front effort, tests decay over time See: “Hansei: Property-based Development of Concurrent Systems” by Joe Blomstedt of Basho Unify model and production code with annotations McErlang does exhaustive state-space exploration Thursday, May 16, 13
Testing vs. Verification How to we narrow the conceptual gap between a formal specification (“what”) and its implementation (“how”)? Languages to the rescue? Thursday, May 16, 13
On Languages Resurgence of functional, declarative programming C++ has closures now! Dings in the armor of OO Explosion of new languages Let’s use existing tools like compilers, static analysis to verify our programs Thursday, May 16, 13
On Monitoring What should we monitor and how? We know little about the emergent properties of networks Current open-source options all mostly suck Thursday, May 16, 13
SELECT sys.ip ip, procname, rss, pid FROM sys, processes WHERE sys.ip = processes.ip AND (rss*100)/sys.memtotal > 75 AND sys.ip in (SELECT ip FROM machinerole WHERE role=’dns’); Akamai “Query” System *Keeping Track of 70,000+ Servers: The Akamai Query System Thursday, May 16, 13
Emergent Property: TCP Incast “You can’t pour two buckets of manure into one bucket” - Scott Fritchie’s Grandfather “microbursts” of traffic sent to one cluster member Coordinator sends request to three replicas All respond with large-ish result at roughly the same time Switch has to either buffer or drop packets Result: throughput collapse Thursday, May 16, 13
On Teaching Are there better ways to explain complicated things, like Paxos? Or are they just fundamentally complex and we need to deal with it? Do other disciplines have anything to teach us about new/richer models? Thursday, May 16, 13
In Summary The free lunch is over, again! It’s an amazing time to be part of this community Let’s sharpen our tools and build new ones I love all of you! <3 <3 <3 Thursday, May 16, 13