Lock in $30 Savings on PRO—Offer Ends Soon! ⏳

Lessons Learned and Questions Raised From Build...

Lessons Learned and Questions Raised From Building Distributed Systems

Keynote from RICON East 2013

Andy Gross

May 14, 2013
Tweet

More Decks by Andy Gross

Other Decks in Technology

Transcript

  1. Lessons Learned and Questions Raised (from building distributed systems) Andy

    Gross <@argv0> Basho Technologies Thursday, May 16, 13
  2. “Andy, I studied with Eric Brewer. I know Eric Brewer.

    Eric Brewer is a friend of mine. Andy, you're no Eric Brewer.” Thursday, May 16, 13
  3. My Story C64 Wardialer (~1990) Inventory Management System (MS Access

    +VBA) (~1998) Large network deployment tools (1999) Content storage system (2000) IP Overlay Network (2005) Distributed File System (2006) Riak (2007-) Riak CS (2011-) Thursday, May 16, 13
  4. The Free Lunch is Over 2005 paper by Herb Sutter

    “A fundamental turn towards concurrency in software” Today: A fundamental turn towards distributed systems Thursday, May 16, 13
  5. The Distributed Systems Renaissance We’re all distributed systems people now

    Reasons: Larger problems Increased expectations Problems: This stuff is hard Thursday, May 16, 13
  6. “Thank goodness we don't have only serious problems, but ridiculous

    ones as well.” - Dijkstra Thursday, May 16, 13
  7. Revival and Renewal Of interest in formal specification and verification

    Of interest in consensus protocols Of new programming languages and paradigms to deal with the complexity of distributed Of databases! Thursday, May 16, 13
  8. On Abstractions They’re the means by which we reason about

    complicated things They’re the means by which we make progress in software But... Thursday, May 16, 13
  9. Where’s my libPaxos? Modern operating systems should have consensus capabilities

    in the OS VMS had DLM (Distributed Lock Manager) We’ve regressed! If Linux can have 50 toy filesystems, why can’t we have Paxos? Thursday, May 16, 13
  10. Where’s my libARIES? Write-ahead logging should also be a reusable

    primitive Historically hard to implement in a layered fashion Stasis (http://code.google.com/p/stasis) is a good start Thursday, May 16, 13
  11. Riak Core Dynamo Abstracted Not specific to databases Reasonably successful,

    despite sparse documentation: Multiple large production deployments (Yahoo, OpenX, StackMob) Used in a few university systems classes Thursday, May 16, 13
  12. Erlang/OTP Runtime Riak KV Riak Architecture Client APIs Request Coordination

    Riak Core get put delete map-reduce HTTP Protocol Buffers Erlang local client membership consistent hashing handoff node-liveness gossip buckets vnodes storage backend JS Runtime vnode master Thursday, May 16, 13
  13. On Testing We can prove correctness of distributed algorithms, can

    we prove correctness of existing distributed systems? Unit tests grossly insufficient for large distributed systems QuickCheck is an improvement “testing only shows the presence, not the absence of bugs” - Dijkstra Thursday, May 16, 13
  14. QuickCheck Write high-level assertions (“properties”) that a function should fulfill

    QuickCheck generates millions of test cases to try to falsify the property Code coverage vs. quality of coverage Thursday, May 16, 13
  15. Case Study: Poolboy Poolboy: Erlang connection pool library Seemed to

    work fine: unit tests passed, Riak integration tests passed A day’s worth of QuickCheck testing revealed bugs in every major piece of functionality Thursday, May 16, 13
  16. Problems Remain QuickCheck is complex, requires training and practice Code

    evolves separately from tests Large up-front effort, tests decay over time See: “Hansei: Property-based Development of Concurrent Systems” by Joe Blomstedt of Basho Unify model and production code with annotations McErlang does exhaustive state-space exploration Thursday, May 16, 13
  17. Testing vs. Verification How to we narrow the conceptual gap

    between a formal specification (“what”) and its implementation (“how”)? Languages to the rescue? Thursday, May 16, 13
  18. On Languages Resurgence of functional, declarative programming C++ has closures

    now! Dings in the armor of OO Explosion of new languages Let’s use existing tools like compilers, static analysis to verify our programs Thursday, May 16, 13
  19. On Monitoring What should we monitor and how? We know

    little about the emergent properties of networks Current open-source options all mostly suck Thursday, May 16, 13
  20. SELECT sys.ip ip, procname, rss, pid FROM sys, processes WHERE

    sys.ip = processes.ip AND (rss*100)/sys.memtotal > 75 AND sys.ip in (SELECT ip FROM machinerole WHERE role=’dns’); Akamai “Query” System *Keeping Track of 70,000+ Servers: The Akamai Query System Thursday, May 16, 13
  21. Emergent Property: TCP Incast “You can’t pour two buckets of

    manure into one bucket” - Scott Fritchie’s Grandfather “microbursts” of traffic sent to one cluster member Coordinator sends request to three replicas All respond with large-ish result at roughly the same time Switch has to either buffer or drop packets Result: throughput collapse Thursday, May 16, 13
  22. On Teaching Are there better ways to explain complicated things,

    like Paxos? Or are they just fundamentally complex and we need to deal with it? Do other disciplines have anything to teach us about new/richer models? Thursday, May 16, 13
  23. q: do i even know how vector clocks work? a:

    kinda, but should i have to? Thursday, May 16, 13
  24. In Summary The free lunch is over, again! It’s an

    amazing time to be part of this community Let’s sharpen our tools and build new ones I love all of you! <3 <3 <3 Thursday, May 16, 13