Silence is Golden: Coordination-Avoiding Systems Design

SILENCE IS GOLDEN COORDINATION-AVOIDING SYSTEMS DESIGN Peter Bailis @pbailis MesosCon
2015 Keynote 21 August, Seattle, WA

Attendee Login Room Reservations Social Media Monitoring Database Reasoning about
Distribution is Hard

Attendee Login Room Reservations Social Media Monitoring Database •Should you
and I be able to simultaneously reserve rooms? •Can you reserve a room while I log in? •Can you tweet while I change my username? Reasoning about Distribution is Hard

Simple, classic strategy: Hide concurrency by coordinating

Mechanisms: Consensus (Paxos, VR, Raft) Zookeeper, etcd, Doozer ACID transactions
Simple, classic strategy: Hide concurrency by coordinating Abstraction: Serial access to state Replicated State Machines

Coordination is expensive Processes cannot make progress independently

Coordination is expensive This limits: 1.) Scalability 2.) Throughput 3.)
Low Latency 4.) Availability Processes cannot make progress independently

A B C D E F G H IN-MEMORY LOCKING
DISTRIBUTED TRANSACTIONS (EC2) 1 2 3 4 5 6 7 Number of Items per Transaction Throughput (txns/s) Number of Servers (Items) Accessed per Transaction

COORDINATED 1 2 3 4 5 6 7 Number of Items per Transaction Throughput (txns/s) DISTRIBUTED TRANSACTIONS (EC2) Number of Servers (Items) Accessed per Transaction

COORDINATED 1 2 3 4 5 6 7 Number of Items per Transaction Throughput (txns/s) DISTRIBUTED TRANSACTIONS (EC2) LOG SCALE! -398x Number of Servers (Items) Accessed per Transaction

This limits: 1.) Scalability 2.) Throughput 3.) Low Latency 4.)
Availability Coordination is expensive Processes cannot make progress independently

133.7+ ms RTT

133.7+ ms RTT 85.1+ ms RTT

This limits: 1.) Scalability 2.) Throughput 3.) Low Latency 4.)
Availability Coordination is expensive Processes cannot make progress independently

High cost! Scalability Throughput Latency Availability Simple, classic strategy: Hide
concurrency by coordinating Abstraction: Serial access to state Fundamental penalties to

Surely there’s a better way to build systems!

Why do we feel it's necessary to yak in order
to be comfortable? That's when you know you've found somebody really special: when you can just shut up for a minute and comfortably share silence.

Scalable systems can just shut up and comfortably share silence

1.) Why is shutting up good for systems? 2.) When can systems comfortably share silence? This talk:

Why is shutting up good?

Coordination-free systems: Why is shutting up good?

Coordination-free systems: Why is shutting up good? `

Coordination-free systems: 1.) Enable inﬁnite scale-out Why is shutting up
good? `

COORDINATED 1 2 3 4 5 6 7 Number of Items per Transaction Throughput (txns/s) DISTRIBUTED TRANSACTIONS (EC2) -398x Number of Servers (Items) Accessed per Transaction

1 2 3 4 5 6 7 Number of Items per Transaction Throughput (txns/s) COORDINATED COORDINATION-FREE DISTRIBUTED TRANSACTIONS (EC2) -398x Number of Servers (Items) Accessed per Transaction

Coordination-free systems: 1.) Enable inﬁnite scale-out 2.) Improve throughput 3.)
Ensure low latency Why is shutting up good?

Why is shutting up good? Coordination-free systems: 1.) Enable inﬁnite
scale-out 2.) Improve throughput 3.) Ensure low latency 4.) Improve availability

any replica can respond to any request “Always on” Availability

Ensure low latency 4.) Guarantee “always on” response Why is shutting up good?

Ensure low latency 4.) Guarantee “always on” response Why is shutting up good? Silence is key to scalability!

1.) Why is shutting up good for systems? 2.) When can systems comfortably share silence? This talk:

Attendee Login Room Reservations Social Media Monitoring Database Reasoning about
Distribution is Hard

Attendee Login Room Reservations Social Media Monitoring Database •Should you
and I be able to simultaneously reserve rooms? •Can you reserve a room while I log in? •Can you tweet while I change my username? Reasoning about Distribution is Hard

THOSE LIGHT CONES If operations happen concurrently… …ensure their side-effects
can be COMPOSED

can be COMPOSED IN A WAY THAT MAKES “SENSE”

IN A WAY THAT MAKES “SENSE” COMPOSED

IN A WAY THAT MAKES “SENSE” COMPOSED (“merged”)

IN A WAY THAT MAKES “SENSE” COMPOSED 1+1=2 {“a”}+{“b”}={“a”, “b”}
(“merged”)

(“merged”) (invariants over state will hold)

(“merged”) Counters are positive (invariants over state will hold) No two talks share a timeslot No NULL values Usernames are unique

Key question: Can invariants can be violated by merging independent
operations?

operations? ICT: Invariant Confluence Test [VLDB 2015]

operations? INVARIANT: User IDs are positive OPERATION: Save new user MERGE: Add both records to DB ICT: Invariant Confluence Test [VLDB 2015]

operations? INVARIANT: User IDs are positive OPERATION: Save new user MERGE: Add both records to DB {} ICT: Invariant Confluence Test [VLDB 2015]

operations? INVARIANT: User IDs are positive OPERATION: Save new user MERGE: Add both records to DB {} add {Stu,ID=1} ICT: Invariant Confluence Test [VLDB 2015]

operations? INVARIANT: User IDs are positive OPERATION: Save new user MERGE: Add both records to DB {} add {Stu,ID=1} add {Ann,ID=1} ICT: Invariant Confluence Test [VLDB 2015]

operations? INVARIANT: User IDs are positive OPERATION: Save new user MERGE: Add both records to DB {{Stu,ID=1}, {Ann,ID=1}} {} MERGE add {Stu,ID=1} add {Ann,ID=1} ICT: Invariant Confluence Test [VLDB 2015]

operations? INVARIANT: User IDs are positive OPERATION: Save new user MERGE: Add both records to DB {{Stu,ID=1}, {Ann,ID=1}} Invariant holds! {} MERGE add {Stu,ID=1} add {Ann,ID=1} ICT: Invariant Confluence Test [VLDB 2015]

operations? ICT: Invariant Confluence Test [VLDB 2015] INVARIANT: User IDs are unique OPERATION: Save new user MERGE: Add both records to DB

operations? ICT: Invariant Confluence Test [VLDB 2015] INVARIANT: User IDs are unique OPERATION: Save new user MERGE: Add both records to DB {{Stu,ID=1}, {Ann,ID=1}} Invariant broken! {} MERGE add {Stu,ID=1} add {Ann,ID=1}

operations? ICT: Invariant Confluence Test [VLDB 2015]

operations? ICT: Invariant Confluence Test [VLDB 2015] ICT passes? Coordination not required

operations? ICT: Invariant Confluence Test [VLDB 2015] ICT passes? ICT fails? Coordination not required Coordination required

can be COMPOSED IN A WAY THAT MAKES “SENSE”

can be COMPOSED IN A WAY THAT MAKES “SENSE” formalized by ICT

Attendee Login Room Reservations Social Media Monitoring Database When can
we comfortably share silence?

Attendee Login Room Reservations Social Media Monitoring Database Can we
simultaneously reserve rooms? Can I log in while you reserve a room? Can I tweet while you change your username? When can we comfortably share silence?

Attendee Login Room Reservations Social Media Monitoring Database Can we
simultaneously reserve rooms? Can I log in while you reserve a room? Can I tweet while you change your username? When can we comfortably share silence? When operations are composable

Constraint Operation Passes ICT? Equality, Inequality Any ??? Generate unique
ID Any ??? Specify unique ID Insert ??? > Increment ??? > Decrement ??? < Decrement ??? < Increment ??? Foreign Key Insert ??? Foreign Key Delete ??? Secondary Indexing Any ??? Materialized Views Any ??? AUTO_INCREMENT Insert ??? [VLDB 2015] Typical database constraints and operations (SQL)

Constraint Operation Passes ICT? Equality, Inequality Any Y Generate unique
ID Any Y Specify unique ID Insert N > Increment Y > Decrement N < Decrement Y < Increment N Foreign Key Insert Y Foreign Key Delete Y* Secondary Indexing Any Y Materialized Views Any Y AUTO_INCREMENT Insert N [VLDB 2015] Typical database constraints and operations (SQL)

adopt-a-hydrant alchemy_cms amahi bostonrb boxroom brevidy browsercms bucketwise calagator canvas-lms
carter chiliproject citizenry comas comfortable- mexican-sofa communityengine copycopter-server danbooru diaspora discourse enki fat_free_crm fedena forem fulcrum gitlab-ci gitlabhq govsgo heaven inkwell insoshi jobsworth juvia kandan linuxfr.org lobsters lovd-by-less nimbleshop obtvse onebody opal opencongress opengovernment openproject piggybak publify radiant railscollab redmine refinerycms ror_ecommerce rucksack saasy salor-retail selfstarter sharetribe skyline spot-us spree sprintapp squaresquash sugar teambox tracks tryshoppe wallgig zena

67 projects 1.77M LoC 1957 tables 9986 total; avg. 5.1
per table

67 projects 1.77M LoC 1957 tables 9986 total; avg. 5.1
per table 86.9% PASS ICT [SIGMOD 2015]

Always coordinating is ineﬃcient! 67 projects 1.77M LoC 1957 tables
9986 total; avg. 5.1 per table 86.9% PASS ICT [SIGMOD 2015]

Everything Happens At Once Legacy Implementations Overcoordinate

Users never read intermediate data Read Committed RDBMS Everything Happens
At Once Legacy Implementations Overcoordinate

Users never read intermediate data w(name=“peter”);/w(name=“pbailis”);/commit; Read Committed RDBMS Everything
Happens At Once Legacy Implementations Overcoordinate

Happens At Once Legacy Implementations Overcoordinate r(name=“peter”);/commit;

Happens At Once Legacy Implementations Overcoordinate r(name=“peter”);/commit; Classic implementation: lock records during access

name/record Users never read intermediate data w(name=“peter”);/w(name=“pbailis”);/commit; Read Committed RDBMS
Everything Happens At Once Legacy Implementations Overcoordinate r(name=“peter”);/commit; Classic implementation: lock records during access

Everything Happens At Once Legacy Implementations Overcoordinate r(name=“peter”);/commit; “peter” Classic implementation: lock records during access

Everything Happens At Once Legacy Implementations Overcoordinate r(name=“peter”);/commit; “pbailis” Classic implementation: lock records during access

name/record w(name=“peter”);/w(name=“pbailis”);/commit; Read Committed RDBMS Everything Happens At Once Legacy
Implementations Overcoordinate r(name=“peter”);/commit; “pbailis” Classic implementation: lock records during access

Implementations Overcoordinate r(name=“peter”);/commit; “pbailis” Classic implementation: lock records during access Better implementation: use multi-versioning, commit tag

Implementations Overcoordinate r(name=“peter”);/commit; “pbailis” Classic implementation: lock records during access name/record Better implementation: use multi-versioning, commit tag

Implementations Overcoordinate r(name=“peter”);/commit; “pbailis” Classic implementation: lock records during access name/record “peter” Better implementation: use multi-versioning, commit tag

Implementations Overcoordinate r(name=“peter”);/commit; “pbailis” Classic implementation: lock records during access name/record “peter” Better implementation: use multi-versioning, commit tag “pbailis”

Implementations Overcoordinate r(name=“peter”);/commit; “pbailis” Classic implementation: lock records during access name/record “peter” Better implementation: use multi-versioning, commit tag “pbailis” OK

Everything Happens At Once Next Level Technique: RAMP Transactions

Everything Happens At Once Next Level Technique: RAMP Transactions Desired
property: see all updates, or see none w(status=“talking”);/w(loc=“seattle”);/commit;

property: see all updates, or see none w(status=“talking”);/w(loc=“seattle”);/commit; used in indexing, materialized views, foreign keys

property: see all updates, or see none w(status=“talking”);/w(loc=“seattle”);/commit; used in indexing, materialized views, foreign keys Classic implementation: lock records

property: see all updates, or see none w(status=“talking”);/w(loc=“seattle”);/commit; used in indexing, materialized views, foreign keys Classic implementation: lock records Result: typically implemented incorrectly at scale

property: see all updates, or see none w(status=“talking”);/w(loc=“seattle”);/commit;

property: see all updates, or see none w(status=“talking”);/w(loc=“seattle”);/commit; RAMP: multi-versioning with intention metadata

property: see all updates, or see none w(status=“talking”);/w(loc=“seattle”);/commit; RAMP: multi-versioning with intention metadata status/record

property: see all updates, or see none w(status=“talking”);/w(loc=“seattle”);/commit; RAMP: multi-versioning with intention metadata status/record loc/record

property: see all updates, or see none w(status=“talking”);/w(loc=“seattle”);/commit; RAMP: multi-versioning with intention metadata status/record “talking”/(@t=10,/also/loc) loc/record

property: see all updates, or see none w(status=“talking”);/w(loc=“seattle”);/commit; RAMP: multi-versioning with intention metadata status/record “talking”/(@t=10,/also/loc) loc/record “seattle”/(@t=10,/also/status)

property: see all updates, or see none w(status=“talking”);/w(loc=“seattle”);/commit; RAMP: multi-versioning with intention metadata status/record “talking”/(@t=10,/also/loc) loc/record “seattle”/(@t=10,/also/status) OK

property: see all updates, or see none w(status=“talking”);/w(loc=“seattle”);/commit; RAMP: multi-versioning with intention metadata status/record “talking”/(@t=10,/also/loc) loc/record “seattle”/(@t=10,/also/status) OK OK

property: see all updates, or see none w(status=“talking”);/w(loc=“seattle”);/commit; RAMP: multi-versioning with intention metadata status/record “talking”/(@t=10,/also/loc) loc/record “seattle”/(@t=10,/also/status) OK

property: see all updates, or see none w(status=“talking”);/w(loc=“seattle”);/commit; RAMP: multi-versioning with intention metadata status/record “talking”/(@t=10,/also/loc) loc/record “seattle”/(@t=10,/also/status) Key: Prevent read stalls Compact metadata SIGMOD 2014 OK

14/16 INVARIANTS PASS ICT TPC-C

14/16 INVARIANTS PASS ICT TPC-C scale to over 25x best
listed result 0 50 100 150 200 2M 4M 6M 8M 10M 12M 14M Total Throughput (txn/s) 0 50 100 150 200 Number of Servers 0 20K 40K 60K 80K Throughput (txn/s/server) 6-11x faster than ACID/serializability 8 16 32 48 64 Number of Warehouses 40K 100K 600K Throughput (txns/s) Coordination-Avoiding Serializable (2PL)

Everything Happens At Once Key Design Patterns

Everything Happens At Once Key Design Patterns • Datatype libraries
can automatically merge operations e.g., Bloom^L, CRDTs

can automatically merge operations e.g., Bloom^L, CRDTs • Multi-versioning can prevent stalls during partial updates e.g., RAMP, COPS, SwiftCloud

can automatically merge operations e.g., Bloom^L, CRDTs • Multi-versioning can prevent stalls during partial updates e.g., RAMP, COPS, SwiftCloud •When you must coordinate, distribute as little as possible e.g., Transaction Chopping

Rethink The API

Rethink The API Read/Write Transaction Distributed Log Consensus Object Distributed
Log Consensus Object

Rethink The API Read/Write Transaction Distributed Log Consensus Object Are
too low level! Distributed Log Consensus Object

The Far Side, Gary Larson

WHAT THE APPLICATION SAYS “post on timeline” “accept friend request”

write read write read write write read write write write read write WHAT THE SYSTEM HEARS read read read read read read write write write read read write read write write

write read write read read write write read WHAT THE SYSTEM HEARS read read read read write write read read write read write write “post on timeline” “accept friend request” write write

The Good Stuff (Papers) ICT in theory and practice Coordination-avoiding
analytics Index, graph, and view maintenance Transaction isolation Upgrading existing stores Quantifying visibility SIGMOD 2015, VLDB 2015 CIDR 2015 SIGMOD 2014 VLDB 2014 SIGMOD 2013 VLDB 2012, VLDBJ 2014

To avoid coordination, maximize composability of operations Scalable systems can
comfortably share silence

To avoid coordination, maximize composability of operations Scalable systems can
comfortably share silence Joint work with Ali Ghodsi, Alan Fekete, Joe Hellerstein, Ion Stoica, and many others (see bailis.org)

To avoid coordination, maximize composability of operations @pbailis Scalable systems
can comfortably share silence

Many illustrations by the Noun Project (CC-Attribution): surprised by Julian
Derveaux world by Wayne Tyler Sall database by Austin Condiff earth by Martin Vanco Woman by Simon Child Man by Simon Child Doctor by Simon Child David-Hockney by Simon Child Server by Simon Child clock by christoph robausch

Silence is Golden: Coordination-Avoiding System...

Silence is Golden: Coordination-Avoiding Systems Design

More Decks by pbailis

Other Decks in Technology

Featured

Transcript