CALM and Disorderly Programming in Bloom

CALM and Disorderly Programming in Joe Hellerstein UC Berkeley joint
work with Peter Alvaro, Neil Conway, David Maier, and William Marczak bloom

Distributed software is ➔ UBIQUITOUS ➔ HARD Programming Distributed Systems:
It’s Time to Talk

Distributed programming is ➔ UBIQUITOUS ➔ HARD An academic imperative!
Minimal activity in industry Programming Distributed Systems: It’s Time to Talk

Distributed programming is ➔ UBIQUITOUS ➔ HARD An academic imperative!
Minimal activity in industry Today: one academic group’s take Lessons in theory and practice  Initial impact in industry. Programming Distributed Systems: It’s Time to Talk

Outline Software Mismatch Order and State   in the Cloud
An Ideal Disorderly Programming for Distributed Systems A Realization <~ bloom Implications CALM Theorem

Outline An Ideal Disorderly Programming for Distributed Systems Realization <~
bloom Implications Software Mismatch Order and State   in the Cloud CALM Theorem

The State of Programming Is in Disorder

ORDER ➔ a list of instructions ➔ an array of
memory THE STATE ➔ mutation in time Von Neumann “Physics”

DISORDERED TIME ➔ multiple clocks ➔ parallel computation ➔ unordered
and   interleaved SHATTERED STATE ➔ local variables ➔ sharded tables ➔ message passing Cloud “Physics” v x z q r w y n

our programming model   fit our physical reality? perhaps we
could… ➔ scale up easily ➔ ignore race conditions ➔ tolerate faults reliably ➔ debug naturally ➔ test intelligently …and better understand   our fundamentals. What if… Data

bloom Implications Software Mismatch Order and State   in the Cloud CALM Theorem

Let’s write code that commutes! ➔ atomic updates can be
disaggregated ➔ replicas can update in different orders ➔ add idempotence: get retry fault tolerance OK. How? ➔ Appealing, but never actionable. Background: Kitchen Wisdom

STATE ✔ mergeable types: e.g. sets, relations ❌ mutable variables,
ordered structures (lists, dense arrays) TIME ✔ logical clocks with application semantics ❌ instruction ordering DISTRIBUTION (STATE + TIME) ✔ unification of storage and communication ❌ messaging libraries Disorderly-by-Default Programming

➔ Textbook example: 2-party communication Let Me Show You What
I Mean ➔ The Bloom language: model and syntax ➔More sophisticated examples

High Noon in  the Land of Two Mountains

Let’s implement that in Bloom 19

Let’s implement that in Bloom 20 Interfaces akin to data
tables Speakers write messages into the speak interface Listeners insert themselves into the listen interface

Let’s implement that in Bloom 21 Interfaces akin to data
tables Speakers write messages into the speak interface Listeners insert themselves into the listen interface The hear interface is dumped to stdio for debugging

Let’s implement that in Bloom 23 We include the RendezvousAPI
verbatim.

verbatim. Rendezvous is a join of speak and listen

verbatim. Rendezvous is a join of speak and listen We have turned communication into query processing.

bloom Implications CALM Theorem Software Mismatch Order and State   in the Cloud

STATE ✔ mergeable types: tables and (other) lattices TIME ✔
fixpoint-per-tick logical clocks DISTRIBUTION (STATE + TIME) ✔ async “shuffled” tables A Disorderly Language <~ bloom Encourages asynchronous monotonic programming,  merging in new information opportunistically over time.

Operational Model <~ bloom bloom rules { • Independent agents
(“nodes”) • Local state & logic (can be SPMD or MIMD) • Event-driven loop: one clock “tick” per iteration One Bloom “Tick” a b c local updates NW/OS events deferred local updates (next <+) async NW/OS msgs (async <~) instantaneous merge (now <=) atomic local fixpoint }

Statements <~ bloom <mergeable> <merge op> <mergeable expression>

Statements <~ bloom <mergeable> <merge op> <mergeable expression> <= now
<+ next <~ async <- del_next

<+ next <~ async <- del_next persistent table  lmax,lbool,lmap… transient scratch interface networked transient channel

<+ next <~ async <- del_next persistent table  lmax,lbool,lmap… transient scratch transient interface networked transient channel <mergeable> map, flat_map reduce, group, argmin/max (r * s).pairs empty? include? count,max,min,… >, <, >=, <= relational  operations lattice  functions

The Land of Two Mountains

Recall Synchronous Rendezvous 34 Rendezvous is a join of speak
and listen We have turned communication into query processing. But what of time? This depends on perfect synchrony (luck).

Recall Synchronous Rendezvous 35 Rendezvous is a join of speak
and listen We have turned communication into query processing. But what of time? This depends on perfect synchrony (luck). Asynchronous communication requires persistence.

Persistence Transience + gap-free sequential refresh.

Sender Persists (Signal Fire)

Let’s implement that in Bloom 42 The spoken table stores
all messages.  

all messages.

all messages. When a listen message arrives, it can rendezvous with all prior spoken messages. 

Receiver Persists (Watchtower)

Let’s implement that in Bloom 49 The listening table records
all the agents who want notifications.   When a speak message arrives, it can rendezvous with all prior listeners.

Both Persist (Signal Fire & Watchtower)

Let’s implement that in Bloom 53 Each rule for hear
joins a channel (events) with a table (state).  Computation is driven by channel arrival. Either channel can “arrive first” and hear will be populated.

➔ Up to now, rendezvous in time ➔ What about
space? ➔ listener, sender on the same node?! Distributing this

➔ Up to now, rendezvous in time ➔ What about
space? ➔ listener, sender on the same node?! ➔ What if listener and sender on their own nodes? ➔ need a “Join Server” ➔ and proxy logic to reroute interfaces ➔ Good news: once you solve time, space is easy! Distributing this

Asynchronous merge (<~) into channels. 

Asynchronous merge (<~) into channels.  Like “shuffle” or “exchange: routed
based on values in a demarcated field.

JoinServer simply wires up an   imported Rendezvous

➔ Can hash-partition JoinServer state on subject for scaling ➔do
this in proxy code; clients unchanged ➔ Can replicate JoinServer state for fault-tolerance ➔Many possible lattice-based consistency models ➔Indy KVS project Distributed JoinServer MapLattice key:  any_type val:  Version  Lattice val: any_type vc:  MapLattice spoken_map node:  any_type val:  MaxLattice

➔ We now have a distributed rendezvous protocol ➔ What
is all this? Reflecting Back: Disorderly, Data-Centric Programming Speaker Persists Log of messages Key-Value Store  (Database) Listener Persists Registry of listeners    Publish/Subscribe

Reflecting Back: Disorderly, Data-Centric Programming Speaker Persists Log of messages
Key-Value Store  (Database) Listener Persists Registry of listeners    Publish/Subscribe Duality of storage and communication! Rendezvous over time.   Choice of “system type” becomes minor code change. Hybrids naturally emerge. Reduced a hard programming problem to a well-understood database problem! Data

And it gets better! Speaker Persists Log of messages Key-Value
Store  (Database) Listener Persists Registry of listeners    Publish/Subscribe Post-Hoc distribution — even of server logic Any table/channel/interface can be treated like a DB table: scale-out: shard fault tolerance: replicate It’s easy to distribute centralized code, post-hoc! Data

Reflecting Back: Disorderly, Data-Centric Programming Speaker Persists Log of messages
Key-Value Store  (Database) Listener Persists Registry of listeners    Publish/Subscribe Post-Hoc distribution Any table/channel/interface can be treated like a DB table: scale-out: shard fault tolerance: replicate It’s easy to distribute centralized code, post-hoc! What about the hard parts of distributed databases/systems? Consistency. Data

➔ Easy! We can statically check Bloom code. ➔ budplot
What of Consistency?

looks for order-sensitive dataflows ➔ async communication causes disorder (yellow) What of Consistency?

looks for order-sensitive dataflows ➔ async communication causes disorder (yellow) ➔ order-sensitive op downstream of disorder? non-deterministic! (red) What of Consistency?

looks for order-sensitive dataflows ➔ async communication causes disorder (yellow) ➔ order-sensitive op downstream of disorder? non-deterministic! (red) ➔ Q: What operations are order-sensitive?  What of Consistency?

looks for order-sensitive dataflows ➔ async communication causes disorder (yellow) ➔ order-sensitive op downstream of disorder? non-deterministic! (red) ➔ Q: What operations are order-sensitive?  A: The non-monotone ones What of Consistency?

looks for order-sensitive dataflows ➔ async communication causes disorder (yellow) ➔ order-sensitive op downstream of disorder? non-deterministic! (red) ➔ Q: What operations are order-sensitive?  A: The non-monotone ones ➔ monotone: output grows with input ➔ non-monotone: must base (partial) results on their full input (prefix) What of Consistency?

<+ next <~ async <- del_next persistent table  lmax,lbool,lmap… transient scratch transient interface networked transient channel scheduled transient periodic <mergeable> map, flat_map reduce, group, argmin/max (r * s).pairs empty? include? count,max,min,… >, <, >=, <= relational  operations lattice  functions

➔ Easy! We can statically check Bloom code. ➔ Yes,
but our SpeakerPersist code is odd: ➔ All messages in an unordered set ➔ Never deletes or overwrites What of Consistency?

A More Typical Speaker Persist

A More Typical Speaker Persist The spoken table is now
mutable: one value per subject.

A More Typical Speaker Persist The spoken table is now
mutable: one value per subject.  Arrival order of network messages should require us to think about consistency. What does static analysis say?

Consistency?

Now what? Two options: 1.avoid non-monotonicity  2.impose global ordering Consistency?

Now what? Two options: 1.avoid non-monotonicity  2.impose global ordering Both
natural in Bloom. Consistency?

Now what? Two options: 1.avoid non-monotonicity  2.impose global ordering Both
natural in Bloom. (But one is better, as we’ll discuss :-) Consistency?

Now what? Two options: 1.avoid non-monotonicity  using vector clocks 2.impose
global ordering Both natural in Bloom. (But one is better, as we’ll discuss :-) Consistency?

Monotonic Structures: Lattices 84 (Join Semi-) Lattice: An object class
with - a merge operator (<=) that is  Associative, Commutative and Idempotent. - a largest value See “ACID 2.0” [Campbell/Helland CIDR ’09], CRDTs [Shapiro, et al. INRIA TR 2011]    

Vector Clocks 85 my_vc:  MapLattice key:  any_type val:  MaxLattic Joe
Phokion { }<= Peter Phokion { } Peter Phokion { }

Vector Clocks in Bloom Lattices 86 my_vc:  MapLattice key:  any_type
val:  MaxLattic Joe Phokion { }<= Peter Phokion { } Bloom lets us compose these lattices just as we compose relational tables/expressions using merge rules, morphisms, and monotone functions

Vector Clocks in Bloom Lattices 87 my_vc:  MapLattice key:  any_type
val:  MaxLattice state do lmap :my_vc end bootstrap do my_vc <= {ip_port => Bud::MaxLattice.new(0)} end Bloom lets us compose these lattices just as we compose relational tables/expressions using merge rules, morphisms, and monotone functions

• Initially all clocks are zero. Vector Clocks: bloom v.
wikipedia bootstrap do  my_vc <=   {ip_port => Bud::MaxLattice.new(0)}   end bloom do  next_vc <= out_msg   { {ip_port => my_vc.at(ip_port) + 1} } out_msg_vc <= out_msg   {|m| [m.addr, m.payload, next_vc]}   next_vc <= in_msg   { {ip_port => my_vc.at(ip_port) + 1} }   next_vc <= my_vc  next_vc <= in_msg {|m| m.clock}  my_vc <+ next_vc  end • Each time a process receives a message, it increments its own logical clock in the vector by one • Each time a process prepares to send a message, it increments its own logical clock in the vector by one • Each time a process experiences an internal event, it increments its own logical clock in the vector by one. and then sends its entire vector along with the message being sent. and updates each element in its vector by taking the maximum of the value in its own vector clock and the value in the vector in the received message (for every element). [“Logic and Lattices”, Conway, et al. SOCC 2012]

Now what? Two options: 1.avoid non-monotonicity  2.impose global ordering  using
Paxos Both natural in Bloom. (But one is better, as we’ll discuss :-) Consistency?

Paxos: pseudocode v. bloom 1. Priest p chooses a new
ballot number b greater than lastTried [p], sets lastTried [p] to b, and sends a NextBallot (b) message to some set of priests.  2. Upon receipt of a NextBallot (b) message from p with b > nextBal [q], priest q sets nextBal [q] to b and sends a LastVote (b, v) message to p, where v equals prevVote [q]. (A NextBallot (b) message is ignored if b < nextBal [q].)  3. After receiving a LastVote (b, v) message from every priest in some majority set Q, where b = lastTried [p], priest p initiates a new ballot with number b, quorum Q, and decree d, where d is chosen to satisfy B3. He then sends a BeginBallot (b, d) message to every priest in Q.  4. Upon receipt of a BeginBallot (b,d) message with b = nextBal [q], priest q casts his vote in ballot number b, sets prevVote [q] to this vote, and sends a Voted (b, q) message to p. (A BeginBallot (b, d) message is ignored if b = nextBal [q].)  5. If p has received a Voted (b, q) message from every priest q in Q (the quorum for ballot number b), where b = lastTried [p], then he writes d (the decree of that ballot) in his ledger and sends a Success (d) message to every priest. lastTried <= (lastTried*nextBallot).pairs(priest=>priest) { |l,n| [l.priest, n.bnum] if n.bnum >= l.old } nextBallot <= (decreeRequest*lastTried*priestCnt).combos (decreeRequest.priest => lastTried.priest, lastTried.priest => priestCnt.priest) { |d,l,p| [d.priest, l.old+p.cnt, d.decree] } sendNextBallot <~ (nextBallot*parliament).pairs(n.priest=>p.priest) { |n,p|   [p.peer, n.ballot, n.decree, n.priest] } nextBal <= (nextBal*lastVote).pairs(priest=>priest) { |n,l| [n.priest, l.ballot] if l.ballot >= n.old } lastVote <= (sendNextBallot, prevVote).pairs(priest=>priest) { |s,p| [s.priest, s.ballot, p.oldBallod, p.oldDecree, s.peer] if s.ballot >= p.oldBallot } sendLastVote <~ lastVote { |l| [priest, ballot, oldBallot, decree, lord] } priestCnt <= parliament.group([lord], count) lastVoteCnt <= sendLastVote.group([lord, ballot], count(Priest)) maxPrevBallot <= sendLastVote.group([lord], max(oldBallot)) quorum(Lord,Ballot) <= (priestCnt*lastVoteCnt).pairs(lord=>lord) { |p,l| [p.lord, p.pcnt] if vnct > (pcnt / 2) } beginBallot <= (quorum*maxPrevBallot*nextBallot*sendLastVote).combos (quorum.ballot => nextBallot.ballot, nextBallot.ballot => sendLastVote.ballot quorum.lord => maxPrevBallot.lord, maxPrevBallot.lord => nextBallot.lord, nextBallot.lord => sendLastVote.lord, maxPrevBallot.maxB => sendLastVote.maxB) {|q,m,n,s| m.maxB == -1 ? [q.lord, q.ballot, n.decree] : [q.lord, q.ballot, l.oldDecree] } sendBeginBallot <~ (beginBallot*parliament).pairs(lord=>lord) {|b,p| [l.priest, b.ballot, b.decree, b.lord] } vote <= (sendBeginBallot*nextBal).pairs(priest=>priest, ballot=>oldB) {|s.n| [s.priest, s.ballot, s.decree] } prevVote <= (prevVote*lastVote*vote).combos (prevVote.priest => lastVote.priest, lastVote.priest => vote.priest, lastVote.ballot => vote.ballot, l.decree => v.decree) { |p,l,v| [p.priest, l.ballot, l.decree] if l.ballot >= p.old } sendVote <~ (vote*sendBeginBallot).pairs(priest=>priest,ballot=>ballot, decree=>decree) {|v,s| [s.lord, v.ballot, v.decree, v.priest] } voteCnt <= sendVote.group([lord, ballot], count(priest)) decree <= (lastTried*voteCnt*lastVoteCnt*beginBallot).combos (lastTried.lord=>voteCnt.lord, voteCnt.lord=>lastVoteCnt.lord, lastVoteCnt.lord=>beginBallot.lord, lastTried.ballot=>voteCnt.ballot, voteCnt.ballot=>lastVoteCnt.ballot, voteCnt.ballot=>beginBallot.ballot, voteCnt.votes => lastVoteCnt.votes) {|lt, v, lv, b| [lt.lord, lt.ballot, b.decree]} 90 [“BOOM Analytics”, Alvaro, et al. Eurosys 2010]

➔ scale up easily ➔ ignore race conditions ➔ tolerate
faults reliably ➔ debug naturally ➔ test intelligently How did we do?

➔ How to test end-to-end Fault Tolerance? ➔ Lineage-Driven Fault
Injection (LDFI) ➔ Molly: an LDFI system   [Alvaro, et al. SIGMOD 2015] ➔ Deployment at Netflix  [Alvaro, et al. SOCC 2016] Testing Distributed Systems for Fault Tolerance

➔ scale up easily ➔ ignore race conditions ➔ tolerate
faults reliably ➔ debug naturally ➔ test intelligently How did we do? Data

➔ Proof Point at Scale: BOOM Analytics [Alvaro, et al.
Eurosys 2009] ➔ HDFS and Hadoop scale-out ➔ Industry Adoption of LDFI @ Netflix [Alvaro et al. 2016] ➔ Full-featured Ruby Interpreter for Bloom:   (https://github.com/bloom-lang/bud) ➔ Current work: ➔ Fluent: C++ compilation-based Bloom   (https://github.com/ucbrise/fluent) ➔ Indy: A dense, elastic Key-Value Store in Fluent How Real is All This?

➔ Multicore-performant Indy: A Key-Value Store [C. Wu, et al.
2017] ➔Smooth elastic scaling across nodes

➔ Smooth scaling across datacenters Indy: A Key-Value Store [C.
Wu, et al. 2017] ➔Implements all known coordination-avoiding consistency models from [Bailis, et al. VLDB ’14]

Realization <~ bloom Outline An Ideal Disorderly Programming for Distributed
Systems Implications CALM Theorem Software Mismatch Order and State   in the Cloud

Consistency is Good! But at what cost? Two options: 1.
avoid non-monotonicity  2. impose global ordering via coordination

Two options: 1. avoid non-monotonicity  2. impose global ordering via
coordination But coordination is expensive! Two-Phase Commit, Paxos, Virtual Synchrony, Raft, etc. Require nodes to send messages and wait for responses Consistency is Good! But at what cost?

Distributed Systems Poetry “The first principle of successful scalability is
to batter the consistency mechanisms down to a minimum move them off the critical path hide them in a rarely visited corner of the system and then make it as hard as possible for application developers to get permission to use them” —James Hamilton (IBM, MS, Amazon)   in Birman, Chockler: “Toward a Cloud Computing Research Agenda”, LADIS 2009

Two options: 1. avoid non-monotonicity  2. impose global ordering via
coordination Consistency is Good! But at what cost?

➔ What computations require coordination for consistency? Questions Deserving Answers

➔ What computations require coordination for consistency? ➔ What computations
can avoid coordination consistently? Questions Deserving Answers

Consistency As Logical Monotonicity {coordination-free consistent} <=> {monotonically expressable}  
➔ Avoid coordination for monotonic programs! ➔ no waiting! ➔ Monotonic programs are CAP-busters ➔ Consistent and Available during Partitions The CALM Theorem: A Bright Line

Intuition from Rendezvous ➔ Happy case: monotonicity. It “streams”! ➔At
any time, output ⊆ final result ➔After all messages, output is maximal ➔Implication: deterministic outcome! Problem: non-monotonicity. Can’t “stream”. Intermediate result ⊄ final result New input refutes previous output No output until you get entire input. Ensuring entire input? Coordination. Only works for BothPersist!!!

The Declarative Imperative Experiences and Conjectures in Distributed Logic Joseph
M. Hellerstein University of California, Berkeley [email protected] ABSTRACT The rise of multicore processors and cloud computing is putting enormous pressure on the software community to find solu- tions to the difficulty of parallel and distributed programming. At the same time, there is more—and more varied—interest in data-centric programming languages than at any time in computing history, in part because these languages parallelize naturally. This juxtaposition raises the possibility that the theory of declarative database query languages can provide a foun- dation for the next generation of parallel and distributed programming languages. In this paper I reflect on my group’s experience over seven years using Datalog extensions to build networking protocols and distributed systems. Based on that experience, I present a number of theoretical conjectures that may both interest the database community, and clarify important practical issues in distributed computing. Most importantly, I make a case for database researchers to take a leadership role in addressing the impending programming crisis. This is an extended version of an invited lecture at the ACM PODS 2010 conference [32]. 1. INTRODUCTION This year marks the forty-fifth anniversary of Gordon Moore’s paper laying down the Law: exponential growth in the density of transistors on a chip. Of course Moore’s Law has served more loosely to predict the doubling of computing efficiency every eighteen months. This year is a watershed: by the loose accounting, computers should be 1 Billion times faster than they were when Moore’s paper appeared in 1965. Technology forecasters appear cautiously optimistic that Moore’s Law will hold steady over the coming decade, in its strict in- terpretation. But they also predict a future in which continued exponentiation in hardware performance will only be available via parallelism. Given the difficulty of parallel programming, this prediction has led to an unusually gloomy outlook for computing in the coming years. At the same time that these storm clouds have been brew- ing, there has been a budding resurgence of interest across the software disciplines in data-centric computation, including declarative programming and Datalog. There is more— and more varied—applied activity in these areas than at any point in memory. The juxtaposition of these trends presents stark alternatives. Will the forecasts of doom and gloom materialize in a storm that drowns out progress in computing? Or is this the long- delayed catharsis that will wash away today’s thicket of imperative languages, preparing the ground for a more fertile declarative future? And what role might the database community play in shaping this future, having sowed the seeds of Datalog over the last quarter century? Before addressing these issues directly, a few more words about both crisis and opportunity are in order. 1.1 Urgency: Parallelism I would be panicked if I were in industry. — John Hennessy, President, Stanford University [35] The need for parallelism is visible at micro and macro scales. In microprocessor development, the connection between the “strict” and “loose” definitions of Moore’s Law has been sev- ered: while transistor density is continuing to grow exponen- tially, it is no longer improving processor speeds. Instead, chip manufacturers are packing increasing numbers of processor cores onto each chip, in reaction to challenges of power con- sumption and heat dissipation. Hence Moore’s Law no longer predicts the clock speed of a chip, but rather its offered degree of parallelism. And as a result, traditional sequential programs will get no faster over time. For the first time since Moore’s paper was published, the hardware community is at the mercy of software: only programmers can deliver the benefits of the Law to the people. At the same time, Cloud Computing promises to commodi- tize access to large compute clusters: it is now within the bud- get of individual developers to rent massive resources in the worlds’ largest computing centers. But again, this computing potential will go untapped unless those developers can write programs that harness parallelism, while managing the hetero- geneity and component failures endemic to very large clusters of distributed computers. Unfortunately, parallel and distributed programming today is challenging even for the best programmers, and unwork- able for the majority. In his Turing lecture, Jim Gray pointed to discouraging trends in the cost of software development, and presented Automatic Programming as the twelfth of his dozen grand challenges for computing [26]: develop methods to build software with orders of magnitude less code and effort. As presented in the Turing lecture, Gray’s challenge con- cerned sequential programming. The urgency and difficulty of his twelfth challenge has grown markedly with the technology SIGMOD Record, March 2010 (Vol. 39, No. 1) 5 ➔ CALM Conjecture  [Hellerstein, PODS ’10, SIGMOD Record 2010] CALM History

M. Hellerstein University of California, Berkeley [email protected] ABSTRACT The rise of multicore processors and cloud computing is putting enormous pressure on the software community to find solu- tions to the difficulty of parallel and distributed programming. At the same time, there is more—and more varied—interest in data-centric programming languages than at any time in computing history, in part because these languages parallelize naturally. This juxtaposition raises the possibility that the theory of declarative database query languages can provide a foun- dation for the next generation of parallel and distributed programming languages. In this paper I reflect on my group’s experience over seven years using Datalog extensions to build networking protocols and distributed systems. Based on that experience, I present a number of theoretical conjectures that may both interest the database community, and clarify important practical issues in distributed computing. Most importantly, I make a case for database researchers to take a leadership role in addressing the impending programming crisis. This is an extended version of an invited lecture at the ACM PODS 2010 conference [32]. 1. INTRODUCTION This year marks the forty-fifth anniversary of Gordon Moore’s paper laying down the Law: exponential growth in the density of transistors on a chip. Of course Moore’s Law has served more loosely to predict the doubling of computing efficiency every eighteen months. This year is a watershed: by the loose accounting, computers should be 1 Billion times faster than they were when Moore’s paper appeared in 1965. Technology forecasters appear cautiously optimistic that Moore’s Law will hold steady over the coming decade, in its strict in- terpretation. But they also predict a future in which continued exponentiation in hardware performance will only be available via parallelism. Given the difficulty of parallel programming, this prediction has led to an unusually gloomy outlook for computing in the coming years. At the same time that these storm clouds have been brew- ing, there has been a budding resurgence of interest across the software disciplines in data-centric computation, including declarative programming and Datalog. There is more— and more varied—applied activity in these areas than at any point in memory. The juxtaposition of these trends presents stark alternatives. Will the forecasts of doom and gloom materialize in a storm that drowns out progress in computing? Or is this the long- delayed catharsis that will wash away today’s thicket of imperative languages, preparing the ground for a more fertile declarative future? And what role might the database community play in shaping this future, having sowed the seeds of Datalog over the last quarter century? Before addressing these issues directly, a few more words about both crisis and opportunity are in order. 1.1 Urgency: Parallelism I would be panicked if I were in industry. — John Hennessy, President, Stanford University [35] The need for parallelism is visible at micro and macro scales. In microprocessor development, the connection between the “strict” and “loose” definitions of Moore’s Law has been sev- ered: while transistor density is continuing to grow exponen- tially, it is no longer improving processor speeds. Instead, chip manufacturers are packing increasing numbers of processor cores onto each chip, in reaction to challenges of power con- sumption and heat dissipation. Hence Moore’s Law no longer predicts the clock speed of a chip, but rather its offered degree of parallelism. And as a result, traditional sequential programs will get no faster over time. For the first time since Moore’s paper was published, the hardware community is at the mercy of software: only programmers can deliver the benefits of the Law to the people. At the same time, Cloud Computing promises to commodi- tize access to large compute clusters: it is now within the bud- get of individual developers to rent massive resources in the worlds’ largest computing centers. But again, this computing potential will go untapped unless those developers can write programs that harness parallelism, while managing the hetero- geneity and component failures endemic to very large clusters of distributed computers. Unfortunately, parallel and distributed programming today is challenging even for the best programmers, and unwork- able for the majority. In his Turing lecture, Jim Gray pointed to discouraging trends in the cost of software development, and presented Automatic Programming as the twelfth of his dozen grand challenges for computing [26]: develop methods to build software with orders of magnitude less code and effort. As presented in the Turing lecture, Gray’s challenge con- cerned sequential programming. The urgency and difficulty of his twelfth challenge has grown markedly with the technology SIGMOD Record, March 2010 (Vol. 39, No. 1) 5 Declarative Networking: Language, Execution and Optimization Boon Thau Loo∗ Tyson Condie∗ Minos Garofalakis† David E. Gay† Joseph M. Hellerstein∗ Petros Maniatis† Raghu Ramakrishnan‡ Timothy Roscoe† Ion Stoica∗ ∗UC Berkeley, †Intel Research Berkeley and ‡University of Wisconsin-Madison ABSTRACT The networking and distributed systems communities have recently explored a variety of new network architectures, both for application- level overlay networks, and as prototypes for a next-generation In- ternet architecture. In this context, we have investigated declarative networking: the use of a distributed recursive query engine as a powerful vehicle for accelerating innovation in network architectures [23, 24, 33]. Declarative networking represents a significant new application area for database research on recursive query processing. In this paper, we address fundamental database issues in this domain. First, we motivate and formally define the Network Datalog (NDlog) language for declarative network specifications. Second, we introduce and prove correct relaxed versions of the traditional semi-na¨ ıve query evaluation technique, to overcome fundamental problems of the traditional technique in an asynchronous distributed setting. Third, we consider the dynamics of network state, and formalize the “eventual consistency” of our programs even when bursts of updates can arrive in the midst of query execution. Fourth, we present a number of query optimization opportunities that arise in the declarative networking context, including applications of traditional techniques as well as new optimizations. Last, we present evaluation results of the above ideas implemented in our P2 declarative networking system, running on 100 machines over the Emulab network testbed. 1. INTRODUCTION The database literature has a rich tradition of research on recursive query languages and processing. This work has influenced commercial database systems to a certain extent. However, recursion is still considered an esoteric feature by most practitioners, and research in the area has had limited practical impact. Even within the database research community, there is longstanding controversy over the practical relevance of recursive queries, going back at least to the Laguna Beach Report [7], and continuing into relatively recent textbooks [35]. In more recent work, we have made the case that recursive query technology has a natural application in the design of Internet infras- tructure. We presented an approach called declarative networking ∗UC Berkeley authors funded by NSF grants 0205647, 0209108, and 0225660, and a gift from Microsoft. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGMOD 2006, June 27–29, 2006, Chicago, Illinois, USA. Copyright 2006 ACM 1-59593-256-9/06/0006 ...$5.00. that enables declarative specification and deployment of distributed protocols and algorithms via distributed recursive queries over network graphs [23, 24, 33]. We recently described how we implemented and deployed this concept in a system called P2 [23, 33]. Our high-level goal is to provide a software environment that can accelerate the process of specifying, implementing, experimenting with and evolving designs for network architectures. Declarative networking is part of a larger effort to revisit the current Internet Architecture, which is considered by many researchers to be fundamentally ill-suited to handle today’s network uses and abuses [13]. While radical new architectures are being proposed for a “clean slate” design, there are also many efforts to develop application-level “overlay” networks on top of the current Internet, to prototype and roll out new network services in an evolutionary fashion [26]. Whether one is a proponent of revolution or evolution in this context, there is agreement that we are entering a period of significant flux in network services, protocols and architectures. In such an environment, innovation can be better focused and ac- celerated by having the right software tools at hand. Declarative query approaches appear to be one of the most promising avenues for dealing with the complexity of prototyping, deploying and evolving new network architectures. The forwarding tables in network routing nodes can be regarded as a view over changing ground state (network links, nodes, load, operator policies, etc.), and this view is kept correct by the maintenance of distributed queries over this state. These queries are necessarily recursive, maintaining facts about ar- bitrarily long multi-hop paths over a network of single-hop links. Our initial forays into declarative networking have been promising. First, in declarative routing [24], we demonstrated that recursive queries can be used to express a variety of well-known wired and wireless routing protocols in a compact and clean fashion, typ- ically in a handful of lines of program code. We also showed that the declarative approach can expose fundamental connections: for example, the query specifications for two well-known protocols – one for wired networks and one for wireless – differ only in the order of two predicates in a single rule body. Moreover, higher-level routing concepts (e.g., QoS constraints) can be achieved via simple modifications to the queries. Second, in declarative overlays [23], we extended our framework to support more complex application- level overlay networks such as multicast overlays and distributed hash tables (DHTs). We demonstrated a working implementation of the Chord [34] overlay lookup network specified in 47 Datalog-like rules, versus thousands of lines of C++ for the original version. Our declarative approach to networking promises not only flexibil- ity and compactness of specification, but also the potential to statically check network protocols for security and correctness properties [11]. In addition, dynamic runtime checks to test distributed properties of the network can easily be expressed as declarative queries, providing a uniform framework for network specification, monitoring and debugging [33]. ➔ CALM Conjecture  [Hellerstein, PODS ’10, SIGMOD Record 2010] ➔ Monotonicity => Consistency   [Abiteboul PODS 2011, Loo et al., SIGMOD 2006] CALM History

M. Hellerstein University of California, Berkeley [email protected] ABSTRACT The rise of multicore processors and cloud computing is putting enormous pressure on the software community to find solu- tions to the difficulty of parallel and distributed programming. At the same time, there is more—and more varied—interest in data-centric programming languages than at any time in computing history, in part because these languages parallelize naturally. This juxtaposition raises the possibility that the theory of declarative database query languages can provide a foun- dation for the next generation of parallel and distributed programming languages. In this paper I reflect on my group’s experience over seven years using Datalog extensions to build networking protocols and distributed systems. Based on that experience, I present a number of theoretical conjectures that may both interest the database community, and clarify important practical issues in distributed computing. Most importantly, I make a case for database researchers to take a leadership role in addressing the impending programming crisis. This is an extended version of an invited lecture at the ACM PODS 2010 conference [32]. 1. INTRODUCTION This year marks the forty-fifth anniversary of Gordon Moore’s paper laying down the Law: exponential growth in the density of transistors on a chip. Of course Moore’s Law has served more loosely to predict the doubling of computing efficiency every eighteen months. This year is a watershed: by the loose accounting, computers should be 1 Billion times faster than they were when Moore’s paper appeared in 1965. Technology forecasters appear cautiously optimistic that Moore’s Law will hold steady over the coming decade, in its strict in- terpretation. But they also predict a future in which continued exponentiation in hardware performance will only be available via parallelism. Given the difficulty of parallel programming, this prediction has led to an unusually gloomy outlook for computing in the coming years. At the same time that these storm clouds have been brew- ing, there has been a budding resurgence of interest across the software disciplines in data-centric computation, including declarative programming and Datalog. There is more— and more varied—applied activity in these areas than at any point in memory. The juxtaposition of these trends presents stark alternatives. Will the forecasts of doom and gloom materialize in a storm that drowns out progress in computing? Or is this the long- delayed catharsis that will wash away today’s thicket of imperative languages, preparing the ground for a more fertile declarative future? And what role might the database community play in shaping this future, having sowed the seeds of Datalog over the last quarter century? Before addressing these issues directly, a few more words about both crisis and opportunity are in order. 1.1 Urgency: Parallelism I would be panicked if I were in industry. — John Hennessy, President, Stanford University [35] The need for parallelism is visible at micro and macro scales. In microprocessor development, the connection between the “strict” and “loose” definitions of Moore’s Law has been sev- ered: while transistor density is continuing to grow exponen- tially, it is no longer improving processor speeds. Instead, chip manufacturers are packing increasing numbers of processor cores onto each chip, in reaction to challenges of power con- sumption and heat dissipation. Hence Moore’s Law no longer predicts the clock speed of a chip, but rather its offered degree of parallelism. And as a result, traditional sequential programs will get no faster over time. For the first time since Moore’s paper was published, the hardware community is at the mercy of software: only programmers can deliver the benefits of the Law to the people. At the same time, Cloud Computing promises to commodi- tize access to large compute clusters: it is now within the bud- get of individual developers to rent massive resources in the worlds’ largest computing centers. But again, this computing potential will go untapped unless those developers can write programs that harness parallelism, while managing the hetero- geneity and component failures endemic to very large clusters of distributed computers. Unfortunately, parallel and distributed programming today is challenging even for the best programmers, and unwork- able for the majority. In his Turing lecture, Jim Gray pointed to discouraging trends in the cost of software development, and presented Automatic Programming as the twelfth of his dozen grand challenges for computing [26]: develop methods to build software with orders of magnitude less code and effort. As presented in the Turing lecture, Gray’s challenge con- cerned sequential programming. The urgency and difficulty of his twelfth challenge has grown markedly with the technology SIGMOD Record, March 2010 (Vol. 39, No. 1) 5 Declarative Networking: Language, Execution and Optimization Boon Thau Loo∗ Tyson Condie∗ Minos Garofalakis† David E. Gay† Joseph M. Hellerstein∗ Petros Maniatis† Raghu Ramakrishnan‡ Timothy Roscoe† Ion Stoica∗ ∗UC Berkeley, †Intel Research Berkeley and ‡University of Wisconsin-Madison ABSTRACT The networking and distributed systems communities have recently explored a variety of new network architectures, both for application- level overlay networks, and as prototypes for a next-generation In- ternet architecture. In this context, we have investigated declarative networking: the use of a distributed recursive query engine as a powerful vehicle for accelerating innovation in network architectures [23, 24, 33]. Declarative networking represents a significant new application area for database research on recursive query processing. In this paper, we address fundamental database issues in this domain. First, we motivate and formally define the Network Datalog (NDlog) language for declarative network specifications. Second, we introduce and prove correct relaxed versions of the traditional semi-na¨ ıve query evaluation technique, to overcome fundamental problems of the traditional technique in an asynchronous distributed setting. Third, we consider the dynamics of network state, and formalize the “eventual consistency” of our programs even when bursts of updates can arrive in the midst of query execution. Fourth, we present a number of query optimization opportunities that arise in the declarative networking context, including applications of traditional techniques as well as new optimizations. Last, we present evaluation results of the above ideas implemented in our P2 declarative networking system, running on 100 machines over the Emulab network testbed. 1. INTRODUCTION The database literature has a rich tradition of research on recursive query languages and processing. This work has influenced commercial database systems to a certain extent. However, recursion is still considered an esoteric feature by most practitioners, and research in the area has had limited practical impact. Even within the database research community, there is longstanding controversy over the practical relevance of recursive queries, going back at least to the Laguna Beach Report [7], and continuing into relatively recent textbooks [35]. In more recent work, we have made the case that recursive query technology has a natural application in the design of Internet infras- tructure. We presented an approach called declarative networking ∗UC Berkeley authors funded by NSF grants 0205647, 0209108, and 0225660, and a gift from Microsoft. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGMOD 2006, June 27–29, 2006, Chicago, Illinois, USA. Copyright 2006 ACM 1-59593-256-9/06/0006 ...$5.00. that enables declarative specification and deployment of distributed protocols and algorithms via distributed recursive queries over network graphs [23, 24, 33]. We recently described how we implemented and deployed this concept in a system called P2 [23, 33]. Our high-level goal is to provide a software environment that can accelerate the process of specifying, implementing, experimenting with and evolving designs for network architectures. Declarative networking is part of a larger effort to revisit the current Internet Architecture, which is considered by many researchers to be fundamentally ill-suited to handle today’s network uses and abuses [13]. While radical new architectures are being proposed for a “clean slate” design, there are also many efforts to develop application-level “overlay” networks on top of the current Internet, to prototype and roll out new network services in an evolutionary fashion [26]. Whether one is a proponent of revolution or evolution in this context, there is agreement that we are entering a period of significant flux in network services, protocols and architectures. In such an environment, innovation can be better focused and ac- celerated by having the right software tools at hand. Declarative query approaches appear to be one of the most promising avenues for dealing with the complexity of prototyping, deploying and evolving new network architectures. The forwarding tables in network routing nodes can be regarded as a view over changing ground state (network links, nodes, load, operator policies, etc.), and this view is kept correct by the maintenance of distributed queries over this state. These queries are necessarily recursive, maintaining facts about ar- bitrarily long multi-hop paths over a network of single-hop links. Our initial forays into declarative networking have been promising. First, in declarative routing [24], we demonstrated that recursive queries can be used to express a variety of well-known wired and wireless routing protocols in a compact and clean fashion, typ- ically in a handful of lines of program code. We also showed that the declarative approach can expose fundamental connections: for example, the query specifications for two well-known protocols – one for wired networks and one for wireless – differ only in the order of two predicates in a single rule body. Moreover, higher-level routing concepts (e.g., QoS constraints) can be achieved via simple modifications to the queries. Second, in declarative overlays [23], we extended our framework to support more complex application- level overlay networks such as multicast overlays and distributed hash tables (DHTs). We demonstrated a working implementation of the Chord [34] overlay lookup network specified in 47 Datalog-like rules, versus thousands of lines of C++ for the original version. Our declarative approach to networking promises not only flexibil- ity and compactness of specification, but also the potential to statically check network protocols for security and correctness properties [11]. In addition, dynamic runtime checks to test distributed properties of the network can easily be expressed as declarative queries, providing a uniform framework for network specification, monitoring and debugging [33]. A Relational Transducers for Declarative Networking TOM J. AMELOOT, Hasselt University & Transnational University of Limburg FRANK NEVEN, Hasselt University & Transnational University of Limburg JAN VAN DEN BUSSCHE, Hasselt University & Transnational University of Limburg Motivated by a recent conjecture concerning the expressiveness of declarative networking, we propose a formal computation model for “eventually consistent” distributed querying, based on relational transducers. A tight link has been conjectured between coordination-freeness of computations, and monotonicity of the queries expressed by such computations. Indeed, we propose a formal definition of coordination- freeness and confirm that the class of monotone queries is captured by coordination-free transducer networks. Coordination-freeness is a semantic property, but the syntactic class of “oblivious” transducers we define also captures the same class of monotone queries. Transducer networks that are not coordination-free are much more powerful. Categories and Subject Descriptors: H.2 [ Database Management ]: Languages; H.2 [ Database Manage- ment ]: Systems—Distributed databases; F.1 [ Computation by Abstract Devices ]: Models of Compu- tation General Terms: languages, theory Additional Key Words and Phrases: distributed database, relational transducer, monotonicity, expressive power, cloud programming ACM Reference Format: AMELOOT, T. J., NEVEN, F. and VAN DEN BUSSCHE, J. 2011. Relational Transducers for Declarative Networking. J. ACM V, N, Article A (January YYYY), 37 pages. DOI = 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000 1. INTRODUCTION Declarative networking [Loo et al. 2009] is a recent approach by which distributed computations and networking protocols are modeled and programmed using formalisms based on Datalog. In his keynote speech at PODS 2010 [Hellerstein 2010a; Hellerstein 2010b], Heller- stein made a number of intriguing conjectures concerning the expressiveness of declarative networking. In the present paper, we are focusing on the CALM conjecture (Consistency And Logical Monotonicity). This conjecture suggests a strong link between, on the one hand, “eventually consistent” and “coordination-free” distributed computations, and on the other hand, expressibility in monotonic Datalog (without negation or aggregate functions). The conjecture was not fully formalized, however; indeed, as Hellerstein notes himself, a proper treatment of this conjecture requires crisp definitions of eventual consistency and coordination, which have been lacking so far. Moreover, it also requires a formal model of distributed computation. Tom J. Ameloot is a PhD Fellow of the Fund for Scientific Research, Flanders (FWO). Author’s email addresses: [email protected], [email protected], [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c • YYYY ACM 0004-5411/YYYY/01-ARTA $15.00 DOI 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000 Journal of the ACM, Vol. V, No. N, Article A, Publication date: January YYYY. ➔ CALM Conjecture  [Hellerstein, PODS ’10, SIGMOD Record 2010] ➔ Monotonicity => Consistency   [Abiteboul PODS 2011, Loo et al., SIGMOD 2006] ➔ Relational Transducer Proofs  [Ameloot, et al. PODS 2012, JACM 2013]  CALM History

M. Hellerstein University of California, Berkeley [email protected] ABSTRACT The rise of multicore processors and cloud computing is putting enormous pressure on the software community to find solu- tions to the difficulty of parallel and distributed programming. At the same time, there is more—and more varied—interest in data-centric programming languages than at any time in computing history, in part because these languages parallelize naturally. This juxtaposition raises the possibility that the theory of declarative database query languages can provide a foun- dation for the next generation of parallel and distributed programming languages. In this paper I reflect on my group’s experience over seven years using Datalog extensions to build networking protocols and distributed systems. Based on that experience, I present a number of theoretical conjectures that may both interest the database community, and clarify important practical issues in distributed computing. Most importantly, I make a case for database researchers to take a leadership role in addressing the impending programming crisis. This is an extended version of an invited lecture at the ACM PODS 2010 conference [32]. 1. INTRODUCTION This year marks the forty-fifth anniversary of Gordon Moore’s paper laying down the Law: exponential growth in the density of transistors on a chip. Of course Moore’s Law has served more loosely to predict the doubling of computing efficiency every eighteen months. This year is a watershed: by the loose accounting, computers should be 1 Billion times faster than they were when Moore’s paper appeared in 1965. Technology forecasters appear cautiously optimistic that Moore’s Law will hold steady over the coming decade, in its strict in- terpretation. But they also predict a future in which continued exponentiation in hardware performance will only be available via parallelism. Given the difficulty of parallel programming, this prediction has led to an unusually gloomy outlook for computing in the coming years. At the same time that these storm clouds have been brew- ing, there has been a budding resurgence of interest across the software disciplines in data-centric computation, including declarative programming and Datalog. There is more— and more varied—applied activity in these areas than at any point in memory. The juxtaposition of these trends presents stark alternatives. Will the forecasts of doom and gloom materialize in a storm that drowns out progress in computing? Or is this the long- delayed catharsis that will wash away today’s thicket of imperative languages, preparing the ground for a more fertile declarative future? And what role might the database community play in shaping this future, having sowed the seeds of Datalog over the last quarter century? Before addressing these issues directly, a few more words about both crisis and opportunity are in order. 1.1 Urgency: Parallelism I would be panicked if I were in industry. — John Hennessy, President, Stanford University [35] The need for parallelism is visible at micro and macro scales. In microprocessor development, the connection between the “strict” and “loose” definitions of Moore’s Law has been sev- ered: while transistor density is continuing to grow exponen- tially, it is no longer improving processor speeds. Instead, chip manufacturers are packing increasing numbers of processor cores onto each chip, in reaction to challenges of power con- sumption and heat dissipation. Hence Moore’s Law no longer predicts the clock speed of a chip, but rather its offered degree of parallelism. And as a result, traditional sequential programs will get no faster over time. For the first time since Moore’s paper was published, the hardware community is at the mercy of software: only programmers can deliver the benefits of the Law to the people. At the same time, Cloud Computing promises to commodi- tize access to large compute clusters: it is now within the bud- get of individual developers to rent massive resources in the worlds’ largest computing centers. But again, this computing potential will go untapped unless those developers can write programs that harness parallelism, while managing the hetero- geneity and component failures endemic to very large clusters of distributed computers. Unfortunately, parallel and distributed programming today is challenging even for the best programmers, and unwork- able for the majority. In his Turing lecture, Jim Gray pointed to discouraging trends in the cost of software development, and presented Automatic Programming as the twelfth of his dozen grand challenges for computing [26]: develop methods to build software with orders of magnitude less code and effort. As presented in the Turing lecture, Gray’s challenge con- cerned sequential programming. The urgency and difficulty of his twelfth challenge has grown markedly with the technology SIGMOD Record, March 2010 (Vol. 39, No. 1) 5 Declarative Networking: Language, Execution and Optimization Boon Thau Loo∗ Tyson Condie∗ Minos Garofalakis† David E. Gay† Joseph M. Hellerstein∗ Petros Maniatis† Raghu Ramakrishnan‡ Timothy Roscoe† Ion Stoica∗ ∗UC Berkeley, †Intel Research Berkeley and ‡University of Wisconsin-Madison ABSTRACT The networking and distributed systems communities have recently explored a variety of new network architectures, both for application- level overlay networks, and as prototypes for a next-generation In- ternet architecture. In this context, we have investigated declarative networking: the use of a distributed recursive query engine as a powerful vehicle for accelerating innovation in network architectures [23, 24, 33]. Declarative networking represents a significant new application area for database research on recursive query processing. In this paper, we address fundamental database issues in this domain. First, we motivate and formally define the Network Datalog (NDlog) language for declarative network specifications. Second, we introduce and prove correct relaxed versions of the traditional semi-na¨ ıve query evaluation technique, to overcome fundamental problems of the traditional technique in an asynchronous distributed setting. Third, we consider the dynamics of network state, and formalize the “eventual consistency” of our programs even when bursts of updates can arrive in the midst of query execution. Fourth, we present a number of query optimization opportunities that arise in the declarative networking context, including applications of traditional techniques as well as new optimizations. Last, we present evaluation results of the above ideas implemented in our P2 declarative networking system, running on 100 machines over the Emulab network testbed. 1. INTRODUCTION The database literature has a rich tradition of research on recursive query languages and processing. This work has influenced commercial database systems to a certain extent. However, recursion is still considered an esoteric feature by most practitioners, and research in the area has had limited practical impact. Even within the database research community, there is longstanding controversy over the practical relevance of recursive queries, going back at least to the Laguna Beach Report [7], and continuing into relatively recent textbooks [35]. In more recent work, we have made the case that recursive query technology has a natural application in the design of Internet infras- tructure. We presented an approach called declarative networking ∗UC Berkeley authors funded by NSF grants 0205647, 0209108, and 0225660, and a gift from Microsoft. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGMOD 2006, June 27–29, 2006, Chicago, Illinois, USA. Copyright 2006 ACM 1-59593-256-9/06/0006 ...$5.00. that enables declarative specification and deployment of distributed protocols and algorithms via distributed recursive queries over network graphs [23, 24, 33]. We recently described how we implemented and deployed this concept in a system called P2 [23, 33]. Our high-level goal is to provide a software environment that can accelerate the process of specifying, implementing, experimenting with and evolving designs for network architectures. Declarative networking is part of a larger effort to revisit the current Internet Architecture, which is considered by many researchers to be fundamentally ill-suited to handle today’s network uses and abuses [13]. While radical new architectures are being proposed for a “clean slate” design, there are also many efforts to develop application-level “overlay” networks on top of the current Internet, to prototype and roll out new network services in an evolutionary fashion [26]. Whether one is a proponent of revolution or evolution in this context, there is agreement that we are entering a period of significant flux in network services, protocols and architectures. In such an environment, innovation can be better focused and ac- celerated by having the right software tools at hand. Declarative query approaches appear to be one of the most promising avenues for dealing with the complexity of prototyping, deploying and evolving new network architectures. The forwarding tables in network routing nodes can be regarded as a view over changing ground state (network links, nodes, load, operator policies, etc.), and this view is kept correct by the maintenance of distributed queries over this state. These queries are necessarily recursive, maintaining facts about ar- bitrarily long multi-hop paths over a network of single-hop links. Our initial forays into declarative networking have been promising. First, in declarative routing [24], we demonstrated that recursive queries can be used to express a variety of well-known wired and wireless routing protocols in a compact and clean fashion, typ- ically in a handful of lines of program code. We also showed that the declarative approach can expose fundamental connections: for example, the query specifications for two well-known protocols – one for wired networks and one for wireless – differ only in the order of two predicates in a single rule body. Moreover, higher-level routing concepts (e.g., QoS constraints) can be achieved via simple modifications to the queries. Second, in declarative overlays [23], we extended our framework to support more complex application- level overlay networks such as multicast overlays and distributed hash tables (DHTs). We demonstrated a working implementation of the Chord [34] overlay lookup network specified in 47 Datalog-like rules, versus thousands of lines of C++ for the original version. Our declarative approach to networking promises not only flexibil- ity and compactness of specification, but also the potential to statically check network protocols for security and correctness properties [11]. In addition, dynamic runtime checks to test distributed properties of the network can easily be expressed as declarative queries, providing a uniform framework for network specification, monitoring and debugging [33]. A Relational Transducers for Declarative Networking TOM J. AMELOOT, Hasselt University & Transnational University of Limburg FRANK NEVEN, Hasselt University & Transnational University of Limburg JAN VAN DEN BUSSCHE, Hasselt University & Transnational University of Limburg Motivated by a recent conjecture concerning the expressiveness of declarative networking, we propose a formal computation model for “eventually consistent” distributed querying, based on relational transducers. A tight link has been conjectured between coordination-freeness of computations, and monotonicity of the queries expressed by such computations. Indeed, we propose a formal definition of coordination- freeness and confirm that the class of monotone queries is captured by coordination-free transducer networks. Coordination-freeness is a semantic property, but the syntactic class of “oblivious” transducers we define also captures the same class of monotone queries. Transducer networks that are not coordination-free are much more powerful. Categories and Subject Descriptors: H.2 [ Database Management ]: Languages; H.2 [ Database Manage- ment ]: Systems—Distributed databases; F.1 [ Computation by Abstract Devices ]: Models of Compu- tation General Terms: languages, theory Additional Key Words and Phrases: distributed database, relational transducer, monotonicity, expressive power, cloud programming ACM Reference Format: AMELOOT, T. J., NEVEN, F. and VAN DEN BUSSCHE, J. 2011. Relational Transducers for Declarative Networking. J. ACM V, N, Article A (January YYYY), 37 pages. DOI = 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000 1. INTRODUCTION Declarative networking [Loo et al. 2009] is a recent approach by which distributed computations and networking protocols are modeled and programmed using formalisms based on Datalog. In his keynote speech at PODS 2010 [Hellerstein 2010a; Hellerstein 2010b], Heller- stein made a number of intriguing conjectures concerning the expressiveness of declarative networking. In the present paper, we are focusing on the CALM conjecture (Consistency And Logical Monotonicity). This conjecture suggests a strong link between, on the one hand, “eventually consistent” and “coordination-free” distributed computations, and on the other hand, expressibility in monotonic Datalog (without negation or aggregate functions). The conjecture was not fully formalized, however; indeed, as Hellerstein notes himself, a proper treatment of this conjecture requires crisp definitions of eventual consistency and coordination, which have been lacking so far. Moreover, it also requires a formal model of distributed computation. Tom J. Ameloot is a PhD Fellow of the Fund for Scientific Research, Flanders (FWO). Author’s email addresses: [email protected], [email protected], [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c • YYYY ACM 0004-5411/YYYY/01-ARTA $15.00 DOI 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000 Journal of the ACM, Vol. V, No. N, Article A, Publication date: January YYYY. ➔ CALM Conjecture  [Hellerstein, PODS ’10, SIGMOD Record 2010] ➔ Monotonicity => Consistency   [Abiteboul PODS 2011, Loo et al., SIGMOD 2006] ➔ Relational Transducer Proofs  [Ameloot, et al. PODS 2012, JACM 2013]  [Ameloot et al. PODS 2014] CALM History Weaker Forms of Monotonicity for Declarative Networking: a More Fine-grained Answer to the CALM-conjecture Tom J. Ameloot ⇤ Hasselt University & transnational University of Limburg [email protected] Bas Ketsman Hasselt University & transnational University of Limburg [email protected] Frank Neven Hasselt University & transnational University of Limburg [email protected] Daniel Zinn LogicBlox, Inc [email protected] ABSTRACT The CALM-conjecture, first stated by Hellerstein [23] and proved in its revised form by Ameloot et al. [13] within the framework of relational transducer networks, asserts that a query has a coordination-free execution strategy if and only if the query is monotone. Zinn et al. [32] extended the framework of relational transducer networks to allow for specific data distribution strategies and showed that the non- monotone win-move query is coordination-free for domain- guided data distributions. In this paper, we complete the story by equating increasingly larger classes of coordination- free computations with increasingly weaker forms of monotonicity and make Datalog variants explicit that capture each of these classes. One such fragment is based on strati- fied Datalog where rules are required to be connected with the exception of the last stratum. In addition, we charac- terize coordination-freeness as those computations that do not require knowledge about all other nodes in the network, and therefore, can not globally coordinate. The results in this paper can be interpreted as a more fine-grained answer to the CALM-conjecture. Categories and Subject Descriptors H.2 [Database Management]: Languages; H.2 [Database Management]: Systems—Distributed databases; F.1 [Com- putation by Abstract Devices]: Models of Computation Keywords Distributed database, relational transducer, consistency, coordination, expressive power, cloud programming ⇤ PhD Fellow of the Fund for Scientific Research, Flanders (FWO). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. PODS’14, June 22–27, 2014, Snowbird, UT, USA. Copyright 2014 ACM 978-1-4503-2375-8/14/06 ...$15.00. http://dx.doi.org/10.1145/2594538.2594541. 1. INTRODUCTION Declarative networking is an approach where distributed computations are modeled and programmed using declarative formalisms based on extensions of Datalog. On a logical level, programs (queries) are specified over a global schema and are computed by multiple computing nodes over which the input database is distributed. These nodes can perform local computations and communicate asynchronously with each other via messages. The model operates under the assumption that messages can never be lost but can be ar- bitrarily delayed. An inherent source of ine ciency in such systems are the global barriers raised by the need for syn- chronization in computing the result of queries. This source of ine ciency inspired Hellerstein [11] to formulate the CALM-principle which suggests a link between logical monotonicity on the one hand and distributed consistency without the need for coordination on the other hand.1 A crucial property of monotone programs is that derived facts must never be retracted when new data arrives. The latter implies a simple coordination-free execution strategy: every node sends all relevant data to every other node in the network and outputs new facts from the moment they can be derived. No coordination is needed and the output of all computing nodes is consistent. This observation motivated Hellerstein [23] to formulate the CALM-conjecture which, in its revised form2, states “A query has a coordination-free execution strategy i↵ the query is monotone.” Ameloot, Neven, and Van den Bussche [13] formalized the conjecture in terms of relational transducer networks and provided a proof. Zinn, Green, and Lud¨ ascher [32] subse- quently showed that there is more to this story. In particu- lar, they obtained that when computing nodes are increasingly more knowledgeable on how facts are distributed, increasingly more queries can be computed in a coordination- free manner. Zinn et al. [32] considered two extensions of the original transducer model introduced in [13]. In the first extension, here referred to as the policy-aware model, every computing node is aware of the facts that should be assigned to it and can consequently evaluate negation over schema relations. In the second extension, referred to as the 1CALM stands for Consistency And Logical Monotonicity. 2The original conjecture replaced monotone by Datalog [13].

M. Hellerstein University of California, Berkeley [email protected] ABSTRACT The rise of multicore processors and cloud computing is putting enormous pressure on the software community to find solu- tions to the difficulty of parallel and distributed programming. At the same time, there is more—and more varied—interest in data-centric programming languages than at any time in computing history, in part because these languages parallelize naturally. This juxtaposition raises the possibility that the theory of declarative database query languages can provide a foun- dation for the next generation of parallel and distributed programming languages. In this paper I reflect on my group’s experience over seven years using Datalog extensions to build networking protocols and distributed systems. Based on that experience, I present a number of theoretical conjectures that may both interest the database community, and clarify important practical issues in distributed computing. Most importantly, I make a case for database researchers to take a leadership role in addressing the impending programming crisis. This is an extended version of an invited lecture at the ACM PODS 2010 conference [32]. 1. INTRODUCTION This year marks the forty-fifth anniversary of Gordon Moore’s paper laying down the Law: exponential growth in the density of transistors on a chip. Of course Moore’s Law has served more loosely to predict the doubling of computing efficiency every eighteen months. This year is a watershed: by the loose accounting, computers should be 1 Billion times faster than they were when Moore’s paper appeared in 1965. Technology forecasters appear cautiously optimistic that Moore’s Law will hold steady over the coming decade, in its strict in- terpretation. But they also predict a future in which continued exponentiation in hardware performance will only be available via parallelism. Given the difficulty of parallel programming, this prediction has led to an unusually gloomy outlook for computing in the coming years. At the same time that these storm clouds have been brew- ing, there has been a budding resurgence of interest across the software disciplines in data-centric computation, including declarative programming and Datalog. There is more— and more varied—applied activity in these areas than at any point in memory. The juxtaposition of these trends presents stark alternatives. Will the forecasts of doom and gloom materialize in a storm that drowns out progress in computing? Or is this the long- delayed catharsis that will wash away today’s thicket of imperative languages, preparing the ground for a more fertile declarative future? And what role might the database community play in shaping this future, having sowed the seeds of Datalog over the last quarter century? Before addressing these issues directly, a few more words about both crisis and opportunity are in order. 1.1 Urgency: Parallelism I would be panicked if I were in industry. — John Hennessy, President, Stanford University [35] The need for parallelism is visible at micro and macro scales. In microprocessor development, the connection between the “strict” and “loose” definitions of Moore’s Law has been sev- ered: while transistor density is continuing to grow exponen- tially, it is no longer improving processor speeds. Instead, chip manufacturers are packing increasing numbers of processor cores onto each chip, in reaction to challenges of power con- sumption and heat dissipation. Hence Moore’s Law no longer predicts the clock speed of a chip, but rather its offered degree of parallelism. And as a result, traditional sequential programs will get no faster over time. For the first time since Moore’s paper was published, the hardware community is at the mercy of software: only programmers can deliver the benefits of the Law to the people. At the same time, Cloud Computing promises to commodi- tize access to large compute clusters: it is now within the bud- get of individual developers to rent massive resources in the worlds’ largest computing centers. But again, this computing potential will go untapped unless those developers can write programs that harness parallelism, while managing the hetero- geneity and component failures endemic to very large clusters of distributed computers. Unfortunately, parallel and distributed programming today is challenging even for the best programmers, and unwork- able for the majority. In his Turing lecture, Jim Gray pointed to discouraging trends in the cost of software development, and presented Automatic Programming as the twelfth of his dozen grand challenges for computing [26]: develop methods to build software with orders of magnitude less code and effort. As presented in the Turing lecture, Gray’s challenge con- cerned sequential programming. The urgency and difficulty of his twelfth challenge has grown markedly with the technology SIGMOD Record, March 2010 (Vol. 39, No. 1) 5 Declarative Networking: Language, Execution and Optimization Boon Thau Loo∗ Tyson Condie∗ Minos Garofalakis† David E. Gay† Joseph M. Hellerstein∗ Petros Maniatis† Raghu Ramakrishnan‡ Timothy Roscoe† Ion Stoica∗ ∗UC Berkeley, †Intel Research Berkeley and ‡University of Wisconsin-Madison ABSTRACT The networking and distributed systems communities have recently explored a variety of new network architectures, both for application- level overlay networks, and as prototypes for a next-generation In- ternet architecture. In this context, we have investigated declarative networking: the use of a distributed recursive query engine as a powerful vehicle for accelerating innovation in network architectures [23, 24, 33]. Declarative networking represents a significant new application area for database research on recursive query processing. In this paper, we address fundamental database issues in this domain. First, we motivate and formally define the Network Datalog (NDlog) language for declarative network specifications. Second, we introduce and prove correct relaxed versions of the traditional semi-na¨ ıve query evaluation technique, to overcome fundamental problems of the traditional technique in an asynchronous distributed setting. Third, we consider the dynamics of network state, and formalize the “eventual consistency” of our programs even when bursts of updates can arrive in the midst of query execution. Fourth, we present a number of query optimization opportunities that arise in the declarative networking context, including applications of traditional techniques as well as new optimizations. Last, we present evaluation results of the above ideas implemented in our P2 declarative networking system, running on 100 machines over the Emulab network testbed. 1. INTRODUCTION The database literature has a rich tradition of research on recursive query languages and processing. This work has influenced commercial database systems to a certain extent. However, recursion is still considered an esoteric feature by most practitioners, and research in the area has had limited practical impact. Even within the database research community, there is longstanding controversy over the practical relevance of recursive queries, going back at least to the Laguna Beach Report [7], and continuing into relatively recent textbooks [35]. In more recent work, we have made the case that recursive query technology has a natural application in the design of Internet infras- tructure. We presented an approach called declarative networking ∗UC Berkeley authors funded by NSF grants 0205647, 0209108, and 0225660, and a gift from Microsoft. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGMOD 2006, June 27–29, 2006, Chicago, Illinois, USA. Copyright 2006 ACM 1-59593-256-9/06/0006 ...$5.00. that enables declarative specification and deployment of distributed protocols and algorithms via distributed recursive queries over network graphs [23, 24, 33]. We recently described how we implemented and deployed this concept in a system called P2 [23, 33]. Our high-level goal is to provide a software environment that can accelerate the process of specifying, implementing, experimenting with and evolving designs for network architectures. Declarative networking is part of a larger effort to revisit the current Internet Architecture, which is considered by many researchers to be fundamentally ill-suited to handle today’s network uses and abuses [13]. While radical new architectures are being proposed for a “clean slate” design, there are also many efforts to develop application-level “overlay” networks on top of the current Internet, to prototype and roll out new network services in an evolutionary fashion [26]. Whether one is a proponent of revolution or evolution in this context, there is agreement that we are entering a period of significant flux in network services, protocols and architectures. In such an environment, innovation can be better focused and ac- celerated by having the right software tools at hand. Declarative query approaches appear to be one of the most promising avenues for dealing with the complexity of prototyping, deploying and evolving new network architectures. The forwarding tables in network routing nodes can be regarded as a view over changing ground state (network links, nodes, load, operator policies, etc.), and this view is kept correct by the maintenance of distributed queries over this state. These queries are necessarily recursive, maintaining facts about ar- bitrarily long multi-hop paths over a network of single-hop links. Our initial forays into declarative networking have been promising. First, in declarative routing [24], we demonstrated that recursive queries can be used to express a variety of well-known wired and wireless routing protocols in a compact and clean fashion, typ- ically in a handful of lines of program code. We also showed that the declarative approach can expose fundamental connections: for example, the query specifications for two well-known protocols – one for wired networks and one for wireless – differ only in the order of two predicates in a single rule body. Moreover, higher-level routing concepts (e.g., QoS constraints) can be achieved via simple modifications to the queries. Second, in declarative overlays [23], we extended our framework to support more complex application- level overlay networks such as multicast overlays and distributed hash tables (DHTs). We demonstrated a working implementation of the Chord [34] overlay lookup network specified in 47 Datalog-like rules, versus thousands of lines of C++ for the original version. Our declarative approach to networking promises not only flexibil- ity and compactness of specification, but also the potential to statically check network protocols for security and correctness properties [11]. In addition, dynamic runtime checks to test distributed properties of the network can easily be expressed as declarative queries, providing a uniform framework for network specification, monitoring and debugging [33]. A Relational Transducers for Declarative Networking TOM J. AMELOOT, Hasselt University & Transnational University of Limburg FRANK NEVEN, Hasselt University & Transnational University of Limburg JAN VAN DEN BUSSCHE, Hasselt University & Transnational University of Limburg Motivated by a recent conjecture concerning the expressiveness of declarative networking, we propose a formal computation model for “eventually consistent” distributed querying, based on relational transducers. A tight link has been conjectured between coordination-freeness of computations, and monotonicity of the queries expressed by such computations. Indeed, we propose a formal definition of coordination- freeness and confirm that the class of monotone queries is captured by coordination-free transducer networks. Coordination-freeness is a semantic property, but the syntactic class of “oblivious” transducers we define also captures the same class of monotone queries. Transducer networks that are not coordination-free are much more powerful. Categories and Subject Descriptors: H.2 [ Database Management ]: Languages; H.2 [ Database Manage- ment ]: Systems—Distributed databases; F.1 [ Computation by Abstract Devices ]: Models of Compu- tation General Terms: languages, theory Additional Key Words and Phrases: distributed database, relational transducer, monotonicity, expressive power, cloud programming ACM Reference Format: AMELOOT, T. J., NEVEN, F. and VAN DEN BUSSCHE, J. 2011. Relational Transducers for Declarative Networking. J. ACM V, N, Article A (January YYYY), 37 pages. DOI = 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000 1. INTRODUCTION Declarative networking [Loo et al. 2009] is a recent approach by which distributed computations and networking protocols are modeled and programmed using formalisms based on Datalog. In his keynote speech at PODS 2010 [Hellerstein 2010a; Hellerstein 2010b], Heller- stein made a number of intriguing conjectures concerning the expressiveness of declarative networking. In the present paper, we are focusing on the CALM conjecture (Consistency And Logical Monotonicity). This conjecture suggests a strong link between, on the one hand, “eventually consistent” and “coordination-free” distributed computations, and on the other hand, expressibility in monotonic Datalog (without negation or aggregate functions). The conjecture was not fully formalized, however; indeed, as Hellerstein notes himself, a proper treatment of this conjecture requires crisp definitions of eventual consistency and coordination, which have been lacking so far. Moreover, it also requires a formal model of distributed computation. Tom J. Ameloot is a PhD Fellow of the Fund for Scientific Research, Flanders (FWO). Author’s email addresses: [email protected], [email protected], [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c • YYYY ACM 0004-5411/YYYY/01-ARTA $15.00 DOI 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000 Journal of the ACM, Vol. V, No. N, Article A, Publication date: January YYYY. ➔ CALM Conjecture  [Hellerstein, PODS ’10, SIGMOD Record 2010] ➔ Monotonicity => Consistency   [Abiteboul PODS 2011, Loo et al., SIGMOD 2006] ➔ Relational Transducer Proofs  [Ameloot, et al. PODS 2012, JACM 2013]  [Ameloot et al. PODS 2014] ➔ Napkin-sized proof  [Hellerstein & Alvaro 2017?] CALM History Weaker Forms of Monotonicity for Declarative Networking: a More Fine-grained Answer to the CALM-conjecture Tom J. Ameloot ⇤ Hasselt University & transnational University of Limburg [email protected] Bas Ketsman Hasselt University & transnational University of Limburg [email protected] Frank Neven Hasselt University & transnational University of Limburg [email protected] Daniel Zinn LogicBlox, Inc [email protected] ABSTRACT The CALM-conjecture, first stated by Hellerstein [23] and proved in its revised form by Ameloot et al. [13] within the framework of relational transducer networks, asserts that a query has a coordination-free execution strategy if and only if the query is monotone. Zinn et al. [32] extended the framework of relational transducer networks to allow for specific data distribution strategies and showed that the non- monotone win-move query is coordination-free for domain- guided data distributions. In this paper, we complete the story by equating increasingly larger classes of coordination- free computations with increasingly weaker forms of monotonicity and make Datalog variants explicit that capture each of these classes. One such fragment is based on strati- fied Datalog where rules are required to be connected with the exception of the last stratum. In addition, we charac- terize coordination-freeness as those computations that do not require knowledge about all other nodes in the network, and therefore, can not globally coordinate. The results in this paper can be interpreted as a more fine-grained answer to the CALM-conjecture. Categories and Subject Descriptors H.2 [Database Management]: Languages; H.2 [Database Management]: Systems—Distributed databases; F.1 [Com- putation by Abstract Devices]: Models of Computation Keywords Distributed database, relational transducer, consistency, coordination, expressive power, cloud programming ⇤ PhD Fellow of the Fund for Scientific Research, Flanders (FWO). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. PODS’14, June 22–27, 2014, Snowbird, UT, USA. Copyright 2014 ACM 978-1-4503-2375-8/14/06 ...$15.00. http://dx.doi.org/10.1145/2594538.2594541. 1. INTRODUCTION Declarative networking is an approach where distributed computations are modeled and programmed using declarative formalisms based on extensions of Datalog. On a logical level, programs (queries) are specified over a global schema and are computed by multiple computing nodes over which the input database is distributed. These nodes can perform local computations and communicate asynchronously with each other via messages. The model operates under the assumption that messages can never be lost but can be ar- bitrarily delayed. An inherent source of ine ciency in such systems are the global barriers raised by the need for syn- chronization in computing the result of queries. This source of ine ciency inspired Hellerstein [11] to formulate the CALM-principle which suggests a link between logical monotonicity on the one hand and distributed consistency without the need for coordination on the other hand.1 A crucial property of monotone programs is that derived facts must never be retracted when new data arrives. The latter implies a simple coordination-free execution strategy: every node sends all relevant data to every other node in the network and outputs new facts from the moment they can be derived. No coordination is needed and the output of all computing nodes is consistent. This observation motivated Hellerstein [23] to formulate the CALM-conjecture which, in its revised form2, states “A query has a coordination-free execution strategy i↵ the query is monotone.” Ameloot, Neven, and Van den Bussche [13] formalized the conjecture in terms of relational transducer networks and provided a proof. Zinn, Green, and Lud¨ ascher [32] subse- quently showed that there is more to this story. In particu- lar, they obtained that when computing nodes are increasingly more knowledgeable on how facts are distributed, increasingly more queries can be computed in a coordination- free manner. Zinn et al. [32] considered two extensions of the original transducer model introduced in [13]. In the first extension, here referred to as the policy-aware model, every computing node is aware of the facts that should be assigned to it and can consequently evaluate negation over schema relations. In the second extension, referred to as the 1CALM stands for Consistency And Logical Monotonicity. 2The original conjecture replaced monotone by Datalog [13].

➔ Immerman-Vardi Theorem ➔ Same monotonicity as CALM?! ➔ Consistency
<=> Monotonicity <=> PTIME! ➔ Can avoid coordination for all polynomial-time computations?! An Intriguing Connection

An Intriguing Connection ➔ Immerman-Vardi Theorem ➔ Same monotonicity as
CALM?! ➔ Consistency <=> Monotonicity <=> PTIME! ➔ Can avoid coordination for all polynomial- time computations?!

1. Fluent: disorderly programming toolkit ➔C++ Libraries ➔Lattices, Relational Algebra
➔Rule Registry/Execution ➔Static C++ typechecking via template metaprogramming ➔ Fluent Debugger ➔Distributed data lineage ➔Distributed tracing 2. Familiar programming models ➔Can we skin Fluent with: RPC, State Machines, Actors, Futures, etc? Current and Future Work 3. Dense Clouds ➔ High-performance, coordination- free code at multiple scales ➔Cores to servers to the globe 4. Fundamentals ➔ Constructions for coordination- free polynomial-time programs? ➔General? ➔Code synthesis? ➔ Abiding theoretical questions ➔Stochastic CALM? 5. Applications ➔ RL and Robotics?

CALM ➔ Seek monotonicity, avoid coordination. ➔ Move up the
stack! ➔ Historic focus on Read/Write API distracts from what’s possible in application logic Bloom ➔ Disorderly programming can radically simplify distributed programming ➔ Data-centric: be it “declarative”, “reactive”, “dataflow”, etc. ➔ Revolution (Bloom vs. Java?) or Evolution (Bloom vs. LLVM IR?) Ambitious systems work in an era of maturity? ➔ If it’s doable, somebody is already doing it ➔ Green field problems are the ones with high switching costs ➔ “That will never work!” ➔ Be patient, seek lessons along the way Takeaways

CALM and Disorderly Programming in Bloom

CALM and Disorderly Programming in Bloom

More Decks by Joe Hellerstein

Other Decks in Technology

Featured

Transcript