M. Hellerstein University of California, Berkeley
[email protected] ABSTRACT The rise of multicore processors and cloud computing is putting enormous pressure on the software community to find solu- tions to the difficulty of parallel and distributed programming. At the same time, there is more—and more varied—interest in data-centric programming languages than at any time in com- puting history, in part because these languages parallelize nat- urally. This juxtaposition raises the possibility that the theory of declarative database query languages can provide a foun- dation for the next generation of parallel and distributed pro- gramming languages. In this paper I reflect on my group’s experience over seven years using Datalog extensions to build networking protocols and distributed systems. Based on that experience, I present a number of theoretical conjectures that may both interest the database community, and clarify important practical issues in distributed computing. Most importantly, I make a case for database researchers to take a leadership role in addressing the impending programming crisis. This is an extended version of an invited lecture at the ACM PODS 2010 conference [32]. 1. INTRODUCTION This year marks the forty-fifth anniversary of Gordon Moore’s paper laying down the Law: exponential growth in the density of transistors on a chip. Of course Moore’s Law has served more loosely to predict the doubling of computing efficiency every eighteen months. This year is a watershed: by the loose accounting, computers should be 1 Billion times faster than they were when Moore’s paper appeared in 1965. Technology forecasters appear cautiously optimistic that Moore’s Law will hold steady over the coming decade, in its strict in- terpretation. But they also predict a future in which continued exponentiation in hardware performance will only be avail- able via parallelism. Given the difficulty of parallel program- ming, this prediction has led to an unusually gloomy outlook for computing in the coming years. At the same time that these storm clouds have been brew- ing, there has been a budding resurgence of interest across the software disciplines in data-centric computation, includ- ing declarative programming and Datalog. There is more— and more varied—applied activity in these areas than at any point in memory. The juxtaposition of these trends presents stark alternatives. Will the forecasts of doom and gloom materialize in a storm that drowns out progress in computing? Or is this the long- delayed catharsis that will wash away today’s thicket of im- perative languages, preparing the ground for a more fertile declarative future? And what role might the database com- munity play in shaping this future, having sowed the seeds of Datalog over the last quarter century? Before addressing these issues directly, a few more words about both crisis and opportunity are in order. 1.1 Urgency: Parallelism I would be panicked if I were in industry. — John Hennessy, President, Stanford University [35] The need for parallelism is visible at micro and macro scales. In microprocessor development, the connection between the “strict” and “loose” definitions of Moore’s Law has been sev- ered: while transistor density is continuing to grow exponen- tially, it is no longer improving processor speeds. Instead, chip manufacturers are packing increasing numbers of processor cores onto each chip, in reaction to challenges of power con- sumption and heat dissipation. Hence Moore’s Law no longer predicts the clock speed of a chip, but rather its offered degree of parallelism. And as a result, traditional sequential programs will get no faster over time. For the first time since Moore’s paper was published, the hardware community is at the mercy of software: only programmers can deliver the benefits of the Law to the people. At the same time, Cloud Computing promises to commodi- tize access to large compute clusters: it is now within the bud- get of individual developers to rent massive resources in the worlds’ largest computing centers. But again, this computing potential will go untapped unless those developers can write programs that harness parallelism, while managing the hetero- geneity and component failures endemic to very large clusters of distributed computers. Unfortunately, parallel and distributed programming today is challenging even for the best programmers, and unwork- able for the majority. In his Turing lecture, Jim Gray pointed to discouraging trends in the cost of software development, and presented Automatic Programming as the twelfth of his dozen grand challenges for computing [26]: develop methods to build software with orders of magnitude less code and ef- fort. As presented in the Turing lecture, Gray’s challenge con- cerned sequential programming. The urgency and difficulty of his twelfth challenge has grown markedly with the technology SIGMOD Record, March 2010 (Vol. 39, No. 1) 5 Declarative Networking: Language, Execution and Optimization Boon Thau Loo∗ Tyson Condie∗ Minos Garofalakis† David E. Gay† Joseph M. Hellerstein∗ Petros Maniatis† Raghu Ramakrishnan‡ Timothy Roscoe† Ion Stoica∗ ∗UC Berkeley, †Intel Research Berkeley and ‡University of Wisconsin-Madison ABSTRACT The networking and distributed systems communities have recently explored a variety of new network architectures, both for application- level overlay networks, and as prototypes for a next-generation In- ternet architecture. In this context, we have investigated declara- tive networking: the use of a distributed recursive query engine as a powerful vehicle for accelerating innovation in network architec- tures [23, 24, 33]. Declarative networking represents a significant new application area for database research on recursive query pro- cessing. In this paper, we address fundamental database issues in this domain. First, we motivate and formally define the Network Datalog (NDlog) language for declarative network specifications. Second, we introduce and prove correct relaxed versions of the tra- ditional semi-na¨ ıve query evaluation technique, to overcome fun- damental problems of the traditional technique in an asynchronous distributed setting. Third, we consider the dynamics of network state, and formalize the “eventual consistency” of our programs even when bursts of updates can arrive in the midst of query execution. Fourth, we present a number of query optimization opportunities that arise in the declarative networking context, including applica- tions of traditional techniques as well as new optimizations. Last, we present evaluation results of the above ideas implemented in our P2 declarative networking system, running on 100 machines over the Emulab network testbed. 1. INTRODUCTION The database literature has a rich tradition of research on recursive query languages and processing. This work has influenced com- mercial database systems to a certain extent. However, recursion is still considered an esoteric feature by most practitioners, and re- search in the area has had limited practical impact. Even within the database research community, there is longstanding controversy over the practical relevance of recursive queries, going back at least to the Laguna Beach Report [7], and continuing into relatively re- cent textbooks [35]. In more recent work, we have made the case that recursive query technology has a natural application in the design of Internet infras- tructure. We presented an approach called declarative networking ∗UC Berkeley authors funded by NSF grants 0205647, 0209108, and 0225660, and a gift from Microsoft. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGMOD 2006, June 27–29, 2006, Chicago, Illinois, USA. Copyright 2006 ACM 1-59593-256-9/06/0006 ...$5.00. that enables declarative specification and deployment of distributed protocols and algorithms via distributed recursive queries over net- work graphs [23, 24, 33]. We recently described how we imple- mented and deployed this concept in a system called P2 [23, 33]. Our high-level goal is to provide a software environment that can accelerate the process of specifying, implementing, experimenting with and evolving designs for network architectures. Declarative networking is part of a larger effort to revisit the cur- rent Internet Architecture, which is considered by many researchers to be fundamentally ill-suited to handle today’s network uses and abuses [13]. While radical new architectures are being proposed for a “clean slate” design, there are also many efforts to develop application-level “overlay” networks on top of the current Internet, to prototype and roll out new network services in an evolutionary fashion [26]. Whether one is a proponent of revolution or evolution in this context, there is agreement that we are entering a period of significant flux in network services, protocols and architectures. In such an environment, innovation can be better focused and ac- celerated by having the right software tools at hand. Declarative query approaches appear to be one of the most promising avenues for dealing with the complexity of prototyping, deploying and evolv- ing new network architectures. The forwarding tables in network routing nodes can be regarded as a view over changing ground state (network links, nodes, load, operator policies, etc.), and this view is kept correct by the maintenance of distributed queries over this state. These queries are necessarily recursive, maintaining facts about ar- bitrarily long multi-hop paths over a network of single-hop links. Our initial forays into declarative networking have been promis- ing. First, in declarative routing [24], we demonstrated that recur- sive queries can be used to express a variety of well-known wired and wireless routing protocols in a compact and clean fashion, typ- ically in a handful of lines of program code. We also showed that the declarative approach can expose fundamental connections: for example, the query specifications for two well-known protocols – one for wired networks and one for wireless – differ only in the or- der of two predicates in a single rule body. Moreover, higher-level routing concepts (e.g., QoS constraints) can be achieved via simple modifications to the queries. Second, in declarative overlays [23], we extended our framework to support more complex application- level overlay networks such as multicast overlays and distributed hash tables (DHTs). We demonstrated a working implementation of the Chord [34] overlay lookup network specified in 47 Datalog-like rules, versus thousands of lines of C++ for the original version. Our declarative approach to networking promises not only flexibil- ity and compactness of specification, but also the potential to stat- ically check network protocols for security and correctness proper- ties [11]. In addition, dynamic runtime checks to test distributed properties of the network can easily be expressed as declarative queries, providing a uniform framework for network specification, monitoring and debugging [33]. A Relational Transducers for Declarative Networking TOM J. AMELOOT, Hasselt University & Transnational University of Limburg FRANK NEVEN, Hasselt University & Transnational University of Limburg JAN VAN DEN BUSSCHE, Hasselt University & Transnational University of Limburg Motivated by a recent conjecture concerning the expressiveness of declarative networking, we propose a formal computation model for “eventually consistent” distributed querying, based on relational transduc- ers. A tight link has been conjectured between coordination-freeness of computations, and monotonicity of the queries expressed by such computations. Indeed, we propose a formal definition of coordination- freeness and confirm that the class of monotone queries is captured by coordination-free transducer net- works. Coordination-freeness is a semantic property, but the syntactic class of “oblivious” transducers we define also captures the same class of monotone queries. Transducer networks that are not coordination-free are much more powerful. Categories and Subject Descriptors: H.2 [ Database Management ]: Languages; H.2 [ Database Manage- ment ]: Systems—Distributed databases; F.1 [ Computation by Abstract Devices ]: Models of Compu- tation General Terms: languages, theory Additional Key Words and Phrases: distributed database, relational transducer, monotonicity, expressive power, cloud programming ACM Reference Format: AMELOOT, T. J., NEVEN, F. and VAN DEN BUSSCHE, J. 2011. Relational Transducers for Declarative Networking. J. ACM V, N, Article A (January YYYY), 37 pages. DOI = 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000 1. INTRODUCTION Declarative networking [Loo et al. 2009] is a recent approach by which distributed compu- tations and networking protocols are modeled and programmed using formalisms based on Datalog. In his keynote speech at PODS 2010 [Hellerstein 2010a; Hellerstein 2010b], Heller- stein made a number of intriguing conjectures concerning the expressiveness of declarative networking. In the present paper, we are focusing on the CALM conjecture (Consistency And Logical Monotonicity). This conjecture suggests a strong link between, on the one hand, “eventually consistent” and “coordination-free” distributed computations, and on the other hand, expressibility in monotonic Datalog (without negation or aggregate functions). The conjecture was not fully formalized, however; indeed, as Hellerstein notes himself, a proper treatment of this conjecture requires crisp definitions of eventual consistency and coordina- tion, which have been lacking so far. Moreover, it also requires a formal model of distributed computation. Tom J. Ameloot is a PhD Fellow of the Fund for Scientific Research, Flanders (FWO). Author’s email addresses:
[email protected],
[email protected],
[email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or
[email protected]. c • YYYY ACM 0004-5411/YYYY/01-ARTA $15.00 DOI 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000 Journal of the ACM, Vol. V, No. N, Article A, Publication date: January YYYY. ➔ CALM Conjecture
[Hellerstein, PODS ’10, SIGMOD Record 2010] ➔ Monotonicity => Consistency
[Abiteboul PODS 2011, Loo et al., SIGMOD 2006] ➔ Relational Transducer Proofs
[Ameloot, et al. PODS 2012, JACM 2013]
[Ameloot et al. PODS 2014] ➔ Napkin-sized proof
[Hellerstein & Alvaro 2017?] CALM History Weaker Forms of Monotonicity for Declarative Networking: a More Fine-grained Answer to the CALM-conjecture Tom J. Ameloot ⇤ Hasselt University & transnational University of Limburg
[email protected] Bas Ketsman Hasselt University & transnational University of Limburg
[email protected] Frank Neven Hasselt University & transnational University of Limburg
[email protected] Daniel Zinn LogicBlox, Inc
[email protected] ABSTRACT The CALM-conjecture, first stated by Hellerstein [23] and proved in its revised form by Ameloot et al. [13] within the framework of relational transducer networks, asserts that a query has a coordination-free execution strategy if and only if the query is monotone. Zinn et al. [32] extended the framework of relational transducer networks to allow for spe- cific data distribution strategies and showed that the non- monotone win-move query is coordination-free for domain- guided data distributions. In this paper, we complete the story by equating increasingly larger classes of coordination- free computations with increasingly weaker forms of mono- tonicity and make Datalog variants explicit that capture each of these classes. One such fragment is based on strati- fied Datalog where rules are required to be connected with the exception of the last stratum. In addition, we charac- terize coordination-freeness as those computations that do not require knowledge about all other nodes in the network, and therefore, can not globally coordinate. The results in this paper can be interpreted as a more fine-grained answer to the CALM-conjecture. Categories and Subject Descriptors H.2 [Database Management]: Languages; H.2 [Database Management]: Systems—Distributed databases; F.1 [Com- putation by Abstract Devices]: Models of Computation Keywords Distributed database, relational transducer, consistency, co- ordination, expressive power, cloud programming ⇤ PhD Fellow of the Fund for Scientific Research, Flanders (FWO). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected]. PODS’14, June 22–27, 2014, Snowbird, UT, USA. Copyright 2014 ACM 978-1-4503-2375-8/14/06 ...$15.00. http://dx.doi.org/10.1145/2594538.2594541. 1. INTRODUCTION Declarative networking is an approach where distributed computations are modeled and programmed using declara- tive formalisms based on extensions of Datalog. On a logical level, programs (queries) are specified over a global schema and are computed by multiple computing nodes over which the input database is distributed. These nodes can perform local computations and communicate asynchronously with each other via messages. The model operates under the assumption that messages can never be lost but can be ar- bitrarily delayed. An inherent source of ine ciency in such systems are the global barriers raised by the need for syn- chronization in computing the result of queries. This source of ine ciency inspired Hellerstein [11] to for- mulate the CALM-principle which suggests a link between logical monotonicity on the one hand and distributed consis- tency without the need for coordination on the other hand.1 A crucial property of monotone programs is that derived facts must never be retracted when new data arrives. The latter implies a simple coordination-free execution strategy: every node sends all relevant data to every other node in the network and outputs new facts from the moment they can be derived. No coordination is needed and the output of all computing nodes is consistent. This observation motivated Hellerstein [23] to formulate the CALM-conjecture which, in its revised form2, states “A query has a coordination-free execution strat- egy i↵ the query is monotone.” Ameloot, Neven, and Van den Bussche [13] formalized the conjecture in terms of relational transducer networks and provided a proof. Zinn, Green, and Lud¨ ascher [32] subse- quently showed that there is more to this story. In particu- lar, they obtained that when computing nodes are increas- ingly more knowledgeable on how facts are distributed, in- creasingly more queries can be computed in a coordination- free manner. Zinn et al. [32] considered two extensions of the original transducer model introduced in [13]. In the first extension, here referred to as the policy-aware model, ev- ery computing node is aware of the facts that should be assigned to it and can consequently evaluate negation over schema relations. In the second extension, referred to as the 1CALM stands for Consistency And Logical Monotonicity. 2The original conjecture replaced monotone by Datalog [13].