Archaeology,” a phrase I designed to be a catchy talk title before actually thinking through what it meant. - The more I thought about it, however, the more I convinced myself that it made sense, and so here we are.
to apply an artifact-driven approach to understanding the history of the ﬁeld of Distributed Systems - So I’m going to try to tell the story of the ﬁeld primarily through its artifacts. Instead of giving a talk and showing the sources at the end, I will use the sources to drive the story.
on the job.” Hodges, 2013 - I’ve been thinking a lot about that too - In this paper Hodges clearly states that new engineers will ﬁnd literature, but not applicable lessons, so he provides some excellent applicable lessons - That delta, between the literature and the lessons, is the space I am trying to understand - So I saturated myself in the literature in order to get some clarity around what the ﬁeld was all about, and why it is so challenging to interact with distributed systems precisely at a time when we need the tools to do things well and easily
reﬂecting, I divided the history of distributed systems up into three rough categories: The Mind, The Proof, and The Market - Each of these categories represents a common thread that I have traced through the literature - And each of these threads contained information and wisdom that helped me understand where we started, and how we got to where we are today
Mind - The ﬁrst thread is “The Mind: The roots of distributed systems research in Artiﬁcial Intelligence” - At the time I pitched this talk, I was deeply inﬂuenced from having recently trawled through the archives at the AI group at MIT, particularly the famed “AI Memos” - Amazing papers by amazing people, a wide range of topics - I started to read through papers about understanding and modeling the mind as a distributed system, and got very excited
work, not all of it very comprehensible, what I found was that AI researchers had a freedom to dream about the future - This freedom allowed them to create something special, at a time when anything seemed possible - I chose the work of three individuals: JCR Licklider, Marvin Minsky, and Carl Hewitt to represent the origins of distributed systems in artiﬁcial intelligence
Licklider, 1968 - JCR Licklider worked for the United States government in various capacities, and his story is way too big to be told here - He is the spiritual ﬁgurehead of the internet, one of the people responsible for the popularizing Vanevar Bush’s work - He also helped to create the AI lab at MIT that employed Minsky, whom taught Hewitt - According to legend, Licklider was quite the eccentric. He drank coke for breakfast. He dreamed about connecting humanity with a large network of computers. And he named that network ‘the intergalactic computer network’ because he knew that his audience, scientists, would ﬁnd it amusing
communication device” where Licklider is trying to work out how information could travel through networks - We see terms like “nodes,” “ports,” “message processor” - He is using biology as an inspiration for modeling the ﬂow of information
Project MAC at MIT that employed people like Marvin Minsky, allowing him to publish amazing works and teach a generation of students - This progress report covers the depth of the work going on at the MIT lab at the time. - Work began to transition from purely theoretical, to implementation based, and back - The primary concern was understanding how computers and humans could interact - Again, this one slide does Minsky no justice - please read about him to discover more
- One of Minsky’s students was Carl Hewitt, whose work is an inﬂection point for me in the history of distributed systems - In 1973 he published “A universal Modular ACTOR Formalism for AI” that deﬁned the Actor model of computation and discussed his language PLANNER - The actor model describes computation as being performed by a single kind of object, known as an actor
- In 1976, Hewitt published “Viewing Control Structures” which discussed his work with the actor model and message passing. - Hewitt’s work, as I mentioned, is an inﬂection point for me because of how implementation based it is - there is formalism, there is code, there is hardware - Even though it is technical, the humanity is not lost - “Modelling an intelligent person” and “Modeling a society of experts” are stated goals - The actor model is one of the most enduring ideas produced at the MIT AI lab, it is everywhere - in the semantics of Erlang in the libraries of many other major programming languages - To recap - the work of Licklider, Minsky, and Hewitt was in inherently concerned with the distributed nature of knowledge and how technology could be used to solve a wide variety of problems - The origins of the ﬁeld of Distributed Systems can be found here - at the very intersection of technology and humanity
distributed systems research - So that covers the Mind. - In parallel with the work on AI at prominent universities around the world, computer science was growing as a ﬁeld, and the lines between mathematics and its application to technology were being teased out - As a backdrop, it’s important to state that at the time I started researching this work, I had a hard time grasping how you could formalize something as complex as a distributed system - It seemed like too many moving parts, like it was too abstract - I didn’t understand the techniques used to apply something like a mathematical proof, wherein you have to “control the entire universe” to something like distributed systems - The work of three individuals helped me understand these techniques
Lamport’s names are all likely well known to people at this conference - In this section I’m going to cover Dijkstra and Lynch’s work and brieﬂy mention Lamport’s involvement, and we’ll talk again about Lamport in a little while
- The most well known name in the ﬁeld - The ACM Symposium on Principles of Distributed Computing has a prize called the Dijsktra prize in Distributed Computing - One of the strongest voices in Computer Science history on the importance of formalism - this thread starts with him - This paper describes a mutual exclusion algorithm in one gorgeous page
In 1974 Dijkstra published this paper, which was also ahead of its time, rigorous, and elegant. - Self-stabilization is akin to the more modern idea of “fault tolerance,” where systems will eventually settle into an acceptable state - This paper went on to inspire many computer scientists who followed, chief amongst them being Leslie Lamport and Nancy Lynch who both went on to produce an incredible stream of impressive theoretical work
- The ﬁrst paper of Nancy Lynch’s that I would like to discuss (actually Lynch et. al.) is “Impossibility...” - One of the ﬁeld’s most inﬂuential papers - The result (known as the “FLP impossibility result”) states that in an *asynchronous network,* distributed consensus was *impossible* - The type of proof Dr. Lynch offered is also known as a “negative result,” one that proves that something is NOT possible, not that something else IS - It is interesting to consider what role this *type* of proof has, and how it is perceived, and what impact it has had on the ﬁeld. I’ll discuss this again a little bit later
‘solved’ in practice; rather, they point up the need for more reﬁned models of distributed computing.” Lynch, 1982 - In discussing the results of the paper in its conclusion, Lynch says the following - A big ah hah! Moment for me - I had some experience studying the formal principles underlying programming languages - The connection between a language’s operational semantics (if it has one) and using the language in practice is very different, for a variety of reasons - So it clicked: in order to reason about formal properties of systems, assumptions have to be made - All such theorems are based on these kinds of assumptions - This technique is essential and it is absolutely brilliant, but it is a point often missed by practitioners
In 1989, Lynch published this fantastic paper that collects the work in the intervening years since her 1982 paper - Stumbled upon this in Dr. Lynch’s list of works on her web page - these pages are invaluable to any budding archaeologists in the room - What is so great about this paper is how Lynch playfully collects, distills, and reports on the work in a ﬁeld she helped to pioneer
Lynch, 1989 - Amongst the One Hundred proofs that Lynch surveys, she ﬁnds that they all have but one thing in common - In other words, asynchronous networks and the potential for failure in other nodes makes certain assumptions impossible - By 1989, over 100 papers had been published based on these assumptions - What kind of impact did that have on the researchers who followed Dr. Lynch? practice?
services” Lynch, 2002 - Fast forward 13 years and Lynch publishes this landmark work, which starts with Dr. Brewer’s CAP “conjecture” and turns it into the CAP “theorem” through the application of a so- called “negative result” - Many speakers at this conference will address this theorem, Lindsey Kuper did a much better job yesterday than I could hope to in explaining it - By the time I looked into this paper, I was already searching for those assumptions - what are those frozen FACTS about the system that allow the result to be produced? - The thing to focus on for the purposes of this talk is that the C, A, and P in CAP are *formalized* to be *very speciﬁc deﬁnitions* of Consistency, Availability, and Partition- Tolerance - Just as the term “impossible” has its own speciﬁc meaning - Just as ALL formalisms contain tradeoffs between correctness and expressivity - It’s easy to get confused - It is an irony that those who actually produce rigorous proofs are often those that are the most misunderstood - The *proof* gets divorced from the work that went into it - the WHY gets lost - A game of telephone
- So far we’ve discussed the philosophical and humanity-based origins of distributed systems research in the work of Licklider, Minsky and Hewitt, and the formal origins in the work of Dijkstra and Lynch - My job motivated me to be professionally competent in distributed systems programming - I know many others are in the same position - Clearly there is a great commercial interest in distributed systems - The existence of this conference helps validate this idea - It made me wonder - has this always been the case? - The researchers we have covered up until this point have mostly impacted the ﬁeld in terms of theory - who had actual industrial involvement?
Control” Lamport, 1978 - Well Peter Bailis’s #1, Leslie Lamport, of course - Wrote this 1978 Paper, which I discovered this paper on Lamport’s page on the Microsoft Research site - It is worth noting that it is one of the greatest web pages of all time, ripe for many archaeological explorations - For example, the entry for this paper in Lamport’s page contains the following quote
ﬂying commercial aircraft, NASA began funding research to ﬁgure out how to make them reliable enough for the task.” Lamport, 1978 - Wow. Let’s stare at this for a second. - Part of the NASA funding included the SIFT (Software Implemented Fault Tolerance) project - Lamport helped to theorize and prove the system’s reliability even in the face of malicious (also known as “Byzantine”) faults - Software was now responsible for safely coordinating hardware that was responsible for HUMAN LIFE. Not advertising revenue, HUMAN LIFE. - Lamport could have clearly been covered in the last section on Formalism, as he published many works in response to Dijkstra’s - But his impact, to me, has had inﬂuence in other ways
with an intel engineer he worked with to formally verify multiprocessor memory designs - Lamport has applied techniques of formal veriﬁcation to a variety of industrial applications - This is why he straddles the section on the Market and the section on Formalism - Lamport claims that high Level Speciﬁcations, such as the tools provided by his TLA+ language are essential to verifying industrial systems, concurrent algorithms, and more - TLA+ allows you to provide speciﬁcations which get “compiled” into proofs - I feel that in the long run, his work on TLA+, which makes proving systems more accessible, will be of great importance - I’ve seen mention of it being used at Amazon, for example - It shows that Lamport has made the connection between the theory of distributed systems and one form of its practice - A form of practice that is very different from what most of us do
- Another luminary in the ﬁeld of distributed systems, Ken Birman, has had quite a bit to say over the years about the the mixture of commerce and research, and its impact on practitioners - Birman is noted for his work on Virtual Synchrony and the Isis Toolkit, which is very well covered by his own bit of auto-archeology in 2010’s “A History...” - Virtual Synchrony is a framework for considering work in distributed systems and has had various formulations over the years - Birman’s ﬂirtations with industrial applications of distributed systems are storied. - New York Stock Exchange, the French Air Trafﬁc Control System, and more were powered by Isis - He is also an outspoken, reﬂective writer who has participated in workshops and produced papers about the history and impact of distributed systems research.
and Skeen, 1993 “A Response To Cheriton and Skeen’s Criticism Of Causal and Totally Ordered Communication” Birman, 1993 - A famous exchange in the form of two academic papers from 1993 between Birman and two other authors in the ﬁeld, Cheriton and Skeen, can and should be consumed by any fellow obsessives - Cheriton & Skeen published “Understanding...” wherein they critique what they see as the primary thrust of Birman’s work: network level ordered communication - they say it is inefﬁcient and hard to reason about - Birman ﬁred back, claiming that their work was a thinly veiled attack on Isis, and revealing that all three authors had “skin in the game” with respect to trying to sell systems to industrial clients at the time their work was being developed - The papers are a fascinating read and they remind us that researchers are living, breathing human beings who have to survive and want to advance their ideas
“Towards a Cloud-Computing Research Agenda” Birman, 2008 Birman, 2006 - Fast forward more than 10 years later, and Birman has some interesting perspectives on the interactions between money, research, and practice, speciﬁcally as it pertains to advancements in the ﬁeld of distributed systems - In these papers, where he is not coming off a bit bitter for his history, Birman urges his fellow researchers to pursue practical and thus humane solutions to the problems that actual people face. - He has many interesting things to say, from the impact of the “impossibility” idea I discussed previously to the blow that the applicability of transactions and database theory had on the ﬁeld of software reliability. - As a takeaway, Birman’s main idea seems to be that we need to be aware of the impact that the market has on our work, and thus our lives, both as researchers and practitioners
on the market, I wanted to brieﬂy touched on a phenomenon that has had a proliﬁc impact on the theory and practice in distributed systems - publications from researchers in “Industrial” settings -Google’s papers in particular have been crucial to the ﬁeld and many practitioners who I spoke to in preparing for this talk point directly to these papers as the initial sources of interest and access for them. - Here you have companies at a scale that most people will never see actually publishing the techniques they use to do the seemingly impossible. - This feedback loop between large companies and academia is seen by some as a mixed blessing, however. - If the deep academic work pursued by some is considered inapplicable, how could the average practitioner ever hope to leverage the lessons of a company who counts synchronizing atomic clocks between international data centers among its problem areas? - The answer is that sometimes they are applicable, and sometimes they aren’t, and as usual it is up to the practitioner, who often has no training, to make this determination. - This leaves us in an awkward position of wanting to learn from our predecessors while being stung by the impedance mismatch in what we actually _need_
of the main threads that I introduced at the beginning of the talk, and hopefully I have exposed some obvious sources of inspiration and tension for the current state of distributed systems theory and practice - To wrap things up, I am going to cover two areas of pursuit that I think can help usher us towards where we need to be, in order to look back 10 years from now and feel conﬁdent that we have, as a community, successfully conquered, or at least intelligently considered, the issues at play in designing, implementing, and maintaining distributed systems
was surprised to see references to “Distributed languages” throughout the texts - I brieﬂy mentioned PLANNER, but there are literally hundreds of these languages in the literature, and most of them have AWESOME NAMES - However the idea that a “language for distributed computing” that isn’t Erlang could possibly exist is not known to many developers, and I think it is high time to destroy this myth. - I’ll brieﬂy discuss two books that are directly applicable to why I feel that it is important for researchers and practitioners to pursue the advancement of languages for distributed computation
Haridi, 2004 - Concepts, Techniques and Models, is a revolutionary Computer Science textbook that completely changed my brain and ﬁnally got me to understand the connection between comptuer programming and computer science, no easy task to be sure - just ask Dijkstra, or anyone unfortunate enough to have worked with me. - CTM is an important book for many reasons, chief amongst them being that it makes the reader realize that small, simple, understandable languages that can be evolved into more complex ones are very powerful for forming intuitions of problems in computer science. - In the book you are exposed to a basic language with a simple underlying formal model that is made more and less advanced over time as various subjects are treated - state is added here and taken away, distribution is included when it is needed, etc. - Helps you get a sense for the impedance mismatch found between theory and practice
Varela is an author who is clearly inspired by Van Roy and Haridi’s work, and his excellent book on distributed computation takes the position that *understanding concurrent computation is essential to understanding distributed computation,* and proceeds to elucidate various modern formal process calculi that he argues should be the basis for future languages. - Varela describes the terms *distribution* and *mobility* as essential properties for distributed models. *Distribution* is the idea that computation can occur in different locations and *mobility* is the idea that computation can move between these locations. The combination of distribution and mobility is what most modern developers are actually dealing with, but they simply do not have these tools. - Adding distribution and mobility to a formal model can help bridge the gap between theory and practice
in the ﬁeld of Computer Science is a direct result of the application of resources towards the ends of furthering and better understanding humanity. It is a simple fact that the longer we ignore this reality, the more it is to our peril. - The last two sources I’ll mention are deeply thoughtful, humane works.
and Progress in Mathematics”* by the mathematician William Thurston. I came across this paper when Thurston died - I wasn’t sure what to expect but I certainly wasn’t prepared. This paper is an absolute brain- breaking work of painful beauty, and I won’t say much about it besides the fact that everyone here should read it, and that it contains keys to the questions I’m trying to bring to your attention in this talk. - As a short summary, however, Thurston deals with the idea that “progress” in mathematics is often measured by proof, and attempts to understand the impact that it has. - Section 4 “What is a proof,” in particular also has direct applicability to this talk - Thurston is discussing his recollection of trying to understand what a proof is and how it works.
documents were not really primary.” Thurston, 1994 - The human factor - how we communicate our proofs is more important than the proofs themselves - How our work gets used matters - Community counts. - From this we can extrapolate that the more we consider humanity, the further we will go
on many of the notes that I’ve brought up here in a much deeper and more intelligent way. - Discusses the impact of the deﬁnition of the ﬁeld by negative proof - Compares Kuhn’s theory of revolutions in natural sciences to Computer Science - deﬁnitely a worthwhile read. - Papa’s attempts to understand and contextualize the work done by researchers at various points in innovation cycles is a poignant reminder that our place in time impacts what we do and how effectively we do it. - He also identiﬁes and destroys the notion of the normative impact of describing a proof as “positive” or “negative” - Computer Science is a ﬁeld which has a unique interaction with mathematical formalisms, his descriptions of that relationship are gorgeous - It’s essential to ask the big questions.
takeaways - Distributed Systems has a fascinating history - Humanity, Formalism, and Commerce have each had deep effects on the ﬁeld and those who study and practice it - Each of the groups represented in the triangle on the screen have things to gain and things to lose based on how we pursue theory and practice going forward - To end things, I’ll give one recommendation to each group
systems is an incredibly deep and rich ﬁeld. Studying it has been absolutely thrilling and in addition to a fascinating body of artifacts that are ripe for more archaeological work, the community is generous, motivated, and forward-thinking. - Together we will do amazing things.