Distributed Systems Archaeology

Distributed Systems Archaeology

My talk at Ricon West on the history of the theory and practice of Distributed Systems.

25ee17e694c4590abcc3c1e3c628724b?s=128

Michael Bernstein

October 30, 2013
Tweet

Transcript

  1. Distributed Systems Archaeology Michael R. Bernstein October 2013 - Thanks

    for having me, it is truly an honor.
  2. Obsessed - Hello, my name is Michael R. Bernstein, and

    I’m Obsessed. - I want you to be obsessed too.
  3. Distributed Systems Archaeology - This talk is titled “Distributed Systems

    Archaeology,” a phrase I designed to be a catchy talk title before actually thinking through what it meant. - The more I thought about it, however, the more I convinced myself that it made sense, and so here we are.
  4. Distributed Systems Archaeology - Archaeology is defined as “the study

    of human history through its artifacts”
  5. Distributed Systems Archaeology - “Distributed Systems Archaeology” is an attempt

    to apply an artifact-driven approach to understanding the history of the field of Distributed Systems - So I’m going to try to tell the story of the field primarily through its artifacts. Instead of giving a talk and showing the sources at the end, I will use the sources to drive the story.
  6. But Why? - I was a developer, working at a

    startup - The startup started to grow - Suddenly I was a distributed systems programmer - Or maybe I was one all along
  7. Ubiquity - Distributed Systems are *everywhere*

  8. Difficulty - And not only are they everywhere, they are

    *hard* - I found reasoning about them difficult - I didn’t know how the tools worked or what tools existed - No one around me seemed to really know what was going on either
  9. A Lack of Resources - And finally, I just couldn’t

    seem to find the resources I needed to put all of these pieces together
  10. “Notes on Distributed Systems for Young Bloods” Hodges, 2013 -

    A merciful exception to the lack of resources applicable to me as a practitioner at the time was this wonderful piece by Jeff Hodges, which starts with this quote
  11. “I’ve been thinking about the lessons distributed systems engineers learn

    on the job.” Hodges, 2013 - I’ve been thinking a lot about that too - In this paper Hodges clearly states that new engineers will find literature, but not applicable lessons, so he provides some excellent applicable lessons - That delta, between the literature and the lessons, is the space I am trying to understand - So I saturated myself in the literature in order to get some clarity around what the field was all about, and why it is so challenging to interact with distributed systems precisely at a time when we need the tools to do things well and easily
  12. The Mind The Proof The Market - In reading and

    reflecting, I divided the history of distributed systems up into three rough categories: The Mind, The Proof, and The Market - Each of these categories represents a common thread that I have traced through the literature - And each of these threads contained information and wisdom that helped me understand where we started, and how we got to where we are today
  13. The roots of distributed systems research in artificial intelligence The

    Mind - The first thread is “The Mind: The roots of distributed systems research in Artificial Intelligence” - At the time I pitched this talk, I was deeply influenced from having recently trawled through the archives at the AI group at MIT, particularly the famed “AI Memos” - Amazing papers by amazing people, a wide range of topics - I started to read through papers about understanding and modeling the mind as a distributed system, and got very excited
  14. Licklider Minsky Hewitt - After reading through a lot of

    work, not all of it very comprehensible, what I found was that AI researchers had a freedom to dream about the future - This freedom allowed them to create something special, at a time when anything seemed possible - I chose the work of three individuals: JCR Licklider, Marvin Minsky, and Carl Hewitt to represent the origins of distributed systems in artificial intelligence
  15. “Memorandum For Members and Affiliates of the Intergalactic Computer Network”

    Licklider, 1968 - JCR Licklider worked for the United States government in various capacities, and his story is way too big to be told here - He is the spiritual figurehead of the internet, one of the people responsible for the popularizing Vanevar Bush’s work - He also helped to create the AI lab at MIT that employed Minsky, whom taught Hewitt - According to legend, Licklider was quite the eccentric. He drank coke for breakfast. He dreamed about connecting humanity with a large network of computers. And he named that network ‘the intergalactic computer network’ because he knew that his audience, scientists, would find it amusing
  16. “Man-Computer Symbiosis” Licklider, 1960 - He spoke of ideas like

    “Man-Computer Symbiosis” as early as 1960
  17. “The Computer as a Communication Device” Licklider, 1968 - And

    he continued to think deeply for years about how humans could be connected with technology
  18. - Here is an image from “The computer as a

    communication device” where Licklider is trying to work out how information could travel through networks - We see terms like “nodes,” “ports,” “message processor” - He is using biology as an inspiration for modeling the flow of information
  19. “1968-1969 Progress Report” Minsky, 1970 - Licklider helped to create

    Project MAC at MIT that employed people like Marvin Minsky, allowing him to publish amazing works and teach a generation of students - This progress report covers the depth of the work going on at the MIT lab at the time. - Work began to transition from purely theoretical, to implementation based, and back - The primary concern was understanding how computers and humans could interact - Again, this one slide does Minsky no justice - please read about him to discover more
  20. “A Universal Modular ACTOR Formalism for Artificial Intelligence” Hewitt, 1973

    - One of Minsky’s students was Carl Hewitt, whose work is an inflection point for me in the history of distributed systems - In 1973 he published “A universal Modular ACTOR Formalism for AI” that defined the Actor model of computation and discussed his language PLANNER - The actor model describes computation as being performed by a single kind of object, known as an actor
  21. “Viewing Control Structures as Patterns of Passing Messages” Hewitt, 1976

    - In 1976, Hewitt published “Viewing Control Structures” which discussed his work with the actor model and message passing. - Hewitt’s work, as I mentioned, is an inflection point for me because of how implementation based it is - there is formalism, there is code, there is hardware - Even though it is technical, the humanity is not lost - “Modelling an intelligent person” and “Modeling a society of experts” are stated goals - The actor model is one of the most enduring ideas produced at the MIT AI lab, it is everywhere - in the semantics of Erlang in the libraries of many other major programming languages - To recap - the work of Licklider, Minsky, and Hewitt was in inherently concerned with the distributed nature of knowledge and how technology could be used to solve a wide variety of problems - The origins of the field of Distributed Systems can be found here - at the very intersection of technology and humanity
  22. Formalism The role of the proof in the history of

    distributed systems research - So that covers the Mind. - In parallel with the work on AI at prominent universities around the world, computer science was growing as a field, and the lines between mathematics and its application to technology were being teased out - As a backdrop, it’s important to state that at the time I started researching this work, I had a hard time grasping how you could formalize something as complex as a distributed system - It seemed like too many moving parts, like it was too abstract - I didn’t understand the techniques used to apply something like a mathematical proof, wherein you have to “control the entire universe” to something like distributed systems - The work of three individuals helped me understand these techniques
  23. Dijkstra Lynch Lamport - Edsger Dijkstra, Nancy Lynch and Leslie

    Lamport’s names are all likely well known to people at this conference - In this section I’m going to cover Dijkstra and Lynch’s work and briefly mention Lamport’s involvement, and we’ll talk again about Lamport in a little while
  24. “Solution of a Problem in Concurrent Programming Control” Dijkstra, 1965

    - The most well known name in the field - The ACM Symposium on Principles of Distributed Computing has a prize called the Dijsktra prize in Distributed Computing - One of the strongest voices in Computer Science history on the importance of formalism - this thread starts with him - This paper describes a mutual exclusion algorithm in one gorgeous page
  25. “To begin, consider N computers, each engaged in a process...”

    Dijkstra, 1965 - Begins with this quote, which is telling - In 1965 Dijkstra is already proving small but important properties of concurrent computation
  26. “Self-stabilizing Systems in Spite of Distributed Control” Dijkstra, 1974 -

    In 1974 Dijkstra published this paper, which was also ahead of its time, rigorous, and elegant. - Self-stabilization is akin to the more modern idea of “fault tolerance,” where systems will eventually settle into an acceptable state - This paper went on to inspire many computer scientists who followed, chief amongst them being Leslie Lamport and Nancy Lynch who both went on to produce an incredible stream of impressive theoretical work
  27. “Impossibility of distributed consensus with one faulty process” Lynch, 1982

    - The first paper of Nancy Lynch’s that I would like to discuss (actually Lynch et. al.) is “Impossibility...” - One of the field’s most influential papers - The result (known as the “FLP impossibility result”) states that in an *asynchronous network,* distributed consensus was *impossible* - The type of proof Dr. Lynch offered is also known as a “negative result,” one that proves that something is NOT possible, not that something else IS - It is interesting to consider what role this *type* of proof has, and how it is perceived, and what impact it has had on the field. I’ll discuss this again a little bit later
  28. “These results do not show that such problems cannot be

    ‘solved’ in practice; rather, they point up the need for more refined models of distributed computing.” Lynch, 1982 - In discussing the results of the paper in its conclusion, Lynch says the following - A big ah hah! Moment for me - I had some experience studying the formal principles underlying programming languages - The connection between a language’s operational semantics (if it has one) and using the language in practice is very different, for a variety of reasons - So it clicked: in order to reason about formal properties of systems, assumptions have to be made - All such theorems are based on these kinds of assumptions - This technique is essential and it is absolutely brilliant, but it is a point often missed by practitioners
  29. “A Hundred Impossibility Proofs for Distributed Computing” Lynch, 1989 -

    In 1989, Lynch published this fantastic paper that collects the work in the intervening years since her 1982 paper - Stumbled upon this in Dr. Lynch’s list of works on her web page - these pages are invaluable to any budding archaeologists in the room - What is so great about this paper is how Lynch playfully collects, distills, and reports on the work in a field she helped to pioneer
  30. “The limitation imposed by local knowledge in a distributed system”

    Lynch, 1989 - Amongst the One Hundred proofs that Lynch surveys, she finds that they all have but one thing in common - In other words, asynchronous networks and the potential for failure in other nodes makes certain assumptions impossible - By 1989, over 100 papers had been published based on these assumptions - What kind of impact did that have on the researchers who followed Dr. Lynch? practice?
  31. “Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web

    services” Lynch, 2002 - Fast forward 13 years and Lynch publishes this landmark work, which starts with Dr. Brewer’s CAP “conjecture” and turns it into the CAP “theorem” through the application of a so- called “negative result” - Many speakers at this conference will address this theorem, Lindsey Kuper did a much better job yesterday than I could hope to in explaining it - By the time I looked into this paper, I was already searching for those assumptions - what are those frozen FACTS about the system that allow the result to be produced? - The thing to focus on for the purposes of this talk is that the C, A, and P in CAP are *formalized* to be *very specific definitions* of Consistency, Availability, and Partition- Tolerance - Just as the term “impossible” has its own specific meaning - Just as ALL formalisms contain tradeoffs between correctness and expressivity - It’s easy to get confused - It is an irony that those who actually produce rigorous proofs are often those that are the most misunderstood - The *proof* gets divorced from the work that went into it - the WHY gets lost - A game of telephone
  32. The Market The impact of commerce on distributed systems research

    - So far we’ve discussed the philosophical and humanity-based origins of distributed systems research in the work of Licklider, Minsky and Hewitt, and the formal origins in the work of Dijkstra and Lynch - My job motivated me to be professionally competent in distributed systems programming - I know many others are in the same position - Clearly there is a great commercial interest in distributed systems - The existence of this conference helps validate this idea - It made me wonder - has this always been the case? - The researchers we have covered up until this point have mostly impacted the field in terms of theory - who had actual industrial involvement?
  33. “SIFT: Design and Analysis of a Fault-Tolerant Computer for Aircraft

    Control” Lamport, 1978 - Well Peter Bailis’s #1, Leslie Lamport, of course - Wrote this 1978 Paper, which I discovered this paper on Lamport’s page on the Microsoft Research site - It is worth noting that it is one of the greatest web pages of all time, ripe for many archaeological explorations - For example, the entry for this paper in Lamport’s page contains the following quote
  34. “When it became clear that computers were going to be

    flying commercial aircraft, NASA began funding research to figure out how to make them reliable enough for the task.” Lamport, 1978 - Wow. Let’s stare at this for a second. - Part of the NASA funding included the SIFT (Software Implemented Fault Tolerance) project - Lamport helped to theorize and prove the system’s reliability even in the face of malicious (also known as “Byzantine”) faults - Software was now responsible for safely coordinating hardware that was responsible for HUMAN LIFE. Not advertising revenue, HUMAN LIFE. - Lamport could have clearly been covered in the last section on Formalism, as he published many works in response to Dijkstra’s - But his impact, to me, has had influence in other ways
  35. “High-Level Specifications: Lessons from Industry” Lamport, 2003 - A collaboration

    with an intel engineer he worked with to formally verify multiprocessor memory designs - Lamport has applied techniques of formal verification to a variety of industrial applications - This is why he straddles the section on the Market and the section on Formalism - Lamport claims that high Level Specifications, such as the tools provided by his TLA+ language are essential to verifying industrial systems, concurrent algorithms, and more - TLA+ allows you to provide specifications which get “compiled” into proofs - I feel that in the long run, his work on TLA+, which makes proving systems more accessible, will be of great importance - I’ve seen mention of it being used at Amazon, for example - It shows that Lamport has made the connection between the theory of distributed systems and one form of its practice - A form of practice that is very different from what most of us do
  36. “A History of the Virtual Synchrony Replication Model” Birman, 2010

    - Another luminary in the field of distributed systems, Ken Birman, has had quite a bit to say over the years about the the mixture of commerce and research, and its impact on practitioners - Birman is noted for his work on Virtual Synchrony and the Isis Toolkit, which is very well covered by his own bit of auto-archeology in 2010’s “A History...” - Virtual Synchrony is a framework for considering work in distributed systems and has had various formulations over the years - Birman’s flirtations with industrial applications of distributed systems are storied. - New York Stock Exchange, the French Air Traffic Control System, and more were powered by Isis - He is also an outspoken, reflective writer who has participated in workshops and produced papers about the history and impact of distributed systems research.
  37. “Understanding the Limitations of Causally and Totally Ordered Communication” Cheriton

    and Skeen, 1993 “A Response To Cheriton and Skeen’s Criticism Of Causal and Totally Ordered Communication” Birman, 1993 - A famous exchange in the form of two academic papers from 1993 between Birman and two other authors in the field, Cheriton and Skeen, can and should be consumed by any fellow obsessives - Cheriton & Skeen published “Understanding...” wherein they critique what they see as the primary thrust of Birman’s work: network level ordered communication - they say it is inefficient and hard to reason about - Birman fired back, claiming that their work was a thinly veiled attack on Isis, and revealing that all three authors had “skin in the game” with respect to trying to sell systems to industrial clients at the time their work was being developed - The papers are a fascinating read and they remind us that researchers are living, breathing human beings who have to survive and want to advance their ideas
  38. “How the Hidden Hand Shapes the Market for Software Reliability”

    “Towards a Cloud-Computing Research Agenda” Birman, 2008 Birman, 2006 - Fast forward more than 10 years later, and Birman has some interesting perspectives on the interactions between money, research, and practice, specifically as it pertains to advancements in the field of distributed systems - In these papers, where he is not coming off a bit bitter for his history, Birman urges his fellow researchers to pursue practical and thus humane solutions to the problems that actual people face. - He has many interesting things to say, from the impact of the “impossibility” idea I discussed previously to the blow that the applicability of transactions and database theory had on the field of software reliability. - As a takeaway, Birman’s main idea seems to be that we need to be aware of the impact that the market has on our work, and thus our lives, both as researchers and practitioners
  39. Google Microsoft Facebook Linkedin etc. - To end this section

    on the market, I wanted to briefly touched on a phenomenon that has had a prolific impact on the theory and practice in distributed systems - publications from researchers in “Industrial” settings -Google’s papers in particular have been crucial to the field and many practitioners who I spoke to in preparing for this talk point directly to these papers as the initial sources of interest and access for them. - Here you have companies at a scale that most people will never see actually publishing the techniques they use to do the seemingly impossible. - This feedback loop between large companies and academia is seen by some as a mixed blessing, however. - If the deep academic work pursued by some is considered inapplicable, how could the average practitioner ever hope to leverage the lessons of a company who counts synchronizing atomic clocks between international data centers among its problem areas? - The answer is that sometimes they are applicable, and sometimes they aren’t, and as usual it is up to the practitioner, who often has no training, to make this determination. - This leaves us in an awkward position of wanting to learn from our predecessors while being stung by the impedance mismatch in what we actually _need_
  40. *whew* - That was a lot. - I’ve covered each

    of the main threads that I introduced at the beginning of the talk, and hopefully I have exposed some obvious sources of inspiration and tension for the current state of distributed systems theory and practice - To wrap things up, I am going to cover two areas of pursuit that I think can help usher us towards where we need to be, in order to look back 10 years from now and feel confident that we have, as a community, successfully conquered, or at least intelligently considered, the issues at play in designing, implementing, and maintaining distributed systems
  41. Programming Languages - Since I’m a Programming Language nerd, I

    was surprised to see references to “Distributed languages” throughout the texts - I briefly mentioned PLANNER, but there are literally hundreds of these languages in the literature, and most of them have AWESOME NAMES - However the idea that a “language for distributed computing” that isn’t Erlang could possibly exist is not known to many developers, and I think it is high time to destroy this myth. - I’ll briefly discuss two books that are directly applicable to why I feel that it is important for researchers and practitioners to pursue the advancement of languages for distributed computation
  42. “Concepts, Techniques, and Models of Computer Programming” Van Roy and

    Haridi, 2004 - Concepts, Techniques and Models, is a revolutionary Computer Science textbook that completely changed my brain and finally got me to understand the connection between comptuer programming and computer science, no easy task to be sure - just ask Dijkstra, or anyone unfortunate enough to have worked with me. - CTM is an important book for many reasons, chief amongst them being that it makes the reader realize that small, simple, understandable languages that can be evolved into more complex ones are very powerful for forming intuitions of problems in computer science. - In the book you are exposed to a basic language with a simple underlying formal model that is made more and less advanced over time as various subjects are treated - state is added here and taken away, distribution is included when it is needed, etc. - Helps you get a sense for the impedance mismatch found between theory and practice
  43. “Programming Distributed Computing Systems: A Foundational Approach” Varela, 2013 -Carlos

    Varela is an author who is clearly inspired by Van Roy and Haridi’s work, and his excellent book on distributed computation takes the position that *understanding concurrent computation is essential to understanding distributed computation,* and proceeds to elucidate various modern formal process calculi that he argues should be the basis for future languages. - Varela describes the terms *distribution* and *mobility* as essential properties for distributed models. *Distribution* is the idea that computation can occur in different locations and *mobility* is the idea that computation can move between these locations. The combination of distribution and mobility is what most modern developers are actually dealing with, but they simply do not have these tools. - Adding distribution and mobility to a formal model can help bridge the gap between theory and practice
  44. Humanity - The most fruitful work that we have achieved

    in the field of Computer Science is a direct result of the application of resources towards the ends of furthering and better understanding humanity. It is a simple fact that the longer we ignore this reality, the more it is to our peril. - The last two sources I’ll mention are deeply thoughtful, humane works.
  45. “On Proof and Progress in Mathematics” Thurston, 1994 -”On Proof

    and Progress in Mathematics”* by the mathematician William Thurston. I came across this paper when Thurston died - I wasn’t sure what to expect but I certainly wasn’t prepared. This paper is an absolute brain- breaking work of painful beauty, and I won’t say much about it besides the fact that everyone here should read it, and that it contains keys to the questions I’m trying to bring to your attention in this talk. - As a short summary, however, Thurston deals with the idea that “progress” in mathematics is often measured by proof, and attempts to understand the impact that it has. - Section 4 “What is a proof,” in particular also has direct applicability to this talk - Thurston is discussing his recollection of trying to understand what a proof is and how it works.
  46. “...knowledge and understanding were embedded in the minds and in

    the social fabric of the community of people thinking about a particular topic.” Thurston, 1994 - ...
  47. “...this knowledge was supported by written documents, but the written

    documents were not really primary.” Thurston, 1994 - The human factor - how we communicate our proofs is more important than the proofs themselves - How our work gets used matters - Community counts. - From this we can extrapolate that the more we consider humanity, the further we will go
  48. “Database Metatheory: Asking the Big Queries” Papadimitriou, 1995 - Hits

    on many of the notes that I’ve brought up here in a much deeper and more intelligent way. - Discusses the impact of the definition of the field by negative proof - Compares Kuhn’s theory of revolutions in natural sciences to Computer Science - definitely a worthwhile read. - Papa’s attempts to understand and contextualize the work done by researchers at various points in innovation cycles is a poignant reminder that our place in time impacts what we do and how effectively we do it. - He also identifies and destroys the notion of the normative impact of describing a proof as “positive” or “negative” - Computer Science is a field which has a unique interaction with mathematical formalisms, his descriptions of that relationship are gorgeous - It’s essential to ask the big questions.
  49. Takeaways Academic Research Industrial Research Practice - So, some final

    takeaways - Distributed Systems has a fascinating history - Humanity, Formalism, and Commerce have each had deep effects on the field and those who study and practice it - Each of the groups represented in the triangle on the screen have things to gain and things to lose based on how we pursue theory and practice going forward - To end things, I’ll give one recommendation to each group
  50. To the best of your ability, recognize the formalisms your

    work is based on, understand the details of the papers you’re reading, and be careful with how you communicate these ideas to your peers. - Practitioners
  51. Guide the community and strike a balance between alleviating current

    pain and making the future path clear. - Academic Researchers
  52. Provide complete implementation details in papers, be generous with your

    Open Source contributions, try to give advice directly to practitioners. - Industrial Researchers
  53. In Conclusion - Finally, for real

  54. The Future Is Bright In Conclusion - In conclusion, distributed

    systems is an incredibly deep and rich field. Studying it has been absolutely thrilling and in addition to a fascinating body of artifacts that are ripe for more archaeological work, the community is generous, motivated, and forward-thinking. - Together we will do amazing things.
  55. Peter Alvaro, Peter Bailis, Camille Fournier, Andy Gross, Jeff Hodges,

    Alex Kahn, Lindsey Kuper, Chris Meiklejohn, Maya Miller, David Nolen, Mark Phillips, Aaron Quint, Tom Santero, and YOU Thank You