Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to Make Distributed Programming Safer using Types

Philipp Haller
November 23, 2023
230

How to Make Distributed Programming Safer using Types

Philipp Haller

November 23, 2023
Tweet

Transcript

  1. How to Make Distributed Programming Safer using Types 1 KTH

    Royal Institute of Technology Stockholm, Sweden Philipp Haller 34th Nordic Workshop on Programming Theory (NWPT 2023) 
 Västerås, Sweden, November 22, 2023
  2. How to Make Distributed Programming Safer using Types 2 KTH

    Royal Institute of Technology Stockholm, Sweden Philipp Haller 34th Nordic Workshop on Programming Theory (NWPT 2023) 
 Västerås, Sweden, November 22, 2023
  3. How to Make Distributed Programming Safer, Simpler, and more Scalable

    3 KTH Royal Institute of Technology Stockholm, Sweden Philipp Haller 34th Nordic Workshop on Programming Theory (NWPT 2023) 
 Västerås, Sweden, November 22, 2023
  4. ACM Student Research Competition 
 at <Programming> Conference 2024 •

    Unique opportunity for students to present their original research at <Programming> before judges • Two categories: undergraduate and graduate • Poster presentation & short research talk • Three winners receive $ 500, $ 300, and $ 200, respectively • First place winners advance to ACM SRC Grand Finals • <Programming> Conference 2024: • March 11-14, 2024, in Lund, Sweden • https://2024.programming-conference.org/ 5
  5. Cloud computing continues to grow... Recent past (source: Gartner): Spending

    in 2020: $257.5 billion Spending in 2023: $563.6 billion Spending in 2024 (forecast): $678.8 billion Increase of ~ 164% within 4 years! 6
  6. What’s the fastest-growing consumer application in history? Feb 1 (Reuters)

    - ChatGPT, the popular chatbot from OpenAI, is estimated to have reached 100 million monthly active users in January, just two months after launch, making it the fastest-growing consumer application in history, according to a UBS study on Wednesday. 7
  7. “In the next five years, this will change completely. You

    won’t have to use different apps for different tasks. You’ll simply tell your device, in everyday language, what you want to do.” “In the near future, anyone who’s online will be able to have a personal assistant powered by artificial intelligence that’s far beyond today’s technology.” 8
  8. “[…] since 2012, the amount of compute used in the

    largest AI training runs has been increasing exponentially with a 3.4-month doubling time (by comparison, Moore’s Law had a 2-year doubling period). Since 2012, this metric has grown by more than 300,000x (a 2-year doubling period would yield only a 7x increase). Improvements in compute have been a key component of AI progress, so as long as this trend continues, it’s worth preparing for the implications of systems far outside today’s capabilities.” Results of an analysis by OpenAI in 2018: 9
  9. Importance of Large-Scale Distributed Software Systems • Massive impact of

    new generation of digital infrastructure • Cloud computing: rapid growth, de-facto deployment platform • AI: rapid adoption of AI-powered applications, rapidly increasing demand for computing power • Distributed software systems at the core • Cloud computing based on large-scale distributed computing infrastructure • Computing power the bottleneck for training and application of ML models • Rise of specialized hardware architectures 10
  10. Challenges of Building Distributed Systems • Fault tolerance requires sophisticated

    distributed algorithms • Scalability requires concurrency, parallelism, and distribution • Increasingly targeting specialized hardware • High availability often requires weakening consistency • Enforcing data-privacy legislation (GDPR, CCPA, …) automatically is difficult • Dynamic software upgrades are challenging 12
  11. Objective The design and implementation of a programming system that

    • supports emerging applications and workloads; • provides reliability and trust; and • embraces simplicity and accessibility. Scalability, Reliability, and Simplicity 13
  12. Scalability, Reliability, and Simplicity Scalability • Low latency and high

    throughput requires • Asynchronous operations • Task, pipeline and data parallelism • Distribution requires • Fault tolerance • Minimizing coordination whenever possible Reliability & trust • Provable properties Simplicity • Strong guarantees, automatic checking, potential for wide adoption 14
  13. Geo-Distribution • Operating a service in multiple datacenters can improve

    latency and availability for geographically distributed clients • Challenge: round-trip latency • < 2ms between servers within the same datacenter • up to two orders of magnitude higher between datacenters in different countries 15 Naive reuse of single-datacenter application architectures and protocols leads to poor performance!
  14. (Partial) Remedy: Eventual Consistency • Eventual consistency promises better availability

    and performance than strong consistency (= serializing updates in a global total order) • Each update executes at some replica (e.g., geographically closest) without synchronization • Each update is propagated asynchronously to the other replicas • All updates eventually take effect at all replicas, possibly in different orders • Updates required to be commutative 16 Image source: Shapiro, Preguica, Baquero, and Zawirski: Conflict-Free Replicated Data Types. SSS 2011
  15. Replicated Data: Example • Collaborative application: shared grocery list •

    Idea: multiple users can edit the grocery list concurrently on their phones • Key feature: grocery list should support offline editing • Supported operations: • Add item to grocery list • Remove item from grocery list • Mark item as “picked up” 17
  16. Alice Bob Add “Cola” to grocery list Grocery list: •

    Potatoes • Salad Grocery list: • Potatoes • Salad online online Grocery list: • Potatoes • Salad • Cola Grocery list: • Potatoes • Salad • Cola Phone loses reception… offline Remove “Cola” Add “Tonic Water” Grocery list: • Potatoes • Salad Grocery list: • Potatoes • Salad • Tonic Water Pick up “Tonic Water” Phone comes back online… online Pick up “Cola” Grocery list: • Potatoes • Salad • Cola Grocery list: • Potatoes • Salad • Tonic Water Grocery list: • Potatoes • Salad • Cola • Tonic Water Grocery list: • Potatoes • Salad • Tonic Water • Cola Problem: picked up both Cola and Tonic Water! Only one of Cola or Tonic Water should be bought! 18
  17. Problem and Solutions • What is the problem? • Allowing

    pick-ups while offline can lead to double pick-ups • Marking an item as “picked up” is problematic if lists out of sync • Possible disagreement about what should be picked up • Solution 1: • Forbid removing items from grocery list, and • Make “pick up” a blocking operation that only works online • Then, at most one user is going to pick up each item • However: restrictive, since it limits possible changes to list • Solution 2: • When a user tries to mark an item as “picked up”: • Force synchronization of all replicas → block until synchronized • Try to perform “pick-up” on all synchronized replicas 19
  18. Alice Bob Add “Cola” to grocery list Grocery list: •

    Potatoes • Salad Grocery list: • Potatoes • Salad online online Grocery list: • Potatoes • Salad • Cola Grocery list: • Potatoes • Salad • Cola Phone loses reception… offline Remove “Cola” Add “Tonic Water” Grocery list: • Potatoes • Salad Grocery list: • Potatoes • Salad • Tonic Water Phone comes back online… online Try to pick up “Cola” Sync: remove “Cola” + add “Tonic Water” Error: cannot pick up “Cola”: not on list Alice pressed “OK” and puts the Cola back on the shelf Pick up “Tonic Water”… Correct solution: 
 Before attempting to mark as “picked up”, replicas are synchronized (block until sync possible) Grocery list: • Potatoes • Salad • Tonic Water 20
  19. Programming Abstraction and Consistency Model Idea: • Synchronization of replicas

    based on Conflict-Free Replicated Data Types (CRDTs) • Extend CRDTs with on-demand sequential consistency • Add support for sequentially consistent operations • Don’t have to be commutative! • Define a consistency model and a distributed protocol that enforces the consistency model 
 → Observable Atomic Consistency Protocol (OACP) 21
  20. Observable Atomic Consistency (OAC) • Basis: a replicated data type

    (RDT) storing values of a lattice • Example 1: lattice = natural numbers where join(x, y) = max(x, y) • Example 2: lattice = subset lattice where join(s, q) = s.union(q) • Operations with different consistency levels: • A totally-ordered operation (“TOp”) atomically: • synchronizes the replicas; and • applies the operation to the state of each replica. • A convergent operation (“CvOp”) is commutative and processed asynchronously. • Let’s have a look at an example… E.g., using distributed consensus (Paxos, Raft, …) Actually, a join- semilattice 22
  21. OACP Applied to the Grocery List Example class GroceryList(var added:

    Set[String], var removed: Set[String]) { private var pickedUp = Set[String]() def lookup(item: String) = added.contains(item) && !removed.contains(item) def add(item: String) = added += item def remove(item: String) = removed += item def merge(other: GroceryList) = new GroceryList(added.union(other.added), removed.union(other.removed)) def pickup(item: String) = pickedUp += item } def markAsPickedUp(groceries: Ref[GroceryList], item: String) = groceries.tOp(glist => { val success = glist.lookup(item) if success then glist.pickup(item) success }) 23 Based on two-phase set (2P-Set) CRDT Once removed an item can never be added again
  22. OACP Applied to the Grocery List Example class GroceryList(var added:

    Set[String], var removed: Set[String]) { private var pickedUp = Set[String]() def lookup(item: String) = added.contains(item) && !removed.contains(item) def add(item: String) = added += item def remove(item: String) = removed += item def merge(other: GroceryList) = new GroceryList(added.union(other.added), removed.union(other.removed)) def pickup(item: String) = pickedUp += item } def markAsPickedUp(groceries: Ref[GroceryList], item: String) = groceries.tOp(glist => { val success = glist.lookup(item) if success then glist.pickup(item) success }) 24 Pick-up using totally-ordered operation (tOp)
  23. Results • A formal definition of the observable atomic consistency

    (OAC) model • A mechanized model of OACP implemented using Maude with checked properties: • the state of all replicas is made consistent upon executing a totally- ordered operation; • the protocol preserves the order defined by OAC. • An experimental evaluation including latency, throughput, coordination overhead, and scalability Xin Zhao, Philipp Haller: Replicated data types that unify eventual consistency and observable atomic consistency. J. Log. Algebraic Methods Program. 114: 100561 (2020) 
 https://doi.org/10.1016/j.jlamp.2020.100561 (open access) 25
  24. Mixing Consistency Models • Supporting multiple consistency models within the

    same application • is important in order to achieve both consistency and availability, as needed; • is prone to catastrophic errors. • Mutating strongly consistent data based on weakly consistent data violates strong consistency • On the other hand, using strongly consistent data as input to a weakly- consistent operation is safe 26
  25. CTRD: Consistency Types for Replicated Data • Type system that

    distinguishes values according to their consistency • Consistency represented as labels attached to types and values • A label can be con (consistent), oac (OAC) or ava (available) • Labels are ordered: • The label ordering expresses permitted data flow: con → oac → ava • Labeled types are covariant in their labels: 27 ava!"!con
  26. Syntax • Essentially: STLC extended with distributed ML-style references and

    labels • Main influence: (Toro et al. TOPLAS 2018) 28
  27. Select Typing Rules (CTRD) • Example 1: tcon := sava

    • Example 2: if xava then tcon := 1con else tcon := 0con 29 Illegal! Illegal!
  28. Results • Distributed small-step operational semantics • Formalizes RDTs including

    observable atomic consistency; operations via message passing • Proofs of correctness properties: • Type soundness → no run-time label violations! • Noninterference 
 E.g., mutation of ava-labelled references cannot be observed via con-labelled values • Proofs of consistency properties: • Theorem: For con operations, CTRD ensures sequential consistency • Theorem: For ava operations, CTRD ensures eventual consistency 30
  29. Summary CTRD: Consistency Types for Replicated Data • A distributed,

    higher-order language with replicated data types and consistency labels • Enables safe mixing of strongly consistent and available (weakly consistent) data • Proofs of type soundness, noninterference, and consistency properties • Integrates observable atomic consistency (OAC) 31 Xin Zhao, Philipp Haller: Consistency types for replicated data in a higher-order distributed programming language. Art Sci. Eng. Program. 5(2): 6 (2021) 
 https://doi.org/10.22152/programming-journal.org/2021/5/6
  30. Scalability, Reliability, and Simplicity Scalability • Low latency and high

    throughput requires • Asynchronous operations • Task, pipeline and data parallelism • Distribution requires • Fault tolerance • Minimizing coordination whenever possible Reliability & trust • Provable properties Simplicity • Strong guarantees, automatic checking, potential for wide adoption 32
  31. Problem: Reliable Composition of Services • Application logic in workflows

    A, B and C — each processing a stream of incoming events • Problems: • Problem 1: Workflows cannot communicate directly with each other → custom dispatching logic needed • Problem 2: How to ensure fault tolerance of communication between the workflows? • Ideally: exactly-once processing of each event • Current practice: • Communicate events via reliable, distributed logs (e.g., Apache Kafka) • Major problem: events are dispatched via unreliable custom logic Typically deployed across many machines in a data center 33
  32. The Portals Project — Goals • Simplify composition of reliable,

    decentralized services • Service = long-running process consuming and producing events • Deployed in the cloud and on the edge → no centralized coordination • Provide strong execution guarantees: • Transactional processing • Exactly-once processing • Formalization and correctness proofs • Open-source implementation enabling experimental evaluation 
 and extensions 34
  33. Workflows • Workflow as the unit of computation = DAG

    of stateful tasks • A workflow consumes and produces streams of atoms • Atom = batch of events sink Work fl ow[T, U] src tasks AtomicStream[T] AtomicStream[U] 35
  34. Atomic Processing Contract Data is processed through atomic (transactional) steps:

    • Consume an atom (“batch of events”) from the input stream • Process all events in the atom • Perform all state updates • Produce one atom containing all emitted events sink Work fl ow[T, U] src tasks AtomicStream[T] AtomicStream[U] 36
  35. Portals A Portal exposes a dataflow as a service via

    bidirectional communication Responding Dataflow src tasks sink Requests Replies Portal Service Access Operator Reques8ng Dataflow Requests Replies Portal Service 37
  36. Portals: End-to-End Exactly-Once Processing • Atomic streams = reliable distributed

    streams with transactional interface • Communication via Portals based on atomic streams • Key property: atomic streams + atomic processing contract → end-to-end exactly-once processing Current practice: 
 No guarantees for dispatcher and workflow composition Portals: 
 Exactly-once processing guaranteed for composition of workflows 38
  37. The Portals Model • Compared to the actor model, Portals:

    • guarantees exactly-once processing, adds data parallelism, and • removes fully-dynamic communication topologies: workflows cannot create workflows. • Compared to previous models for stateful serverless (Durable Functions, …), Portals: • supports communication cycles, • adds dataflow composition, and • introduces serializable state updates (not shown) The Actor Model 39 Jonas Spenger, Paris Carbone, Philipp Haller: Portals: An Extension of Dataflow Streaming for Stateful Serverless. Onward! 2022: 153-171 https://doi.org/10.1145/3563835.3567664 • Actors1 are independent concurrent processes that communicate by asynchronous message passing • In response to a message, an actor can: • send messages to other actors; • change its behavior/state; • create new actors. Portals: (workflows) For details, see: 1 Gul Agha: Concurrent Object-Oriented Programming. Commun. ACM 33(9): 125-141 (1990)
  38. Portals Open Source Project • A prototype implementation under active

    development • Open source, Apache 2.0 License • Written in Scala 3, a high-level language combining functional and object-oriented programming • Repository on GitHub: 
 https://github.com/portals-project/portals 40
  39. Portals Playground • Portals Playground enables running Portals applications in

    the web browser: 
 portals-project.org/playground/ • Made possible by compiling the Portals framework to JavaScript using Scala.js, the Scala-to- JavaScript compiler 41
  40. Portals: Summary • An extension of dataflow streaming with atomic

    streams and portals • Portals enable direct communication between workflows • End-to-end exactly-once processing guaranteed via atomic processing contract in combination with atomic streams • Ongoing work on formalization and correctness proofs • Project website: portals-project.org 42 Jonas Spenger, Paris Carbone, Philipp Haller: Portals: An Extension of Dataflow Streaming for Stateful Serverless. Onward! 2022: 153-171 https://doi.org/10.1145/3563835.3567664
  41. Conclusion • Key challenge: • The design and implementation of

    a programming system that • supports emerging applications and workloads; • provides reliability and trust; and • embraces simplicity and accessibility. • Realizing this vision requires work on: • consistency models and distributed protocols; • type systems and/or program verification; • program models that 
 enable scalability, fault tolerance, and simplicity. 43 Scalability, Fault Tolerance, Simplicity Observable Atomic Consistency Consistency Types Portals
  42. Postdoc Fellowship Opportunity • Are you a PhD student in

    the final year? Are you a postdoc? • Digital Futures Research Center (KTH, Stockholm U, RISE) • Fully-funded 2-year postdoc fellowships • Project defined by postdoc fellow • Calls twice a year, closing dates typically in Nov and Mar • Notification ~ 2 months later • Info (check also closed calls!): https://www.digitalfutures.kth.se/ 44