Slide 1

Slide 1 text

How to Make Distributed Programming Safer using Types 1 KTH Royal Institute of Technology Stockholm, Sweden Philipp Haller 34th Nordic Workshop on Programming Theory (NWPT 2023) 
 Västerås, Sweden, November 22, 2023

Slide 2

Slide 2 text

How to Make Distributed Programming Safer using Types 2 KTH Royal Institute of Technology Stockholm, Sweden Philipp Haller 34th Nordic Workshop on Programming Theory (NWPT 2023) 
 Västerås, Sweden, November 22, 2023

Slide 3

Slide 3 text

How to Make Distributed Programming Safer, Simpler, and more Scalable 3 KTH Royal Institute of Technology Stockholm, Sweden Philipp Haller 34th Nordic Workshop on Programming Theory (NWPT 2023) 
 Västerås, Sweden, November 22, 2023

Slide 4

Slide 4 text

Collaborators 4 Paris Carbone Jonas Spenger Xin Zhao Thank you!

Slide 5

Slide 5 text

ACM Student Research Competition 
 at Conference 2024 • Unique opportunity for students to present their original research at before judges • Two categories: undergraduate and graduate • Poster presentation & short research talk • Three winners receive $ 500, $ 300, and $ 200, respectively • First place winners advance to ACM SRC Grand Finals • Conference 2024: • March 11-14, 2024, in Lund, Sweden • https://2024.programming-conference.org/ 5

Slide 6

Slide 6 text

Cloud computing continues to grow... Recent past (source: Gartner): Spending in 2020: $257.5 billion Spending in 2023: $563.6 billion Spending in 2024 (forecast): $678.8 billion Increase of ~ 164% within 4 years! 6

Slide 7

Slide 7 text

What’s the fastest-growing consumer application in history? Feb 1 (Reuters) - ChatGPT, the popular chatbot from OpenAI, is estimated to have reached 100 million monthly active users in January, just two months after launch, making it the fastest-growing consumer application in history, according to a UBS study on Wednesday. 7

Slide 8

Slide 8 text

“In the next five years, this will change completely. You won’t have to use different apps for different tasks. You’ll simply tell your device, in everyday language, what you want to do.” “In the near future, anyone who’s online will be able to have a personal assistant powered by artificial intelligence that’s far beyond today’s technology.” 8

Slide 9

Slide 9 text

“[…] since 2012, the amount of compute used in the largest AI training runs has been increasing exponentially with a 3.4-month doubling time (by comparison, Moore’s Law had a 2-year doubling period). Since 2012, this metric has grown by more than 300,000x (a 2-year doubling period would yield only a 7x increase). Improvements in compute have been a key component of AI progress, so as long as this trend continues, it’s worth preparing for the implications of systems far outside today’s capabilities.” Results of an analysis by OpenAI in 2018: 9

Slide 10

Slide 10 text

Importance of Large-Scale Distributed Software Systems • Massive impact of new generation of digital infrastructure • Cloud computing: rapid growth, de-facto deployment platform • AI: rapid adoption of AI-powered applications, rapidly increasing demand for computing power • Distributed software systems at the core • Cloud computing based on large-scale distributed computing infrastructure • Computing power the bottleneck for training and application of ML models • Rise of specialized hardware architectures 10

Slide 11

Slide 11 text

11 Failures in cloud-based distributed systems can be catastrophic.

Slide 12

Slide 12 text

Challenges of Building Distributed Systems • Fault tolerance requires sophisticated distributed algorithms • Scalability requires concurrency, parallelism, and distribution • Increasingly targeting specialized hardware • High availability often requires weakening consistency • Enforcing data-privacy legislation (GDPR, CCPA, …) automatically is difficult • Dynamic software upgrades are challenging 12

Slide 13

Slide 13 text

Objective The design and implementation of a programming system that • supports emerging applications and workloads; • provides reliability and trust; and • embraces simplicity and accessibility. Scalability, Reliability, and Simplicity 13

Slide 14

Slide 14 text

Scalability, Reliability, and Simplicity Scalability • Low latency and high throughput requires • Asynchronous operations • Task, pipeline and data parallelism • Distribution requires • Fault tolerance • Minimizing coordination whenever possible Reliability & trust • Provable properties Simplicity • Strong guarantees, automatic checking, potential for wide adoption 14

Slide 15

Slide 15 text

Geo-Distribution • Operating a service in multiple datacenters can improve latency and availability for geographically distributed clients • Challenge: round-trip latency • < 2ms between servers within the same datacenter • up to two orders of magnitude higher between datacenters in different countries 15 Naive reuse of single-datacenter application architectures and protocols leads to poor performance!

Slide 16

Slide 16 text

(Partial) Remedy: Eventual Consistency • Eventual consistency promises better availability and performance than strong consistency (= serializing updates in a global total order) • Each update executes at some replica (e.g., geographically closest) without synchronization • Each update is propagated asynchronously to the other replicas • All updates eventually take effect at all replicas, possibly in different orders • Updates required to be commutative 16 Image source: Shapiro, Preguica, Baquero, and Zawirski: Conflict-Free Replicated Data Types. SSS 2011

Slide 17

Slide 17 text

Replicated Data: Example • Collaborative application: shared grocery list • Idea: multiple users can edit the grocery list concurrently on their phones • Key feature: grocery list should support offline editing • Supported operations: • Add item to grocery list • Remove item from grocery list • Mark item as “picked up” 17

Slide 18

Slide 18 text

Alice Bob Add “Cola” to grocery list Grocery list: • Potatoes • Salad Grocery list: • Potatoes • Salad online online Grocery list: • Potatoes • Salad • Cola Grocery list: • Potatoes • Salad • Cola Phone loses reception… offline Remove “Cola” Add “Tonic Water” Grocery list: • Potatoes • Salad Grocery list: • Potatoes • Salad • Tonic Water Pick up “Tonic Water” Phone comes back online… online Pick up “Cola” Grocery list: • Potatoes • Salad • Cola Grocery list: • Potatoes • Salad • Tonic Water Grocery list: • Potatoes • Salad • Cola • Tonic Water Grocery list: • Potatoes • Salad • Tonic Water • Cola Problem: picked up both Cola and Tonic Water! Only one of Cola or Tonic Water should be bought! 18

Slide 19

Slide 19 text

Problem and Solutions • What is the problem? • Allowing pick-ups while offline can lead to double pick-ups • Marking an item as “picked up” is problematic if lists out of sync • Possible disagreement about what should be picked up • Solution 1: • Forbid removing items from grocery list, and • Make “pick up” a blocking operation that only works online • Then, at most one user is going to pick up each item • However: restrictive, since it limits possible changes to list • Solution 2: • When a user tries to mark an item as “picked up”: • Force synchronization of all replicas → block until synchronized • Try to perform “pick-up” on all synchronized replicas 19

Slide 20

Slide 20 text

Alice Bob Add “Cola” to grocery list Grocery list: • Potatoes • Salad Grocery list: • Potatoes • Salad online online Grocery list: • Potatoes • Salad • Cola Grocery list: • Potatoes • Salad • Cola Phone loses reception… offline Remove “Cola” Add “Tonic Water” Grocery list: • Potatoes • Salad Grocery list: • Potatoes • Salad • Tonic Water Phone comes back online… online Try to pick up “Cola” Sync: remove “Cola” + add “Tonic Water” Error: cannot pick up “Cola”: not on list Alice pressed “OK” and puts the Cola back on the shelf Pick up “Tonic Water”… Correct solution: 
 Before attempting to mark as “picked up”, replicas are synchronized (block until sync possible) Grocery list: • Potatoes • Salad • Tonic Water 20

Slide 21

Slide 21 text

Programming Abstraction and Consistency Model Idea: • Synchronization of replicas based on Conflict-Free Replicated Data Types (CRDTs) • Extend CRDTs with on-demand sequential consistency • Add support for sequentially consistent operations • Don’t have to be commutative! • Define a consistency model and a distributed protocol that enforces the consistency model 
 → Observable Atomic Consistency Protocol (OACP) 21

Slide 22

Slide 22 text

Observable Atomic Consistency (OAC) • Basis: a replicated data type (RDT) storing values of a lattice • Example 1: lattice = natural numbers where join(x, y) = max(x, y) • Example 2: lattice = subset lattice where join(s, q) = s.union(q) • Operations with different consistency levels: • A totally-ordered operation (“TOp”) atomically: • synchronizes the replicas; and • applies the operation to the state of each replica. • A convergent operation (“CvOp”) is commutative and processed asynchronously. • Let’s have a look at an example… E.g., using distributed consensus (Paxos, Raft, …) Actually, a join- semilattice 22

Slide 23

Slide 23 text

OACP Applied to the Grocery List Example class GroceryList(var added: Set[String], var removed: Set[String]) { private var pickedUp = Set[String]() def lookup(item: String) = added.contains(item) && !removed.contains(item) def add(item: String) = added += item def remove(item: String) = removed += item def merge(other: GroceryList) = new GroceryList(added.union(other.added), removed.union(other.removed)) def pickup(item: String) = pickedUp += item } def markAsPickedUp(groceries: Ref[GroceryList], item: String) = groceries.tOp(glist => { val success = glist.lookup(item) if success then glist.pickup(item) success }) 23 Based on two-phase set (2P-Set) CRDT Once removed an item can never be added again

Slide 24

Slide 24 text

OACP Applied to the Grocery List Example class GroceryList(var added: Set[String], var removed: Set[String]) { private var pickedUp = Set[String]() def lookup(item: String) = added.contains(item) && !removed.contains(item) def add(item: String) = added += item def remove(item: String) = removed += item def merge(other: GroceryList) = new GroceryList(added.union(other.added), removed.union(other.removed)) def pickup(item: String) = pickedUp += item } def markAsPickedUp(groceries: Ref[GroceryList], item: String) = groceries.tOp(glist => { val success = glist.lookup(item) if success then glist.pickup(item) success }) 24 Pick-up using totally-ordered operation (tOp)

Slide 25

Slide 25 text

Results • A formal definition of the observable atomic consistency (OAC) model • A mechanized model of OACP implemented using Maude with checked properties: • the state of all replicas is made consistent upon executing a totally- ordered operation; • the protocol preserves the order defined by OAC. • An experimental evaluation including latency, throughput, coordination overhead, and scalability Xin Zhao, Philipp Haller: Replicated data types that unify eventual consistency and observable atomic consistency. J. Log. Algebraic Methods Program. 114: 100561 (2020) 
 https://doi.org/10.1016/j.jlamp.2020.100561 (open access) 25

Slide 26

Slide 26 text

Mixing Consistency Models • Supporting multiple consistency models within the same application • is important in order to achieve both consistency and availability, as needed; • is prone to catastrophic errors. • Mutating strongly consistent data based on weakly consistent data violates strong consistency • On the other hand, using strongly consistent data as input to a weakly- consistent operation is safe 26

Slide 27

Slide 27 text

CTRD: Consistency Types for Replicated Data • Type system that distinguishes values according to their consistency • Consistency represented as labels attached to types and values • A label can be con (consistent), oac (OAC) or ava (available) • Labels are ordered: • The label ordering expresses permitted data flow: con → oac → ava • Labeled types are covariant in their labels: 27 ava!"!con

Slide 28

Slide 28 text

Syntax • Essentially: STLC extended with distributed ML-style references and labels • Main influence: (Toro et al. TOPLAS 2018) 28

Slide 29

Slide 29 text

Select Typing Rules (CTRD) • Example 1: tcon := sava • Example 2: if xava then tcon := 1con else tcon := 0con 29 Illegal! Illegal!

Slide 30

Slide 30 text

Results • Distributed small-step operational semantics • Formalizes RDTs including observable atomic consistency; operations via message passing • Proofs of correctness properties: • Type soundness → no run-time label violations! • Noninterference 
 E.g., mutation of ava-labelled references cannot be observed via con-labelled values • Proofs of consistency properties: • Theorem: For con operations, CTRD ensures sequential consistency • Theorem: For ava operations, CTRD ensures eventual consistency 30

Slide 31

Slide 31 text

Summary CTRD: Consistency Types for Replicated Data • A distributed, higher-order language with replicated data types and consistency labels • Enables safe mixing of strongly consistent and available (weakly consistent) data • Proofs of type soundness, noninterference, and consistency properties • Integrates observable atomic consistency (OAC) 31 Xin Zhao, Philipp Haller: Consistency types for replicated data in a higher-order distributed programming language. Art Sci. Eng. Program. 5(2): 6 (2021) 
 https://doi.org/10.22152/programming-journal.org/2021/5/6

Slide 32

Slide 32 text

Scalability, Reliability, and Simplicity Scalability • Low latency and high throughput requires • Asynchronous operations • Task, pipeline and data parallelism • Distribution requires • Fault tolerance • Minimizing coordination whenever possible Reliability & trust • Provable properties Simplicity • Strong guarantees, automatic checking, potential for wide adoption 32

Slide 33

Slide 33 text

Problem: Reliable Composition of Services • Application logic in workflows A, B and C — each processing a stream of incoming events • Problems: • Problem 1: Workflows cannot communicate directly with each other → custom dispatching logic needed • Problem 2: How to ensure fault tolerance of communication between the workflows? • Ideally: exactly-once processing of each event • Current practice: • Communicate events via reliable, distributed logs (e.g., Apache Kafka) • Major problem: events are dispatched via unreliable custom logic Typically deployed across many machines in a data center 33

Slide 34

Slide 34 text

The Portals Project — Goals • Simplify composition of reliable, decentralized services • Service = long-running process consuming and producing events • Deployed in the cloud and on the edge → no centralized coordination • Provide strong execution guarantees: • Transactional processing • Exactly-once processing • Formalization and correctness proofs • Open-source implementation enabling experimental evaluation 
 and extensions 34

Slide 35

Slide 35 text

Workflows • Workflow as the unit of computation = DAG of stateful tasks • A workflow consumes and produces streams of atoms • Atom = batch of events sink Work fl ow[T, U] src tasks AtomicStream[T] AtomicStream[U] 35

Slide 36

Slide 36 text

Atomic Processing Contract Data is processed through atomic (transactional) steps: • Consume an atom (“batch of events”) from the input stream • Process all events in the atom • Perform all state updates • Produce one atom containing all emitted events sink Work fl ow[T, U] src tasks AtomicStream[T] AtomicStream[U] 36

Slide 37

Slide 37 text

Portals A Portal exposes a dataflow as a service via bidirectional communication Responding Dataflow src tasks sink Requests Replies Portal Service Access Operator Reques8ng Dataflow Requests Replies Portal Service 37

Slide 38

Slide 38 text

Portals: End-to-End Exactly-Once Processing • Atomic streams = reliable distributed streams with transactional interface • Communication via Portals based on atomic streams • Key property: atomic streams + atomic processing contract → end-to-end exactly-once processing Current practice: 
 No guarantees for dispatcher and workflow composition Portals: 
 Exactly-once processing guaranteed for composition of workflows 38

Slide 39

Slide 39 text

The Portals Model • Compared to the actor model, Portals: • guarantees exactly-once processing, adds data parallelism, and • removes fully-dynamic communication topologies: workflows cannot create workflows. • Compared to previous models for stateful serverless (Durable Functions, …), Portals: • supports communication cycles, • adds dataflow composition, and • introduces serializable state updates (not shown) The Actor Model 39 Jonas Spenger, Paris Carbone, Philipp Haller: Portals: An Extension of Dataflow Streaming for Stateful Serverless. Onward! 2022: 153-171 https://doi.org/10.1145/3563835.3567664 • Actors1 are independent concurrent processes that communicate by asynchronous message passing • In response to a message, an actor can: • send messages to other actors; • change its behavior/state; • create new actors. Portals: (workflows) For details, see: 1 Gul Agha: Concurrent Object-Oriented Programming. Commun. ACM 33(9): 125-141 (1990)

Slide 40

Slide 40 text

Portals Open Source Project • A prototype implementation under active development • Open source, Apache 2.0 License • Written in Scala 3, a high-level language combining functional and object-oriented programming • Repository on GitHub: 
 https://github.com/portals-project/portals 40

Slide 41

Slide 41 text

Portals Playground • Portals Playground enables running Portals applications in the web browser: 
 portals-project.org/playground/ • Made possible by compiling the Portals framework to JavaScript using Scala.js, the Scala-to- JavaScript compiler 41

Slide 42

Slide 42 text

Portals: Summary • An extension of dataflow streaming with atomic streams and portals • Portals enable direct communication between workflows • End-to-end exactly-once processing guaranteed via atomic processing contract in combination with atomic streams • Ongoing work on formalization and correctness proofs • Project website: portals-project.org 42 Jonas Spenger, Paris Carbone, Philipp Haller: Portals: An Extension of Dataflow Streaming for Stateful Serverless. Onward! 2022: 153-171 https://doi.org/10.1145/3563835.3567664

Slide 43

Slide 43 text

Conclusion • Key challenge: • The design and implementation of a programming system that • supports emerging applications and workloads; • provides reliability and trust; and • embraces simplicity and accessibility. • Realizing this vision requires work on: • consistency models and distributed protocols; • type systems and/or program verification; • program models that 
 enable scalability, fault tolerance, and simplicity. 43 Scalability, Fault Tolerance, Simplicity Observable Atomic Consistency Consistency Types Portals

Slide 44

Slide 44 text

Postdoc Fellowship Opportunity • Are you a PhD student in the final year? Are you a postdoc? • Digital Futures Research Center (KTH, Stockholm U, RISE) • Fully-funded 2-year postdoc fellowships • Project defined by postdoc fellow • Calls twice a year, closing dates typically in Nov and Mar • Notification ~ 2 months later • Info (check also closed calls!): https://www.digitalfutures.kth.se/ 44

Slide 45

Slide 45 text

Thanks! 45 Do you have any question?