Royal Institute of Technology Stockholm, Sweden Philipp Haller 34th Nordic Workshop on Programming Theory (NWPT 2023) Västerås, Sweden, November 22, 2023
Royal Institute of Technology Stockholm, Sweden Philipp Haller 34th Nordic Workshop on Programming Theory (NWPT 2023) Västerås, Sweden, November 22, 2023
3 KTH Royal Institute of Technology Stockholm, Sweden Philipp Haller 34th Nordic Workshop on Programming Theory (NWPT 2023) Västerås, Sweden, November 22, 2023
Unique opportunity for students to present their original research at <Programming> before judges • Two categories: undergraduate and graduate • Poster presentation & short research talk • Three winners receive $ 500, $ 300, and $ 200, respectively • First place winners advance to ACM SRC Grand Finals • <Programming> Conference 2024: • March 11-14, 2024, in Lund, Sweden • https://2024.programming-conference.org/ 5
- ChatGPT, the popular chatbot from OpenAI, is estimated to have reached 100 million monthly active users in January, just two months after launch, making it the fastest-growing consumer application in history, according to a UBS study on Wednesday. 7
won’t have to use different apps for different tasks. You’ll simply tell your device, in everyday language, what you want to do.” “In the near future, anyone who’s online will be able to have a personal assistant powered by artificial intelligence that’s far beyond today’s technology.” 8
largest AI training runs has been increasing exponentially with a 3.4-month doubling time (by comparison, Moore’s Law had a 2-year doubling period). Since 2012, this metric has grown by more than 300,000x (a 2-year doubling period would yield only a 7x increase). Improvements in compute have been a key component of AI progress, so as long as this trend continues, it’s worth preparing for the implications of systems far outside today’s capabilities.” Results of an analysis by OpenAI in 2018: 9
new generation of digital infrastructure • Cloud computing: rapid growth, de-facto deployment platform • AI: rapid adoption of AI-powered applications, rapidly increasing demand for computing power • Distributed software systems at the core • Cloud computing based on large-scale distributed computing infrastructure • Computing power the bottleneck for training and application of ML models • Rise of specialized hardware architectures 10
• supports emerging applications and workloads; • provides reliability and trust; and • embraces simplicity and accessibility. Scalability, Reliability, and Simplicity 13
latency and availability for geographically distributed clients • Challenge: round-trip latency • < 2ms between servers within the same datacenter • up to two orders of magnitude higher between datacenters in different countries 15 Naive reuse of single-datacenter application architectures and protocols leads to poor performance!
and performance than strong consistency (= serializing updates in a global total order) • Each update executes at some replica (e.g., geographically closest) without synchronization • Each update is propagated asynchronously to the other replicas • All updates eventually take effect at all replicas, possibly in different orders • Updates required to be commutative 16 Image source: Shapiro, Preguica, Baquero, and Zawirski: Conflict-Free Replicated Data Types. SSS 2011
Idea: multiple users can edit the grocery list concurrently on their phones • Key feature: grocery list should support offline editing • Supported operations: • Add item to grocery list • Remove item from grocery list • Mark item as “picked up” 17
Potatoes • Salad Grocery list: • Potatoes • Salad online online Grocery list: • Potatoes • Salad • Cola Grocery list: • Potatoes • Salad • Cola Phone loses reception… offline Remove “Cola” Add “Tonic Water” Grocery list: • Potatoes • Salad Grocery list: • Potatoes • Salad • Tonic Water Pick up “Tonic Water” Phone comes back online… online Pick up “Cola” Grocery list: • Potatoes • Salad • Cola Grocery list: • Potatoes • Salad • Tonic Water Grocery list: • Potatoes • Salad • Cola • Tonic Water Grocery list: • Potatoes • Salad • Tonic Water • Cola Problem: picked up both Cola and Tonic Water! Only one of Cola or Tonic Water should be bought! 18
pick-ups while offline can lead to double pick-ups • Marking an item as “picked up” is problematic if lists out of sync • Possible disagreement about what should be picked up • Solution 1: • Forbid removing items from grocery list, and • Make “pick up” a blocking operation that only works online • Then, at most one user is going to pick up each item • However: restrictive, since it limits possible changes to list • Solution 2: • When a user tries to mark an item as “picked up”: • Force synchronization of all replicas → block until synchronized • Try to perform “pick-up” on all synchronized replicas 19
Potatoes • Salad Grocery list: • Potatoes • Salad online online Grocery list: • Potatoes • Salad • Cola Grocery list: • Potatoes • Salad • Cola Phone loses reception… offline Remove “Cola” Add “Tonic Water” Grocery list: • Potatoes • Salad Grocery list: • Potatoes • Salad • Tonic Water Phone comes back online… online Try to pick up “Cola” Sync: remove “Cola” + add “Tonic Water” Error: cannot pick up “Cola”: not on list Alice pressed “OK” and puts the Cola back on the shelf Pick up “Tonic Water”… Correct solution: Before attempting to mark as “picked up”, replicas are synchronized (block until sync possible) Grocery list: • Potatoes • Salad • Tonic Water 20
based on Conflict-Free Replicated Data Types (CRDTs) • Extend CRDTs with on-demand sequential consistency • Add support for sequentially consistent operations • Don’t have to be commutative! • Define a consistency model and a distributed protocol that enforces the consistency model → Observable Atomic Consistency Protocol (OACP) 21
(RDT) storing values of a lattice • Example 1: lattice = natural numbers where join(x, y) = max(x, y) • Example 2: lattice = subset lattice where join(s, q) = s.union(q) • Operations with different consistency levels: • A totally-ordered operation (“TOp”) atomically: • synchronizes the replicas; and • applies the operation to the state of each replica. • A convergent operation (“CvOp”) is commutative and processed asynchronously. • Let’s have a look at an example… E.g., using distributed consensus (Paxos, Raft, …) Actually, a join- semilattice 22
(OAC) model • A mechanized model of OACP implemented using Maude with checked properties: • the state of all replicas is made consistent upon executing a totally- ordered operation; • the protocol preserves the order defined by OAC. • An experimental evaluation including latency, throughput, coordination overhead, and scalability Xin Zhao, Philipp Haller: Replicated data types that unify eventual consistency and observable atomic consistency. J. Log. Algebraic Methods Program. 114: 100561 (2020) https://doi.org/10.1016/j.jlamp.2020.100561 (open access) 25
same application • is important in order to achieve both consistency and availability, as needed; • is prone to catastrophic errors. • Mutating strongly consistent data based on weakly consistent data violates strong consistency • On the other hand, using strongly consistent data as input to a weakly- consistent operation is safe 26
distinguishes values according to their consistency • Consistency represented as labels attached to types and values • A label can be con (consistent), oac (OAC) or ava (available) • Labels are ordered: • The label ordering expresses permitted data flow: con → oac → ava • Labeled types are covariant in their labels: 27 ava!"!con
observable atomic consistency; operations via message passing • Proofs of correctness properties: • Type soundness → no run-time label violations! • Noninterference E.g., mutation of ava-labelled references cannot be observed via con-labelled values • Proofs of consistency properties: • Theorem: For con operations, CTRD ensures sequential consistency • Theorem: For ava operations, CTRD ensures eventual consistency 30
higher-order language with replicated data types and consistency labels • Enables safe mixing of strongly consistent and available (weakly consistent) data • Proofs of type soundness, noninterference, and consistency properties • Integrates observable atomic consistency (OAC) 31 Xin Zhao, Philipp Haller: Consistency types for replicated data in a higher-order distributed programming language. Art Sci. Eng. Program. 5(2): 6 (2021) https://doi.org/10.22152/programming-journal.org/2021/5/6
A, B and C — each processing a stream of incoming events • Problems: • Problem 1: Workflows cannot communicate directly with each other → custom dispatching logic needed • Problem 2: How to ensure fault tolerance of communication between the workflows? • Ideally: exactly-once processing of each event • Current practice: • Communicate events via reliable, distributed logs (e.g., Apache Kafka) • Major problem: events are dispatched via unreliable custom logic Typically deployed across many machines in a data center 33
decentralized services • Service = long-running process consuming and producing events • Deployed in the cloud and on the edge → no centralized coordination • Provide strong execution guarantees: • Transactional processing • Exactly-once processing • Formalization and correctness proofs • Open-source implementation enabling experimental evaluation and extensions 34
of stateful tasks • A workflow consumes and produces streams of atoms • Atom = batch of events sink Work fl ow[T, U] src tasks AtomicStream[T] AtomicStream[U] 35
• Consume an atom (“batch of events”) from the input stream • Process all events in the atom • Perform all state updates • Produce one atom containing all emitted events sink Work fl ow[T, U] src tasks AtomicStream[T] AtomicStream[U] 36
streams with transactional interface • Communication via Portals based on atomic streams • Key property: atomic streams + atomic processing contract → end-to-end exactly-once processing Current practice: No guarantees for dispatcher and workflow composition Portals: Exactly-once processing guaranteed for composition of workflows 38
• guarantees exactly-once processing, adds data parallelism, and • removes fully-dynamic communication topologies: workflows cannot create workflows. • Compared to previous models for stateful serverless (Durable Functions, …), Portals: • supports communication cycles, • adds dataflow composition, and • introduces serializable state updates (not shown) The Actor Model 39 Jonas Spenger, Paris Carbone, Philipp Haller: Portals: An Extension of Dataflow Streaming for Stateful Serverless. Onward! 2022: 153-171 https://doi.org/10.1145/3563835.3567664 • Actors1 are independent concurrent processes that communicate by asynchronous message passing • In response to a message, an actor can: • send messages to other actors; • change its behavior/state; • create new actors. Portals: (workflows) For details, see: 1 Gul Agha: Concurrent Object-Oriented Programming. Commun. ACM 33(9): 125-141 (1990)
development • Open source, Apache 2.0 License • Written in Scala 3, a high-level language combining functional and object-oriented programming • Repository on GitHub: https://github.com/portals-project/portals 40
the web browser: portals-project.org/playground/ • Made possible by compiling the Portals framework to JavaScript using Scala.js, the Scala-to- JavaScript compiler 41
streams and portals • Portals enable direct communication between workflows • End-to-end exactly-once processing guaranteed via atomic processing contract in combination with atomic streams • Ongoing work on formalization and correctness proofs • Project website: portals-project.org 42 Jonas Spenger, Paris Carbone, Philipp Haller: Portals: An Extension of Dataflow Streaming for Stateful Serverless. Onward! 2022: 153-171 https://doi.org/10.1145/3563835.3567664
a programming system that • supports emerging applications and workloads; • provides reliability and trust; and • embraces simplicity and accessibility. • Realizing this vision requires work on: • consistency models and distributed protocols; • type systems and/or program verification; • program models that enable scalability, fault tolerance, and simplicity. 43 Scalability, Fault Tolerance, Simplicity Observable Atomic Consistency Consistency Types Portals
the final year? Are you a postdoc? • Digital Futures Research Center (KTH, Stockholm U, RISE) • Fully-funded 2-year postdoc fellowships • Project defined by postdoc fellow • Calls twice a year, closing dates typically in Nov and Mar • Notification ~ 2 months later • Info (check also closed calls!): https://www.digitalfutures.kth.se/ 44