Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to Make Distributed Programming Safer using Types

Philipp Haller
November 23, 2023
210

How to Make Distributed Programming Safer using Types

Philipp Haller

November 23, 2023
Tweet

Transcript

  1. How to Make Distributed
    Programming Safer using Types
    1
    KTH Royal Institute of Technology


    Stockholm, Sweden
    Philipp Haller
    34th Nordic Workshop on Programming Theory (NWPT 2023)

    Västerås, Sweden, November 22, 2023

    View full-size slide

  2. How to Make Distributed
    Programming Safer using Types
    2
    KTH Royal Institute of Technology


    Stockholm, Sweden
    Philipp Haller
    34th Nordic Workshop on Programming Theory (NWPT 2023)

    Västerås, Sweden, November 22, 2023

    View full-size slide

  3. How to Make Distributed Programming
    Safer, Simpler, and more Scalable
    3
    KTH Royal Institute of Technology


    Stockholm, Sweden
    Philipp Haller
    34th Nordic Workshop on Programming Theory (NWPT 2023)

    Västerås, Sweden, November 22, 2023

    View full-size slide

  4. Collaborators
    4
    Paris Carbone Jonas Spenger Xin Zhao
    Thank you!

    View full-size slide

  5. ACM Student Research Competition

    at Conference 2024
    • Unique opportunity for students to present their
    original research at before judges


    • Two categories: undergraduate and graduate


    • Poster presentation & short research talk


    • Three winners receive
    $
    500,
    $
    300, and
    $
    200,
    respectively


    • First place winners advance to ACM SRC Grand
    Finals


    • Conference 2024:


    • March 11-14, 2024, in Lund, Sweden


    • https://2024.programming-conference.org/
    5

    View full-size slide

  6. Cloud computing continues to grow...
    Recent past (source: Gartner):


    Spending in 2020: $257.5 billion


    Spending in 2023: $563.6 billion


    Spending in 2024 (forecast): $678.8 billion


    Increase of ~ 164% within 4 years!
    6

    View full-size slide

  7. What’s the fastest-growing consumer application in history?
    Feb 1 (Reuters) - ChatGPT, the popular chatbot from
    OpenAI, is estimated to have reached 100 million
    monthly active users in January, just two months
    after launch, making it the fastest-growing consumer
    application in history, according to a UBS study on
    Wednesday.
    7

    View full-size slide

  8. “In the next five years, this will change completely. You
    won’t have to use different apps for different tasks. You’ll
    simply tell your device, in everyday language, what you
    want to do.”
    “In the near future, anyone who’s online will be able to
    have a personal assistant powered by artificial
    intelligence that’s far beyond today’s technology.”
    8

    View full-size slide

  9. “[…] since 2012, the amount of compute used in the largest AI training runs
    has been increasing exponentially with a 3.4-month doubling time (by
    comparison, Moore’s Law had a 2-year doubling period). Since 2012, this metric
    has grown by more than 300,000x (a 2-year doubling period would yield only a
    7x increase).
    Improvements in compute have been a key component of AI progress, so as
    long as this trend continues, it’s worth preparing for the implications of
    systems far outside today’s capabilities.”
    Results of an analysis by OpenAI in 2018:
    9

    View full-size slide

  10. Importance of Large-Scale Distributed Software Systems
    • Massive impact of new generation of digital infrastructure


    • Cloud computing: rapid growth, de-facto deployment platform


    • AI: rapid adoption of AI-powered applications, rapidly increasing demand for computing power


    • Distributed software systems at the core


    • Cloud computing based on large-scale distributed computing infrastructure


    • Computing power the bottleneck for training and application of ML models


    • Rise of specialized hardware architectures
    10

    View full-size slide

  11. 11
    Failures in cloud-based distributed systems can be catastrophic.

    View full-size slide

  12. Challenges of Building Distributed Systems
    • Fault tolerance requires sophisticated distributed algorithms


    • Scalability requires concurrency, parallelism, and distribution


    • Increasingly targeting specialized hardware


    • High availability often requires weakening consistency


    • Enforcing data-privacy legislation (GDPR, CCPA, …) automatically
    is difficult


    • Dynamic software upgrades are challenging
    12

    View full-size slide

  13. Objective
    The design and implementation of a programming system that


    • supports emerging applications and workloads;


    • provides reliability and trust; and


    • embraces simplicity and accessibility.
    Scalability, Reliability, and Simplicity
    13

    View full-size slide

  14. Scalability, Reliability, and Simplicity
    Scalability


    • Low latency and high throughput requires


    • Asynchronous operations


    • Task, pipeline and data parallelism


    • Distribution requires


    • Fault tolerance


    • Minimizing coordination whenever possible


    Reliability & trust


    • Provable properties


    Simplicity


    • Strong guarantees, automatic checking, potential for wide adoption
    14

    View full-size slide

  15. Geo-Distribution
    • Operating a service in multiple datacenters can improve latency and availability
    for geographically distributed clients


    • Challenge: round-trip latency


    • < 2ms between servers within the same datacenter


    • up to two orders of magnitude higher between datacenters in different
    countries
    15
    Naive reuse of single-datacenter application
    architectures and protocols leads to poor performance!

    View full-size slide

  16. (Partial) Remedy: Eventual Consistency
    • Eventual consistency promises better availability and performance than strong consistency (=
    serializing updates in a global total order)


    • Each update executes at some replica (e.g., geographically closest) without synchronization


    • Each update is propagated asynchronously to the other replicas


    • All updates eventually take effect at all replicas, possibly in different orders


    • Updates required to be commutative
    16
    Image source: Shapiro, Preguica, Baquero, and Zawirski: Conflict-Free Replicated Data Types. SSS 2011

    View full-size slide

  17. Replicated Data: Example
    • Collaborative application: shared grocery list


    • Idea: multiple users can edit the grocery list concurrently on their phones


    • Key feature: grocery list should support offline editing


    • Supported operations:


    • Add item to grocery list


    • Remove item from grocery list


    • Mark item as “picked up”
    17

    View full-size slide

  18. Alice Bob
    Add “Cola” to grocery list
    Grocery list:


    • Potatoes


    • Salad
    Grocery list:


    • Potatoes


    • Salad
    online online
    Grocery list:


    • Potatoes


    • Salad


    • Cola
    Grocery list:


    • Potatoes


    • Salad


    • Cola
    Phone loses reception…
    offline
    Remove “Cola”
    Add “Tonic Water”
    Grocery list:


    • Potatoes


    • Salad
    Grocery list:


    • Potatoes


    • Salad


    • Tonic Water
    Pick up “Tonic Water”
    Phone comes back online…
    online
    Pick up “Cola”
    Grocery list:


    • Potatoes


    • Salad


    • Cola Grocery list:


    • Potatoes


    • Salad


    • Tonic Water
    Grocery list:


    • Potatoes


    • Salad


    • Cola


    • Tonic Water
    Grocery list:


    • Potatoes


    • Salad


    • Tonic Water


    • Cola
    Problem: picked up both
    Cola and Tonic Water!
    Only one of Cola or Tonic
    Water should be bought!
    18

    View full-size slide

  19. Problem and Solutions
    • What is the problem?


    • Allowing pick-ups while offline can lead to double pick-ups


    • Marking an item as “picked up” is problematic if lists out of sync


    • Possible disagreement about what should be picked up


    • Solution 1:


    • Forbid removing items from grocery list, and


    • Make “pick up” a blocking operation that only works online


    • Then, at most one user is going to pick up each item


    • However: restrictive, since it limits possible changes to list


    • Solution 2:


    • When a user tries to mark an item as “picked up”:


    • Force synchronization of all replicas → block until synchronized


    • Try to perform “pick-up” on all synchronized replicas
    19

    View full-size slide

  20. Alice Bob
    Add “Cola” to grocery list
    Grocery list:


    • Potatoes


    • Salad
    Grocery list:


    • Potatoes


    • Salad
    online online
    Grocery list:


    • Potatoes


    • Salad


    • Cola
    Grocery list:


    • Potatoes


    • Salad


    • Cola
    Phone loses reception…
    offline
    Remove “Cola”
    Add “Tonic Water”
    Grocery list:


    • Potatoes


    • Salad
    Grocery list:


    • Potatoes


    • Salad


    • Tonic Water
    Phone comes back online…
    online
    Try to pick up “Cola”
    Sync: remove “Cola” +
    add “Tonic Water”
    Error: cannot pick up
    “Cola”: not on list
    Alice pressed “OK” and puts
    the Cola back on the shelf
    Pick up “Tonic Water”…
    Correct solution:

    Before attempting to mark as “picked up”,
    replicas are synchronized
    (block until sync possible)
    Grocery list:


    • Potatoes


    • Salad


    • Tonic Water
    20

    View full-size slide

  21. Programming Abstraction and Consistency Model
    Idea:


    • Synchronization of replicas based on Conflict-Free Replicated
    Data Types (CRDTs)


    • Extend CRDTs with on-demand sequential consistency


    • Add support for sequentially consistent operations


    • Don’t have to be commutative!


    • Define a consistency model and a distributed protocol that
    enforces the consistency model

    → Observable Atomic Consistency Protocol (OACP)
    21

    View full-size slide

  22. Observable Atomic Consistency (OAC)
    • Basis: a replicated data type (RDT) storing values of a lattice


    • Example 1: lattice = natural numbers where join(x, y) = max(x, y)


    • Example 2: lattice = subset lattice where join(s, q) = s.union(q)


    • Operations with different consistency levels:


    • A totally-ordered operation (“TOp”) atomically:


    • synchronizes the replicas; and


    • applies the operation to the state of each replica.


    • A convergent operation (“CvOp”) is commutative and processed asynchronously.


    • Let’s have a look at an example…
    E.g., using
    distributed consensus
    (Paxos, Raft, …)
    Actually, a join-
    semilattice
    22

    View full-size slide

  23. OACP Applied to the Grocery List Example
    class GroceryList(var added: Set[String], var removed: Set[String]) {


    private var pickedUp = Set[String]()


    def lookup(item: String) = added.contains(item) && !removed.contains(item)


    def add(item: String) = added += item


    def remove(item: String) = removed += item


    def merge(other: GroceryList) =


    new GroceryList(added.union(other.added), removed.union(other.removed))


    def pickup(item: String) = pickedUp += item


    }


    def markAsPickedUp(groceries: Ref[GroceryList], item: String) =


    groceries.tOp(glist => {


    val success = glist.lookup(item)


    if success then glist.pickup(item)


    success


    }) 23
    Based on
    two-phase set
    (2P-Set) CRDT
    Once removed
    an item can never be
    added again

    View full-size slide

  24. OACP Applied to the Grocery List Example
    class GroceryList(var added: Set[String], var removed: Set[String]) {


    private var pickedUp = Set[String]()


    def lookup(item: String) = added.contains(item) && !removed.contains(item)


    def add(item: String) = added += item


    def remove(item: String) = removed += item


    def merge(other: GroceryList) =


    new GroceryList(added.union(other.added), removed.union(other.removed))


    def pickup(item: String) = pickedUp += item


    }


    def markAsPickedUp(groceries: Ref[GroceryList], item: String) =


    groceries.tOp(glist => {


    val success = glist.lookup(item)


    if success then glist.pickup(item)


    success


    }) 24
    Pick-up using
    totally-ordered
    operation (tOp)

    View full-size slide

  25. Results
    • A formal definition of the observable atomic consistency (OAC) model


    • A mechanized model of OACP implemented using Maude with checked
    properties:


    • the state of all replicas is made consistent upon executing a totally-
    ordered operation;


    • the protocol preserves the order defined by OAC.


    • An experimental evaluation including latency, throughput, coordination
    overhead, and scalability
    Xin Zhao, Philipp Haller: Replicated data types that unify eventual consistency and
    observable atomic consistency. J. Log. Algebraic Methods Program. 114: 100561 (2020)

    https://doi.org/10.1016/j.jlamp.2020.100561 (open access)
    25

    View full-size slide

  26. Mixing Consistency Models
    • Supporting multiple consistency models within the same application


    • is important in order to achieve both consistency and availability, as
    needed;


    • is prone to catastrophic errors.


    • Mutating strongly consistent data based on weakly consistent data
    violates strong consistency


    • On the other hand, using strongly consistent data as input to a weakly-
    consistent operation is safe
    26

    View full-size slide

  27. CTRD: Consistency Types for Replicated Data
    • Type system that distinguishes values according to their consistency


    • Consistency represented as labels attached to types and values


    • A label can be con (consistent), oac (OAC) or ava (available)


    • Labels are ordered:


    • The label ordering expresses permitted data flow: con → oac → ava


    • Labeled types are covariant in their labels:
    27
    ava!"!con

    View full-size slide

  28. Syntax
    • Essentially: STLC extended with distributed ML-style references and labels


    • Main influence: (Toro et al. TOPLAS 2018)
    28

    View full-size slide

  29. Select Typing Rules (CTRD)
    • Example 1: tcon := sava


    • Example 2: if xava then tcon := 1con else tcon := 0con
    29
    Illegal!
    Illegal!

    View full-size slide

  30. Results
    • Distributed small-step operational semantics


    • Formalizes RDTs including observable atomic consistency; operations via message
    passing


    • Proofs of correctness properties:


    • Type soundness → no run-time label violations!


    • Noninterference

    E.g., mutation of ava-labelled references cannot be observed via con-labelled values


    • Proofs of consistency properties:


    • Theorem: For con operations, CTRD ensures sequential consistency


    • Theorem: For ava operations, CTRD ensures eventual consistency
    30

    View full-size slide

  31. Summary
    CTRD: Consistency Types for Replicated Data


    • A distributed, higher-order language with replicated data types and
    consistency labels


    • Enables safe mixing of strongly consistent and available (weakly
    consistent) data


    • Proofs of type soundness, noninterference, and consistency
    properties


    • Integrates observable atomic consistency (OAC)
    31
    Xin Zhao, Philipp Haller: Consistency types for replicated data in a higher-order
    distributed programming language. Art Sci. Eng. Program. 5(2): 6 (2021)

    https://doi.org/10.22152/programming-journal.org/2021/5/6

    View full-size slide

  32. Scalability, Reliability, and Simplicity
    Scalability


    • Low latency and high throughput requires


    • Asynchronous operations


    • Task, pipeline and data parallelism


    • Distribution requires


    • Fault tolerance


    • Minimizing coordination whenever possible


    Reliability & trust


    • Provable properties


    Simplicity


    • Strong guarantees, automatic checking, potential for wide adoption
    32

    View full-size slide

  33. Problem: Reliable
    Composition of Services
    • Application logic in workflows A, B and C — each
    processing a stream of incoming events


    • Problems:


    • Problem 1: Workflows cannot communicate
    directly with each other → custom dispatching
    logic needed


    • Problem 2: How to ensure fault tolerance of
    communication between the workflows?


    • Ideally: exactly-once processing of each
    event


    • Current practice:


    • Communicate events via reliable, distributed
    logs (e.g., Apache Kafka)


    • Major problem: events are dispatched via
    unreliable custom logic
    Typically deployed
    across many machines in
    a data center
    33

    View full-size slide

  34. The Portals Project — Goals
    • Simplify composition of reliable, decentralized services


    • Service = long-running process consuming and producing events


    • Deployed in the cloud and on the edge → no centralized coordination


    • Provide strong execution guarantees:


    • Transactional processing


    • Exactly-once processing


    • Formalization and correctness proofs


    • Open-source implementation enabling experimental evaluation

    and extensions
    34

    View full-size slide

  35. Workflows
    • Workflow as the unit of computation = DAG of stateful tasks


    • A workflow consumes and produces streams of atoms


    • Atom = batch of events
    sink
    Work
    fl
    ow[T, U]
    src
    tasks
    AtomicStream[T] AtomicStream[U]
    35

    View full-size slide

  36. Atomic Processing Contract
    Data is processed through atomic (transactional) steps:


    • Consume an atom (“batch of events”) from the input stream


    • Process all events in the atom


    • Perform all state updates


    • Produce one atom containing all emitted events
    sink
    Work
    fl
    ow[T, U]
    src
    tasks
    AtomicStream[T] AtomicStream[U]
    36

    View full-size slide

  37. Portals
    A Portal exposes a dataflow as a service via bidirectional communication
    Responding
    Dataflow
    src tasks sink
    Requests Replies
    Portal Service
    Access
    Operator
    Reques8ng
    Dataflow
    Requests Replies
    Portal Service
    37

    View full-size slide

  38. Portals: End-to-End Exactly-Once Processing
    • Atomic streams = reliable distributed streams with transactional interface


    • Communication via Portals based on atomic streams


    • Key property: atomic streams + atomic processing contract → end-to-end exactly-once processing
    Current practice:

    No guarantees for dispatcher
    and workflow composition
    Portals:

    Exactly-once processing guaranteed
    for composition of workflows
    38

    View full-size slide

  39. The Portals Model
    • Compared to the actor model, Portals:


    • guarantees exactly-once
    processing, adds data parallelism,
    and


    • removes fully-dynamic
    communication topologies:
    workflows cannot create workflows.


    • Compared to previous models for stateful
    serverless (Durable Functions, …), Portals:


    • supports communication cycles,


    • adds dataflow composition, and


    • introduces serializable state updates
    (not shown)
    The Actor Model
    39
    Jonas Spenger, Paris Carbone, Philipp Haller:
    Portals: An Extension of Dataflow Streaming
    for Stateful Serverless. Onward! 2022: 153-171


    https://doi.org/10.1145/3563835.3567664
    • Actors1 are independent concurrent
    processes that communicate by
    asynchronous message passing


    • In response to a message, an actor can:


    • send messages to other actors;


    • change its behavior/state;


    • create new actors.
    Portals: (workflows)
    For details, see:
    1 Gul Agha: Concurrent Object-Oriented Programming. Commun. ACM 33(9): 125-141 (1990)

    View full-size slide

  40. Portals Open Source Project
    • A prototype implementation under active development


    • Open source, Apache 2.0 License


    • Written in Scala 3, a high-level language combining functional and object-oriented
    programming


    • Repository on GitHub:

    https://github.com/portals-project/portals
    40

    View full-size slide

  41. Portals Playground
    • Portals Playground enables running
    Portals applications in the web
    browser:

    portals-project.org/playground/


    • Made possible by compiling the
    Portals framework to JavaScript
    using Scala.js, the Scala-to-
    JavaScript compiler
    41

    View full-size slide

  42. Portals: Summary
    • An extension of dataflow streaming with atomic streams and portals


    • Portals enable direct communication between workflows


    • End-to-end exactly-once processing guaranteed via atomic processing contract
    in combination with atomic streams


    • Ongoing work on formalization and correctness proofs


    • Project website: portals-project.org
    42
    Jonas Spenger, Paris Carbone, Philipp Haller: Portals: An Extension of Dataflow Streaming
    for Stateful Serverless. Onward! 2022: 153-171


    https://doi.org/10.1145/3563835.3567664

    View full-size slide

  43. Conclusion
    • Key challenge:


    • The design and implementation of a programming system that


    • supports emerging applications and workloads;


    • provides reliability and trust; and


    • embraces simplicity and accessibility.


    • Realizing this vision requires work on:


    • consistency models and distributed protocols;


    • type systems and/or program verification;


    • program models that

    enable scalability, fault tolerance, and simplicity.
    43
    Scalability, Fault Tolerance, Simplicity
    Observable Atomic Consistency
    Consistency Types
    Portals

    View full-size slide

  44. Postdoc Fellowship Opportunity
    • Are you a PhD student in the final year? Are you a postdoc?


    • Digital Futures Research Center (KTH, Stockholm U, RISE)


    • Fully-funded 2-year postdoc fellowships


    • Project defined by postdoc fellow


    • Calls twice a year, closing dates typically in Nov and Mar


    • Notification ~ 2 months later


    • Info (check also closed calls!): https://www.digitalfutures.kth.se/
    44

    View full-size slide

  45. Thanks!
    45
    Do you have any
    question?

    View full-size slide