Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Failure Transparency in Dataflow Systems

Avatar for Philipp Haller Philipp Haller
June 05, 2025
3

Failure Transparency in Dataflow Systems

Avatar for Philipp Haller

Philipp Haller

June 05, 2025
Tweet

Transcript

  1. Failure Transparency 
 in Dataflow Systems KTH Royal Institute of

    Technology Stockholm, Sweden Philipp Haller Joint work with Jonas Spenger, Aleksey Veresov, and Paris Carbone Uppsala, Sweden, June 5, 2025
  2. Dataflow Systems • Typical application: processing of streams of events/data

    • Wide adoption in modern cloud infrastructure • Example: Apache Flink used to power thousands of streaming jobs at Uber and ByteDance • Other widely-used systems: Apache Spark, Google Dataflow, Azure Event Hubs • Essential: recovery from failures, as failures are to be expected in any long-running streaming job • Problem: failure recovery is difficult! • Failure recovery protocols must balance efficiency and reliability • As a result, practical failure recovery protocols are complex The correctness of failure recovery protocols is a crucial problem for the reliability of stateful dataflow systems!
  3. Dataflow Example • The source “integers” transfers a stream of

    integer events E<i> • The task “incremental average” computes the incremental average of the stream of integers • The source “reset” transfers a stream of control messages Reset • Processing a Reset event resets the current average to zero
  4. Assumptions • Message channels are FIFO ordered, a common assumption

    • Failures: assumptions common to asynchronous distributed systems [1] • Failures are assumed to be crash-recovery failures: a node looses its volatile state from crashing • We assume the existence of an eventually perfect failure detector, which is used for (eventually) triggering the recovery. • System components (all found in production dataflow systems): • Failure-free coordinator implemented using a distributed consensus protocol such as Paxos • Snapshot storage is assumed to be persistent and durable, e.g., provided by HDFS • The input to the dataflow graph is assumed to be logged such that it can be replayed upon failure, using a durable log system such as Kafka [1] Christian Cachin, Rachid Guerraoui, and Luís E. T. Rodrigues. Introduction to Reliable and Secure Distributed Programming (2. ed.). Springer, 2011
  5. Contributions • The first small-step operational semantics of the Asynchronous

    Barrier Snapshotting protocol within a stateful dataflow system, as used in Apache Flink • A novel definition of failure transparency for programming models expressed in small-step operational semantics with explicit failure rules • The first definition of failure transparency for stateful dataflow systems • A proof that the provided implementation model is failure transparent and guarantees liveness • A mechanization of the definitions, theorems, and models in Coq Aleksey Veresov, Jonas Spenger, Paris Carbone, Philipp Haller: Failure Transparency in Stateful Dataflow Systems. ECOOP 2024: 42:1-42:31
  6. Failure Recovery • Process p2 fails • Coordinator (not shown)

    discovers failure, triggers recovery step • All processes recover to the latest completed snapshot • No failures or incomplete epochs visible to observer • Intuitively: side effects from failed epochs are ignored Execution with failure Observed execution
  7. Recovery Protocol: Asynchronous Barrier Snapshotting 1. Process up to a

    barrier 2. Barriers are aligned 3. Upload snapshot &
 propagate barrier 4. Continue processing Idea: periodically, • a “barrier” is input to the dataflow graph • when all input streams carry a barrier, a snapshot of the state is created, and the barrier is propagated Epochs (not shown): • Each task maintains current epoch • Barriers increment epochs • Least common snapshot = set of local snapshots at greatest common epoch
  8. Recovery Protocol: Asynchronous Barrier Snapshotting 1. Process up to a

    barrier 2. Barriers are aligned 3. Upload snapshot &
 propagate barrier 4. Continue processing Idea: periodically, • a “barrier” is input to the dataflow graph • when all input streams carry a barrier, a snapshot of the state is created, and the barrier is propagated Failure recovery: • Triggered by an implicit coordinator (not shown) • All tasks in dataflow graph restart from least common snapshot
  9. An Operational Model of Stateful Dataflow • Small-step operational semantics

    • Well suited for modeling concurrent systems • A compact set of evaluation rules: • 3 rules describe a failure-free system • 2 rules are related to failures • 2 rules are auxiliary
  10. Recall incremental average task • The source “integers” transfers a

    stream of integer events E<i> • The task “incremental average” computes the incremental average of the stream of integers • The source “reset” transfers a stream of control messages Reset • Processing a Reset event resets the current average to zero
  11. Failure Transparency: Example • Execution of the incremental average task

    with a failure and subsequent recovery • Snapshot archives: a0 = [0 ↦ 0], a1 = a0[1 ↦ 1], a2 = a1[2 ↦ 4] • In a0, epoch 0 is mapped to state 0 • a1 extends a0, mapping epoch 1 to state 1 (etc.) • Processing BDs creates a new snapshot and increments the epoch • Purpose of failure transparency: provide an abstraction of a system which hides the internals of failures and failure recovery
  12. Failure Transparency: Example • Observer should be able to reason

    about the observed execution as if it was an ideal, failure-free execution • Intuitively, the observer should find some failure-free execution which “explains” the execution • A failure-free execution corresponds to the bottom execution • Idea: lift the observed executions by means of “observability functions”, to a level where failure-related events and states are hidden • Example: keep only the snapshot storage
  13. Observational Explainability • Goal: formal definition of failure transparency •

    Our approach: definition based on Observational Explainability Failure-free execution
  14. In the paper • We prove that the presented implementation

    model is failure transparent • We prove liveness of the implementation model • The implementation model eventually produces outputs for all epochs in its input • Discussion of related work on failure transparency, failure transparency proofs, resilient distributed programming models, and failure recovery • Most recent work: • Extends results (including proof) to failure-transparent actor model • Different failure recovery protocol, different proof technique (simulation using prophecy variables) Aleksey Veresov, Jonas Spenger, Paris Carbone, Philipp Haller: Failure Transparency in Stateful Dataflow Systems. ECOOP 2024: 42:1-42:31 Jonas Spenger, Paris Carbone, Philipp Haller: Semantics of Failure Transparent Actors. GulFest 2025: to appear
  15. Summary (Contributions) • The first small-step operational semantics of the

    Asynchronous Barrier Snapshotting protocol within a stateful dataflow system, as used in Apache Flink • A novel definition of failure transparency for programming models expressed in small-step operational semantics with explicit failure rules • The first definition of failure transparency for stateful dataflow systems • A proof that the provided implementation model is failure transparent and guarantees liveness • A mechanization of the definitions, theorems, and models in Coq