Failure Transparency in Dataflow Systems

Slide 1

Slide 1 text

Failure Transparency   in Dataflow Systems KTH Royal Institute of Technology Stockholm, Sweden Philipp Haller Joint work with Jonas Spenger, Aleksey Veresov, and Paris Carbone Uppsala, Sweden, June 5, 2025

Slide 2

Slide 2 text

Dataflow Systems • Typical application: processing of streams of events/data • Wide adoption in modern cloud infrastructure • Example: Apache Flink used to power thousands of streaming jobs at Uber and ByteDance • Other widely-used systems: Apache Spark, Google Dataflow, Azure Event Hubs • Essential: recovery from failures, as failures are to be expected in any long-running streaming job • Problem: failure recovery is difficult! • Failure recovery protocols must balance efficiency and reliability • As a result, practical failure recovery protocols are complex The correctness of failure recovery protocols is a crucial problem for the reliability of stateful dataflow systems!

Slide 3

Slide 3 text

Dataflow Example • The source “integers” transfers a stream of integer events E • The task “incremental average” computes the incremental average of the stream of integers • The source “reset” transfers a stream of control messages Reset • Processing a Reset event resets the current average to zero

Slide 4

Slide 4 text

Assumptions • Message channels are FIFO ordered, a common assumption • Failures: assumptions common to asynchronous distributed systems [1] • Failures are assumed to be crash-recovery failures: a node looses its volatile state from crashing • We assume the existence of an eventually perfect failure detector, which is used for (eventually) triggering the recovery. • System components (all found in production dataflow systems): • Failure-free coordinator implemented using a distributed consensus protocol such as Paxos • Snapshot storage is assumed to be persistent and durable, e.g., provided by HDFS • The input to the dataflow graph is assumed to be logged such that it can be replayed upon failure, using a durable log system such as Kafka [1] Christian Cachin, Rachid Guerraoui, and Luís E. T. Rodrigues. Introduction to Reliable and Secure Distributed Programming (2. ed.). Springer, 2011

Slide 5

Slide 5 text

Contributions • The first small-step operational semantics of the Asynchronous Barrier Snapshotting protocol within a stateful dataflow system, as used in Apache Flink • A novel definition of failure transparency for programming models expressed in small-step operational semantics with explicit failure rules • The first definition of failure transparency for stateful dataflow systems • A proof that the provided implementation model is failure transparent and guarantees liveness • A mechanization of the definitions, theorems, and models in Coq Aleksey Veresov, Jonas Spenger, Paris Carbone, Philipp Haller: Failure Transparency in Stateful Dataflow Systems. ECOOP 2024: 42:1-42:31

Slide 6

Slide 6 text

Failure Recovery • Process p2 fails • Coordinator (not shown) discovers failure, triggers recovery step • All processes recover to the latest completed snapshot • No failures or incomplete epochs visible to observer • Intuitively: side effects from failed epochs are ignored Execution with failure Observed execution

Slide 7

Slide 7 text

Recovery Protocol: Asynchronous Barrier Snapshotting 1. Process up to a barrier 2. Barriers are aligned 3. Upload snapshot &  propagate barrier 4. Continue processing Idea: periodically, • a “barrier” is input to the dataflow graph • when all input streams carry a barrier, a snapshot of the state is created, and the barrier is propagated Epochs (not shown): • Each task maintains current epoch • Barriers increment epochs • Least common snapshot = set of local snapshots at greatest common epoch

Slide 8

Slide 8 text

Recovery Protocol: Asynchronous Barrier Snapshotting 1. Process up to a barrier 2. Barriers are aligned 3. Upload snapshot &  propagate barrier 4. Continue processing Idea: periodically, • a “barrier” is input to the dataflow graph • when all input streams carry a barrier, a snapshot of the state is created, and the barrier is propagated Failure recovery: • Triggered by an implicit coordinator (not shown) • All tasks in dataflow graph restart from least common snapshot

Slide 9

Slide 9 text

An Operational Model of Stateful Dataflow • Small-step operational semantics • Well suited for modeling concurrent systems • A compact set of evaluation rules: • 3 rules describe a failure-free system • 2 rules are related to failures • 2 rules are auxiliary

Slide 10

Slide 10 text

Failure-Free Rule 1: Process Input Event

Slide 11

Slide 11 text

Failure-Free Rule 2: Process Border

Slide 12

Slide 12 text

Failure-Free Rule 3: Step of Dataflow Graph

Slide 13

Slide 13 text

Failure-Related Rule 1: F-Fail

Slide 14

Slide 14 text

Failure-Related Rule 2: F-Recover

Slide 15

Slide 15 text

Recall incremental average task • The source “integers” transfers a stream of integer events E • The task “incremental average” computes the incremental average of the stream of integers • The source “reset” transfers a stream of control messages Reset • Processing a Reset event resets the current average to zero

Slide 16

Slide 16 text

Failure Transparency: Example • Execution of the incremental average task with a failure and subsequent recovery • Snapshot archives: a0 = [0 ↦ 0], a1 = a0[1 ↦ 1], a2 = a1[2 ↦ 4] • In a0, epoch 0 is mapped to state 0 • a1 extends a0, mapping epoch 1 to state 1 (etc.) • Processing BDs creates a new snapshot and increments the epoch • Purpose of failure transparency: provide an abstraction of a system which hides the internals of failures and failure recovery

Slide 17

Slide 17 text

Failure Transparency: Example • Observer should be able to reason about the observed execution as if it was an ideal, failure-free execution • Intuitively, the observer should find some failure-free execution which “explains” the execution • A failure-free execution corresponds to the bottom execution • Idea: lift the observed executions by means of “observability functions”, to a level where failure-related events and states are hidden • Example: keep only the snapshot storage

Slide 18

Slide 18 text

Observational Explainability • Goal: formal definition of failure transparency • Our approach: definition based on Observational Explainability Failure-free execution

Slide 19

Slide 19 text

Failure Transparency, Formally

Slide 20

Slide 20 text

In the paper • We prove that the presented implementation model is failure transparent • We prove liveness of the implementation model • The implementation model eventually produces outputs for all epochs in its input • Discussion of related work on failure transparency, failure transparency proofs, resilient distributed programming models, and failure recovery • Most recent work: • Extends results (including proof) to failure-transparent actor model • Different failure recovery protocol, different proof technique (simulation using prophecy variables) Aleksey Veresov, Jonas Spenger, Paris Carbone, Philipp Haller: Failure Transparency in Stateful Dataflow Systems. ECOOP 2024: 42:1-42:31 Jonas Spenger, Paris Carbone, Philipp Haller: Semantics of Failure Transparent Actors. GulFest 2025: to appear

Slide 21

Slide 21 text

Summary (Contributions) • The first small-step operational semantics of the Asynchronous Barrier Snapshotting protocol within a stateful dataflow system, as used in Apache Flink • A novel definition of failure transparency for programming models expressed in small-step operational semantics with explicit failure rules • The first definition of failure transparency for stateful dataflow systems • A proof that the provided implementation model is failure transparent and guarantees liveness • A mechanization of the definitions, theorems, and models in Coq