Failure Transparency in Dataflow Systems

Failure Transparency   in Dataflow Systems KTH Royal Institute of
Technology Stockholm, Sweden Philipp Haller Joint work with Jonas Spenger, Aleksey Veresov, and Paris Carbone Uppsala, Sweden, June 5, 2025

Dataflow Systems • Typical application: processing of streams of events/data
• Wide adoption in modern cloud infrastructure • Example: Apache Flink used to power thousands of streaming jobs at Uber and ByteDance • Other widely-used systems: Apache Spark, Google Dataflow, Azure Event Hubs • Essential: recovery from failures, as failures are to be expected in any long-running streaming job • Problem: failure recovery is difficult! • Failure recovery protocols must balance efficiency and reliability • As a result, practical failure recovery protocols are complex The correctness of failure recovery protocols is a crucial problem for the reliability of stateful dataflow systems!

Dataflow Example • The source “integers” transfers a stream of
integer events E<i> • The task “incremental average” computes the incremental average of the stream of integers • The source “reset” transfers a stream of control messages Reset • Processing a Reset event resets the current average to zero

Assumptions • Message channels are FIFO ordered, a common assumption
• Failures: assumptions common to asynchronous distributed systems [1] • Failures are assumed to be crash-recovery failures: a node looses its volatile state from crashing • We assume the existence of an eventually perfect failure detector, which is used for (eventually) triggering the recovery. • System components (all found in production dataflow systems): • Failure-free coordinator implemented using a distributed consensus protocol such as Paxos • Snapshot storage is assumed to be persistent and durable, e.g., provided by HDFS • The input to the dataflow graph is assumed to be logged such that it can be replayed upon failure, using a durable log system such as Kafka [1] Christian Cachin, Rachid Guerraoui, and Luís E. T. Rodrigues. Introduction to Reliable and Secure Distributed Programming (2. ed.). Springer, 2011

Contributions • The first small-step operational semantics of the Asynchronous
Barrier Snapshotting protocol within a stateful dataflow system, as used in Apache Flink • A novel definition of failure transparency for programming models expressed in small-step operational semantics with explicit failure rules • The first definition of failure transparency for stateful dataflow systems • A proof that the provided implementation model is failure transparent and guarantees liveness • A mechanization of the definitions, theorems, and models in Coq Aleksey Veresov, Jonas Spenger, Paris Carbone, Philipp Haller: Failure Transparency in Stateful Dataflow Systems. ECOOP 2024: 42:1-42:31

Failure Recovery • Process p2 fails • Coordinator (not shown)
discovers failure, triggers recovery step • All processes recover to the latest completed snapshot • No failures or incomplete epochs visible to observer • Intuitively: side effects from failed epochs are ignored Execution with failure Observed execution

Recovery Protocol: Asynchronous Barrier Snapshotting 1. Process up to a
barrier 2. Barriers are aligned 3. Upload snapshot &  propagate barrier 4. Continue processing Idea: periodically, • a “barrier” is input to the dataflow graph • when all input streams carry a barrier, a snapshot of the state is created, and the barrier is propagated Epochs (not shown): • Each task maintains current epoch • Barriers increment epochs • Least common snapshot = set of local snapshots at greatest common epoch

Recovery Protocol: Asynchronous Barrier Snapshotting 1. Process up to a
barrier 2. Barriers are aligned 3. Upload snapshot &  propagate barrier 4. Continue processing Idea: periodically, • a “barrier” is input to the dataflow graph • when all input streams carry a barrier, a snapshot of the state is created, and the barrier is propagated Failure recovery: • Triggered by an implicit coordinator (not shown) • All tasks in dataflow graph restart from least common snapshot

An Operational Model of Stateful Dataflow • Small-step operational semantics
• Well suited for modeling concurrent systems • A compact set of evaluation rules: • 3 rules describe a failure-free system • 2 rules are related to failures • 2 rules are auxiliary

Failure-Free Rule 1: Process Input Event

Failure-Free Rule 2: Process Border

Failure-Free Rule 3: Step of Dataflow Graph

Failure-Related Rule 1: F-Fail

Failure-Related Rule 2: F-Recover

Recall incremental average task • The source “integers” transfers a
stream of integer events E<i> • The task “incremental average” computes the incremental average of the stream of integers • The source “reset” transfers a stream of control messages Reset • Processing a Reset event resets the current average to zero

Failure Transparency: Example • Execution of the incremental average task
with a failure and subsequent recovery • Snapshot archives: a0 = [0 ↦ 0], a1 = a0[1 ↦ 1], a2 = a1[2 ↦ 4] • In a0, epoch 0 is mapped to state 0 • a1 extends a0, mapping epoch 1 to state 1 (etc.) • Processing BDs creates a new snapshot and increments the epoch • Purpose of failure transparency: provide an abstraction of a system which hides the internals of failures and failure recovery

Failure Transparency: Example • Observer should be able to reason
about the observed execution as if it was an ideal, failure-free execution • Intuitively, the observer should find some failure-free execution which “explains” the execution • A failure-free execution corresponds to the bottom execution • Idea: lift the observed executions by means of “observability functions”, to a level where failure-related events and states are hidden • Example: keep only the snapshot storage

Observational Explainability • Goal: formal definition of failure transparency •
Our approach: definition based on Observational Explainability Failure-free execution

Failure Transparency, Formally

In the paper • We prove that the presented implementation
model is failure transparent • We prove liveness of the implementation model • The implementation model eventually produces outputs for all epochs in its input • Discussion of related work on failure transparency, failure transparency proofs, resilient distributed programming models, and failure recovery • Most recent work: • Extends results (including proof) to failure-transparent actor model • Different failure recovery protocol, different proof technique (simulation using prophecy variables) Aleksey Veresov, Jonas Spenger, Paris Carbone, Philipp Haller: Failure Transparency in Stateful Dataflow Systems. ECOOP 2024: 42:1-42:31 Jonas Spenger, Paris Carbone, Philipp Haller: Semantics of Failure Transparent Actors. GulFest 2025: to appear

Summary (Contributions) • The first small-step operational semantics of the
Asynchronous Barrier Snapshotting protocol within a stateful dataflow system, as used in Apache Flink • A novel definition of failure transparency for programming models expressed in small-step operational semantics with explicit failure rules • The first definition of failure transparency for stateful dataflow systems • A proof that the provided implementation model is failure transparent and guarantees liveness • A mechanization of the definitions, theorems, and models in Coq

Failure Transparency in Dataflow Systems

Failure Transparency in Dataflow Systems

Philipp Haller

More Decks by Philipp Haller

Featured

Transcript