Failure Transparency for Stream Processing and Actors

Failure Transparency for Stream Processing and Actors KTH Royal Institute
of Technology Stockholm, Sweden Philipp Haller Joint work with Jonas Spenger, Aleksey Veresov, and Paris Carbone Technical University of Denmark, Copenhagen, Denmark, Nov 7, 2025

Dataflow Systems • Typical application: processing of streams of events/data
• Wide adoption in modern cloud infrastructure • Example: Apache Flink used to power thousands of streaming jobs at Uber and ByteDance • Other widely-used systems: Apache Spark, Google Dataflow, Azure Event Hubs • Essential: recovery from failures, as failures are to be expected in any long-running streaming job running on a large cluster of computers • Problem: failure recovery is difficult! • Failure recovery protocols must balance efficiency and reliability • As a result, practical failure recovery protocols are complex The correctness of failure recovery protocols is a crucial problem for the reliability of stateful dataflow systems!

Dataflow Example • The “integers” source transfers a stream of
integer events E<i> • The “incremental average” task computes the incremental average of the stream of integers • The “reset” source transfers a stream of Reset events • Processing a Reset event resets the current average to zero • Semantics: as soon as the “incremental average” task receives an event from either the “integers” source or the “reset” source, it will process the corresponding event

Assumptions • Message channels are FIFO ordered, a common assumption
• Failures: assumptions common to asynchronous distributed systems [CGR11] • Failures are assumed to be crash-recovery failures: a node loses its volatile state when crashing • We assume the existence of an eventually perfect failure detector, which is used for (eventually) triggering the recovery. • System components (all found in production dataflow systems): • Failure-free coordinator implemented using a distributed consensus protocol such as Paxos • Snapshot storage is assumed to be persistent and durable, e.g., provided by HDFS • The input to the dataflow graph is assumed to be logged such that it can be replayed upon failure, using a durable log system such as Kafka [CGR11] Christian Cachin, Rachid Guerraoui, and Luís E. T. Rodrigues. Introduction to Reliable and Secure Distributed Programming (2. ed.). Springer, 2011

Contributions • The first small-step operational semantics of the Asynchronous
Barrier Snapshotting [CEFHRT17] protocol within a stateful dataflow system, as used in Apache Flink • A novel definition of failure transparency for programming models expressed in small-step operational semantics with explicit failure rules • The first definition of failure transparency for stateful dataflow systems • A proof that the implementation model is failure transparent and guarantees liveness • A mechanization of the definitions, theorems, and models in Rocq [CEFHRT17] Paris Carbone, Stephan Ewen, Gyula Fóra, Seif Haridi, Stefan Richter, Kostas Tzoumas: State Management in Apache Flink®: Consistent Stateful Distributed Stream Processing. Proc. VLDB Endow. 10(12): 1718-1729 (2017)

Failure Recovery • Process p2 fails • Coordinator (not shown)
discovers failure, triggers recovery step • All processes recover to the latest completed snapshot (red) • Subsequently, the next snapshot (blue) is successfully committed • No failures or non-snapshotted state visible to observer • Intuitively: side effects from failed epochs are ignored Execution with failure Observed execution Epoch: “logical time of a snapshot“ Messages not part of some snapshot are not visible

Recovery Protocol: Asynchronous Barrier Snapshotting 1. Process up to a
barrier 2. Barriers are aligned 3. Upload snapshot &  propagate barrier 4. Continue processing Idea: periodically, • a “barrier” is input to the dataflow graph • when all input streams carry a barrier, a snapshot of the state is created, and the barrier is propagated Epochs: • Each task maintains current epoch • Barriers increment epochs • Least common snapshot = set of local snapshots at greatest common epoch

Recovery Protocol: Asynchronous Barrier Snapshotting 1. Process up to a
barrier 2. Barriers are aligned 3. Upload snapshot &  propagate barrier 4. Continue processing Idea: periodically, • a “barrier” is input to the dataflow graph • when all input streams carry a barrier, a snapshot of the state is created, and the barrier is propagated Failure recovery: • Triggered by an implicit coordinator (not shown) • All tasks in dataflow graph restart from least common snapshot

An Operational Model of Stateful Dataflow • Small-step operational semantics
• Well suited for modeling concurrent systems • A compact set of evaluation rules: • 3 rules describe a failure-free system • 2 rules are related to failures • 2 rules are auxiliary

Failure-Free Rule 1: Process Input Event S = input streams,
  o = single output stream e = current epoch,   v = current state

Failure-Free Rule 2: Process Border snapshot for epoch e stored
in archive a current epoch incremented

Failure-Free Rule 3: Step of Dataflow Graph

Failure-Related Rule 1: F-Fail volatile state is lost

Failure-Related Rule 2: F-Recover

Recall incremental average task • The “integers” source transfers a
stream of integer events E<i> • The “incremental average” task computes the incremental average of the stream of integers • The “reset” source transfers a stream of Reset events • Processing a Reset event resets the current average to zero • Semantics: as soon as the “incremental average” task receives an event from either the “integers” source or the “reset” source, it will process the corresponding event

Failure Transparency: Example • Execution of the incremental average task
with a failure and subsequent recovery • Snapshot archives: a0 = [0 ↦ 0], a1 = a0[1 ↦ 1], a2 = a1[2 ↦ 4] • In a0, epoch 0 is mapped to state 0 • a1 extends a0, mapping epoch 1 to state 1 (etc.) • Processing BDs creates a new snapshot and increments the epoch • Purpose of failure transparency: provide an abstraction of a system which hides the internals of failures and failure recovery

Failure Transparency: Example • Observer should be able to reason
about the observed execution as if it was an ideal, failure-free execution • Intuitively, the observer should find some failure-free execution which “explains” the execution • A failure-free execution corresponds to the bottom execution • Idea: lift the observed executions by means of “observability functions”, to a level where failure-related events and states are hidden • Example: keep only the snapshot storage

Observational Explainability • Goal: formal definition of failure transparency •
Our approach: definition based on Observational Explainability Execution with failures and failure recovery

Failure Transparency, Formally

In the paper • We prove that the presented implementation
model is failure transparent • We prove liveness of the implementation model • The implementation model eventually produces outputs for all epochs in its input • Discussion of related work on failure transparency, failure transparency proofs, resilient distributed programming models, and failure recovery Aleksey Veresov, Jonas Spenger, Paris Carbone, Philipp Haller: Failure Transparency in Stateful Dataflow Systems. ECOOP 2024: 42:1-42:31

Failure-Transparent Actors • Stateful dataflow streaming enables building scalable and
correct services with low effort • However, restricted in expressiveness: directed, acyclic graphs of tasks • Actor concurrency model lifts this restriction: • Cyclic, dynamic communication graphs • Widely used for building reliable services • Industrial adoption, examples: • Telecom services with Erlang at Ericsson [Armstrong86] • Messaging services with Erlang at WhatsApp [EST18] • Social media with Scala Actors at Twitter/X [Mok09]

Building Reliable Services with Actors • Tools for building reliable
services: • Hierarchical actor supervision (Erlang, Akka) • Event sourcing, etc. • Libraries (Erlang/OTP) • Handling failures is the user's responsibility (although libraries help) • Missing: a failure-transparent actor programming model

Failure-Transparent Actors • Specification: • Actors can (i) spawn new
actors; (ii) send messages to other actors • Actors never fail • Actors are failure-transparent • Implementation: • Actors can fail • Failure = crash-recovery failure: • Actor loses volatile state and restarts Goal: proof of failure transparency of the implementation with respect to the specification

Syntax and Semantics Reducing send or spawn produces an action

Actors as State Machines (1/4) • State machine of an
actor in the implementation model • States: • State 0: Idle • State 1: Process a message, collect actions • State Fail: Failed, lose all volatile state • State 2: Success, perform collected actions Side-effects only performed with transition S5

Actors as State Machines (2/4) • Failure-transparency: prove that (b)
simulates (a) • Challenge: how can we distinguish transition sequences (with failures): • S2 followed by S3 (ignore / discard S2's side-effects) • S2 followed by S5 (execute S2's side-effects) (a) Without failures (b) With failures

Actors as State Machines (3/4) • Introduce ”Prophecy variables” [LM22]
• Represent if execution will • (.F) Fail/Recover • (.S) Succeed (c) With prophecy (a) Without failures (b) With failures Leslie Lamport, Stephan Merz: Prophecy Made Simple. ACM Trans. Program. Lang. Syst. 44(2): 6:1-6:27 (2022)

Actors as State Machines (4/4) • Establish a mapping between
states and transitions for simulation relation: • States 0, 1.F and Fail in (c) are all mapped to state 0 in (a) • State 1.S in (c) is mapped to state 1 in (a) • State 2 in (c) is mapped to state 2 in (a) (c) With prophecy (a) Without failures (b) With failures To show: (c) simulates (a) and (b) simulates (c)

Failure-Transparent Actors: Summary • Failure-transparent actors is a proposal to
extend failure-transparency from stateful dataflow streaming to the actor model • We present a proof of failure transparency by showing that the implementation model using prophecy variables simulates the failure-free specification model • Prototype implementation available at: https://github.com/jspenger/ durable-actor • Paper: Jonas Spenger, Paris Carbone, Philipp Haller: Failure-Transparent Actors. In: Meseguer, J., Varela, C.A., Venkatasubramanian, N. (eds) Concurrent Programming, Open Systems and Formal Methods: Essays Dedicated to Gul Agha to Celebrate His Scientific Career. LNCS, vol. 16120: 81-113. Springer, 2025.

Summary • The first small-step operational semantics of the Asynchronous
Barrier Snapshotting protocol within a stateful dataflow system, as used in Apache Flink • A novel definition of failure transparency for programming models expressed in small-step operational semantics with explicit failure rules • The first definition of failure transparency for stateful dataflow systems • A proof that the provided implementation model is failure transparent and guarantees liveness • Failure-transparent actors is a proposal to extend failure-transparency from stateful dataflow streaming to the actor model Jonas Spenger, Paris Carbone, Philipp Haller: Failure-Transparent Actors. In: Meseguer, J., Varela, C.A., Venkatasubramanian, N. (eds) Concurrent Programming, Open Systems and Formal Methods: Essays Dedicated to Gul Agha to Celebrate His Scientific Career. LNCS, vol. 16120: 81-113. Springer, 2025. Aleksey Veresov, Jonas Spenger, Paris Carbone, Philipp Haller: Failure Transparency in Stateful Dataflow Systems. ECOOP 2024: 42:1-42:31

References [CGR11] Christian Cachin, Rachid Guerraoui, and Luís E. T.
Rodrigues. Introduction to Reliable and Secure Distributed Programming (2. ed.). Springer, 2011 [CEFHRT17] Paris Carbone, Stephan Ewen, Gyula Fóra, Seif Haridi, Stefan Richter, Kostas Tzoumas: State Management in Apache Flink®: Consistent Stateful Distributed Stream Processing. Proc. VLDB Endow. 10(12): 1718-1729 (2017) [VSCH24] Aleksey Veresov, Jonas Spenger, Paris Carbone, Philipp Haller: Failure Transparency in Stateful Dataflow Systems. ECOOP 2024: 42:1-42:31 [LM22] Leslie Lamport, Stephan Merz: Prophecy Made Simple. ACM Trans. Program. Lang. Syst. 44(2): 6:1-6:27 (2022) [SCH25] Jonas Spenger, Paris Carbone, Philipp Haller: Failure-Transparent Actors. In: Meseguer, J., Varela, C.A., Venkatasubramanian, N. (eds) Concurrent Programming, Open Systems and Formal Methods: Essays Dedicated to Gul Agha to Celebrate His Scientific Career. LNCS, vol. 16120: 81-113. Springer, 2025. [Armstrong96] Armstrong, Joe.: Erlang—a Survey of the Language and its Industrial Applications. Proc. INAP. Vol. 96. 1996. [Mok09] Waiming Mok. "How Twitter is Scaling." https://waimingmok.wordpress.com/2009/06/27/how-twitter-is- scaling/ [EST18] Erlang Solutions Team. "20 Years of Open Source Erlang: OpenErlang Interview with Anton Lavrik from WhatsApp." https://www.erlang-solutions.com/blog/20-years-of-open-source-erlang-openerlang-interview-with- anton-lavrik-from-whatsapp/

Failure Transparency for Stream Processing and ...

Failure Transparency for Stream Processing and Actors

Philipp Haller

More Decks by Philipp Haller

Featured

Transcript