Hydroflow: A Compiler Target for Fast, Correct Distributed Programs

Hydroﬂow: A Compiler Target for Fast, Correct Distributed Programs JOE
HELLERSTEIN UC BERKELEY SUTTER HILL VENTURES

Joe Hellerstein Prof. Alvin Cheung Prof. Natacha Crooks Conor Power
Shadaj Laddad Prof.* Mae Milano Mingwei Samuel David Chu Dr. Tiemo Bang Lucky Katahanas Chris Douglas Chris Douglas

3 Cray-1, 1976 Supercomputers iPhone, 2007 Smart Phones Macintosh, 1984
Personal Computers PDP-11, 1970 Minicomputers Sea Changes in Computing

4 New Platform + New Language = Innovation Cray-1, 1976
Supercomputers iPhone, 2007 Smart Phones PDP-11, 1970 Minicomputers Macintosh, 1984 Personal Computers

5 ? How will folks program the cloud? In a
way that fosters unexpected innovation Distributed programming is hard! • Parallelism, consistency, partial failure, … Autoscaling makes it harder! Today’s compilers don’t address distributed concerns The Big Question Programming the Cloud: A Grand Challenge for Computing

Ted Codd Turing Award 1981 Formalize specification; automate implementation.

Declarative Programming for the Cloud Relational databases were invented to
hide how data is laid out and how queries are executed. The cloud was invented to hide how computing resources are laid out and how computations are executed.

9 Prior Systems Work: Anna Key-Value Store KVS: Petri dish
of distributed systems! Algorithmically trivial Focus on consistency guarantees, performance Wide range of deployment options ICDE 18 VLDB 19

10 Anna KVS Performance + Consistency 700x! Hand-written in C++
for a Ph.D. dissertation. Implementation correct by assertion. Fast, especially under contention Up to 700x faster than Masstree and Intel TBB on multicore Up to 10x faster than Cassandra in a geo-deployment 350x the performance of DynamoDB for the same price Consistency guarantees Pluggable! Causal, read-committed, … Bloom-style compositions of trivial semi-lattices (“CRDTs”)

11 Can we make this easier and more reliable?

12 Hydro Project Overview

13 A New Language Stack LLVM for the Cloud Compiler
Optimizations for Elastic Distributed Computing Compiler Guarantees for Distributed Correctness Checks Goals Fast: low latency and high throughput Flexible: polyglot, easy to adopt, easy to extend Cost-effective: good use of cloud resources CIDR 19

HYDRO Stack Many languages for programmers Declarative ”Global” IR Single-core
IR Adaptive deployment … Cloud Services … FaaS Storage ML Frameworks Actors (e.g. Orleans) Functional (e.g. Spark) Logic (e.g. Bloom) Futures (e.g. Ray) New DSLs HYDROLOGIC (global) HYDRAULIC Veriﬁed Lifting HYDROLYSIS Compiler HYDRODEPLOY HYDROFLOW (local) Sequential Code HYDRO

Today’s Focus HYDROFLOW (local) A Hydroflow transducer Rust

Communicating Transducers Form a Distributed System

17 Joe Hellerstein Prof. Alvin Cheung Prof. Natacha Crooks Conor
Power Shadaj Laddad Prof.* Mae Milano Mingwei Samuel David Chu Dr. Tiemo Bang Lucky Katahanas Chris Douglas Chris Douglas Hydroﬂow An IR and Compiler

18 Body of the Talk: Four Chapters Hydroflow IR: Shared-Nothing
dataflow transducer per core Compilation and Performance LatticeFlows for distributed consistency Cross-Node Optimization

19 Hydroﬂow: Transducer IR

Hydroflow IR: A Graph Specification Language A human-programmable IR In
the spirit of LLVM, Halide, etc. Still: a compiler target. Each transducer to be run on a single core A dataflow graph specification language https://hydro.run/docs/hydroflow/ my_flow = op1() -> op2(); my_flow op1 op2

Hydroﬂow Operators https://hydro.run/docs/hydroﬂow/syntax/surface_ops_gen

Simple KVS Dataflow source demux dest Put Get join HT

source demux dest Put Get join HT Simple KVS Dataﬂow
// Demux network inputs network_recv = source_stream_serde(inbound) -> demux_enum::<KvsMessageWithAddr>();

source demux dest Put Get join HT Simple KVS Dataflow
// Demux network inputs network_recv = source_stream_serde(inbound) -> demux_enum::<KvsMessageWithAddr>(); puts = network_recv[Put]; gets = network_recv[Get]; puts -> [0]lookup; gets -> [1]lookup;

source demux dest Put Get join HT Simple KVS Dataﬂow
// Demux network inputs network_recv = source_stream_serde(inbound) -> demux_enum::<KvsMessageWithAddr>(); puts = network_recv[Put]; gets = network_recv[Get]; puts -> [0]lookup; gets -> [1]lookup; // Join PUTs and GETs by key, persisting the PUTs. lookup = join();

Simple KVS Dataﬂow // Demux network inputs network_recv = source_stream_serde(inbound)
-> demux_enum::<KvsMessageWithAddr>(); puts = network_recv[Put]; gets = network_recv[Get]; puts -> [0]lookup; gets -> [1]lookup; // Join PUTs and GETs by key, persisting the PUTs. lookup = join::<'static, 'tick>(); // Send GET responses back to the client. lookup -> dest_sink_serde(outbound); source demux dest Put Get join HT

source demux dest Put Get join HT // Demux network
inputs network_recv = source_stream_serde(inbound) -> _upcast(Some(Delta)) -> map(Result::unwrap) -> map(|(msg, addr)| KvsMessageWithAddr::from_message(msg, addr)) -> demux_enum::<KvsMessageWithAddr>(); puts = network_recv[Put]; gets = network_recv[Get]; // Join PUTs and GETs by key, persisting the PUTs. puts -> map(|(key, value, _addr)| (key, value)) -> [0]lookup; gets -> [1]lookup; lookup = join::<'static, 'tick>(); // Send GET responses back to the client address. lookup -> map(|(key, (value, client_addr))| (KvsResponse { key, value }, client_addr)) -> dest_sink_serde(outbound);

// Demux network inputs network_recv = source_stream_serde(inbound) -> _upcast(Some(Delta)) ->
map(Result::unwrap) -> map(|(msg, addr)| KvsMessageWithAddr::from_message(msg, addr)) -> demux_enum::<KvsMessageWithAddr>(); puts = network_recv[Put]; gets = network_recv[Get]; // Join PUTs and GETs by key, persisting the PUTs. puts -> map(|(key, value, _addr)| (key, value)) -> [0]lookup; gets -> [1]lookup; lookup = join::<'static, 'tick>(); // Send GET responses back to the client address. lookup -> map(|(key, (value, client_addr))| (KvsResponse { key, value }, client_addr)) -> dest_sink_serde(outbound); (n1v1) source_stream_serde (n2v1) _upcast (n3v1) map (n4v1) map (n5v1) demux_enum (n6v1) map Put (n7v1) join Get 1 0 (n8v1) map (n9v1) dest_sink_serde

Simple → Replicated KVS source demux dest Put Get join
HT

Simple → Replicated KVS source demux dest Put Get join
HT cross_join HT2 Peer Join

Replicated KVS source demux dest Put Get join HT cross_join
HT2 Peer Join mux

34 (n1v1) union (n2v1) dest_sink_serde (n3v1) source_stream_serde (n4v1) _upcast (n5v1)
map (n6v1) map (n7v1) demux_enum (n8v1) for_each Server Response (n9v1) map Peer Join (n11v1) union Put Peer Gossip (n16v1) join Get 1 (n10v1) tee (n20v1) cross_join 1 (n23v1) union (n12v1) tee (n13v1) map (n25v1) cross_join 0 (n14v1) persist (n15v1) tee 0 0 (n17v1) map 1 (n18v1) source_iter_delta (n19v1) map (n21v1) map (n22v1) source_iter_delta (n24v1) persist 1 (n26v1) filter (n27v1) map // Join as a peer if peer_server is set. source_iter_delta(peer_server) -> map(|peer_addr| (KvsMessage::PeerJoin, peer_addr)) -> network_send; // Peers: When a new peer joins, send them all data. writes_store -> [0]peer_join; peers -> [1]peer_join; peer_join = cross_join() -> map(|((key, value), peer_addr)| (KvsMessage::PeerGossip { key, value }, peer_addr)) -> network_send; // Outbound gossip. Send updates to peers. peers -> peer_store; source_iter_delta(peer_server) -> peer_store; peer_store = union() -> persist(); writes -> [0]outbound_gossip; peer_store -> [1]outbound_gossip; outbound_gossip = cross_join() // Don't send gossip back to same sender. -> filter(|((_key, _value, writer_addr), peer_addr)| writer_addr != peer_addr) -> map(|((key, value, _writer_addr), peer_addr)| (KvsMessage::PeerGossip { key, value }, peer_addr)) -> network_send; Remaining Code Entire Dataflow

35 More on the IR Integration with Rust via staged
programming (Hydroﬂow+) Cyclic graphs & recursion as in Datalog Stratiﬁcation of blocking operators (negation/∀, fold, etc.) A single clock per transducer, as in Dedalus See The Hydro Book: https://hydro.run/docs/hydroflow/

36 Compilation and Performance

Hydroﬂow Performance Batch processing isn’t too hard in dataﬂow Event
handling is more demanding

38 pull push (n1v1) source_iter (n7v1) union base (n2v1) source_stream
(n4v1) join 1 (n3v1) map 0 (n5v1) flat_map (n6v1) tee (n8v1) uniqu print (n9v1) for_ea pull push (n1v1) source_iter (n7v1) union base (n2v1) source_stream 1 (n3v1) map 0 (n6v1) tee (n8v1) unique print (n9v1) for_each In-Out Tree Partitioning / Inlining Hydroflow is a Rust crate that generates Rust code from flow graphs Partition a flow graph into “in-out trees” Leads rustc to perform aggressive inlining pull push (n1v1) source_iter (n7v1) union base (n2v1) source_stream (n4v1) join 1 (n3v1) map 0 (n5v1) flat_map (n6v1) tee (n8v1) unique print (n9v1) for_each Graph Reachability

39 In-Out Tree Partitioning / Inlining pull push (n1v1) source_iter
(n7v1) union base (n2v1) source_stream (n4v1) join 1 (n3v1) map 0 (n5v1) flat_map (n6v1) tee (n8v1) unique print (n9v1) for_each pull push (n1v1) source_iter (n7v1) union base source_stream ) join 1 (n3v1) map 0 (n6v1) tee (n8v1) unique print (n9v1) for_each pivot One inlined function per in-out tree!

40 In-Out Tree Partitioning / Inlining pull push (n1v1) source_iter
(n7v1) union base (n2v1) source_stream (n4v1) join 1 (n3v1) map 0 (n5v1) flat_map (n6v1) tee (n8v1) unique print (n9v1) for_each pull push (n1v1) source_iter (n7v1) union base source_stream ) join 1 (n3v1) map 0 (n6v1) tee (n8v1) unique print (n9v1) for_each pivot handoff

41 Anna KVS Performance + Consistency 700x! Reminder! Fast, especially
under contention Up to 700x faster than Masstree and Intel TBB on multicore Up to 10x faster than Cassandra in a geo-deployment 350x the performance of DynamoDB for the same price Consistency guarantees Causal, read-committed, … Bloom-style compositions of trivial semi-lattices (“CRDTs”)

Original Anna KVS. C++ 2018 Amazon m4.16xlarge instances (64 vCPU,
256GB RAM,) Fast?

Anna KVS. Hydro 2023 GCP n2-standard-64 instances (64 vCPU, 256GB
RAM) Fast? ✅ Original Anna KVS. C++ 2018 Amazon m4.16xlarge instances (64 vCPU, 256GB RAM,)

Fast? ✅

45 Types and Consistency Analysis

46 Challenge: Replica Consistency Ensure that distant agents agree (or
will agree) on common knowledge. Classic example: data replication How do we know if they agree on the value of a mutable variable x? x = ❤

47 Challenge: Replica Consistency Ensure that distant agents agree (or
will agree) on common knowledge. Classic example: data replication How do we know if they agree on the value of a mutable variable x? If they disagree now, what could happen later? Split Brain divergence! We want to generalize to program outcomes! x = ❤ x = 💩

48 Classical Solution: Coordination Global total order of operations via
atomics, critical sections, distributed protocols like Paxos and 2-phase commit, etc. Expensive at every scale When can we avoid?

49 Big Queries: When? Why? When do I need Coordination?
Why? No really: Why? When is Coordination required?

51 Hellerstein JM. The Declarative Imperative: Experiences and conjectures in
distributed logic. ACM PODS Keynote, June 2010 ACM SIGMOD Record, Sep 2010. Ameloot TJ, Neven F, Van den Bussche J. Relational transducers for declarative networking. JACM, Apr 2013. Ameloot TJ, Ketsman B, Neven F, Zinn D. Weaker forms of monotonicity for declarative networking: a more ﬁne-grained answer to the CALM-conjecture. ACM TODS, Feb 2016. CALM: CONSISTENCY AS LOGICAL MONOTONICITY Theorem (CALM): A distributed program has a consistent, coordination-free distributed implementation if and only if it is monotonic. Hellerstein, JM, Alvaro, P. Keeping CALM: When Distributed Consistency is Easy. CACM, Sept, 2020.

52 CACM September 2020 http://bit.ly/calm-cacm

53 Semi-Lattices: CALM Algebra Semi-Lattice: <S, +> Associative: x +
(y + z) = (x + y) + z Commutative: x + y = y + x Idempotent: x + x = x Every semi-lattice corresponds to a partial order: x <= y ⇔ x + y = y Partial orders are compatible with many total orders! CALM connection: monotonicity in the lattice’s partial order

54 Wanted: Consistent Answers from Consistent Data Can use coordination

55 When is it Consistent to Read Without Coordination? Only
at the very top!

56 When is it Consistent to Read Without Coordination? Or
… can apply monotonic functions to get to top of smaller lattices

∅ F T

∅ F T has_pair()

60 Hydroflow’s Approach Coordination protocols in standard library Paxos implementation
described below (WIP) Object types: lattices crate with composable lattice types. Example: Causal Consistency Clean and amenable to synthesis! Monotone & non-monotone f’ns maps, folds, etc. Dataﬂow types & properties OOPSLA 22

61 Dataﬂow Types & Properties (WIP) Type: (Seq<T>, ‘a) Props:
{RandomBatch, Dups} “Happenstance” total order ‘a Type: Seq<Lattice<L>> Props: {RandomBatch, Dups} Latticeflow of points from L to be monotonically accumulated Type: Lattice<L> Props: {NonDeterministic} Type: Seq<Lattice<L>> Props: {RandomBatch, Dups} (n1v1) source_stream_serde (n2v1) _upcast (n3v1) map (n4v1) fold (n5v1) demux_enum Cumulative Lattice point L RandomBatch input to NonMonotonic op produces NonDeterministic output!

62 Dataﬂow Types & Properties (WIP) Type: (Seq<T>, ‘a) Props:
None “Happenstance” total order ‘a Type: Seq<Lattice<L>> Props: None Latticeﬂow of points from L to be monotonically accumulated Type: T’ Props: None Type: Seq<Lattice<L>> Props: None (n1v1) source_iter (n2v1) _upcast (n3v1) map (n4v1) fold (n5v1) demux_enum Cumulative Lattice point L

63 (n1v1) source_stream_serde (n2v1) _upcast (n3v1) map (n4v1) lattice_merge (n5v1)
demux_enum Dataﬂow Types & Properties (WIP) Type: (Seq<T>, ‘a) Props: {RandomBatch, Dups} “Happenstance” total order ‘a Type: Seq<Lattice<L>> Props: {RandomBatch, Dups} Latticeflow of points from L to be monotonically accumulated Type: Lattice<L> Props: None Type: Seq<Lattice<L>> Props: {RandomBatch, Dups} Cumulative Lattice point L

64 Eventually Consistent At a glance! (n1v1) union (n2v1) dest_sink_serde
(n3v1) source_stream_serde (n4v1) _upcast (n5v1) map (n6v1) map (n7v1) demux_enum (n8v1) for_each Server Response (n9v1) map PeerJoin (n11v1) union Put Peer Gossip (n16v1) join Get 1 (n10v1) tee (n20v1) cross_join 1 (n23v1) union (n12v1) tee (n13v1) map (n25v1) cross_join 0 (n14v1) persist (n15v1) tee 0 0 (n17v1) map 1 (n18v1) source_iter_delta (n19v1) map (n21v1) map (n22v1) source_iter_delta (n24v1) persist 1 (n26v1) filter (n27v1) map

65 Cross-Node Optimizations

… Cloud Services … FaaS Storage ML Frameworks Actors (e.g.
Orleans) Functional (e.g. Spark) Logic (e.g. Bloom) Futures (e.g. Ray) New DSLs HYDROLOGIC (global) HYDRAULIC Veriﬁed Lifting HYDROLYSIS Compiler HYDRODEPLOY HYDROFLOW (local) Sequential Code HYDRO HYDRO Stack Dedalus

67 Hydrologic can be analyzed for many properties! Monotonicity analysis
from the type system Functional dependencies from DB theory to co-partition state Data provenance helps us understand how results get generated No surprise that a data-centric approach helps! Theme In Distributed Systems, the hard part is the data (state + messages) Dedalus

An optimizer for protocols like Paxos? Tricky! David Chu

Challenges in Optimizing Protocols Compiler cannot understand high level protocol
semantics Many global invariants implicit in distributed programs How can we prove that our optimizations are always correct?

Keep It Simple, Preserve Equivalence! Two forms of “compartmentalization” The
really hard part: correctly applying transformations automatically! Hand-written in Scala, correct by assertion

Fast? ✅ Beats SOTA Paxos implementations Rule-based optimization (Hydroﬂow) Whittaker’s
Compartmentalized Paxos (Hydroﬂow) Whittaker’s Compartmentalized Paxos (Scala)

Halfway there! Rules proven correct, provide desired wins Need: Cost
model for an objective function Search techniques to ﬁnd optimal rewritings E-graphs meet Query Optimizers Very similar technologies!

73 More Open Questions

74 What should Hydrologic contain? Hydroflow+ is a functional variant
of Hydroflow. Maybe Hydrologic starts as single-transducer Hydroflow+? With specs for SLOs Cost, latency, uptime Other “facets”? Consistency? Security? … Cloud Services … FaaS Storage ML Frameworks Actors (e.g. Orleans) Functional (e.g. Spark) Logic (e.g. Bloom) Futures (e.g. Ray) Ne DSL HYDROLOGIC (global) HYDRAULIC Verified Lifting HYDROLYSIS Compiler HYDRODEPLOY HYDROFLOW (local) Sequential Code HYDRO

Deployment, Autoscaling, Fault Tolerance Heterogeneous cloud assignment challenges Dynamics of
autoscaling Sensing & monitoring Live autoscaling: deployment + recompilation Declarative fault tolerance What does a spec look like? Many mechanisms to choose from! … Cloud Services … FaaS Storage ML Frameworks Actors (e.g. Orleans) Functional (e.g. Spark) Logic (e.g. Bloom) Futures (e.g. Ray) Ne DSL HYDROLOGIC (global) HYDRAULIC Veriﬁed Lifting HYDROLYSIS Compiler HYDRODEPLOY HYDROFLOW (local) Sequential Code HYDRO

Handling Multiple Input Languages/Patterns In general LLMs arrived just in
time! Still need verification. Hypotheses on specific examples Shallow syntactic compilation could work for actors, distributed futures? Early results on sequential code: Katara … Cloud Services … FaaS Storage ML Frameworks Actors (e.g. Orleans) Functional (e.g. Spark) Logic (e.g. Bloom) Futures (e.g. Ray) Ne DSL HYDROLOGIC (global) HYDRAULIC Verified Lifting HYDROLYSIS Compiler HYDRODEPLOY HYDROFLOW (local) Sequential Code HYDRO OOPSLA 22

The Narrow Waist Between Generative AI and Reliable Infrastructure Render
for Review Check for Correctness

Thank You! https://hydro.run [email protected] 7

79 Backup Slides

80 Language/Theory Work: 2010-15 Formalism: Dedalus CALM Theorem: coordination in
its place Consistency ó Monotonicity : Logic + Lattices in a functional notation

81 Systems Work: 2015-20 Cloudburst: Stateful Faas Compartmentalized Paxos Lineage
Driven Fault Injection

82 FAQ #2: Isn’t monotonicity a rare corner case? Actually,
most code is mostly monotonic Even coordination protocols are mostly monotonic! Michael Whittaker’s work on Compartmentalized Paxos Next challenge: automatically move nonmonotonic (sequential/coordinated) code off the critical path and autoscale everything else. VLDB 2021

83 Easy and Hard Questions Is anyone over 18? Who
is the youngest?

is the youngest? ∃x∀y (x < y) ∃x x > 18

is the person nobody is younger than? ∃x x > 18 ∃x¬∃y (y < x)

86 What is Time for? “Time is what keeps everything
from happening at once.” Ray Cummings, The Girl in the Golden Atom, 1922

87 Classical Consistency Mechanisms: Coordination Consensus (Paxos, etc), Commit (Two-Phase
Commit, etc)

88 Coordination Avoidance (a poem) the ﬁrst principle of successful
scalability is to batter the consistency mechanisms down to a minimum move them oﬀ the critical path hide them in a rarely visited corner of the system, and then make it as hard as possible for application developers to get permission to use them —James Hamilton (IBM, MS, Amazon) in Birman, Chockler: “Toward a Cloud Computing Research Agenda”, LADIS 2009 ” “

89 Queries are easy to analyze for distributed properties! (non-)Monotonicity
is clear in syntax (NOT EXISTS, EXCEPT, NOT IN, …)

is clear in syntax (NOT EXISTS, EXCEPT, NOT IN, …) Explicit data relationships help us understand how to co-partition state Keys and foreign keys More generally, Dependency Theory UserID MessageID Channel ID 27 “What’s for dinner” 101 11 “Burritos please!” 101 CID Owner ID 101 11 14 “Monotonicity is cool!” 102 102 14 27 “Excelente!” 101 Channel ID → CID

is clear in syntax (NOT EXISTS, EXCEPT, NOT IN, …) Explicit data relationships help us understand how to co-partition state Keys and foreign keys More generally, Dependency Theory UserID MessageID Channel ID 27 “What’s for dinner” 101 11 “Burritos please!” 101 CID Owner ID 101 11 14 “Monotonicity is cool!” 102 102 14 27 “Excelente!” 101 UserID MessageID Channel ID CID Owner ID Channel ID → CID

is clear in syntax (NOT EXISTS, EXCEPT, NOT IN, …) Explicit data relationships help us understand how to co-partition state Data provenance helps us understand how results get generated And why they’re generated … or even why not!

94 Distributed Deadlock: Once you observe the existence of a
waits-for cycle, you can (autonomously) declare deadlock. More information will not change the result. Garbage Collection: Suspecting garbage (the non-existence of a path from root) is not enough; more information may change the result. Hence you are required to check all nodes for information (under any assignment of objects to nodes!) Two Canonical Examples Deadloc k! Garbag e?

Extensions to Other Lattices Set (Merge = Union) Increasing Int
(Merge = Max) Boolean (Merge = Or) {a} {b} {c} {a,b} {b,c} {a,c} {a,b,c} 5 5 7 7 3 7 false false false true true true SELECT key FROM input HAVING COUNT(*) > 10 Can use monotone functions to map to other lattices! coun t(S) x > 6

96 Generational Shift to Reasoning at the App Level 21st
Century Immutable State Monotonicity Analysis Functional Dependencies Data Provenance … app-specific assumptions Tired: Reasoning about memory access Wired: Reasoning about App Semantics 20th Century Read/Write Access/Store Linearizability Serializability … worst-case assumptions

97 CRDTs: OOP Semi-lattices as objects: CRDTs + is the
only method “mathematically sound rules to guarantee state convergence” — Shapiro, et al. guarantees eventual consistency of state in the end times But what do folks do with them in the mean time? “multiple active copies present accurate views of the shared datasets at low latencies” — TechBeacon blog 2022 Hmmm…. VLDB 23

98 CRDTs: Oops! Amazon Shopping Carts, a.k.a. the 2-phase set
A composite semilattice: (SetUnion, SetUnion) Adds Deletes

99 CRDTs: Oops! Amazon Shopping Carts, a.k.a. the 2-phase set
A composite semilattice: (SetUnion, SetUnion) What everybody wants to do: “Read” the contents, i.e. compute Adds — Deletes Not part of the object API. Adds Deletes Non-monotonic and hence inconsistent, non-deterministic, etc.

Hydroflow: A Compiler Target for Fast, Correct ...

Hydroflow: A Compiler Target for Fast, Correct Distributed Programs

More Decks by Joe Hellerstein

Other Decks in Programming

Featured

Transcript