Adaptive Dataflow: A Database/Networking Convergence

Adaptive Dataflow: A Database/Networking Cosmic Convergence Joe Hellerstein UC Berkeley

Road Map How I got started on this n CONTROL
project n Eddies Tie-ins to Networking Research Telegraph & ongoing adaptive dataflow research New arenas: n Sensor networks n P2P networks

Background: CONTROL project Online/Interactive query processing n Online aggregation n
Scalable spreadsheets & refining visualizations n Online data cleaning (Potter’s Wheel) Pipelining operators (ripple joins, online reordering) over streaming samples

Example: Online Aggregation

Online Data Visualization CLOUDS

Potter’s Wheel

Goals for Online Processing Performance metric: J n Statistical (e.g.
conf. intervals) n User-driven (e.g. weighted by widgets) New “greedy” performance regime n Maximize 1st derivative of the “mirth index” n Mirth defined on-the-fly n Therefore need FEEDBACK and CONTROL Time J 100% Online Traditional

CONTROL Þ Volatility Goals and data may change over time
n User feedback, sample variance Goals and data may be different in different “regions” n Group-by, scrollbar position n [An aside: dependencies in selectivity estimation] Q: Query optimization in this world? n Or in any pipelining, volatile environment?? n Where else do we see volatility?

Continuous Adaptivity: Eddies A little more state per tuple n
Ready/done bits (extensible a la Volcano/Starburst) Query processing = dataflow routing!! n We'll come back to this! Eddy

Eddies: Two Key Observations Break the set-oriented boundary n Usual
DB model: algebra expressions: (R S) T n Usual DB implementation: pipelining operators! n Subexpressions never materialized n Typical implementation is more flexible than algebra n We can reorder in-flight operators n Other gains possible by breaking the set-oriented boundary… Don’t rewrite graph. Impose a router n Graph edge = absence of routing constraint n Observe operator consumption/production rates n Consumption: cost n Production: cost*selectivity

Coincidence: Eddie Comes to Berkeley CLICK: a NW router is
a query plan! n “The Click Modular Router”, Robert Morris, Eddie Kohler, John Jannotti, and M. Frans Kaashoek, SOSP ‘99

Figure 3:Example Router Graph Also Scout Paths the key to
comm-centric OS n “Making Paths Explicit in the Scout Operating System”, David Mosberger and Larry L. Peterson. OSDI ‘96.

More Interaction: CS262 Experiment w/ Eric Brewer Merge OS &
DBMS grad class, over a year Eric/Joe, point/counterpoint Some tie-ins were obvious: n memory mgmt, storage, scheduling, concurrency Surprising: QP and networks go well side by side n E.g. eddies and TCP Congestion Control n Both use back-pressure and simple Control Theory to “learn” in an unpredictable dataflow environment n Eddies close to the n-armed bandit problem

Networking Overview for DB People Like Me Core function of
protocols: data xfer n Data Manipulation (buffer, checksum, encryption, xfer to/fr app space, presentation) n Transfer Control (flow/congestion ctl, detecting xmission probs, acks, muxing, timestamps, framing) -- Clark & Tennenhouse, “Architectural Considerations for a New Generation of Protocols”, SIGCOMM ‘90 Basic Internet assumption: n “a network of unknown topology and with an unknown, unknowable and constantly changing population of competing conversations” (Van Jacobson)

Exchange! Data Modeling! Query Opt! Thesis: nets are good at
xfer control, not so good at data manipulation Some C&T wacky ideas for better data manipulation n Xfer semantic units, not packets (ALF) n Auto-rewrite layers to flatten them (ILP) n Minimize cross-layer ordering constraints n Control delivery in parallel via packet content C & T’s Wacky Ideas

Wacky New Ideas in QP What if… n We had
unbounded data producers and consumers (“streams” … “continuous queries”) n We couldn’t know our producers’ behavior or contents?? (“federation” … “mediators”) n We couldn’t predict user behavior? (“control”) n We couldn’t predict behavior of components in the dataflow? (“networked services”) n We had partial failure as a given? (oops, have we ignored this?) Yes … networking people have been here! n Remember Van Jacobson’s quote?

The Cosmic Convergence NETWORKING RESEARCH Content-Based Routing Router Toolkits Content
Addressable Networks Directed Diffusion Adaptivity, Federated Control, GeoScalability DATABASE RESEARCH Adaptive Query Processing Continuous Queries Approximate/ Interactive QP Sensor Databases Data Models, Query Opt, DataScalability

The Cosmic Convergence Adaptivity, Federated Control, GeoScalability NETWORKING RESEARCH Content-Based
Routing Router Toolkits Content Addressable Networks Directed Diffusion DATABASE RESEARCH Adaptive Query Processing Continuous Queries Approximate/ Interactive QP Sensor Databases Data Models, Query Opt, DataScalability Telegraph

What’s in the Sweet Spot? Scenarios with: n Structured Content
n Volatility n Rich Queries Clearly: n Long-running data analysis a la CONTROL n Continuous queries n Queries over Internet sources and services Two emerging scenarios: n Sensor networks n P2P query processing

Telegraph: Engineering the Sweet Spot An adaptive dataflow system n
Dataflow programming model n A la Volcano, CLICK: push and pull. “Fjords”, ICDE02 n Extensible set of pipelining operators, including relational ops, grouped filters (e.g. XFilter) n SQL parser for convenience (looking at XQuery) n Adaptivity operators n Eddies n + Extensible rules for routing constraints, Competition n SteMs (state modules) n FLuX (Fault-tolerant Load-balancing eXchange) n Bounded and continuous: n Data sources n Queries

State Modules (SteMs) Goal: Further adaptivity through competition n Multiple
mirrored sources n Handle rate changes, failures, parallelism n Multiple alternate operators n Join = Routing + State n SteM operator manages tradeoffs n State Module, unifies caches, rendezvous buffers, join state n Competitive sources/operators share building/probing SteMs n Join algorithm hybridization! Vijayshankar Raman static dataflow eddy eddy + stems

FLuX: Routing Across Cluster Fault Tolerance, Load Balancing n Continuous/long-running
flows need high availability n Big flows need parallelism n Adaptive Load-Balancing req’d n FLuX operator: Exchange plus… n Adaptive flow partitioning (River) n Transient state replication & migration n RAID for SteMs n Needs to be extensible to different ops: n Content-sensitivity n History-sensitivity n Dataflow semantics n Optimize based on edge semantics n Networking tie-in again: • At-least-once delivery? • Exactly-once delivery? • In/Out of order? n Migration policy: the ski rental analogy Mehul Shah

Continuously Adaptive Continuous Queries (CACQ) Continuous Queries clearly need all
this stuff! Address adaptivity 1st. 4 Ideas in CACQ: n Use eddies to allow reordering of ops. n But one eddy will serve for all queries n Explicit tuple lineage n Mark each tuple with per-op ready/done bits n Mark each tuple with per-query completed bits n Queries are data: join with Grouped Filter n Much like XFilter, but for relational queries n Joins via SteMs, shared across all queries n Note: mixed-lineage tuples in a SteM. I.e. shared state is not shared algebraic expressions! n Delete a tuple from flow only if it matches no query Next: F.T. CACQ via FLuXen Sam Madden, Mehul Shah, Vijayshankar Raman

Sensor Nets “Smart Dust” + TinyOS Thousands of “motes” Expensive
communication n Power constraints Query workload: n Aggregation & approximation n Queries and Continuous Queries Challenges: n Push the processing into the network n Deal with volatility & failure n CONTROL issues: data variance, user desires Joint work with Ramesh Govindan, Sam Madden, Wei Hong and David Culler (Intel Berkeley Lab) Simple example: Aggregation query

P2P QP Starting point: P2P as grassroots phenomenon n Outrageous
filesharing volume (1.8Gfiles in October 2001) n No business case to date Challenge: scale DDBMS QP ideas to P2P n Motivate why n Pick the right parts of DBMS research to focus on n Storage: no! QP: yes. n Make it work: n Scalability well beyond our usual target n Admin constraints n Unknown data distributions, load n Heterogeneous comm/processing n Partial failure Joint work with Scott Shenker, Ion Stoica, Matt Harren, Ryan Huebsch, Nick Lanham, Boon Thau Loo

A Grassroots Example: TeleNap

Themes Throughout Adaptivity n Requires clever system design n The
Exchange model: encapsulate in ops? n Interesting adaptive policy problems n E.g. eddy routing, flux migration n Control Theory, Machine Learning n Encompasses another CS goal? n “No-knobs”, “Autonomic”, etc. New performance regimes n Decent performance in the common case n Mean/Variance more important than MAX n Interactive Metrics n Time to completion often unimportant/irrelevant

More Themes Set-valued thinking as albatross? n E.g. eddies vs.
Kabra/DeWitt or Tukwila n E.g. SteMs vs. Materialized Views n E.g. CACQ vs. NiagaraCQ n Some clean theory here would be nice n Current routing correctness proofs are inelegant Extensibility n Model/language of choice is not clear n SEQ? Relational? XQuery? n Extensible operators, edge semantics n [A whine about VLDB’s absurd “Specificity Factor”]

Conclusions? Too early for technical conclusions Of this I’m sure:
n The CS262 experiment is a success n Our students are getting a bigger picture than before n I’m learning, finding new connections n May morph to OS/Nets, Nets/DB n Eventually rethink the systems software curriculum at the undergraduate level too n Nets folks are coming our way n Doing relevant work, eager to collaborate n DB community needs to branch out n Outbound: Better proselytizing in CS n Inbound: Need new ideas

Conclusions, cont. Sabbatical is a good invention n Hasn’t even
started, I’m already grateful!

Adaptive Dataflow: A Database/Networking Convergence

Adaptive Dataflow: A Database/Networking Convergence

Joe Hellerstein

More Decks by Joe Hellerstein

Featured

Transcript

Adaptive Dataflow: A Database/Networking Cosmic Convergence Joe Hellerstein UC Berkeley

Road Map How I got started on this n CONTROL

Background: CONTROL project Online/Interactive query processing n Online aggregation n

Example: Online Aggregation

Online Data Visualization CLOUDS

Potter’s Wheel

Goals for Online Processing Performance metric: J n Statistical (e.g.

CONTROL Þ Volatility Goals and data may change over time

Continuous Adaptivity: Eddies A little more state per tuple n

Eddies: Two Key Observations Break the set-oriented boundary n Usual

Road Map How I got started on this n CONTROL

Coincidence: Eddie Comes to Berkeley CLICK: a NW router is

Figure 3:Example Router Graph Also Scout Paths the key to

More Interaction: CS262 Experiment w/ Eric Brewer Merge OS &

Networking Overview for DB People Like Me Core function of

Exchange! Data Modeling! Query Opt! Thesis: nets are good at

Wacky New Ideas in QP What if… n We had

The Cosmic Convergence NETWORKING RESEARCH Content-Based Routing Router Toolkits Content

The Cosmic Convergence Adaptivity, Federated Control, GeoScalability NETWORKING RESEARCH Content-Based

Road Map How I got started on this n CONTROL

What’s in the Sweet Spot? Scenarios with: n Structured Content

Telegraph: Engineering the Sweet Spot An adaptive dataflow system n

State Modules (SteMs) Goal: Further adaptivity through competition n Multiple

FLuX: Routing Across Cluster Fault Tolerance, Load Balancing n Continuous/long-running

Continuously Adaptive Continuous Queries (CACQ) Continuous Queries clearly need all

Road Map How I got started on this n CONTROL

Sensor Nets “Smart Dust” + TinyOS Thousands of “motes” Expensive

P2P QP Starting point: P2P as grassroots phenomenon n Outrageous

A Grassroots Example: TeleNap

Themes Throughout Adaptivity n Requires clever system design n The

More Themes Set-valued thinking as albatross? n E.g. eddies vs.

Conclusions? Too early for technical conclusions Of this I’m sure:

Conclusions, cont. Sabbatical is a good invention n Hasn’t even