Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Adaptive Dataflow: A Database/Networking Convergence

Joe Hellerstein
December 07, 2001
24

Adaptive Dataflow: A Database/Networking Convergence

Given at the Stanford database seminar 12/7/2001, this talk makes the case that database and networking research are converging, and gives examples from work in the Telegraph group at Berkeley.

Joe Hellerstein

December 07, 2001
Tweet

Transcript

  1. Road Map How I got started on this n CONTROL

    project n Eddies Tie-ins to Networking Research Telegraph & ongoing adaptive dataflow research New arenas: n Sensor networks n P2P networks
  2. Background: CONTROL project Online/Interactive query processing n Online aggregation n

    Scalable spreadsheets & refining visualizations n Online data cleaning (Potter’s Wheel) Pipelining operators (ripple joins, online reordering) over streaming samples
  3. Goals for Online Processing Performance metric: J n Statistical (e.g.

    conf. intervals) n User-driven (e.g. weighted by widgets) New “greedy” performance regime n Maximize 1st derivative of the “mirth index” n Mirth defined on-the-fly n Therefore need FEEDBACK and CONTROL Time J 100% Online Traditional
  4. CONTROL Þ Volatility Goals and data may change over time

    n User feedback, sample variance Goals and data may be different in different “regions” n Group-by, scrollbar position n [An aside: dependencies in selectivity estimation] Q: Query optimization in this world? n Or in any pipelining, volatile environment?? n Where else do we see volatility?
  5. Continuous Adaptivity: Eddies A little more state per tuple n

    Ready/done bits (extensible a la Volcano/Starburst) Query processing = dataflow routing!! n We'll come back to this! Eddy
  6. Eddies: Two Key Observations Break the set-oriented boundary n Usual

    DB model: algebra expressions: (R S) T n Usual DB implementation: pipelining operators! n Subexpressions never materialized n Typical implementation is more flexible than algebra n We can reorder in-flight operators n Other gains possible by breaking the set-oriented boundary… Don’t rewrite graph. Impose a router n Graph edge = absence of routing constraint n Observe operator consumption/production rates n Consumption: cost n Production: cost*selectivity
  7. Road Map How I got started on this n CONTROL

    project n Eddies Tie-ins to Networking Research Telegraph & ongoing adaptive dataflow research New arenas: n Sensor networks n P2P networks
  8. Coincidence: Eddie Comes to Berkeley CLICK: a NW router is

    a query plan! n “The Click Modular Router”, Robert Morris, Eddie Kohler, John Jannotti, and M. Frans Kaashoek, SOSP ‘99
  9. Figure 3:Example Router Graph Also Scout Paths the key to

    comm-centric OS n “Making Paths Explicit in the Scout Operating System”, David Mosberger and Larry L. Peterson. OSDI ‘96.
  10. More Interaction: CS262 Experiment w/ Eric Brewer Merge OS &

    DBMS grad class, over a year Eric/Joe, point/counterpoint Some tie-ins were obvious: n memory mgmt, storage, scheduling, concurrency Surprising: QP and networks go well side by side n E.g. eddies and TCP Congestion Control n Both use back-pressure and simple Control Theory to “learn” in an unpredictable dataflow environment n Eddies close to the n-armed bandit problem
  11. Networking Overview for DB People Like Me Core function of

    protocols: data xfer n Data Manipulation (buffer, checksum, encryption, xfer to/fr app space, presentation) n Transfer Control (flow/congestion ctl, detecting xmission probs, acks, muxing, timestamps, framing) -- Clark & Tennenhouse, “Architectural Considerations for a New Generation of Protocols”, SIGCOMM ‘90 Basic Internet assumption: n “a network of unknown topology and with an unknown, unknowable and constantly changing population of competing conversations” (Van Jacobson)
  12. Exchange! Data Modeling! Query Opt! Thesis: nets are good at

    xfer control, not so good at data manipulation Some C&T wacky ideas for better data manipulation n Xfer semantic units, not packets (ALF) n Auto-rewrite layers to flatten them (ILP) n Minimize cross-layer ordering constraints n Control delivery in parallel via packet content C & T’s Wacky Ideas
  13. Wacky New Ideas in QP What if… n We had

    unbounded data producers and consumers (“streams” … “continuous queries”) n We couldn’t know our producers’ behavior or contents?? (“federation” … “mediators”) n We couldn’t predict user behavior? (“control”) n We couldn’t predict behavior of components in the dataflow? (“networked services”) n We had partial failure as a given? (oops, have we ignored this?) Yes … networking people have been here! n Remember Van Jacobson’s quote?
  14. The Cosmic Convergence NETWORKING RESEARCH Content-Based Routing Router Toolkits Content

    Addressable Networks Directed Diffusion Adaptivity, Federated Control, GeoScalability DATABASE RESEARCH Adaptive Query Processing Continuous Queries Approximate/ Interactive QP Sensor Databases Data Models, Query Opt, DataScalability
  15. The Cosmic Convergence Adaptivity, Federated Control, GeoScalability NETWORKING RESEARCH Content-Based

    Routing Router Toolkits Content Addressable Networks Directed Diffusion DATABASE RESEARCH Adaptive Query Processing Continuous Queries Approximate/ Interactive QP Sensor Databases Data Models, Query Opt, DataScalability Telegraph
  16. Road Map How I got started on this n CONTROL

    project n Eddies Tie-ins to Networking Research Telegraph & ongoing adaptive dataflow research New arenas: n Sensor networks n P2P networks
  17. What’s in the Sweet Spot? Scenarios with: n Structured Content

    n Volatility n Rich Queries Clearly: n Long-running data analysis a la CONTROL n Continuous queries n Queries over Internet sources and services Two emerging scenarios: n Sensor networks n P2P query processing
  18. Telegraph: Engineering the Sweet Spot An adaptive dataflow system n

    Dataflow programming model n A la Volcano, CLICK: push and pull. “Fjords”, ICDE02 n Extensible set of pipelining operators, including relational ops, grouped filters (e.g. XFilter) n SQL parser for convenience (looking at XQuery) n Adaptivity operators n Eddies n + Extensible rules for routing constraints, Competition n SteMs (state modules) n FLuX (Fault-tolerant Load-balancing eXchange) n Bounded and continuous: n Data sources n Queries
  19. State Modules (SteMs) Goal: Further adaptivity through competition n Multiple

    mirrored sources n Handle rate changes, failures, parallelism n Multiple alternate operators n Join = Routing + State n SteM operator manages tradeoffs n State Module, unifies caches, rendezvous buffers, join state n Competitive sources/operators share building/probing SteMs n Join algorithm hybridization! Vijayshankar Raman static dataflow eddy eddy + stems
  20. FLuX: Routing Across Cluster Fault Tolerance, Load Balancing n Continuous/long-running

    flows need high availability n Big flows need parallelism n Adaptive Load-Balancing req’d n FLuX operator: Exchange plus… n Adaptive flow partitioning (River) n Transient state replication & migration n RAID for SteMs n Needs to be extensible to different ops: n Content-sensitivity n History-sensitivity n Dataflow semantics n Optimize based on edge semantics n Networking tie-in again: • At-least-once delivery? • Exactly-once delivery? • In/Out of order? n Migration policy: the ski rental analogy Mehul Shah
  21. Continuously Adaptive Continuous Queries (CACQ) Continuous Queries clearly need all

    this stuff! Address adaptivity 1st. 4 Ideas in CACQ: n Use eddies to allow reordering of ops. n But one eddy will serve for all queries n Explicit tuple lineage n Mark each tuple with per-op ready/done bits n Mark each tuple with per-query completed bits n Queries are data: join with Grouped Filter n Much like XFilter, but for relational queries n Joins via SteMs, shared across all queries n Note: mixed-lineage tuples in a SteM. I.e. shared state is not shared algebraic expressions! n Delete a tuple from flow only if it matches no query Next: F.T. CACQ via FLuXen Sam Madden, Mehul Shah, Vijayshankar Raman
  22. Road Map How I got started on this n CONTROL

    project n Eddies Tie-ins to Networking Research Telegraph & ongoing adaptive dataflow research New arenas: n Sensor networks n P2P networks
  23. Sensor Nets “Smart Dust” + TinyOS Thousands of “motes” Expensive

    communication n Power constraints Query workload: n Aggregation & approximation n Queries and Continuous Queries Challenges: n Push the processing into the network n Deal with volatility & failure n CONTROL issues: data variance, user desires Joint work with Ramesh Govindan, Sam Madden, Wei Hong and David Culler (Intel Berkeley Lab) Simple example: Aggregation query
  24. P2P QP Starting point: P2P as grassroots phenomenon n Outrageous

    filesharing volume (1.8Gfiles in October 2001) n No business case to date Challenge: scale DDBMS QP ideas to P2P n Motivate why n Pick the right parts of DBMS research to focus on n Storage: no! QP: yes. n Make it work: n Scalability well beyond our usual target n Admin constraints n Unknown data distributions, load n Heterogeneous comm/processing n Partial failure Joint work with Scott Shenker, Ion Stoica, Matt Harren, Ryan Huebsch, Nick Lanham, Boon Thau Loo
  25. Themes Throughout Adaptivity n Requires clever system design n The

    Exchange model: encapsulate in ops? n Interesting adaptive policy problems n E.g. eddy routing, flux migration n Control Theory, Machine Learning n Encompasses another CS goal? n “No-knobs”, “Autonomic”, etc. New performance regimes n Decent performance in the common case n Mean/Variance more important than MAX n Interactive Metrics n Time to completion often unimportant/irrelevant
  26. More Themes Set-valued thinking as albatross? n E.g. eddies vs.

    Kabra/DeWitt or Tukwila n E.g. SteMs vs. Materialized Views n E.g. CACQ vs. NiagaraCQ n Some clean theory here would be nice n Current routing correctness proofs are inelegant Extensibility n Model/language of choice is not clear n SEQ? Relational? XQuery? n Extensible operators, edge semantics n [A whine about VLDB’s absurd “Specificity Factor”]
  27. Conclusions? Too early for technical conclusions Of this I’m sure:

    n The CS262 experiment is a success n Our students are getting a bigger picture than before n I’m learning, finding new connections n May morph to OS/Nets, Nets/DB n Eventually rethink the systems software curriculum at the undergraduate level too n Nets folks are coming our way n Doing relevant work, eager to collaborate n DB community needs to branch out n Outbound: Better proselytizing in CS n Inbound: Need new ideas