Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Query Processing and Networking Infrastructures

Query Processing and Networking Infrastructures

A 2-day tutorial and workshop given at MIT, September 2002. Day 1 was a crash course on traditional query processing in relational databases and text retrieval systems. Day 2 was a more speculative discussion of resonances between query processing and data movement in networks, including an overview of our research at Berkeley on the Telegraph, TinyDB, and PIER projects. Slides are available, please do not copy without permission.

Joe Hellerstein

September 02, 2002
Tweet

More Decks by Joe Hellerstein

Other Decks in Technology

Transcript

  1. Query Processing and Networking Infrastructures Day 1 of 2 Joe

    Hellerstein UC Berkeley Septemer 20, 2002
  2. Two Goals Day 1: Primer on query processing n Targeted

    to networking/OS folk n Bias: systems issues Day 2: Seed some cross-fertilized research n Especially with networking n Thesis: dataflow convergence n query processing and routing n Clearly other resonances here n Dataflow HW architectures n Event-based systems designs n ML and Control Theory n Online Algorithms
  3. (Sub)Space of Possible Topics Traditional Relational QP: Optimization & Execution

    Parallel QP Distributed & Federated QP Boolean Text Search Traditional Text Ranking Hypertext Ranking Indexing Data Model & Query Language Design NFNF Data Models (OO, XML, “Semistructured”) Online and Approximate QP Visual Querying & Data Visualization Statistical Data Analysis (“Mining”) Active DBs (Trigger Systems) Media Queries, Feature Extraction & Similarity Search Data Reduction Transactional Storage & Networking Compression Data Streams & Continuous Queries Adaptive QP
  4. Transactional Storage & Networking Likely Topics Here Traditional Relational QP:

    Optimization & Execution Parallel QP Distributed & Federated QP Boolean Text Search Traditional Text Ranking Hypertext Ranking Indexing Data Model & Query Language Design NFNF Data Models (OO, XML, “Semistructured”) Online and Approximate QP Visual Querying & Data Visualization Statistical Data Analysis (“Mining”) Active DBs (Trigger Systems) Media Queries, Feature Extraction & Similarity Search Data Reduction Compression Data Streams & Continuous Queries Adaptive QP
  5. Plus Some Speculative Ones Traditional Relational QP: Optimization & Execution

    Parallel QP Distributed & Federated QP Boolean Text Search Traditional Text Ranking Indexing Online and Approximate QP Data Streams & Continuous Queries Adaptive QP Peer-to-Peer QP Sensornet QP Content Routing Network Monitoring Indirection Architectures
  6. Outline Day 1: Query Processing Crash Course n Intro n

    Queries as indirection n How do relational databases run queries? n How do search engines run queries? n Scaling up: cluster parallelism and distribution Day 2: Research Synergies w/Networking n Queries as indirection, revisited n Useful (?) analogies to networking research n Some of our recent research at the seams n Some of your research? n Directions and collective discussion
  7. Roots: database and IR research “Top-down” traditions (“applications”) n Usually

    begins with semantics and models. Common Misconceptions n Query processing = Oracle or Google. n Need not be so heavyweight or monolithic! Many reusable lessons within n IR search and DB querying are fundamentally different n Very similar from a query processing perspective n Many similarities in other data models as well n Querying is a synchronous, interactive process. n Triggers, rules and "continuous queries" not so different from plain old queries.
  8. So… we’ll go bottom-up Focus on resuable building blocks Attempt

    to be language- and model- agnostic n illustrate with various querying scenarios
  9. Confession: Two Biases Relational query engines n Most mature and

    general query technology n Best documented in the literature n Conceptually general enough to “capture” most all other models/schemes Everybody does web searches n So it’s both an important app, and an inescapable usage bias we carry around n It will inform our discussion. Shouldn’t skew it Lots of other query systems/languages you can keep in mind as we go n LDAP, DNS, XSL/Xpath/XQuery, Datalog
  10. What Are Queries For? I Obvious answer: search and analysis

    over big data sets n Search: select data of interest n Boolean expressions over content n sometimes with an implicit ordering on results n Analysis: construct new information from base data. n compute functions over each datum n concatenate related records (join) n partition into groups, summarize (aggregates) n aside: “Mining” vs. “Querying”? As a rule of thumb, think of mining as WYGIWIGY. Not the most general, powerful answer…
  11. What Are Queries For? II Queries bridge a (large!) level

    of indirection n Declarative programming: what you want, not how to get it n Easy (er) to express n Allows the “how” to change under the covers n A critical issue! n Not just for querying n Method invocation, data update, etc ?? !!
  12. Motivation for this Indirection Critical when rates of change differ

    across layers: n In particular, when dapp/dt << denvironment/dt n E.g. DB apps are used for years, decades (!!) n E.g. networked env: high rates of change (??) DB lit calls this “data independence”
  13. Data Independence: Background Bad Old Days n Hierarchical and “Network”

    (yep!) data models n Nesting & pointers mean that apps explicitly traverse data, become brittle when data layouts change Apps with persistent data have slow dapp/dt n And the database environments change faster! n Logical changes to representation (schema) n Physical changes in storage (indexes, layouts, HW) n DBs often shared by multiple apps! n In B.O.D., all apps had to be rewritten on change
  14. It’s a SW Engineering Thing Analogy: imagine if your C

    structs were to survive for decades n you’d keep them very simple n encapsulation to allow future mods Similar Analogy to NWs n protocol simplicity is good n soft state is good (discourages hardcoded refs to transient resources) But the fun systems part follows directly: n Achieve the goal w/respectable performance over a dynamic execution environment
  15. Codd’s Data Independence Ted Codd, IBM c. 1969 and forward

    n Turing award 1981 Two layers of indirection Applications Logical Representation (schema) Physical Representation (storage) Spanned by views and query rewriting Spanned by query optimization and execution Logical Independence Physical Independence
  16. A More Architectural Picture Query Rewriter Query Processor Optimizer Executor

    Declarative query over views Declarative query over base tables Query Plan (Procedural) Bridges logical independence Bridges physical independence N.B.: This classical QP architecture raises some problems. To be revisited! Access Methods Iterator API
  17. Access Methods Base data access layer Model: Data stored in

    unordered collections n Relations, tables, one type per collection Interface: iterators n Open(predicate) -> cursor n Usually simple predicates: attribute op constant n op usually arithmetic (<, >, =), though we’ll see extensions (e.g. multi-d ops) n Next(cursor) -> datum (of known type) n Close(cursor) n Insert(datum of correct type) n Delete(cursor)
  18. Typical Access Methods “Heap” files n unordered array of records

    n usually sequential on disk n predicates just save cross-layer costs Traditional Index AMs n B-trees n actually, “B+”-trees: all data at leaves n Can scan across leaves for range search n predicates (<,>,=, between) result in fewer I/Os n random I/Os (at least to find beginning of range) n Linear Hash index n Litwin ‘78. Supports equality predicates only. This is it for IR and standard relational DBs n Though when IR folks say “indexing”, they sometimes mean all of query processing
  19. An Exotic Forest of Search Trees Multi-dimensional indexes n For

    geodata, multimedia search, etc. n Dozens! E.g. R-tree family, disk-based Quad- Trees, kdB-trees n And of course “linearizations” with B-trees Path indexes n For XML and OO path queries n E.g. Xfilter Etc. n Lots of one-off indexes, often many per workload n No clear winners here n Extensible indexing scheme would be nice
  20. Generalized Search Trees (GiST) What is a (tree-based) DB index?

    Typically: n A clustering of data into leaf blocks n Hierarchical summaries (subtree predicates -- SPs) for pointers in directory blocks p1 p2 p3 … [Hellerstein et al., VLDB 95]
  21. Generalized Search Trees (GiST) Can realize that abstraction with simple

    interface: n User registers opaque SP objects with a few methods n Consistent(q, p): should query q traverse subtree? n Penalty(d, p): how bad is it to insert d below p n Union (p1, p2): form SP that includes p1, p2 n PickSplit({p1, …, pn}): partition SPs into 2 Tree maintenance, concurrency, recovery all doable under the covers Covers many popular multi-dimensional indexes n Most of which had no concurrency/recovery story http://gist.cs.berkeley.edu
  22. Some Additional Indexing Tricks Bitmap indexing n Many matches per

    value in (secondary) index? Rather than storing pointers to heap file in leaves, store a bitmap of matches in a (sorted) heap file. n Only works if file reorg is infrequent n Can make intersection, COUNT, etc. quicker during query processing n Can mix/match bitmaps and lists in a single index n Works with any (secondary) index with duplicate matches “Vertical Partitioning” / “Columnar storage” n Again, for sorted, relatively static files [O’Neil/Quass, SIGMOD 97]
  23. Dataflow Infrastructure Dataflow abstraction is very simple n “box-and-arrow” diagrams

    n (typed) collections of objects flow along edges Details can be tricky n “Push” or “Pull”? n More to it than that n How do control-flow and dataflow interact? n Where does the data live? n Don’t want to copy data n If passing pointers, where does the “real” data live?
  24. Iterators Most uniprocessor DB engines use iterators n Open() ->

    cursor n Next(cursor) -> typed record n Close(cursor) Simple and elegant n Control-flow and dataflow coupled n Familiar single-threaded, procedure-call API n Data refs passed on stack, no buffering Blocking-agnostic n Works w/blocking ops -- e.g. Sort n Works w/pipelined ops Note: well-behaved iterators “come up for air” in inner loops n E.g. for interrupt handling g R f S
  25. Where is the In-Flight Data? In standard DBMS, raw data

    lives in disk format, in shared Buffer Pool Iterators pass references to BufPool n A tuple “slot” per iterator input n Never copy along edges of dataflow n Join results are arrays of refs to base tables Operators may “pin” pages in BufPool n BufPool never replaces pinned pages n Ops should release pins ASAP (esp. across Next() calls!!) n Some operators copy data into their internal state n Can“spill” this state to private disk space
  26. Weaknesses of Simple Iterators Evolution of uniprocessor archs to parallel

    archs n esp. “shared-nothing” clusters Opportunity for pipelined parallelism Opportunity for partition parallelism n Take a single “box” in the dataflow, and split it across multiple machines Problems with iterators in this environment n Spoils pipelined parallelism opportunity n Polling (Next()) across the network is inefficient n Nodes sit idle until polled, and during comm n A blocking producer blocks its consumer But would like to keep iterator abstraction n Especially to save legacy query processor code n And simplify debugging (single-threaded, synchronous)
  27. Exchange Encapsulate partition parallelism & asynchrony n Keep the iterator

    API between ops n Exchange operator partitions input data by content n E.g. join or sort keys Note basic architectural idea! n Encapsulate dataflow tricks in operators, leaving infrastructure untouched n We’ll see this again next week, e.g. in Eddies [Graefe, SIGMOD 90]
  28. Exchange Internals route XOUT XIN Really 2 operators, XIN and

    XOUT n XIN is “top” of a plan, and pulls, pushing results to XOUT queue n XOUT spins on its local queue One thread becomes two n Producer graph & XIN n Consumer graph & XOUT Routing table/fn in XIN supports partition parallelism n E.g. for || sort, join, etc. Producer and consumer see iterator API Queue + thread barrier turns NW- based “push” into iterator-style “pull” Exchange
  29. Exchange Benefits? Remember Iterator limitations? n “Spoils pipelined parallelism opportunity”

    n solved by Exchange thread boundary n “Polling (Next()) across the network is inefficient” n Solved by XIN pushing to XOUT queue n “A blocking producer blocks its consumer” n Still a problem!
  30. Exchange Limitations Doesn’t allow consumer work to overlap w/blocking producers

    n E.g. streaming data sources, events n E.g. sort, some join algs n Entire consumer graph blocks if XOUT queue empty n Control flow coupled to dataflow, so XOUT won’t return without data n Queue is encapsulated from consumer But … n Note that exchange model is fine for most traditional DB Query Processing n May need to be extended for new settings…
  31. iterator exchange Fjords Thread of control per operator Queues between

    each operator Asynch or synch calls n Can do asynch poll-and-yield iteration in each operator (for both consumer and producer) n Or can do synchronous get_next iteration Can get traditional behavior if you want: n Synch polls + queue of size 1 n » Iterators n Synch consumer, asynch producer n = Exchange n Asynch calls solve the blocking problem of Exchange [Madden/Franklin, ICDE 01]
  32. Fjords Disadvantages: n Lots of “threads” n Best done in

    an event-programming style, not OS threads n Operators really have to “come up for air” (“yield”) n Need to write your own scheduler n Harder to debug But: n Maximizes flexibility for operators at the endpoints n Still provides a fairly simple interface for operator-writers
  33. Relational Algebra Semantics Selection: s p (R) n Returns all

    rows in R that satisfy p Projection: p C (R) n Returns all rows in R projected to columns in C n In strict relational model, remove duplicate rows n In SQL, preserve duplicates (multiset semantics) Cartesian Product: R ´ S Union: R È S Difference: R — S n Note: R, S must have matching schemata Join: R p S = s p (R ´ S) Missing: Grouping & Aggregation, Sorting
  34. Operator Overview: Basics Selection n Typically “free”, so “pushed down”

    n Often omitted from diagrams Projection n In SQL, typically “free”, so “pushed down” n No duplicate elimination n Always pass the minimal set of columns downstream n Typically omitted from diagrams Cartesian Product n Unavoidable nested loop to generate output Union: n Concat, or concat followed by dup. elim.
  35. Operator Overview, Cont. Unary operators: Grouping & Sorting n Grouping

    can be done with hash or sort schemes (as we’ll see) Binary matching: Joins/Intersections n Alternative algorithms: n Nested loops n Loop with index lookup (Index N.L.) n Sort-merge n Hash Join Don’t forget: have to write as iterators n Every time you get called with Next(), you adjust your state and produce an output record
  36. Unary External Hashing E.g. GROUP BY, DISTINCT Two hash functions,

    hc (coarse) and hf (fine) Two phases: n Phase 1: for each tuple of input, hash via hc into a “spill” partition to be put on disk n B-1 blocks of memory used to hold output buffers for writing a block at a time per partition B main memory buffers Disk Disk Original Relation OUTPUT 2 INPUT 1 hash function hc B-1 Partitions 1 2 B- . . . [Bratbergsengen, VLDB 84]
  37. Unary External Hashing Partitions Hash table for partition Ri (k

    < B pages) B main memory buffers Disk Output buffer Result hash fn hf n Phase 2: for each partition, read off disk and hash into a main-memory hashtable via hf n For distinct, when you find a value already in hashtable, discard the copy n For GROUP BY, associate some agg state (e.g. running SUM) with each group in the hash table, and maintain
  38. External Hashing: Analysis To utilize memory well in Phase 2,

    would like each partition to be ~ B blocks big n Hence works in two phases when B >= Ö|R| n Same req as external sorting! n Else can recursively partition the partitions in Phase 2 Can be made to pipeline, to adapt nicely to small data sets, etc.
  39. Hash Join (GRACE) Phase 1: partition each relation on the

    join key with hc , spilling to disk Phase 2: n build each partition of smaller relation into a hashtable via hf n scan matching partition of bigger relation, and for each tuple probe the hashtable via hf for matches Would like each partition of smaller relation to fit in memory n So works well if B >= Ö|smaller| n Size of bigger is irrelevant!! (Vs. sort-merge join) Popular optimization: Hybrid hash join n Partition #0 doesn’t spill -- it builds and probes immediately n Partitions 1 through n use rest of memory for output buffers n [DeWitt/Katz/Olken/Shapiro/Stonebraker/Wood, SIGMOD 84] [Fushimi, et al., VLDB 84]
  40. Hash-Join Partitions of R & S Input buffer for Si

    Hash table for partition Ri (k < B-1 pages) B main memory buffers Disk Output buffer Join Result hash fn hf hf B main memory buffers Disk Disk Original Relations OUTPUT 2 INPUT 1 hash function hc B-1 Partitions 1 2 B-1 . . .
  41. Symmetric Hash Join Pipelining, in-core variant Build and probe symmetrically

    n Correctness: Each output tuple generated when its last-arriving component appears Can be extended to out-of-core case n Tukwila [Ives & HaLevy, SIGMOD ‘99] n Xjoin: Spill and read partitions multiple times n Correctness guaranteed by timestamping tuples and partitions n [Urhan & Franklin, DEBull ‘00] [Mikillineni & Su, TOSE 88] [Wilschut & Apers, PDIS 91]
  42. SELECT [DISTINCT] <output expressions> FROM <tables> [WHERE <predicates>] [GROUP BY

    <gb-expression> [HAVING <h-predicates>]] [ORDER BY <expression>] A Basic SQL primer n Join tables in FROM clause n applying predicates in WHERE clause n If GROUP BY, partition results by GROUP n And maintain aggregate output expressions per group n Delete groups that don’t satisfy HAVING clause n If ORDER BY, sort output accordingly
  43. Examples Single-table S-F-W n DISTINCT, ORDER BY Multi-table S-F-W n

    And self-join Scalar output expressions Aggregate output expressions n With and without DISTINCT Group By Having Nested queries n Uncorrelated and correlated
  44. A Dopey Query Optimizer For each S-F-W query block n

    Create a plan that: n Forms the cartesian product of the FROM clause n Applies the WHERE clause n Incredibly inefficient n Huge intermediate results! Then, as needed: n Apply the GROUP BY clause n Apply the HAVING clause n Apply any projections and output expressions n Apply duplicate elimination and/or ORDER BY ´ s predicates tables …
  45. An Oracular Query Optimizer For each possible correct plan: n

    Run the plan (infinitely fast) n Measure its performance in reality Pick the best plan, and run it in reality
  46. A Standard Query Optimizer Three aspects to the problem n

    Legal plan space (transformation rules) n Cost model n Search Strategy
  47. Plan Space Many legal algebraic transformations, e.g.: n Cartesian product

    followed by selection can be rewritten as join n Join is commutative and associative n Can reorder the join tree arbitrarily n NP-hard to find best join tree in general n Selections should (usually) be “pushed down” n Projections can be “pushed down” And “physical” choices n Choice of Access Methods n Choice of Join algorithms n Taking advantage of sorted nature of some streams n Complicates Dynamic Programming, as we’ll see
  48. Cost Model & Selectivity Estimation Cost of a physical operator

    can be modeled fairly accurately: n E.g. number of random and sequential I/Os n Requires metadata about input tables: n Number of rows (cardinality) n Bytes per tuple (physical schema) In a query pipeline, metadata on intermediate tables is trickier n Cardinality? n Requires “selectivity” (COUNT) estimation n Wet-finger estimates n Histograms, joint distributions and other summaries n Sampling
  49. Search Strategy Dynamic Programming n Used in most commercial systems

    n IBM’s System R [Selinger, et al. SIGMOD 79] Top-Down n Branch and bound with memoization n Exodus, Volcano & Cascades [Graefe, SIGMOD 87, ICDE 93, DEBull 95] n Used in a few commercial systems (Microsoft SQL Server, especially) Randomized n Simulated Annealing, etc. [Ioannidis & Kang SIGMOD 90]
  50. Dynamic Programming Use principle of optimality n Any subtree of

    the optimal plan is itself optimal for its sub- expression Plans enumerated in N passes (if N relations joined): n Pass 1: Find best 1-relation plan for each relation. n Pass 2: Find best way to join result of each 1-relation plan (as outer) to another relation. (All 2-relation plans.) n Pass N: Find best way to join result of a (N-1)-relation plan (as outer) to the N’th relation. (All N-relation plans.) n This gives all left-deep plans. Generalization is easy… A wrinkle: physical properties (e.g. sort orders) violate principle of optimality! n Use partial-order dynamic programming n I.e. keep undominated plans at each step -- optimal for each setting of the physical properties (each “interesting order”)
  51. Relational Architecture Review Query Parsing and Optimization Query Executor Access

    Methods Buffer Management Disk Space Management DB Lock Manager Log Manager
  52. Information Retrieval A research field traditionally separate from Databases n

    Goes back to IBM, Rand and Lockheed in the 50’s n G. Salton at Cornell in the 60’s n Lots of research since then Products traditionally separate n Originally, document management systems for libraries, government, law, etc. n Gained prominence in recent years due to web search Today: simple IR techniques n Show similarities to DBMS techniques you already know
  53. IR vs. DBMS Seem like very different beasts Under the

    hood, not as different as they might seem n But in practice, you have to choose between the 2 IR DBMS Imprecise Semantics Precise Semantics Keyword search SQL Unstructured data format Structured data Read-Mostly. Add docs occasionally Expect reasonable number of updates Page through top k results Generate full answer
  54. IR’s “Bag of Words” Model Typical IR data model: n

    Each document is just a bag of words (“terms”) Detail 1: “Stop Words” n Certain words are considered irrelevant and not placed in the bag n e.g. “the” n e.g. HTML tags like <H1> Detail 2: “Stemming” n Using English-specific rules, convert words to their basic form n e.g. “surfing”, “surfed” --> “surf” Detail 3: we may decorate the words with other attributes n E.g. position, font info, etc. n Not exactly “bag of words” after all
  55. Boolean Text Search Find all documents that match a Boolean

    containment expression: n “Windows” AND (“Glass” OR “Door”) AND NOT “Microsoft” Note: query terms are also filtered via stemming and stop words When web search engines say “10,000 documents found”, that’s the Boolean search result size.
  56. Text “Indexes” When IR folks say “index” or “indexing” …

    n Usually mean more than what DB people mean In our terms, both “tables” and indexes n Really a logical schema (i.e. tables) n With a physical schema (i.e. indexes) n Usually not stored in a DBMS n Tables implemented as files in a file system
  57. A Simple Relational Text Index Create and populate a table

    InvertedFile(term string, docID int64) Build a B+-tree or Hash index on InvertedFile.term n May be lots of duplicate docIDs per term n Secondary index: list compression per term possible This is often called an “inverted file” or “inverted index” n Maps from words -> docs n whereas normal files map docs to the words in the doc (?!) Can now do single-word text search queries
  58. Handling Boolean Logic How to do “term1” OR “term2”? n

    Union of two docID sets How to do “term1” AND “term2”? n Intersection (ID join) of two DocID sets! How to do “term1” AND NOT “term2”? n Set subtraction n Also a join algorithm How to do “term1” OR NOT “term2” n Union of “term1” and “NOT term2”. n “Not term2” = all docs not containing term2. Yuck! n Usually forbidden at UI/parser Refinement: what order to handle terms if you have many ANDs/NOTs?
  59. Boolean Search in SQL (SELECT docID FROM InvertedFile WHERE word

    = “window” INTERSECT SELECT docID FROM InvertedFile WHERE word = “glass” OR word = “door”) EXCEPT SELECT docID FROM InvertedFile WHERE word=“Microsoft” ORDER BY magic_rank() Really there’s only one query (template) in IR n Single-table selects, UNION, INTERSECT, EXCEPT n Note that INTERSECT is a shorthand for equijoin on a key n Often there’s only one query plan in the system, too! magic_rank() is the “secret sauce” in the search engines “Windows” AND (“Glass” OR “Door”) AND NOT “Microsoft”
  60. Fancier: Phrases and “Near” Suppose you want a phrase n

    E.g. “Happy Days” Add a position attribute to the schema: n InvertedFile (term string, docID int64, position int) n Index on term Enhance join condition in query n Can’t use INTERSECT syntax, but query is nearly the same SELECT I1.docID FROM InvertedFile I1, InvertedFile I WHERE I1.word = “HAPPY” AND I2.word = “DAYS” AND I1.docID = I2.docID AND I2.position - I1.position = 1 ORDER BY magic_rank() Can relax to “term1” NEAR “term2” n Position < k off
  61. Classical Document Ranking TF ´ IDF (Term Freq. ´ Inverse

    Doc Freq.) n For each term t in the query n QueryTermRank = #occurrences of t in q TF ´ log((total #docs)/(#docs with this term)) IDF ´ normalization-factor n For each doc d in the boolean result n DocTermRank = #occurrences of t in d TF ´ log((total #docs)/(#docs with this term)) IDF ´ normalization-factor n Rank += DocTermRank*QueryTermRank Requires more to our schema n InvertedFile (term string, docID int64, position int, DocTermRank float) n TermInfo(term string, numDocs int) n Can compress DocTermRank non-relationally This basically works fine for raw text n There are other schemes, but this is the standard
  62. Some Additional Ranking Tricks Phrases/Proximity n Ranking function can incorporate

    position Query expansion, suggestions n Can keep a similarity matrix on terms, and expand/modify people’s queries Document expansion n Can add terms to a doc n E.g. in “anchor text” of refs to the doc Not all occurrences are created equal n Mess with DocTermRank based on: n Fonts, position in doc (title, etc.)
  63. Hypertext Ranking Also factor in graph structure n Social Network

    Theory (Citation Analysis) n “Hubs and Authorities” (Clever), “PageRank” (Google) n Intuition: recursively weighted in-degrees, out- degrees n Math: eigenvector computation PageRank sure seems to help n Though word on the street is that other factors matter as much n Anchor text, title/bold text, etc.
  64. Updates and Text Search Text search engines are designed to

    be query-mostly n Deletes and modifications are rare n Can postpone updates (nobody notices, no transactions!) n Updates done in batch (rebuild the index) n Can’t afford to go offline for an update? n Create a 2nd index on a separate machine n Replace the 1st index with the 2nd n Can do this incrementally with a level of indirection n So no concurrency control problems n Can compress to search-friendly, update-unfriendly format For these reasons, text search engines and DBMSs are usually separate products n Also, text-search engines tune that one SQL query to death! n The benefits of a special-case workload.
  65. { Architectural Comparison The Access Method Buffer Management Disk Space

    Management DB OS “The Query” Search String Modifier Simple DBMS } Ranking Algorithm Query Optimization and Execution Relational Operators Files and Access Methods Buffer Management Disk Space Management DB Concurrency and Recovery Needed DBMS Search Engine
  66. Revisiting Our IR/DBMS Distinctions Data Modeling & Query Complexity n

    DBMS supports any schema & queries n Requires you to define schema n Complex query language (hard for folks to learn) n Multiple applications at output n RowSet API (cursors) n IR supports only one schema & query n No schema design required (unstructured text) n Trivial query language n Single application behavior n Page through output in rank order, ignore most of output Storage Semantics n DBMS: online, transactional storage n IR: batch, unreliable storage
  67. Roots Distributed QP vs. Parallel QP n Distributed QP envisioned

    as a k-node intranet for k ~= 10 n Sound old-fashioned? Think of multiple hosting sites (e.g. one per continent) n Parallel QP grew out of DB Machine research n All in one room, one administrator Parallel DBMS architecture options n Shared-Nothing n Shared-Everything n Shared-Disk n Shared-nothing is most general, most scalable
  68. Distributed QP: Semi-Joins Main query processing issue in distributed DB

    lit: use semi-joins n R S = p R (R S) n Observe that R S º (R p(S)) S Assume each table lives at one site, R is bigger. To reduce communication: n Ship S’s join columns to R’s site, do semijoin there, ship result to S’s site for the join Notes n I’m sloppy about dups in my def’ns above n Semi-joins aren’t always a win n Extra cost estimation task for a distributed optimizer [Bernstein/Goodman ‘79]
  69. Bloom Joins A constant optimization on semi-joins n Idea: (R

    p(S)) S is redundant n Semi-join can safely return “false hits” from R n Rather than shipping p(S), ship a superset n A particular kind of lossy set compression allowed Bloom Filter (B. Bloom, 1970) n Hash each value in a set via k independent hash functions onto an array of n bits n Check membership correspondingly n By tuning k and n, can control false hit rate n Rediscovered recently in web lit, some new wrinkles (Mitzenmacher’s compressed B.F.’s, Rhea’s attenuated B.F.’s) [Babb, TODS 79]
  70. Sideways Information Passing These ideas generalize more broadly n E.g.

    “magic sets” rewriting in datalog & SQL n Tricky to do optimally in those settings, but wins can be very big Set o’ Stuff Costly Set Generator COMBINER Less Costly Set Generator
  71. Parallelism 101 Pipelined vs. Partitioned n Pipelined typically inter-operator n

    Nominal benefits in a dataflow n Partition typically intra-operator n E.g. hash join or sort using k nodes Speedup & Scaleup n Speedup: x=old_time/new_time n Ideal: linear n Scaleup: small_sys_elapsed_small_problem / big_sys_elapse_big_problem n Ideal: 1 n Transaction scaleup: N times as many TPC-C’s for N machines n Batch scaleup: N times as big a DB for a query on N machines [See DeWitt & Gray, CACM 92]
  72. Impediments to Good Parallelism Startup overheads n Amortized for big

    queries Interference n usually the result of unpredictable communication delays (comm cost, empty pipelines) Skew Of these, skew is the real issue in DBs n “Embarrassingly parallel” n I.e. it works
  73. Data Layout Horizontal Partitioning n For each table, assign rows

    to machines by some key n Or assign arbitrarily (round-robin) Vertical Partitioning n Sort table, and slice off columns n Usually not a parallelism trick n But nice for processing queries on read-mostly data (projection is free!)
  74. Intra-Operator Parallelism E.g. for Hash Join n Every site with

    a horizontal partition of either R or S fires off a scan thread n Every storage site reroutes its data among join nodes based on hash of the join column(s) n Upon receipt, each site does local hash join
  75. Skew Handling Skew happens n Even when hashing? Yep. Can

    pre-sample and/or pre-summarize data to partition better Solving skew on the fly is harder n Need to migrate accumulated dataflow state n FLuX: Fault-Tolerant, Load-balancing eXchange
  76. In Current Architectures All DBMSs can run on shared memory,

    many on shared-nothing n The high end belongs to clusters The biggest web-search engines run on clusters (Google, Inktomi) n And use pretty textbook DB stuff for Boolean search n Fun tradeoffs between answer quality and availability/management here (the Inktomi story)
  77. Views and Materialization A view is a logical table n

    A query with a name n In general, not updatable If to be used often, could be materialized n Pre-compute and/or cache result Could even choose to do this for common query sub-expressions n Needn’t require a DBA to say “this is a view”
  78. Challenges in Materialized Views Three main issues: n Given a

    workload, which views should be materialized n Given a query, how can mat-views be incorporated into query optimizer n As base tables are updated, how can views be incrementally maintained? See readings book, Gupta & Mumick
  79. Precomputation in IR Often want to save results of common

    queries n E.g. no point re-running “Britney Spears” or “Harry Potter” as Boolean queries Can also use as subquery results n E.g. the query “Harry Potter Loves Britney Spears” can use the “Harry Potter” and “Britney Spears” results Constrained version of mat-views n No surprise -- constrained relational workload n And consistency of matview with raw tables is not critical, so maintenance not such an issue.
  80. Precomputed Aggregates Aggregation queries work on numerical sets of data

    n Math tricks apply here n Some trivial, some fancy n Theme: replace the raw data with small statistical summaries, get approximate results n Histograms, wavelets, samples, dependency-based models, random projections, etc. n Heavily used in query optimizers for selectivity estimation (a COUNT aggregate) n Spate of recent work on approximate query processing for AVG, SUM, etc. [Garofalakis, Gehrke, Rastogi tutorial, SIGMOD ‘02]
  81. Query Dataflows Meet NWs Some more presentation n Indirection in

    space and time n Thematic similarities, differences in NW/QP n Revisit a NW “classic” through a DB lens n Adaptive QP In Telegraph n Eddies, Stems, FluX n A taste of QP in Sensor Networks n TinyDB (TAG), Directed Diffusion n A taste of QP in p2p n PIER project at Berkeley: parallel QP over DHTs Presentations from MITers? Open Discussion
  82. Query Processing and Networking Infrastructures Day 2 of 2 Joe

    Hellerstein UC Berkeley September 27, 2002
  83. Outline Day 1: Query Processing Crash Course n Intro n

    Queries as indirection n How do relational databases run queries? n How do search engines run queries? n Scaling up: cluster parallelism and distribution Day 2: Research Synergies w/Networking n Queries as indirection, revisited n Useful (?) analogies to networking research n Some of our recent research at the seams n Some of your research? n Directions and collective discussion
  84. Standard: Spatial Indirection Allows referent to move without changes to

    referers n Doesn’t matter where the object is, we find it. Alternative: copying n Works if updates are managed carefully, or don’t exist
  85. Temporal Indirection Asynchronous communication is indirection in time n Doesn’t

    matter when the object arrives, you find it Analogy to space n Sender º referer n Recipient º referent
  86. Generalizing Indirection in Space n x-to-one or x-to-many? n Physical

    or Logical mapping? Indirection in Time n Persistence model: storage or re-xmission n Persistence role: sender or receiver
  87. Indirection in Space, Redux One-to-one, one-to-many, many-to-many? n Standard relational

    issue n E.g. virtual address is many-to-one n E.g. email distribution list is one-to-many Physical or logical n Mapping table? n E.g. page tables, mailing list, DNS, multicast group lists n Logical n E.g. queries, subscriptions, interests
  88. Indirection in Time, Redux Persistence model: storage or re-xmission n

    Storage: e.g. DB, heap, stack, NW buffer, mailqueue n Re-xmission: e.g. polling, retries. n “Joe is so persistent” Persistence of put or get n Put: e.g. DB insert, email, retry n Get: e.g. subscription, polling
  89. Examples: Storage Systems Virtual Memory System n Space: 1-to-1, physical

    n Time: synchronous (no indirection) Database System n Space: many-to-many, logical n Time: synchronous (no indirection) Broadcast Disks n Space: 1-to-1 n Time: re-xmitted put
  90. Examples: Split-Phase APIs Polling n Space: no indirection n Time:

    re-xmitted get Callbacks n Space: no indirection n Time: stored get Active Messages n Space: no indirection n Time: stored get n App stores a get with putter, which tags it on messages
  91. Examples: Communication Email n Space: One-to-many, physical n Mapping is

    one-to-many, delivery is one-to-one (copies) n Time: stored put Multicast n Space: One-to-many, physical n Both mapping and delivery are one-to-many n Time: roughly synchronous?
  92. Examples: Distributed APIs RPC n Space: 1-to-1, physical n Can

    be 1-to-many n Time: synchronous (no indirection) Messaging systems n Space: 1-to-1, physical n Often 1-to-many n Time: depends! n Transactional messaging is stored put n Exactly-once transmission guaranteed n Other schemes are re-xmitted put n At least once transmission. Idempotency of message becomes important!
  93. Examples: Logic-based APIs Publish-Subscribe n Space: one-to-many, logical n Time:

    stored receiver Tuplespaces n Space: one-to-many, logical n Time: stored sender
  94. Indirection Summary 2 binary indirection variables for space, 2 for

    time Can have indirection in one without the other Leads to 24 indirection options n 16 joint space/time indirections, 4 space-only, 4 time-only n And few lessons about the tradeoffs! n Note: issues here in performance and SW engineering and … n E.g. “Are tuplespaces better than pub/sub?” n Not a unidimensional question!
  95. Rendezvous Indirection on both sender and receiver side n In

    time and/or space on each side n Most general: neither sender nor receiver know where or when rendezvous will happen! n Each chases a reference for where n Each must persist for when
  96. Join as Rendezvous Recall pipelining hash join n Combine all

    blue and gray tuples that match A batch rendezvous n In space: the data items were not stored in a fixed location, copied into HT n In time: both sides do put-persist in the join algorithm via storage A hint of things to come: n In parallel DBs, the hash table is content-addressed (via the exchange routing function) n What if hash table is distributed? n If a tuple in the join is doing “get”, then is there a distinction between sender/recipient? Between query and data?
  97. Some resonances We said that query systems are an indirection

    mechanism. n Logical, many-to-many, but synchronous n Query-response And some dataflow techniques inside query engines seem to provide useful indirection mechanisms If we add a network into the picture, life gets very interesting n Indirection in space very useful n Indirection in time is critical n Rendezvous is a basic operation
  98. More Interaction: CS262 Experiment w/ Eric Brewer Merge OS &

    DBMS grad class, over a year Eric/Joe, point/counterpoint Some tie-ins were obvious: n memory mgmt, storage, scheduling, concurrency Surprising: QP and networks go well side by side n E.g. eddies and TCP Congestion Control n Both use back-pressure and simple Control Theory to “learn” in an unpredictable dataflow environment
  99. Figure 3:Example Router Graph Scout Paths the key to comm-centric

    OS n “Making Paths Explicit in the Scout Operating System”, David Mosberger and Larry L. Peterson. OSDI ‘96.
  100. CLICK A NW router is a query plan! n With

    a twist: flow-based context n An opportunity for “autonomous” query optimization
  101. Clark & Tennenhouse, SIGCOMM ‘90 Architectural Considerations for a New

    Generation of Protocols Love it for two reasons n Tries to capture the essence of what networks do n Great for people who need the 10,000-foot view! n I’m a fan of doing this (witness last week) n Tries to move the community up the food chain n Resonances everywhere!!
  102. C&T Overview (for amateurs like me) Core function of protocols:

    data xfer n Data Manipulation n buffer, checksum, encryption, xfer to/from app space, presentation n Transfer Control n flow/congestion ctl, detecting transmission problems, acks, muxing, timestamps, framing
  103. Exchange! Data Modeling! Query Opt! Thesis: nets are good at

    xfer control, not so good at data manipulation Some C&T wacky ideas for better data manipulation n Xfer semantic units, not packets (ALF) n Auto-rewrite layers to flatten them (ILP) n Minimize cross-layer ordering constraints n Control delivery in parallel via packet content C & T’s Wacky Ideas
  104. DB People Should Be Experts! BUT… remember: n Basic Internet

    assumption: “a network of unknown topology and with an unknown, unknowable and constantly changing population of competing conversations” (Van Jacobson) Spoils the whole optimize-then-execute architecture of query optimization n What happens when denvironment/dt < query length?? n What about the competing conversations? n How do we handle the unknown topology? n What about partial failure? Ideally, we’d like: n the semantics and optimization of DB dataflow n with the agility and efficiency of NW dataflow
  105. The Cosmic Convergence NETWORKING RESEARCH XML Routing Router Toolkits Content

    Addressing and DHTs Directed Diffusion Adaptivity, Federated Control, GeoScalability DATABASE RESEARCH Adaptive Query Processing Continuous Queries, Streams P2P Query Engines Sensor Query Engines Data Models, Query Opt, DataScalability
  106. What does the QP perspective add? In terms of high-level

    languages? In terms of a reusable set of operators? In terms of optimization opportunities? In terms of batch-I/O tricks? In terms of approximate answers? A “safe” route to Active Networks? n Not computationally complete n Optimizable and reconfigurable -- data independence applies Fun to be had here! n Addressing a few fronts at Berkeley…
  107. Some of our work at the seams Starting with centralized

    engine for remote data sets and streams n Telegraph: eddies, SteMs, FLuX n “Deep Web”, filesharing systems, sensor streams More recently, querying sensor networks n TinyDB/TAG: in-network queries And DHT-based overlay networks n PIER
  108. Telegraph: An Adaptive Dataflow System Themes: Adaptivity and Sharing n

    Adaptivity encapsulated in operators n Eddies for order of operations n State Modules (SteMs) for transient state n FLuX for parallel load-balance and availability n Work- and state-sharing across flows n Unlike traditional relational schemes, try to share physical structures Franklin, Hellerstein, Hong and students (to follow)
  109. TeSS Eddy Join Select Project Group Aggregate Transitive Closure DupElim

    File Reader Ingress Adaptive Routing and Optimization FLuX Online Query Processing InterModule Comm and scheduling (Fjords) Sensor Proxy Request Parsing, Metadata SQL Explicit Dataflows Modules XML Catalog SteM P2P Proxy Telegraph Architecture Juggle
  110. Continuous Adaptivity: Eddies A little more state per tuple n

    Ready/done bits (extensible a la Volcano/Starburst) n Minimal state in Eddy itself n Queue + parameters being learning n Decisions: which tuple in queue to which operator Query processing = dataflow routing!! Ron Avnur Eddy
  111. Two Key Observations Break the set-oriented boundary n Usual DB

    model: algebra expressions: (R S) T n Common DB implementation: pipelining operators! n Subexpressions needn’t be materialized n Typical implementation is more flexible than algebra n We can reorder in-flight operators Don’t rewrite graph. Impose a router n Graph edge = absence of routing constraint n Observe operator consumption/production rates n Consumption: cost. Production: cost*selectivity n Could break these down per values of tuples So fun! n Simple, incremental, general n Brings all of query optimization online n And hence a bridge to ML, Control Theory, Queuing Theory
  112. State Modules (SteMs) Goal: Further adaptivity through competition n Multiple

    mirrored sources (AMs) n Handle rate changes, failures, parallelism n Multiple alternate operators n Join = Routing + State n SteM operator manages tradeoffs n State Module, unifies caches, rendezvous buffers, join state n Competitive sources/operators share building/probing SteMs n Join algorithm hybridization! Eddies + SteMs tackle the full (single- site) query optimization problem online Vijayshankar Raman, Amol Deshpande static dataflows eddy eddy + stems
  113. FLuX: Routing Across Cluster Fault-tolerant, Load-balancing eXchange Continuous/long-running flows need

    high availability Big flows need parallelism n Adaptive Load-Balancing req’d FLuX operator: Exchange plus… n Adaptive flow partitioning (River) n Transient state replication & migration n Replication & checkpointing for SteMs n Note: set-based, not sequence-based! n Needs to be extensible to different ops: n Content-sensitivity n History-sensitivity n Dataflow semantics n Optimize based on edge semantics n Networking tie-in again: n At-least-once delivery? n Exactly-once delivery? n In/Out of order? Mehul Shah
  114. Continuously Adaptive Continuous Queries (CACQ) Continuous Queries clearly need all

    this stuff! n Natural application of Telegraph infrastructure 4 Ideas in CACQ: n Use eddies to allow reordering of ops. n But one eddy will serve for all queries n Queries are data: join with Grouped Filter n A la stored get! n This idea extended in PSOUP (Chandrasekaran & Franklin) n Explicit tuple lineage n Mark each tuple with per-op ready/done bits n Mark each tuple with per-query completed bits n Joins via SteMs, shared across all queries n Note: mixed-lineage tuples in a SteM. I.e. shared state is not shared algebraic expressions! n Delete a tuple from flow only if it matches no query Sam Madden, Mehul Shah, Vijayshankar Raman, Sirish Chandrasekaran
  115. Smart Dust Motes TinyOS Palm Devices Linux Wireless Sensor Networks

    A spectrum of devices n Varying degrees of power and network constraints n Fun is on the small side! Our current platform: Mica and TinyOS n 4Mhz Atmel CPU, 4KB RAM, 40kBit radio, 512K EEPROM, 128K Flash n Sensors: temp, light, accelerometer, magnetometer, mic, etc. n Wireless, single-ported, multi-hop ad-hoc network n Spanning-tree communication through “root”
  116. TinyDB A query/trigger engine for motes Declarative (SQL-like) language for

    optimizability n Data independence arguments in spades here! n Non-programmers can deal with it Lots of challenges at the seams of queries and routing n Query plans over dynamic multi-hop network n With power and bandwidth consumption as key metrics Sam Madden (w/Hellerstein, Hong, Franklin)
  117. Query Focus: Hierarchical Aggregation Aggregation natural in sensornets n The

    “big picture” typically interesting n Aggregation can smooth noise and loss n E.g. signal processing aggs like wavelets n Provides data reduction Power/Network Reduction: in-network aggregation n Hierarchical version of parallel aggregation n Tricky design space n power vs. quality n topology-selection n value-based routing n dynamic environment requires adaptivity
  118. TinyDB Sample Apps Habitat Monitoring: what is the average humidity

    in the populated petrel burrows on Great Duck Island right now? Smart Office: find me the conference rooms that have been reserved but unoccupied for 5 minutes. Home Automation: lower blinds when light intensity is above a threshold.
  119. Performance in SensorNets Power consumption n Communication >> Computation n

    METRIC: radio wake time n Send > Receive n METRIC: messages generated n “Run for 5 years” vs. “Burn power for critical events” vs. “Run my experiment” Bandwidth Constraints n Internal >> External n Volume >> surface area Result Quality n Noisy sensors n Discrete sampling of continuous phenomena n Lossy communication channel
  120. TinyDB SQL-like language for specifying continuous queries and triggers n

    Schema management, etc. Proxy on desktop, small query engine per mote n Plug and play (query snooping) n To keep the engine “tiny”, use an eddy-style arch n One explicit copy of each iterator’s code image Adaptive dataflow in network Alpha available for download on SourceForge
  121. Some of the Optimization Issues Extensible Aggregation API: n Init(),

    Iter(), SplitFlow(), Close() n Properties n Amount of intermediate state n Duplicate sensitivity n Monotonicity n Exemplary vs. Summary Hypothesis Testing Snooping and Suppression Compression, Presumption, Interpolation Generally, QP and NW issues intertwine!
  122. Querying the Internet As opposed to querying over the Internet

    Have to deal with Internet realities n Scale, dynamics, federated admin, partial failure, etc. n Standard distributed DBs won’t work Applications n Start with real-time, distributed network monitoring n Traffic monitoring, intrusion/spam detection, software deployment detection (e.g. via TBIT), etc. n Use PIER’s SQL as a workload generator for networks? n Virtual “tables” determine load produced by each site n “Queries” become a way of specifying site-to-site communication n Move to infect the network more deeply? n E.g. Indirection schemes like i3, rendezvous mechanisms, etc. n Overlays only?
  123. And p2p QP, Obviously Gnutella done right n And it’s

    so easy! :-) Crawler-free web search Bring WYGIWIGY queries to the people n Ranking, recommenders, etc. Got to be more fun here n If p2p takes off in a big way, queries have to be a big piece Why p2p DB, anyway? n No good reason I can think of! :-) n Focus on the grassroots nature of p2p n Schema integration and transactions and … ?? n No! Work with what you got! Query the data that’s out there n Nothing complicated for users will fly n Avoid the “DB” word: P2P QP, not P2P DB
  124. Approach: Leverage DHTs “Distributed Hash Tables” n Family of distributed

    content-routing schemes n CAN, CHORD, Pastry, Tapestry, etc. n Internet scale “hash table” n A la wide-area, adaptive Exchange routing table n With some notion of storage Leverage DHTs aggressively n As distributed indexes on stored data n As state modules for query processing n E.g. use DHTs as the hash tables in a hash join n As rendezvous points for exchanging info n E.g. Bloom Filters
  125. PIER: P2p Information Exchange and Retrieval Relational-style query executor n

    With front-ends for SQL and catalogs n Standard and continuous queries n With access to DHT APIs n Currently CAN and Chord, working on Tapestry n Common DHT API would help Currently simulating queries running on 10’s of thousands of nodes n Look ma, it scales! n Widest-scale relational engine ever, looks feasible n Most of the simulator code will live on in implementation n On Millennium and PlanetLab this fall/winter Ryan Huebsch and Boon Thau Loo (w/Hellerstein, Shenker, Stoica)
  126. PIER Challenges How does this batch workload stress DHTs? How

    does republishing of soft-state interact with dataflow? n And semantics of query answers Materialization/precomputation/caching n Physical tuning meets SteMs meets materialized views How to do query optimization in this context n Distributed eddies! Partial failure a reality n At storage nodes, query execution nodes? n Impact on results, mitigation What about aggregation? n Similarities/difference with TAG? n With Astrolabe [Birman et al]? The “usual” CQ and data stream query issues, distributed n Analogous to work in Telegraph, and at Brown, Wisconsin, Stanford…
  127. All together now? I thought about changing the names: n

    Telegraph*, Teletiny…? n The group didn’t like the branding n Teletubby! Seriously: integration? n It’s a plausible need n Sensor data + map data + historical sensor logs + … n Filesharing + Web n We have done both of these cheesily n But fun questions of doing it right n E.g. pushing predicates and data into sensor net or not?
  128. Database Texts Undergrad textbooks n Ramakrishnan & Gehrke, Database Management

    Systems n Silberschatz, Korth, Sudarshan, Database System Concepts n Garcia-Molina, Ullman, Widom, Database Systems - The Complete Book n O’Neil & O’Neil, DATABASE Principles, Programming, and Performance n Abiteboul, Hull, Vianu, Foundations of Databases Graduate texts n Stonebraker & Hellerstein, Readings in Database Systems (a.k.a “The Red Book”) n Brewer & Hellerstein: Readings book (e-book?) in progress. Fall 2003?
  129. Research Links DB group at Berkeley: db.cs.berkeley.edu GiST: gist.cs.berkeley.edu Telegraph:

    telegraph.cs.berkeley.edu TinyDB: telegraph.cs.berkeley.edu/tinydb berkeley.intel-research.net/tinydb Red Book: redbook.cs.berkeley.edu