Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Papers We Love BOS: Streaming Queries over Streaming Data

pbailis
April 21, 2016

Papers We Love BOS: Streaming Queries over Streaming Data

pbailis

April 21, 2016
Tweet

More Decks by pbailis

Other Decks in Research

Transcript

  1. Who am I? Berkeley PhD ‘15, spent most of this

    year at MIT T-minus 11 days until tenure-track at Stanford » All about systems for next-gen data applications » Current hotness: MacroBase — automatic, large-scale analytic monitoring for IoT http://bailis.org/ http://futuredata.stanford.edu/
  2. Why PSoup? Streams before streams were hot Interesting / cute

    idea Clean implementation Look to the past to learn for the future
  3. Streaming is so 2016 “Big Data”: Samza / Kafka, Storm,

    Spark Streaming, Amazon Kinesis, Flink Basic idea: continuously processing dataflow D(A)G
  4. Streams: The Big Idea Data warehouse: fixed data, changing queries

    Streaming: changing data, fixed queries Surprisingly hard to get right: semantics, languages, ordering, fault tolerance, efficient execution, handling multiple queries
  5. Streaming is so 2016 “Big Data”: Samza / Kafka, Storm,

    Spark Streaming, Amazon Kinesis, Flink First approximation: today’s OSS stream processors are kind of like Hadoop in 2009 Low-level API // not much Hive / Spark SQL Not much automated query processing Not always as fast as they could be But they’re trying!
  6. Streaming is so 2000 System & Project School Focus STREAM

    Stanford Semantics TelegraphCQ Berkeley Adaptivity Aurora Brown, Brandeis, MIT Shared execution
  7. SELECT dst, COUNT(*), wtime(*) AS c FROM network.tcpdump AS st

    [RANGE BY '5 seconds' SLIDE BY '1 sec' START AT '2003-06-06'] GROUP BY dst; SELECT R.i, R.j, count(*) FROM R [RANGE BY '1 sec' SLIDE BY '2 sec'], S [RANGE BY '3 sec' SLIDE BY '4 sec'] WHERE R.k = S.k GROUP BY R.i, R.j HAVING R.j > C;
  8. Lots to learn… Surprisingly hard to get right: semantics, languages,

    ordering, fault tolerance, efficient execution, handling multiple queries Literature has answers!!! They may not be right… No huge exits yet: TelegraphCQ -> Truviso -> Cisco (2012) Aurora -> StreamBase -> TIBCO (2013) …but we can learn!
  9. Streams: The Big Idea Data warehouse: fixed data, changing queries

    Streaming: changing data, fixed queries This paper (PSoup): changing data, changing queries
  10. PSoup Motivation » What if users register new queries on

    demand? May not know in advance, need back-fill » What if we don’t want to pre-compute all query results? May be rare / infrequent » It’s a cool idea… (conceived over beers?) Semantics: SELECT-FROM-WHERE… BEGIN-END
  11. PSoup Processing Key Ideas: Treat data and queries symmetrically Build

    predicate index over queries Materialize result structure for rendering actual results Time: system assigned (why can this be problematic?)
  12. Fancy terminology, simple ideas: STeM: holds intermediate state (e.g., for

    a join) Can have Query STeMs, Data STeMs… Eddy: routes tuples between STeMs Works for Joins, too
  13. What about aggregates? Can compute aggregates over SPJ queries, Allow

    overlapping queries to traverse shared tree…
  14. Garbage Collection? PSoup is main-memory only Can be expensive to

    re-execute arbitrarily in past Solution (not in this paper): punctuations! “All tuples from time T or less have arrived…” PSoup: need to punctuate queries, too…
  15. Evaluation Continued Trade-off: partial mat slower, but less space required

    Nitpicks: Workload is not very impressive Do the speedups really matter? (< 1ms)
  16. Hot Takes PSoup = Cute, promising idea Symmetric joins =

    Relatively clean mechanism Paper = Not an exemplar of clarity in writing Paper = I like the idea more than the execution Overall = Interesting thought experiment Worth an after-work discussion over beer…
  17. Processor Queries Data Results Data Warehouse changing fixed fixed Streaming

    Engine changing fixed fixed PSoup changing changing fixed ¿¿?? changing changing changing Can’t stop won’t stop
  18. The Trinity Data Queries Results “If you take the trilogy

    of queries, data, and answers, then in theory with any two of them you could synthesize the third. So, for example, if I give you a some queries and some answers, you could synthesize a (non-unique) database that would produce those answers given those queries (this would be useful for generating test databases, for example). Likewise, if I give you some data and some answers, you should be able to synthesize a set of queries that would produce those answers on that data (could be useful for generating an optimizer or perhaps for security-related forensics). There may be Turing award in there somewhere.”
  19. Why did 2000s streaming engines fall short? Too early? Wrong

    API? Story: PoC at major retailer… Veteran @ Brown: temporal queries were more valuable than temporal processing (everyone has WINDOW now); only finance needed latency …and yet: most serious SQL engines now support WINDOW functionality
  20. Why does streaming matter today? » Decoupling storage, compute (e.g.,

    HDFS, Kafka)? » Nowcasting / minute-level latency matters? » Poor man’s materialized view / model maintenance ? » Maybe it doesn’t matter!
  21. IRL: MacroBase We’re building a new engine for prioritizing attention

    in data streams: analytic monitoring = anomaly detection plus summarization What can we use from the literature? » Dataflow as computing substrate + fault-tolerance » Semantics for windows, time, out of order execution What needs to be re-done? » Need support for statistical operators (e.g., KDE) » Need support for approximate computation » Need support for new edge capabilities
  22. Another (Beerworthy) Idea When you squint, most problems look like

    joins with historical state: e.g., every web request is a just a fancy join with all other web requests previously seen… Why don’t we build services / infra like this? (Immutable logs as building blocks appx. this)
  23. Summary PSoup: streaming queries, streaming data One weird trick: join

    data with queries Streaming: what is old is new again… Lots to learn from the literature! Thanks! @pbailis