Papers We Love BOS: Streaming Queries over Streaming Data

B7dc26518988058faa50712248c80bd3?s=47 pbailis
April 21, 2016

Papers We Love BOS: Streaming Queries over Streaming Data

B7dc26518988058faa50712248c80bd3?s=128

pbailis

April 21, 2016
Tweet

Transcript

  1. Papers We Love BOS: PSoup Peter Bailis @pbailis MIT &

    Stanford 21 April 2016
  2. Who am I? Berkeley PhD ‘15, spent most of this

    year at MIT T-minus 11 days until tenure-track at Stanford » All about systems for next-gen data applications » Current hotness: MacroBase — automatic, large-scale analytic monitoring for IoT http://bailis.org/ http://futuredata.stanford.edu/
  3. None
  4. Why PSoup? Streams before streams were hot Interesting / cute

    idea Clean implementation Look to the past to learn for the future
  5. Talk Outline History and background Crossing the streams The Trinity

    Lessons 4 all of us
  6. Streaming is so 2016 “Big Data”: Samza / Kafka, Storm,

    Spark Streaming, Amazon Kinesis, Flink Basic idea: continuously processing dataflow D(A)G
  7. DAG DAG DAG

  8. Streams: The Big Idea Data warehouse: fixed data, changing queries

    Streaming: changing data, fixed queries Surprisingly hard to get right: semantics, languages, ordering, fault tolerance, efficient execution, handling multiple queries
  9. Streaming is so 2016 “Big Data”: Samza / Kafka, Storm,

    Spark Streaming, Amazon Kinesis, Flink First approximation: today’s OSS stream processors are kind of like Hadoop in 2009 Low-level API // not much Hive / Spark SQL Not much automated query processing Not always as fast as they could be But they’re trying!
  10. Streaming is so 2000 System & Project School Focus STREAM

    Stanford Semantics TelegraphCQ Berkeley Adaptivity Aurora Brown, Brandeis, MIT Shared execution
  11. SELECT dst, COUNT(*), wtime(*) AS c FROM network.tcpdump AS st

    [RANGE BY '5 seconds' SLIDE BY '1 sec' START AT '2003-06-06'] GROUP BY dst; SELECT R.i, R.j, count(*) FROM R [RANGE BY '1 sec' SLIDE BY '2 sec'], S [RANGE BY '3 sec' SLIDE BY '4 sec'] WHERE R.k = S.k GROUP BY R.i, R.j HAVING R.j > C;
  12. Lots to learn… Surprisingly hard to get right: semantics, languages,

    ordering, fault tolerance, efficient execution, handling multiple queries Literature has answers!!! They may not be right… No huge exits yet: TelegraphCQ -> Truviso -> Cisco (2012) Aurora -> StreamBase -> TIBCO (2013) …but we can learn!
  13. Talk Outline History and background Crossing the streams The Trinity

    Lessons 4 all of us
  14. None
  15. Streams: The Big Idea Data warehouse: fixed data, changing queries

    Streaming: changing data, fixed queries This paper (PSoup): changing data, changing queries
  16. PSoup Motivation » What if users register new queries on

    demand? May not know in advance, need back-fill » What if we don’t want to pre-compute all query results? May be rare / infrequent » It’s a cool idea… (conceived over beers?) Semantics: SELECT-FROM-WHERE… BEGIN-END
  17. Big Idea Treat data and queries symmetrically Join them as

    needed
  18. None
  19. None
  20. PSoup Processing Key Ideas: Treat data and queries symmetrically Build

    predicate index over queries Materialize result structure for rendering actual results Time: system assigned (why can this be problematic?)
  21. Fancy terminology, simple ideas: STeM: holds intermediate state (e.g., for

    a join) Can have Query STeMs, Data STeMs… Eddy: routes tuples between STeMs Works for Joins, too
  22. Predicate Indexing Classic technique (e.g., used in concurrency control, multi-query

    processing, etc.) Cool trick!
  23. What about aggregates? Can compute aggregates over SPJ queries, Allow

    overlapping queries to traverse shared tree…
  24. Garbage Collection? PSoup is main-memory only Can be expensive to

    re-execute arbitrarily in past Solution (not in this paper): punctuations! “All tuples from time T or less have arrived…” PSoup: need to punctuate queries, too…
  25. Evaluation PSoup-P: partially materialize PSoup-C: fully materialize NoMat: query on

    demand
  26. Evaluation Continued Trade-off: partial mat slower, but less space required

    Nitpicks: Workload is not very impressive Do the speedups really matter? (< 1ms)
  27. Hot Takes PSoup = Cute, promising idea Symmetric joins =

    Relatively clean mechanism Paper = Not an exemplar of clarity in writing Paper = I like the idea more than the execution Overall = Interesting thought experiment Worth an after-work discussion over beer…
  28. Talk Outline History and background Crossing the streams The Trinity

    Lessons 4 all of us
  29. Processor Queries Data Results Data Warehouse changing fixed fixed Streaming

    Engine changing fixed fixed PSoup changing changing fixed ¿¿?? changing changing changing Can’t stop won’t stop
  30. The Trinity Data Queries Results “If you take the trilogy

    of queries, data, and answers, then in theory with any two of them you could synthesize the third. So, for example, if I give you a some queries and some answers, you could synthesize a (non-unique) database that would produce those answers given those queries (this would be useful for generating test databases, for example). Likewise, if I give you some data and some answers, you should be able to synthesize a set of queries that would produce those answers on that data (could be useful for generating an optimizer or perhaps for security-related forensics). There may be Turing award in there somewhere.”
  31. Talk Outline History and background Crossing the streams The Trinity

    Lessons 4 all of us
  32. Why did 2000s streaming engines fall short? Too early? Wrong

    API? Story: PoC at major retailer… Veteran @ Brown: temporal queries were more valuable than temporal processing (everyone has WINDOW now); only finance needed latency …and yet: most serious SQL engines now support WINDOW functionality
  33. Why does streaming matter today? » Decoupling storage, compute (e.g.,

    HDFS, Kafka)? » Nowcasting / minute-level latency matters? » Poor man’s materialized view / model maintenance ? » Maybe it doesn’t matter!
  34. IRL: MacroBase We’re building a new engine for prioritizing attention

    in data streams: analytic monitoring = anomaly detection plus summarization What can we use from the literature? » Dataflow as computing substrate + fault-tolerance » Semantics for windows, time, out of order execution What needs to be re-done? » Need support for statistical operators (e.g., KDE) » Need support for approximate computation » Need support for new edge capabilities
  35. Another (Beerworthy) Idea When you squint, most problems look like

    joins with historical state: e.g., every web request is a just a fancy join with all other web requests previously seen… Why don’t we build services / infra like this? (Immutable logs as building blocks appx. this)
  36. If I were to recommend three papers:

  37. None
  38. Summary PSoup: streaming queries, streaming data One weird trick: join

    data with queries Streaming: what is old is new again… Lots to learn from the literature! Thanks! @pbailis