Papers We Love BOS: Streaming Queries over Streaming Data

Slide 1

Slide 1 text

Papers We Love BOS: PSoup Peter Bailis @pbailis MIT & Stanford 21 April 2016

Slide 2

Slide 2 text

Who am I? Berkeley PhD ‘15, spent most of this year at MIT T-minus 11 days until tenure-track at Stanford » All about systems for next-gen data applications » Current hotness: MacroBase — automatic, large-scale analytic monitoring for IoT http://bailis.org/ http://futuredata.stanford.edu/

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Why PSoup? Streams before streams were hot Interesting / cute idea Clean implementation Look to the past to learn for the future

Slide 5

Slide 5 text

Talk Outline History and background Crossing the streams The Trinity Lessons 4 all of us

Slide 6

Slide 6 text

Streaming is so 2016 “Big Data”: Samza / Kafka, Storm, Spark Streaming, Amazon Kinesis, Flink Basic idea: continuously processing dataﬂow D(A)G

Slide 7

Slide 7 text

DAG DAG DAG

Slide 8

Slide 8 text

Streams: The Big Idea Data warehouse: fixed data, changing queries Streaming: changing data, fixed queries Surprisingly hard to get right: semantics, languages, ordering, fault tolerance, efficient execution, handling multiple queries

Slide 9

Slide 9 text

Streaming is so 2016 “Big Data”: Samza / Kafka, Storm, Spark Streaming, Amazon Kinesis, Flink First approximation: today’s OSS stream processors are kind of like Hadoop in 2009 Low-level API // not much Hive / Spark SQL Not much automated query processing Not always as fast as they could be But they’re trying!

Slide 10

Slide 10 text

Streaming is so 2000 System & Project School Focus STREAM Stanford Semantics TelegraphCQ Berkeley Adaptivity Aurora Brown, Brandeis, MIT Shared execution

Slide 11

Slide 11 text

SELECT dst, COUNT(*), wtime(*) AS c FROM network.tcpdump AS st [RANGE BY '5 seconds' SLIDE BY '1 sec' START AT '2003-06-06'] GROUP BY dst; SELECT R.i, R.j, count(*) FROM R [RANGE BY '1 sec' SLIDE BY '2 sec'], S [RANGE BY '3 sec' SLIDE BY '4 sec'] WHERE R.k = S.k GROUP BY R.i, R.j HAVING R.j > C;

Slide 12

Slide 12 text

Lots to learn… Surprisingly hard to get right: semantics, languages, ordering, fault tolerance, eﬃcient execution, handling multiple queries Literature has answers!!! They may not be right… No huge exits yet: TelegraphCQ -> Truviso -> Cisco (2012) Aurora -> StreamBase -> TIBCO (2013) …but we can learn!

Slide 13

Slide 13 text

Talk Outline History and background Crossing the streams The Trinity Lessons 4 all of us

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

Streams: The Big Idea Data warehouse: ﬁxed data, changing queries Streaming: changing data, ﬁxed queries This paper (PSoup): changing data, changing queries

Slide 16

Slide 16 text

PSoup Motivation » What if users register new queries on demand? May not know in advance, need back-ﬁll » What if we don’t want to pre-compute all query results? May be rare / infrequent » It’s a cool idea… (conceived over beers?) Semantics: SELECT-FROM-WHERE… BEGIN-END

Slide 17

Slide 17 text

Big Idea Treat data and queries symmetrically Join them as needed

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

PSoup Processing Key Ideas: Treat data and queries symmetrically Build predicate index over queries Materialize result structure for rendering actual results Time: system assigned (why can this be problematic?)

Slide 21

Slide 21 text

Fancy terminology, simple ideas: STeM: holds intermediate state (e.g., for a join) Can have Query STeMs, Data STeMs… Eddy: routes tuples between STeMs Works for Joins, too

Slide 22

Slide 22 text

Predicate Indexing Classic technique (e.g., used in concurrency control, multi-query processing, etc.) Cool trick!

Slide 23

Slide 23 text

What about aggregates? Can compute aggregates over SPJ queries, Allow overlapping queries to traverse shared tree…

Slide 24

Slide 24 text

Garbage Collection? PSoup is main-memory only Can be expensive to re-execute arbitrarily in past Solution (not in this paper): punctuations! “All tuples from time T or less have arrived…” PSoup: need to punctuate queries, too…

Slide 25

Slide 25 text

Evaluation PSoup-P: partially materialize PSoup-C: fully materialize NoMat: query on demand

Slide 26

Slide 26 text

Evaluation Continued Trade-oﬀ: partial mat slower, but less space required Nitpicks: Workload is not very impressive Do the speedups really matter? (< 1ms)

Slide 27

Slide 27 text

Hot Takes PSoup = Cute, promising idea Symmetric joins = Relatively clean mechanism Paper = Not an exemplar of clarity in writing Paper = I like the idea more than the execution Overall = Interesting thought experiment Worth an after-work discussion over beer…

Slide 28

Slide 28 text

Talk Outline History and background Crossing the streams The Trinity Lessons 4 all of us

Slide 29

Slide 29 text

Processor Queries Data Results Data Warehouse changing fixed fixed Streaming Engine changing fixed fixed PSoup changing changing fixed ¿¿?? changing changing changing Can’t stop won’t stop

Slide 30

Slide 30 text

The Trinity Data Queries Results “If you take the trilogy of queries, data, and answers, then in theory with any two of them you could synthesize the third. So, for example, if I give you a some queries and some answers, you could synthesize a (non-unique) database that would produce those answers given those queries (this would be useful for generating test databases, for example). Likewise, if I give you some data and some answers, you should be able to synthesize a set of queries that would produce those answers on that data (could be useful for generating an optimizer or perhaps for security-related forensics). There may be Turing award in there somewhere.”

Slide 31

Slide 31 text

Talk Outline History and background Crossing the streams The Trinity Lessons 4 all of us

Slide 32

Slide 32 text

Why did 2000s streaming engines fall short? Too early? Wrong API? Story: PoC at major retailer… Veteran @ Brown: temporal queries were more valuable than temporal processing (everyone has WINDOW now); only ﬁnance needed latency …and yet: most serious SQL engines now support WINDOW functionality

Slide 33

Slide 33 text

Why does streaming matter today? » Decoupling storage, compute (e.g., HDFS, Kafka)? » Nowcasting / minute-level latency matters? » Poor man’s materialized view / model maintenance ? » Maybe it doesn’t matter!

Slide 34

Slide 34 text

IRL: MacroBase We’re building a new engine for prioritizing attention in data streams: analytic monitoring = anomaly detection plus summarization What can we use from the literature? » Dataﬂow as computing substrate + fault-tolerance » Semantics for windows, time, out of order execution What needs to be re-done? » Need support for statistical operators (e.g., KDE) » Need support for approximate computation » Need support for new edge capabilities

Slide 35

Slide 35 text

Another (Beerworthy) Idea When you squint, most problems look like joins with historical state: e.g., every web request is a just a fancy join with all other web requests previously seen… Why don’t we build services / infra like this? (Immutable logs as building blocks appx. this)

Slide 36

Slide 36 text

If I were to recommend three papers:

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

Summary PSoup: streaming queries, streaming data One weird trick: join data with queries Streaming: what is old is new again… Lots to learn from the literature! Thanks! @pbailis