Papers We Love BOS: Streaming Queries over Streaming Data

Papers We Love BOS: PSoup Peter Bailis @pbailis MIT &
Stanford 21 April 2016

Who am I? Berkeley PhD ‘15, spent most of this
year at MIT T-minus 11 days until tenure-track at Stanford » All about systems for next-gen data applications » Current hotness: MacroBase — automatic, large-scale analytic monitoring for IoT http://bailis.org/ http://futuredata.stanford.edu/

Why PSoup? Streams before streams were hot Interesting / cute
idea Clean implementation Look to the past to learn for the future

Talk Outline History and background Crossing the streams The Trinity
Lessons 4 all of us

Streaming is so 2016 “Big Data”: Samza / Kafka, Storm,
Spark Streaming, Amazon Kinesis, Flink Basic idea: continuously processing dataﬂow D(A)G

DAG DAG DAG

Streams: The Big Idea Data warehouse: fixed data, changing queries
Streaming: changing data, fixed queries Surprisingly hard to get right: semantics, languages, ordering, fault tolerance, efficient execution, handling multiple queries

Streaming is so 2016 “Big Data”: Samza / Kafka, Storm,
Spark Streaming, Amazon Kinesis, Flink First approximation: today’s OSS stream processors are kind of like Hadoop in 2009 Low-level API // not much Hive / Spark SQL Not much automated query processing Not always as fast as they could be But they’re trying!

Streaming is so 2000 System & Project School Focus STREAM
Stanford Semantics TelegraphCQ Berkeley Adaptivity Aurora Brown, Brandeis, MIT Shared execution

SELECT dst, COUNT(*), wtime(*) AS c FROM network.tcpdump AS st
[RANGE BY '5 seconds' SLIDE BY '1 sec' START AT '2003-06-06'] GROUP BY dst; SELECT R.i, R.j, count(*) FROM R [RANGE BY '1 sec' SLIDE BY '2 sec'], S [RANGE BY '3 sec' SLIDE BY '4 sec'] WHERE R.k = S.k GROUP BY R.i, R.j HAVING R.j > C;

Lots to learn… Surprisingly hard to get right: semantics, languages,
ordering, fault tolerance, eﬃcient execution, handling multiple queries Literature has answers!!! They may not be right… No huge exits yet: TelegraphCQ -> Truviso -> Cisco (2012) Aurora -> StreamBase -> TIBCO (2013) …but we can learn!

Lessons 4 all of us

Streams: The Big Idea Data warehouse: ﬁxed data, changing queries
Streaming: changing data, ﬁxed queries This paper (PSoup): changing data, changing queries

PSoup Motivation » What if users register new queries on
demand? May not know in advance, need back-ﬁll » What if we don’t want to pre-compute all query results? May be rare / infrequent » It’s a cool idea… (conceived over beers?) Semantics: SELECT-FROM-WHERE… BEGIN-END

Big Idea Treat data and queries symmetrically Join them as
needed

PSoup Processing Key Ideas: Treat data and queries symmetrically Build
predicate index over queries Materialize result structure for rendering actual results Time: system assigned (why can this be problematic?)

Fancy terminology, simple ideas: STeM: holds intermediate state (e.g., for
a join) Can have Query STeMs, Data STeMs… Eddy: routes tuples between STeMs Works for Joins, too

Predicate Indexing Classic technique (e.g., used in concurrency control, multi-query
processing, etc.) Cool trick!

What about aggregates? Can compute aggregates over SPJ queries, Allow
overlapping queries to traverse shared tree…

Garbage Collection? PSoup is main-memory only Can be expensive to
re-execute arbitrarily in past Solution (not in this paper): punctuations! “All tuples from time T or less have arrived…” PSoup: need to punctuate queries, too…

Evaluation PSoup-P: partially materialize PSoup-C: fully materialize NoMat: query on
demand

Evaluation Continued Trade-oﬀ: partial mat slower, but less space required
Nitpicks: Workload is not very impressive Do the speedups really matter? (< 1ms)

Hot Takes PSoup = Cute, promising idea Symmetric joins =
Relatively clean mechanism Paper = Not an exemplar of clarity in writing Paper = I like the idea more than the execution Overall = Interesting thought experiment Worth an after-work discussion over beer…

Lessons 4 all of us

Processor Queries Data Results Data Warehouse changing fixed fixed Streaming
Engine changing fixed fixed PSoup changing changing fixed ¿¿?? changing changing changing Can’t stop won’t stop

The Trinity Data Queries Results “If you take the trilogy
of queries, data, and answers, then in theory with any two of them you could synthesize the third. So, for example, if I give you a some queries and some answers, you could synthesize a (non-unique) database that would produce those answers given those queries (this would be useful for generating test databases, for example). Likewise, if I give you some data and some answers, you should be able to synthesize a set of queries that would produce those answers on that data (could be useful for generating an optimizer or perhaps for security-related forensics). There may be Turing award in there somewhere.”

Lessons 4 all of us

Why did 2000s streaming engines fall short? Too early? Wrong
API? Story: PoC at major retailer… Veteran @ Brown: temporal queries were more valuable than temporal processing (everyone has WINDOW now); only ﬁnance needed latency …and yet: most serious SQL engines now support WINDOW functionality

Why does streaming matter today? » Decoupling storage, compute (e.g.,
HDFS, Kafka)? » Nowcasting / minute-level latency matters? » Poor man’s materialized view / model maintenance ? » Maybe it doesn’t matter!

IRL: MacroBase We’re building a new engine for prioritizing attention
in data streams: analytic monitoring = anomaly detection plus summarization What can we use from the literature? » Dataﬂow as computing substrate + fault-tolerance » Semantics for windows, time, out of order execution What needs to be re-done? » Need support for statistical operators (e.g., KDE) » Need support for approximate computation » Need support for new edge capabilities

Another (Beerworthy) Idea When you squint, most problems look like
joins with historical state: e.g., every web request is a just a fancy join with all other web requests previously seen… Why don’t we build services / infra like this? (Immutable logs as building blocks appx. this)

If I were to recommend three papers:

Summary PSoup: streaming queries, streaming data One weird trick: join
data with queries Streaming: what is old is new again… Lots to learn from the literature! Thanks! @pbailis

Papers We Love BOS: Streaming Queries over Stre...

Papers We Love BOS: Streaming Queries over Streaming Data

pbailis

More Decks by pbailis

Other Decks in Research

Featured

Transcript

Papers We Love BOS: PSoup Peter Bailis @pbailis MIT &

Who am I? Berkeley PhD ‘15, spent most of this

Why PSoup? Streams before streams were hot Interesting / cute

Talk Outline History and background Crossing the streams The Trinity

Streaming is so 2016 “Big Data”: Samza / Kafka, Storm,

DAG DAG DAG

Streams: The Big Idea Data warehouse: ﬁxed data, changing queries

Streaming is so 2016 “Big Data”: Samza / Kafka, Storm,

Streaming is so 2000 System & Project School Focus STREAM

SELECT dst, COUNT(), wtime() AS c FROM network.tcpdump AS st

Lots to learn… Surprisingly hard to get right: semantics, languages,

Talk Outline History and background Crossing the streams The Trinity

Streams: The Big Idea Data warehouse: ﬁxed data, changing queries

PSoup Motivation » What if users register new queries on

Big Idea Treat data and queries symmetrically Join them as

PSoup Processing Key Ideas: Treat data and queries symmetrically Build

Fancy terminology, simple ideas: STeM: holds intermediate state (e.g., for

Predicate Indexing Classic technique (e.g., used in concurrency control, multi-query

What about aggregates? Can compute aggregates over SPJ queries, Allow

Garbage Collection? PSoup is main-memory only Can be expensive to

Evaluation PSoup-P: partially materialize PSoup-C: fully materialize NoMat: query on

Evaluation Continued Trade-oﬀ: partial mat slower, but less space required

Hot Takes PSoup = Cute, promising idea Symmetric joins =

Talk Outline History and background Crossing the streams The Trinity

Processor Queries Data Results Data Warehouse changing ﬁxed ﬁxed Streaming

The Trinity Data Queries Results “If you take the trilogy

Talk Outline History and background Crossing the streams The Trinity

Why did 2000s streaming engines fall short? Too early? Wrong

Why does streaming matter today? » Decoupling storage, compute (e.g.,

IRL: MacroBase We’re building a new engine for prioritizing attention

Another (Beerworthy) Idea When you squint, most problems look like

If I were to recommend three papers:

Summary PSoup: streaming queries, streaming data One weird trick: join