Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Papers We Love BOS: Streaming Queries over Streaming Data

pbailis
April 21, 2016

Papers We Love BOS: Streaming Queries over Streaming Data

pbailis

April 21, 2016
Tweet

More Decks by pbailis

Other Decks in Research

Transcript

  1. Papers We Love BOS:
    PSoup
    Peter Bailis @pbailis
    MIT & Stanford
    21 April 2016

    View Slide

  2. Who am I?
    Berkeley PhD ‘15, spent most of this year at MIT
    T-minus 11 days until tenure-track at Stanford
    » All about systems for next-gen data applications
    » Current hotness: MacroBase — automatic,
    large-scale analytic monitoring for IoT
    http://bailis.org/
    http://futuredata.stanford.edu/

    View Slide

  3. View Slide

  4. Why PSoup?
    Streams before streams were hot
    Interesting / cute idea
    Clean implementation
    Look to the past to learn for the future

    View Slide

  5. Talk Outline
    History and background
    Crossing the streams
    The Trinity
    Lessons 4 all of us

    View Slide

  6. Streaming is so 2016
    “Big Data”: Samza / Kafka, Storm, Spark
    Streaming, Amazon Kinesis, Flink
    Basic idea: continuously processing dataflow D(A)G

    View Slide

  7. DAG DAG DAG

    View Slide

  8. Streams: The Big Idea
    Data warehouse: fixed data, changing queries
    Streaming: changing data, fixed queries
    Surprisingly hard to get right:
    semantics, languages, ordering, fault tolerance,
    efficient execution, handling multiple queries

    View Slide

  9. Streaming is so 2016
    “Big Data”: Samza / Kafka, Storm, Spark
    Streaming, Amazon Kinesis, Flink
    First approximation: today’s OSS stream
    processors are kind of like Hadoop in 2009
    Low-level API // not much Hive / Spark SQL
    Not much automated query processing
    Not always as fast as they could be
    But they’re trying!

    View Slide

  10. Streaming is so 2000
    System & Project School Focus
    STREAM Stanford Semantics
    TelegraphCQ Berkeley Adaptivity
    Aurora Brown, Brandeis, MIT Shared execution

    View Slide

  11. SELECT dst, COUNT(*), wtime(*) AS c
    FROM network.tcpdump AS st
    [RANGE BY '5 seconds' SLIDE BY '1 sec'
    START AT '2003-06-06']
    GROUP BY dst;
    SELECT
    R.i, R.j, count(*)
    FROM
    R [RANGE BY '1 sec' SLIDE BY '2 sec'],
    S [RANGE BY '3 sec' SLIDE BY '4 sec']
    WHERE
    R.k = S.k
    GROUP BY
    R.i, R.j
    HAVING
    R.j > C;

    View Slide

  12. Lots to learn…
    Surprisingly hard to get right:
    semantics, languages, ordering, fault tolerance,
    efficient execution, handling multiple queries
    Literature has answers!!!
    They may not be right…
    No huge exits yet:
    TelegraphCQ -> Truviso -> Cisco (2012)
    Aurora -> StreamBase -> TIBCO (2013)
    …but we can learn!

    View Slide

  13. Talk Outline
    History and background
    Crossing the streams
    The Trinity
    Lessons 4 all of us

    View Slide

  14. View Slide

  15. Streams: The Big Idea
    Data warehouse: fixed data, changing queries
    Streaming: changing data, fixed queries
    This paper (PSoup):
    changing data, changing queries

    View Slide

  16. PSoup Motivation
    » What if users register new queries on demand?
    May not know in advance, need back-fill
    » What if we don’t want to pre-compute all query
    results? May be rare / infrequent
    » It’s a cool idea… (conceived over beers?)
    Semantics: SELECT-FROM-WHERE… BEGIN-END

    View Slide

  17. Big Idea
    Treat data and queries symmetrically
    Join them as needed

    View Slide

  18. View Slide

  19. View Slide

  20. PSoup Processing Key Ideas:
    Treat data and queries symmetrically
    Build predicate index over queries
    Materialize result structure for rendering actual results
    Time: system assigned (why can this be problematic?)

    View Slide

  21. Fancy terminology, simple ideas:
    STeM: holds intermediate state (e.g., for a join)
    Can have Query STeMs, Data STeMs…
    Eddy: routes tuples between STeMs
    Works for Joins, too

    View Slide

  22. Predicate Indexing
    Classic technique (e.g., used in concurrency
    control, multi-query processing, etc.)
    Cool trick!

    View Slide

  23. What about aggregates?
    Can compute aggregates over SPJ queries,
    Allow overlapping queries to traverse shared tree…

    View Slide

  24. Garbage Collection?
    PSoup is main-memory only
    Can be expensive to re-execute arbitrarily in past
    Solution (not in this paper): punctuations!
    “All tuples from time T or less have arrived…”
    PSoup: need to punctuate queries, too…

    View Slide

  25. Evaluation PSoup-P: partially materialize
    PSoup-C: fully materialize
    NoMat: query on demand

    View Slide

  26. Evaluation Continued
    Trade-off: partial mat slower, but less space required
    Nitpicks:
    Workload is not very impressive
    Do the speedups really matter? (< 1ms)

    View Slide

  27. Hot Takes
    PSoup = Cute, promising idea
    Symmetric joins = Relatively clean mechanism
    Paper = Not an exemplar of clarity in writing
    Paper = I like the idea more than the execution
    Overall = Interesting thought experiment
    Worth an after-work discussion over beer…

    View Slide

  28. Talk Outline
    History and background
    Crossing the streams
    The Trinity
    Lessons 4 all of us

    View Slide

  29. Processor Queries Data Results
    Data Warehouse changing fixed fixed
    Streaming Engine changing fixed fixed
    PSoup changing changing fixed
    ¿¿?? changing changing changing
    Can’t stop won’t stop

    View Slide

  30. The Trinity
    Data
    Queries
    Results
    “If you take the trilogy of queries, data, and answers, then in theory with any two of
    them you could synthesize the third. So, for example, if I give you a some queries and
    some answers, you could synthesize a (non-unique) database that would produce
    those answers given those queries (this would be useful for generating test
    databases, for example). Likewise, if I give you some data and some answers, you
    should be able to synthesize a set of queries that would produce those answers on
    that data (could be useful for generating an optimizer or perhaps for security-related
    forensics). There may be Turing award in there somewhere.”

    View Slide

  31. Talk Outline
    History and background
    Crossing the streams
    The Trinity
    Lessons 4 all of us

    View Slide

  32. Why did 2000s streaming
    engines fall short?
    Too early? Wrong API?
    Story: PoC at major retailer…
    Veteran @ Brown: temporal queries were more
    valuable than temporal processing (everyone has
    WINDOW now); only finance needed latency
    …and yet: most serious SQL engines now support
    WINDOW functionality

    View Slide

  33. Why does streaming
    matter today?
    » Decoupling storage, compute (e.g., HDFS, Kafka)?
    » Nowcasting / minute-level latency matters?
    » Poor man’s materialized view / model maintenance ?
    » Maybe it doesn’t matter!

    View Slide

  34. IRL: MacroBase
    We’re building a new engine for prioritizing attention in data streams:
    analytic monitoring = anomaly detection plus summarization
    What can we use from the literature?
    » Dataflow as computing substrate + fault-tolerance
    » Semantics for windows, time, out of order execution
    What needs to be re-done?
    » Need support for statistical operators (e.g., KDE)
    » Need support for approximate computation
    » Need support for new edge capabilities

    View Slide

  35. Another (Beerworthy) Idea
    When you squint, most problems look like joins
    with historical state:
    e.g., every web request is a just a fancy join with all
    other web requests previously seen…
    Why don’t we build services / infra like this?
    (Immutable logs as building blocks appx. this)

    View Slide

  36. If I were to recommend
    three papers:

    View Slide

  37. View Slide

  38. Summary
    PSoup: streaming queries, streaming data
    One weird trick: join data with queries
    Streaming: what is old is new again…
    Lots to learn from the literature!
    Thanks!
    @pbailis

    View Slide