year at MIT T-minus 11 days until tenure-track at Stanford » All about systems for next-gen data applications » Current hotness: MacroBase — automatic, large-scale analytic monitoring for IoT http://bailis.org/ http://futuredata.stanford.edu/
Spark Streaming, Amazon Kinesis, Flink First approximation: today’s OSS stream processors are kind of like Hadoop in 2009 Low-level API // not much Hive / Spark SQL Not much automated query processing Not always as fast as they could be But they’re trying!
[RANGE BY '5 seconds' SLIDE BY '1 sec' START AT '2003-06-06'] GROUP BY dst; SELECT R.i, R.j, count(*) FROM R [RANGE BY '1 sec' SLIDE BY '2 sec'], S [RANGE BY '3 sec' SLIDE BY '4 sec'] WHERE R.k = S.k GROUP BY R.i, R.j HAVING R.j > C;
ordering, fault tolerance, efficient execution, handling multiple queries Literature has answers!!! They may not be right… No huge exits yet: TelegraphCQ -> Truviso -> Cisco (2012) Aurora -> StreamBase -> TIBCO (2013) …but we can learn!
demand? May not know in advance, need back-fill » What if we don’t want to pre-compute all query results? May be rare / infrequent » It’s a cool idea… (conceived over beers?) Semantics: SELECT-FROM-WHERE… BEGIN-END
re-execute arbitrarily in past Solution (not in this paper): punctuations! “All tuples from time T or less have arrived…” PSoup: need to punctuate queries, too…
Relatively clean mechanism Paper = Not an exemplar of clarity in writing Paper = I like the idea more than the execution Overall = Interesting thought experiment Worth an after-work discussion over beer…
of queries, data, and answers, then in theory with any two of them you could synthesize the third. So, for example, if I give you a some queries and some answers, you could synthesize a (non-unique) database that would produce those answers given those queries (this would be useful for generating test databases, for example). Likewise, if I give you some data and some answers, you should be able to synthesize a set of queries that would produce those answers on that data (could be useful for generating an optimizer or perhaps for security-related forensics). There may be Turing award in there somewhere.”
API? Story: PoC at major retailer… Veteran @ Brown: temporal queries were more valuable than temporal processing (everyone has WINDOW now); only finance needed latency …and yet: most serious SQL engines now support WINDOW functionality
in data streams: analytic monitoring = anomaly detection plus summarization What can we use from the literature? » Dataflow as computing substrate + fault-tolerance » Semantics for windows, time, out of order execution What needs to be re-done? » Need support for statistical operators (e.g., KDE) » Need support for approximate computation » Need support for new edge capabilities
joins with historical state: e.g., every web request is a just a fancy join with all other web requests previously seen… Why don’t we build services / infra like this? (Immutable logs as building blocks appx. this)