A new streaming computation engine for real-time analytics by Michael Barton at Big Data Spain 2015

& ALGORITHMS Michael Barton @mrb_barton ITRS Group Malaga What happens
when you make analysis easy to re-use

We have a big complicated trading system Can we calculate
the latency of each order? Lets say I work in a bank entry point exit point

HTTP POST { “MsgDirection”: “I”, “SendingTime”: “2015-04-05T14:30Z”, … } Simple
to publish data entry point exit point HTTP POST { “MsgDirection”: “I”, “SendingTime”: “2015-04-05T14:31Z”, … } { “MsgDirection”: “O”, “SendingTime”: “2015-04-05T14:33Z”, … } { “MsgDirection”: “O”, “SendingTime”: “2015-04-05T14:32Z”, … }

Tell Valo the schema? { "schema": { "version": "1.0.0", "config":
{}, "topDef": { "type": "record", "properties": { ... "MsgDirection": { "type": "string", "comments": "I for input message, O for output" }, "Account": { "type": "string", "optional": "true", "comments": "Account mnemonic as agreed between buy and sell sides" }, ... "SendingTime": { "type": "datetime", "comments": "Time of message transmission (always expressed in UTC (Univ }, "Side": { "type": "string", "optional": "true", "comments": "Side" }, "Symbol": { "type": "string", "optional": "true", "comments": "Ticker symbol. Common, human understood representation of t }, { “MsgDirection”: “I”, “SendingTime”: “2015-04-05T14:30Z”, … } { “MsgDirection”: “I”, “SendingTime”: “2015-04-05T14:31Z”, … } { “MsgDirection”: “O”, “SendingTime”: “2015-04-05T14:33Z”, … } { “MsgDirection”: “O”, “SendingTime”: “2015-04-05T14:32Z”, … }

Lets use it from historical /streams/demo/fix/exchange where MsgDirection == "I"
into left inner join from historical /streams/demo/fix/exchange where MsgDirection == "O" into right on left.ClOrdID == right.ClOrdID && left.MsgType=="New Order Single" && right.MsgType=="Execution Report" select left.ClOrdID as orderId, diff(left.SendingTime, right.SendingTime) as resTime select orderId, millis(resTime) as latency

Lets use it from historical /streams/demo/fix/exchange where MsgDirection == "I"
into left inner join from historical /streams/demo/fix/exchange where MsgDirection == "O" into right on left.ClOrdID == right.ClOrdID && left.MsgType=="New Order Single" && right.MsgType=="Execution Report" select left.ClOrdID as orderId, diff(left.SendingTime, right.SendingTime) as resTime select orderId, millis(resTime) as latency From Filter Join Output

Yeah, it gets complicated

Cluster of nodes Commodity hardware Uniform architecture No special leaders
or roles Streams of data Immutable, append-only, distributed Eventual consistency in failure cases It’s just VALO

/streams/demo/fix/exchange We know where the data is From

nodeA Semi-structured Repo Time Series Repo … We know our
storage From /streams/demo/fix/exchange

nodeA Semi-structured Repo Time Series Repo … We know our
storage Semi-structured Repo Hierarchical Document Data Flexible schemas Lucene Indexes Taxonomies and Facets Time Series Repo Well defined schema Custom I/O layer Bitmap B+Tree Indices

From nodeA Semi-structured Repo Time Series Repo … Filter Join
Execute directly against the data and indexes in storage Push down the query

ward-G5 ward-G3 intensive-care Can we look for unusual activity in
the ECG monitors? Lets say I work in a hospital

Assumption-Free Anomaly Detection in Time Series Li Wei Nitin Kumar
Venkata Lolla Eamonn Keogh Stefano Lonardi Chotirat Ann Ratanamahatana University of California – Riverside Department of Computer Science & Engineering Riverside, CA 92521, USA http://alumni.cs.ucr.edu/~ratana/SSDBM05.pdf http://alumni.cs.ucr.edu/~wli/SSDBM05/ Here’s an interesting paper

@ValoOnlineFunction("anomaly") @ValoOnlineFunctionAnnotation(SchemaAnnotations.ANALYTICS.ANOMALY) @ValoOnlineFunctionDescription("Unsupervised anomaly detection for time series") object OnlineAnomalyDetectionFactory
extends OnlineAlgorithmFactory[OnlineAnomalyDetectionParams, Double, OnlineAnomalyDetectionResult] { override val isCommutative: Boolean = false override val isAssociative: Boolean = false override val isMergeable: Boolean = false override def getDependency(windowType: WindowType): AlgoDependency = AlgoDependency.NoDependencies override def init(args: OnlineAnomalyDetectionParams): OnlineAlgorithm[Double, OnlineAnomalyDetectionResult] = { new OnlineAnomalyDetection(args.lagWindow, args.leadWindow, args.featureSize, 3, 5) } } final case class OnlineAnomalyDetectionParams(lagWindow: Int, leadWindow: Int, featureSize: Int) final case class OnlineAnomalyDetectionResult(isTraining: Boolean, point: Double, signal: Double) Full algorithm code omitted for brevity! So lets implement it!

HTTP POST { “ts”: “2015-04-05T14:30Z”, “contributor”: “ward-g3-monitor0” “value”: 0.25443 }
{ “ts”: “2015-04-05T14:30Z”, “contributor”: “ward-g3-monitor1” “value”: 0.36432 } { “ts”: “2015-04-05T14:30Z”, “contributor”: “intensive-care” “value”: 0.46580 } { “ts”: “2015-04-05T14:31Z”, “contributor”: “ward-g3-monitor0” “value”: 0.26073 } ward-G3 intensive-care Simple to publish data

Lets use it from /streams/demo/infrastructure/ecg group by contributor select contributor,
anomaly(200, 40, 20, value) as result emit every value

final case class OnlineAnomalyDetectionParams(lagWindow: Int, leadWindow: Int, featureSize: Int) Lets
use it from /streams/demo/infrastructure/ecg group by contributor select contributor, anomaly(200, 40, 20, value) as result emit every value

One type of monitor is consistently have issues and producing
bad results. Can we monitor which ones? ward-G3 intensive-care ward-G5 Can we re-use the analysis?

Live updating sets of contributors to data manufacturer == “ACME”
Re-use the same query across domains ACME Monitors Domains ward-G3 intensive-care ward-G5

http://collections.rmg.co.uk/mediaLib/476/media- 476182/large.jpg CC BY-NC-SA Can we re-use the analysis?

Same algorithm Similar queries Real-time and historical Can we re-use
the analysis? http://collections.rmg.co.uk/mediaLib/476/media- 476182/large.jpg CC BY-NC-SA

valo.io @valo_io Lambda World LIT BY

A new streaming computation engine for real-tim...

A new streaming computation engine for real-time analytics by Michael Barton at Big Data Spain 2015

Big Data Spain

More Decks by Big Data Spain

Other Decks in Technology

Featured

Transcript

& ALGORITHMS Michael Barton @mrb_barton ITRS Group Malaga What happens

We have a big complicated trading system Can we calculate

HTTP POST { “MsgDirection”: “I”, “SendingTime”: “2015-04-05T14:30Z”, … } Simple

Tell Valo the schema? { "schema": { "version": "1.0.0", "config":

Lets use it from historical /streams/demo/fix/exchange where MsgDirection == "I"

Lets use it from historical /streams/demo/fix/exchange where MsgDirection == "I"

Yeah, it gets complicated

Cluster of nodes Commodity hardware Uniform architecture No special leaders

/streams/demo/fix/exchange We know where the data is From

nodeA Semi-structured Repo Time Series Repo … We know our

nodeA Semi-structured Repo Time Series Repo … We know our

From nodeA Semi-structured Repo Time Series Repo … Filter Join

ward-G5 ward-G3 intensive-care Can we look for unusual activity in

Assumption-Free Anomaly Detection in Time Series Li Wei Nitin Kumar

@ValoOnlineFunction("anomaly") @ValoOnlineFunctionAnnotation(SchemaAnnotations.ANALYTICS.ANOMALY) @ValoOnlineFunctionDescription("Unsupervised anomaly detection for time series") object OnlineAnomalyDetectionFactory

HTTP POST { “ts”: “2015-04-05T14:30Z”, “contributor”: “ward-g3-monitor0” “value”: 0.25443 }

Lets use it from /streams/demo/infrastructure/ecg group by contributor select contributor,

final case class OnlineAnomalyDetectionParams(lagWindow: Int, leadWindow: Int, featureSize: Int) Lets

One type of monitor is consistently have issues and producing

Live updating sets of contributors to data manufacturer == “ACME”

http://collections.rmg.co.uk/mediaLib/476/media- 476182/large.jpg CC BY-NC-SA Can we re-use the analysis?

Same algorithm Similar queries Real-time and historical Can we re-use

valo.io @valo_io Lambda World LIT BY