Realtime processing with Storm on smart meters data.
Machine learning predictions with R.
Lesson learned using streaming technologies in a Big Data context.
Presented @OpenWorldForum 2014, Paris
Static or dynamic pricing Weather forecasts Data in motion Data at rest http://storm-project.net/ • Simple aggregations ex. national curve • Complex aggregations ex. curves aggregated by tariff • Analytics: ex. scoring (for each meter) • Forecasts: ex.D+1 forecasts expressed in Wh and in € (adaptive models) Output
to process power consumption records: KPIS Adaptive machine learning algorithms to forecast power consumption Joins between real-time data and static historical data Time series processing Performances for 35 000 000 meters? Have a precise understanding of how Storm works and compare it to other CEPs
of data streams Packaged into the Hadoop Hortonworks Platform, runnable on YARN (not plug and play, will come) Linear scaling with the number of servers, real capacity to operate massive processing Low end-to-end latencies (100ms) Zoom : Storm Streaming • Unit tasks coded with Storm as bolts are automatically multithreaded and distributed among nodes • The plumbing, networking, queuing, distribution of the tasks are supported by Storm Input : Data A and B Data B : join with external data Data A : date processing Stream join and sum computation by key Update of the results into a database Sending the results in real time to users Topology example
How to evaluate Storm on test Data? Training of a machine learning model to replicate load curves The model must generate single curves that are realistic, volatile and random The mean of the curves should correspond to the original mean The generator should output millions of curves within minutes Markov Generative Model Real individual data learning simulation
within Storm to code topologies in a different way Mini-batches instead of single tuples Same granularity, but you gain the capability to do more global processing on batches Facilitate the implementation of transactionnality , exactly once processing and the respect of the data flow order, aggregation operations, writing to database Batch size is configurable VS
12 predictive models deployed in parallel on each node Calls to R are time dependent instead of being data stream dependent R processing is quite long and doesn’t correspond to a CEP latency R and Storm ?
explore and innovate • Weird technos «What do you mean by unit testing my neural network?? » OPS • Wants to rationalize • « What do you mean by my scala calls some python making C? » «
engineering Missing values handling Not so Big Data Data cleaning Data analysis Machine learning Classification Clustering Statistics Datavisualisation 1 2 100% of the data Simple ETL processing Sampling (not on the number of examples mostly) Complex, cpu consumming, processing Is it so bad to train the algorithms out of Storm?
Needs of 90% of information systems are covered with small clusters Must rely on other well chosen technologies (NoSQL, Caching…) Usage Java, everybody knows Transparent resource handling, networking routing, distributed processing… Simple and low level API that allows great granularity and modularity or high level aggregations Simple deployment and supervision Good UI Conclusions
Failover, Nimbus as a SPOF DevOps, hot-swap? Resource management, Storm-on-YARN? Maturity Well documented Good community Supported (Hortonworks) Top level Apache project, active development Conclusions sur Storm