Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Storm, machine learning and smart meters

Simon Maby
October 31, 2014

Storm, machine learning and smart meters

Realtime processing with Storm on smart meters data.
Machine learning predictions with R.
Lesson learned using streaming technologies in a Big Data context.
Presented @OpenWorldForum 2014, Paris

Simon Maby

October 31, 2014
Tweet

More Decks by Simon Maby

Other Decks in Technology

Transcript

  1. 1 Tél : +33 (0)1 58 56 10 00 Fax

    : +33 (0)1 58 56 10 01 www.octo.com © OCTO 2012 50, avenue des Champs-Elysées 75008 Paris - FRANCE Simon MABY Big Data consultant @OCTO Technology Storm, machine learning and smart meters
  2. 4 Functional Overview Smart Metering Data Stream Input Customer data

    Static or dynamic pricing Weather forecasts Data in motion Data at rest http://storm-project.net/ • Simple aggregations ex. national curve • Complex aggregations ex. curves aggregated by tariff • Analytics: ex. scoring (for each meter) • Forecasts: ex.D+1 forecasts expressed in Wh and in € (adaptive models) Output
  3. 5 Objectives of the POC Evaluate the capability of Storm

    to process power consumption records: KPIS Adaptive machine learning algorithms to forecast power consumption Joins between real-time data and static historical data Time series processing Performances for 35 000 000 meters? Have a precise understanding of how Storm works and compare it to other CEPs
  4. 6 N Nodes cluster Developed by twitter for distributed processing

    of data streams Packaged into the Hadoop Hortonworks Platform, runnable on YARN (not plug and play, will come) Linear scaling with the number of servers, real capacity to operate massive processing Low end-to-end latencies (100ms) Zoom : Storm Streaming • Unit tasks coded with Storm as bolts are automatically multithreaded and distributed among nodes • The plumbing, networking, queuing, distribution of the tasks are supported by Storm Input : Data A and B Data B : join with external data Data A : date processing Stream join and sum computation by key Update of the results into a database Sending the results in real time to users Topology example
  5. 8 Data generation All meters are not deployed atm :

    How to evaluate Storm on test Data? Training of a machine learning model to replicate load curves The model must generate single curves that are realistic, volatile and random The mean of the curves should correspond to the original mean The generator should output millions of curves within minutes Markov Generative Model Real individual data learning simulation
  6. 9 10 nodes (commodity hardware) 2 master nodes CPU :

    8 cores RAM : 64Go Network : 1Gb/s 8 nodes CPU : 8 cores RAM : 32Go Network : 1Gb/s Infrastucture
  7. 11 Storm Vanilla VS Storm Trident Trident is an API

    within Storm to code topologies in a different way Mini-batches instead of single tuples Same granularity, but you gain the capability to do more global processing on batches Facilitate the implementation of transactionnality , exactly once processing and the respect of the data flow order, aggregation operations, writing to database Batch size is configurable VS
  8. 12 Trident wins Why Trident? Mini-batch is great on larger

    volumes Need for transactionnality: we process power consumption meters! Some aggregation computations are really easier
  9. 14 Analytics Individual scores based on a SAX transformation Power

    consumption forecasting based on GAM models
  10. 15 Java-R connector: rJava One R instance on each node

    12 predictive models deployed in parallel on each node Calls to R are time dependent instead of being data stream dependent R processing is quite long and doesn’t correspond to a CEP latency R and Storm ?
  11. 16 Complicated integration Hard to train models within Storm, here

    we are just outputting predictions! Models are independent from the Storm code So what?
  12. 17 Exploration VS Fast Data Data Scientist • Wants to

    explore and innovate • Weird technos «What do you mean by unit testing my neural network?? » OPS • Wants to rationalize • « What do you mean by my scala calls some python making C? » «
  13. 18 Exploration VS Fast Data : a way? Fast Data

    Spark Streaming Storm Akka Reactive Web architectures
  14. 19 Data Gathering Joins Noise reduction Sampling Feature selection or

    engineering Missing values handling Not so Big Data Data cleaning Data analysis Machine learning Classification Clustering Statistics Datavisualisation 1 2 100% of the data Simple ETL processing Sampling (not on the number of examples mostly) Complex, cpu consumming, processing Is it so bad to train the algorithms out of Storm?
  15. 20 Hadoop Pig Hive Mr.Job python Map Reduce jobs Spark

    2 common steps Python Anaconda, ipython Scikit-Learn Numpy, scipy Pandas R Rstudio Data cleaning Data analysis 1 2
  16. 23 Performances testing Storm UI gives metrics but… Many components

    are part of the journey (Zookeeper, Redis, JEE, camel, HDP, DRPC calls…) Performance testing is hard to do and not so reliable
  17. 24 20 workers Batch size of 5000 HighParallelism Hint (~40)

    200 000 load curves processed per second End to end latency < 300ms
  18. 26 Architecture pattern Structured events Moteur CEP • Decision/ Action

    • Reporting • Result Data Capture Unstructured events Calculs et état en mémoire : fenêtres de temps, opérateurs, règles latence : 100 ms Event/Condition/Action Stream-based querying Analyse multi-dim. … In memory States, historical data for joins Logging of the input stream • NoSQL/ Distributed cache: Couchbase / Redis / Hbase • Storage: HDFS, NoSQL • Integration: Kafka / Flume…
  19. 27 Performances Scale well Hard to anticipate for some processing

    Needs of 90% of information systems are covered with small clusters Must rely on other well chosen technologies (NoSQL, Caching…) Usage Java, everybody knows Transparent resource handling, networking routing, distributed processing… Simple and low level API that allows great granularity and modularity or high level aggregations Simple deployment and supervision Good UI Conclusions
  20. 28 Security Wire encryption? User role management (Kerberos?) Reliability Transactionnality

    Failover, Nimbus as a SPOF DevOps, hot-swap? Resource management, Storm-on-YARN? Maturity Well documented Good community Supported (Hortonworks) Top level Apache project, active development Conclusions sur Storm
  21. 29 L’équipe côté EDF Marie-Luce Picard – Chef de projet

    Sigma² Benoit Grossin – Ingénieur chercheur Alexis BONDU – Ingénieur chercheur, auteur du générateur Bruno JACQUIN – Ingénieur chercheur Charles BERNARD – Consultant IT Leely DAIO PIRES DOS SANTOS – Ingénieur chercheur Yannig GOUDE – Expert prévision L’équipe côté OCTO Rémy Saissy - Consultant Cyrille MAILLEY - Consultant Thanks