Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Courbospark : Decision tree for time-series on Spark

Courbospark : Decision tree for time-series on Spark

Courbospark : Decision tree for time-series on Spark.
Brussels Hadoop summit 2015 - Datascience track

Simon Maby

April 16, 2015
Tweet

More Decks by Simon Maby

Other Decks in Technology

Transcript

  1. COURBOSPARK: DECISION TREE FOR TIME-SERIES ON SPARK Christophe Salperwyck –

    EDF R&D Simon Maby – OCTO Technology - @simonmaby Xdata project: www.xdata.fr, grants from "Investissement d'Avenir" program, 'Big Data' call
  2. | 2 AGENDA 1. PROBLEM DESCRIPTION 2. IMPLEMENTATION • Courbotree:

    presentation of the algorithm • From mllib to courbospark 3. PERFORMANCES • Configuration (cluster description, spark config…) 4. FEEDBACK ON SPARK/MLLIB
  3. | 4 • 1 measure every 10 min • 35

    million customers • Time-series: 144 points x 365 days  Annual data volume: 1800 billion records, 120 TB of raw data BIG DATA!
  4. | 5 LOAD CURVES CLASSIFICATION Contract type Region … Equipment

    type Load Curve 9KVA 75 … Elec 6KVA 22 … Gas … … … … … 12KVA 34 … Elec
  5. | 6 WHY A DECISION TREE? • Easy to understand

    • Ability to explore the model • Ability to choose the expressivity of the model
  6. | 7 Goal: find the most different curves depending on

    an explanatory feature How to split? we can either: • Minimize curves dispersion (intra inertia) or • Maximize differences between average curves (inter inertia) SPLIT CRITERIA: INERTIA
  7. | 9 EXISTING DISTRIBUTED DECISION TREE Scalable Distributed Decision Trees

    in Spark MLLib Manish Amde (Origami Logic), Hirakendu Das (Yahoo! Inc.), Evan Sparks (UC Berkeley), Ameet Talwalkar (UC Berkeley). Spark Summit 2014. http://spark-summit.org/wp- content/uploads/2014/07/Scalable-Distributed-Decision-Trees-in-Spark-Made-Das-Sparks-Talwalkar.pdf A MapReduce Implementation of C4.5 Decision Tree Algorithm Wei Dai, Wei Ji. International Journal of Database Theory and Application. Vol. 7, No. 1, 2014, pages 49- 60. http://www.chinacloud.cn/upload/2014-03/14031920373451.pdf PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce Biswanath Panda, Joshua S. Herbach, Sugato Basu, Roberto J. Bayardo. VLDB 2009. http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36296.pdf Distributed Decision Tree Learning for Mining Big Data Streams Arinto Murdopo, Master thesis, Yahoo ! Labs Barcelona, July 2013. http://people.ac.upc.edu/leandro/emdc/arinto-emdc-thesis.pdf
  8. | 11 Step 1: compute average curves [0:10[ [10:20[ [0:10[

    [10:20[ [0:10[ [10:20[ Host 1 Host 2 Host 3 [0:10[ [10:20[ Host 1 Step 2: collect and find the best split HORIZONTAL STRATEGY
  9. | 12 To build the tree: • Criteria: entropy, Gini,

    variance • Data structure: LabelPoint FROM MLLIB TO COURBOSPARK
  10. | 13 To build the tree: • Criteria: entropy, Gini,

    variance, inertia (to compare time-series) • Data structure: LabelPoint, TimeSeries • Finding split point for nominal features For data visualization of the tree: • Quantile on the nodes and leaves • Lost of inertia • Number of curves per nodes, leaves FROM MLLIB TO COURBOSPARK
  11. | 14 DEALING WITH NOMINAL FEATURES Current implementation for regression:

     order the categories by their mean on the target A B C D Partitions tested: {A}/{CBD}, {AC}/{BD}, {ACB}/{C}
  12. | 16 DEALING WITH NOMINAL FEATURES Hard to order curves…

    Solution 1: Compare curves 2 by 2  {A}/{BCD}, {AB}/{CD}, {ABC}/{D}, {AC}/{BD}… Problem: Combinatory problem depending on n the number of different categories. Complexity is O(2n)
  13. | 17 DEALING WITH NOMINAL FEATURES Solution 2: Agglomerative Hierarchical

    Clustering. Bottom up approach. Complexity is O(n3) - we don’t expect n > 100
  14. | 19 LOOKING FOR THE TEST CONFIGURATION For a constant

    global capacity on 12 nodes: • 120 cores + 120 GB RAM #Executors RAM per exec. Cores per exec. Performance on 100Gb data 12 10 GB 10 22 minutes 24 5 GB 5 17 minutes 60 2 GB 2 12 minutes 120 1 GB 1 15 minutes
  15. | 24 FRAMEWORK STABILITY Tested on: • 10GB, 100GB, 200GB,

    300GB, 400GB, 500GB, 1TB • Categorical and continuous variables • Bin sizes from 100 to 1000
  16. | 28 REAL LIFE DATASET 0 50 100 150 200

    250 300 350 400 0 200 400 600 800 1000 1200 1400 Time in minutes Data in GB • 9 executors with 20 GB and 8 cores • 10 to 1000 millions load curves (10 numerical and 10 categorical features)
  17. | 30 Developers view • Flawless transition from local to

    cluster mode • Debug mode with an IDE • Good performances need knowledge FEEDBACKS
  18. | 32 Data Scientists view • The API is not

    very data oriented • …but now we have SparkSQL and Dataframes! • IPython + pySpark • Feature engineering VS model engineering FEEDBACKS
  19. | 33 OPS view • Better than mapReduce • Performances

    are predictable for tested code • YARNed • Lots of releases, MlLib code is evolving quickly FEEDBACKS
  20. | 34 FUTURE WORKS • Unbalanced trees • Improve performance

    • Other criteria for time-series comparison • Missing values in explanatory features