Harnessing Big Data With Spark

Harnessing Big Data With Spark

Presentation at Data Summit

B7189c9a09c7d99379c2a343fcfb2dbd?s=128

Lawrence Spracklen

May 10, 2016
Tweet

Transcript

  1. Harnessing Big Data with Spark Lawrence Spracklen Alpine Data

  2. 2 Alpine Data

  3. 3 Map Reduce •  Allows the distribution of large data

    computations across a cluster •  Computations typically composed of a sequence of MR operations Big   Data   Map() Output   Reduce()
  4. 4 MR Performance •  Multiple disk interactions required in EACH

    MR operation Map   Reduce  
  5. 5 Performance Hierarchy 0.10GB/s 0.10GB/s 0.60GB/s 80GB/s 100X Read Bandwidth

  6. 6 Optimizing MR •  Many companies have significant legacy MR

    code –  Either direct MR or indirect usage via Pig •  A variety of techniques to accelerate MR –  Apache Tez –  Tachyon or Apache ignite –  System ML
  7. 7 Spark •  Several significant advancements over MR –  Generalizes

    two stage MR into arbitrary DAGs –  Enables in-memory dataset caching –  Improved usability •  Reduced disk read/writes delivers significant speedups –  Especially for iterative algorithms like ML
  8. 8 Perf comparisons *http://www.edureka.co/blog/apache-spark-vs-hadoop-mapreduce

  9. 9 Spark Tuning •  Increased reliance on memory introduces greater

    requirement for tuning •  Need to understand memory requirements for caching •  Significant performance benefits associated with “getting it right” •  Auto-tuning is coming….
  10. 10 Optimization opportunities •  Spark delivers improved ML performance using

    reduced cluster resources •  Enables numerous opportunities –  Reduced time to insights –  Reduced cluster size –  Eliminate subsampling –  AutoML
  11. 11 AutoML •  Data sets increasingly large and complex • 

    Increasing difficult to intuitively “know” optimal –  Feature engineering –  Choice of algorithm –  Optimize parameterization of algorithm(s) •  Significant manual trial-and-error •  Cult of the algorithm
  12. 12 Feature Engineering •  Essential for model performance, efficacy, robustness

    and simplicity –  Feature extraction –  Feature selection –  Feature construction –  Feature elimination •  Domain/dataset knowledge is important, but basic automation feasible
  13. 13 Algorithm selection •  Select dependent column •  Indicate classification

    or regression •  Press “go” Algorithms run in parallel across cluster Minimally provides good starting point Significantly reduces “busy work”
  14. 14 Hyperparameter optimization •  Are the default parameters optimal? • 

    How do I adjust intelligently –  Number of trees? Depth of trees? Splitting criteria? •  Tedious trial and error •  Overfitting danger •  Intelligent automatic search
  15. 15 Algorithm tuning •  Gradient boosted tree parameterization e.g. – 

    # of trees –  Maximum tree depth –  Loss function –  Minimum node split size –  Bagging rate –  Shrinkage
  16. 16 AutoML Data  Set   Alg  #1   Alg  #2

      Alg  #3   Alg  #N   Alg  #1   Alg  #N   1)Investigate N ML algorithms 2) Tune top performing algorithms Feature   engineering   Alg  #2   Alg  #1   Alg  #N   2) Feature elimination
  17. 17 Spark is for large datasets *http://datascience.la/benchmarking-random-forest-implementations/ •  If your

    data fits on a single node…. •  Other high-performance options exist *http://haifengl.github.io/smile/index.html Run time
  18. 18 Data set size •  Large data lakes can consist

    of many small files •  Memory per node increasing rapidly *http://www.kdnuggets.com/2015/11/big-ram-big-data-size-datasets.html
  19. 19 NVDIMMS •  Driving significant increases in node memory – 

    Up to 10X increase in density •  Coming in late 2016…
  20. 20 Hybrid operators •  Time consuming to maintain multiple ML

    libraries & manually determine optimal choice •  Develop hybrid implementations that automatically choose optimal approach –  Data set size –  Cluster size –  Cluster utilization
  21. 21 Single-node performance (1/2) *http://www.ayasdi.com/blog/LawrenceSpracklen

  22. 22 Single-node performance (2/2) *http://www.ayasdi.com/blog/LawrenceSpracklen

  23. 23 Operationalization •  What happens after the models are created?

    •  How does the business benefit from the insights? •  Operationalization is frequently the weak link –  Operationalizing PowerPoint? –  Hand rolled scoring flows
  24. 24 PFA •  Portable Format for Analytics (PFA) •  Successor

    to PMML •  Significant flexibility in encapsulating complex data preprocessing
  25. 25 Conclusions •  Spark delivers significant performance improvements over MR

    –  Can introduce more tuning requirements •  Provides an opportunity for AutoML –  Automatically determine good solutions •  Understand when its appropriate •  Don’t forget about about operationalization