Spark Autotuning

Spark Autotuning

Presentation at Spark Summit East


Lawrence Spracklen

February 09, 2017


  1. Spark Autotuning Lawrence Spracklen Alpine Data

  2. Overview •  Motivation •  Spark Autotuning •  Future enhancements

  3. Motivation

  4. We use Spark •  End-2-end support –  Problem inception through

    model operationalization •  Extensive use of Spark at all stages –  ETL –  Feature engineering –  Model training –  Model scoring
  5. Alpine UI

  6. Spark Configuration Data scientists must set: •  Size of driver

    •  Size of executors •  Number of executors •  Number of partitions
  7. Inefficient •  Manual Spark tuning is a time consuming and

    inefficient process –  Frequently results in “trial-and-error” approach –  Can be hours (or more) before OOM fails occur •  Less problematic for unchanging operationalized jobs –  Run same analysis every hour/day/week on new data –  Worth spending time & resources to fine tune
  8. AutoML Data Set Alg #1 Alg #2 Alg #3 Alg

    #N Alg #1 Alg #N 1)Investigate N ML algorithms 2) Tune top performing algorithms Feature engineering Alg #2 Alg #1 Alg #N 2) Feature elimination
  9. BI Integration •  Sometimes there is no data scientist… – 

    ML behind the scenes deep in the stack
  10. Example Algorithm

  11. Spark Architecture •  Driver can be client or cluster resident

  12. It depends…. •  Size & complexity of data •  Complexity

    of algorithm(s) •  Nuances of the implementation •  Cluster size •  YARN configuration •  Cluster utilization
  13. Default settings? •  Spark defaults are targeted at toy data

    sets & small clusters –  Not applicable to real-world data science •  Resource computations often assume a dedicated cluster –  Looking at total vCPU and memory resources •  Enterprise clusters are typically shared –  Applying dedicated computations is wasteful
  14. Spark Overheads

  15. Challenges •  Algorithm needs to –  Determine compute requirements for

    current analysis –  Reconcile requirements with available resources –  Allow users to set specific variables •  Changing one parameter will generally impact others 1. Increase memory per executor èDecrease number of executors 2. Increase cores per executor èDecrease memory per core [N.B. Not just a single “correct” configuration!!!]
  16. Understanding the inputs •  Sample from input data set(s) • 

    Estimate –  Schema inference –  Dimensionality –  Cardinality –  Row count •  Remember about compressed formats •  Offline background scans feasible –  Retain snapshots in metastore •  Take into account impact of downstream operations
  17. Consider the entire flow NRV Norm LoR NRV PCA LoR

    NRV 1-Hot LoR csv csv csv
  18. Algorithmic complexity •  Each operation has unique requirements –  Wide

    range in memory and compute requirements across algorithms •  Provide levers for algorithms to provide input on resource asks –  Resource heavy or resource light (Memory & CPU) •  Per algorithm modifiers computed for all algorithms in Application toolbox
  19. Wrapping OSS •  Autotuning APIs are made available to user

    & OSS algorithms wrapped using our extensions SDK –  H20 –  MLlib –  Etc. •  Provide input on relative memory requirements and computational complexity
  20. YARN Queues •  Queues frequently used in enterprise settings – 

    Control resource utilization •  Variety of queuing strategies –  Users bound to queues –  Groups bound to queues –  Applications bound to queues •  Queues can have own maximum container sizes •  Queues can have different scheduling polices
  21. Dynamic allocation •  Spark provides support for auto-scaling –  Dynamically

    add/remove executors according to demand •  Is preemption also enabled? –  Minimize wait time for other users •  Still need to determine optimal executor and driver sizing
  22. Query cluster •  Query YARN Resource Manager to get cluster

    state –  Kerberised REST API •  Determine key resource considerations –  Preemption enabled? –  Queue mapping –  Queue type –  Queue resources –  Real-time node utilization
  23. Executor Sizing •  Try to make efficient use of the

    available resources –  Start with a balanced strategy •  Consume resources based on their relative abundance MemPerCore= QueueMem QueueCores *MemHungry CoresPerExecutor = min(MaxUsableMemoryPerContainer MemoryPerCore ,5) MemPerExecutor = max(1GB,MemPerCore*CoresPerExecutor)
  24. Number of executors (1/2) •  Determine how many executors can

    be accommodated per node •  Consider queue constraints •  Consider testing per node with different core values AvailableExecutors = Min( AvailableNodeMemory MemoryPerExecutor+Overheads n=0 ActiveNodes ∑ , AvailableNodeCores CoresPerExecutor ) UseableExecutors = Min(AvailableExecutors, QueueMemory MemoryPerExecutor , QueueCores CoresPerExecutor )
  25. Number of executors (2/2) •  Compute executor count required for

    memory requirements •  Determine final executor count NeededExecutors = CacheSize SparkMemPerExecutor*SparkStorageFraction FinalExecutors = min(UseableExecutors, NeededExecutors*ComputeHungry)
  26. Data Partitioning •  Minimally want a partition per executor core

    •  Increase number of partitions if memory constrained MinPartitions = NumExecutors*CoresPerExecutor MemPerTask = SparkMemPerExecutor*(1−SparkStorageFraction) CoresPerExecutor MemPartitions = CacheSize MemPerTask FinalPartitions = max(MinPartitions,MemPartitions)
  27. Driver sizing •  Leverage executor size as baseline •  Increase

    driver memory if analysis requires –  Some algorithms very driver memory heavy •  Reduce executor count if needed to reflect presence of cluster-side driver
  28. Sanity checks •  Make sure none of the chosen parameters

    look crazy! –  Massive partition counts –  Crazy resourcing for small files –  Complex jobs squeezed for overwhelmed clusters •  Think about edge cases •  Iterate if necessary!
  29. Future Improvements

  30. Future opportunities Cloud Scaling •  Autoscale cloud EMR clusters – 

    Leverage understanding of required resources to dynamically manage the number of compute nodes –  Significant performance benefits and cost savings Iterative improvements •  Retain decisions from prior runs in metastore •  Harvest additional sizing and runtime information –  Spark REST APIs •  Leverage data to fine tune future runs
  31. Conclusions •  Enterprise clusters are noisy, shared environments •  Configuring

    Spark can be labor intensive & error prone –  Significant trial and error process •  Important to automate the Spark configuration process –  Remove requirement for data scientists to understand Spark and cluster configuration details •  Software automatically reconciles analysis requirements with available resources •  Even simple estimators can do a good job
  32. More details? •  Too little time… –  Will blog more

    details, example resourcing and scala code fragments
  33. Acknowledgements •  Rachel Warren – Lead Developer

  34. Thank You. Questions? Come by Booth K3 @spracklen