Upgrade to Pro — share decks privately, control downloads, hide ads and more …

User Based Optimization Goals and Performance Evaluation in High Performance Computing

SciTech
September 23, 2015

User Based Optimization Goals and Performance Evaluation in High Performance Computing

The performance of parallel schedulers is a crucial factor in the efficiency of high performance computing environments. Focusing on improving certain metrics, we must evaluate scheduler performances in realistic testing environments. Since real users submit jobs to their respective system, we need to spend special attention on their job submission behavior and the causes of that behavior. We investigate workload traces to find and model behavioral patterns. Furthermore, we also present the results of survey among users of compute clusters at TU Dortmund University and draw conclusions on important aspects when simulating submission behavior as well as possible goals to increase user satisfaction.

SciTech

September 23, 2015
Tweet

More Decks by SciTech

Other Decks in Technology

Transcript

  1. technische universität dortmund Robotics Research Institute Section Information Technology September,

    2015 User Based OptimizationGoals and Performance Evaluation in High Performance Computing Stephan Schlagkamp
  2. technische universität dortmund Robotics Research Institute Section Information Technology Stephan

    Schlagkamp 09/21/15 Short Biography § PhD student at research training group "Discrete Optimization of Technical Systems under Uncertainty” funded by the German Research Foundation § Studied Computer Science at TU Dortmund University Ø 2011 B.Sc. Ø 2013 M.Sc. § Research interests Ø User based workload modeling Ø Hybrid cloud planning § Current projects at ISI Ø In-depth trace analysis of Mira workload trace Ø HPC vs. HTC 2
  3. technische universität dortmund Robotics Research Institute Section Information Technology Stephan

    Schlagkamp 09/21/15 Outline § Motivation and Overview: Feedback-aware Performance Evaluation § Questionnaire QUHCC § Trace analysis Mira, Argonne National Lab 3
  4. technische universität dortmund Robotics Research Institute Section Information Technology Stephan

    Schlagkamp 09/21/15 Feedback-Aware Performance Evaluation § Evaluation with previously recorded workload traces Ø Crucial difference (E. Shmueliand D. Feitelson. Using site-level modeling to evaluate the performance of parallel system schedulers, MASCOTS 2006) § One instantiation of a dynamic process Ø Replaying a trace (D. G. Feitelson and N. Zakay. Preserving user behavior characteristics in trace-based simulation of parallel job scheduling, MASCOTS 2014) Ø User reaction is a „mystery“ (D. G. Feitelson, Workload Modeling for Computer Systems Performance Evaluation. Cambridge University Press, 2015) § Lack between theory and practice (U. Schwiegelshohn. How to Design a Job Scheduling Algorithm. JSSPP, 2014.) Ø Model user reaction to system performance 4 workload trace Scheduler Performance User Scheduler Performance response time generated load 1 0 stable state user reaction system performance
  5. technische universität dortmund Robotics Research Institute Section Information Technology Stephan

    Schlagkamp 09/21/15 Forms of Feedback § Working Times Ø Beginning and ending of working day § Work on Weekends Ø Decision to work on weekends § Job characteristics Ø Used resources and runtime 5
  6. technische universität dortmund Robotics Research Institute Section Information Technology Stephan

    Schlagkamp 09/21/15 Analyze User Behavior 7 § Data/Trace Analysis Ø Statistical analyses Ø Assumptions on user behavior § Analyze users “personally” Ø Questionnaires Ø Working diaries Ø etc. Questionnaire - QUHCC Analysis of Mira Trace
  7. technische universität dortmund Robotics Research Institute Section Information Technology Stephan

    Schlagkamp 09/21/15 QUHCC (Questionnaire for User Habits in Compute Clusters) § Focus Group Ø Admins of the compute cluster of Physics department - TU Dortmund (HTC) § Questions on Ø Acceptance of waiting times Ø Influence on working times Ø Strategies to cope with limited/slow resources Ø Satisfaction § Questionnaire Ø Compute cluster of Physics department - TU Dortmund (HTC) Ø LiDO cluster - TU Dortmund (HPC) Ø 24 Participants 8
  8. technische universität dortmund Robotics Research Institute Section Information Technology Stephan

    Schlagkamp 09/21/15 User Behavior (Strategies) 9 13# 3# 8# 8# 4# Bag#of#tasks# priori1ze# Other#cluster# Arrangements# None#of#the#above# Strategies)on)faster)results) § Strategies to receive results faster 1. Submit bag of tasks 2. Move to other cluster/arrangements
  9. technische universität dortmund Robotics Research Institute Section Information Technology Stephan

    Schlagkamp 09/21/15 User Behavior (Working Times) 10 7" 8" 8" 16" 5" Earlier"in"the"morning" Later"in"the"evening" At"night" On"weekends" None"of"the"above" Influence'on'working'.mes' § Influence on working times 1. Weekends 2. At night / later in the evening 3. Earlier in the morning
  10. technische universität dortmund Robotics Research Institute Section Information Technology Stephan

    Schlagkamp 09/21/15 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 WaitSmall WaitMedium WaitLarge Experience Satisfaction WorkingTimes Strategies GeneralAdjustment EgoAdjustment WaitSmall WaitMedium WaitLarge Experience Satisfaction WorkingTimes Strategies GeneralAdjustment EgoAdjustment Correlations 11 § ~5 questions on each scale
  11. technische universität dortmund Robotics Research Institute Section Information Technology Stephan

    Schlagkamp 09/21/15 job category INT 10M SMA 180M MED LAR acceptable waiting time #104 0 0.5 1 1.5 2 Acceptable Waiting Times Acceptance of Waiting Times 12 § Waiting time categories: Ø int: interactive Ø 10 minutes Ø small: 2-4 hours Ø 180 minutes Ø medium: 1-3 days Ø large: more than 3 days
  12. technische universität dortmund Robotics Research Institute Section Information Technology Stephan

    Schlagkamp 09/21/15 Consequences on Job Scheduling § Dissatisfaction follows exponential distribution § Satisfaction rates 13 slowdown 0 5 10 15 20 satisfaction level 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Satisfaction CDF (ALL) CDF exponential CDF µ=3.85, shift +1 SD = 1.0 100.0% SD = 2.0 74.4% SD = 2.5 65.4% SD = 3.0 55.1% § Waiting time optimization goal § Search for working time influence in traces
  13. technische universität dortmund Robotics Research Institute Section Information Technology Stephan

    Schlagkamp 09/21/15 Think Time § Think time: ttj+1 = sj+1 – (sj + pj + wj ) § Correlation between response time and think time (D. G. Feitelson. Looking at data, IPDPS’08) § Traces: Ø Parallel Workloads Archive • LANL CM5 • KTH-SP2 • OSC Cluster • HPC2N • ANL Intrepid • SDSC SP2 § tt(rj ) = 0.4826 * rj + 1779s Ø Least squares 14 101 102 103 104 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 response time subsequent think time linear fit tt
  14. technische universität dortmund Robotics Research Institute Section Information Technology Stephan

    Schlagkamp 09/21/15 15 § General information Ø 49,152 nodes Ø 786,432 processors Ø minimum allocation: 512 nodes (8192 cores) Ø 78,782 jobs § Several science fields § Submission behavior follows daily and weekly cycle Analysis of User Behavior at Mira
  15. technische universität dortmund Robotics Research Institute Section Information Technology Stephan

    Schlagkamp 09/21/15 Think Times at Mira § Think Time Ø only if positive and less than eight hours § No change in the past 20 years 16
  16. technische universität dortmund Robotics Research Institute Section Information Technology Stephan

    Schlagkamp 09/21/15 Correlations (Response Time) 17 § Response Time = Waiting Time + Runtime
  17. technische universität dortmund Robotics Research Institute Section Information Technology Stephan

    Schlagkamp 09/21/15 Correlations (Run and Waiting Time) 18 § Influence on the same scale as Response Time
  18. technische universität dortmund Robotics Research Institute Section Information Technology Stephan

    Schlagkamp 09/21/15 Correlations (Slowdown) 19 § Slowdown § Outliers
  19. technische universität dortmund Robotics Research Institute Section Information Technology Stephan

    Schlagkamp 09/21/15 Correlations (Complexity) 20 § Workload (CPU seconds) § Job complexity growths with number of nodes and workload
  20. technische universität dortmund Robotics Research Institute Section Information Technology Stephan

    Schlagkamp 09/21/15 Slowdown (Nodes) 21 § SD <= 2: runtime dominates § SD > 2: waiting time dominates 512  nodes >  512  node
  21. technische universität dortmund Robotics Research Institute Section Information Technology Stephan

    Schlagkamp 09/21/15 Slowdown (Workload) 22 § SD <= 2: runtime dominates § SD > 2: waiting time dominates <=  1  Mio CPU  s >  1  Mio CPU  s
  22. technische universität dortmund Robotics Research Institute Section Information Technology Stephan

    Schlagkamp 09/21/15 Notification 23 § Think Time = “Don’t know time” + actual think time? § Notification mechanism: Ø Receive a mail on job completion § 17,736 out of 78,782 jobs
  23. technische universität dortmund Robotics Research Institute Section Information Technology Stephan

    Schlagkamp 09/21/15 Notification (small jobs) 24 with Notification without Notification § 512 Nodes
  24. technische universität dortmund Robotics Research Institute Section Information Technology Stephan

    Schlagkamp 09/21/15 Notification (large jobs) 25 with Notification without Notification § > 512 Nodes
  25. technische universität dortmund Robotics Research Institute Section Information Technology Stephan

    Schlagkamp 09/21/15 Conclusions § Waiting or Runtime seemed to have equal influence on subsequent behavior Ø Detail analysis shows difference for dominating run or waiting time § Notification does not influence subsequent behavior § Investigate users specifically on our findings 26
  26. technische universität dortmund Robotics Research Institute Section Information Technology Stephan

    Schlagkamp 09/21/15 References § Schlagkamp, S., Renker, J.: Acceptance of waiting times in high performance computing. In: HCI International 2015 - Posters’ Extended Abstracts. Springer Berlin Heidelberg § Renker, J., Schlagkamp, S.: QUHCC: Questionnaire for User Habits of Compute Clusters. In: HCI International 2015 - Posters’ Extended Abstracts. Springer Berlin Heidelberg § Schlagkamp, S.: Influence of dynamic think times on parallel job scheduler performances in generative simulations. In JSSPP 2015 - 19th Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP 2015). 27