Slide 1

Slide 1 text

technische universität dortmund Robotics Research Institute Section Information Technology September, 2015 User Based OptimizationGoals and Performance Evaluation in High Performance Computing Stephan Schlagkamp

Slide 2

Slide 2 text

technische universität dortmund Robotics Research Institute Section Information Technology Stephan Schlagkamp 09/21/15 Short Biography § PhD student at research training group "Discrete Optimization of Technical Systems under Uncertainty” funded by the German Research Foundation § Studied Computer Science at TU Dortmund University Ø 2011 B.Sc. Ø 2013 M.Sc. § Research interests Ø User based workload modeling Ø Hybrid cloud planning § Current projects at ISI Ø In-depth trace analysis of Mira workload trace Ø HPC vs. HTC 2

Slide 3

Slide 3 text

technische universität dortmund Robotics Research Institute Section Information Technology Stephan Schlagkamp 09/21/15 Outline § Motivation and Overview: Feedback-aware Performance Evaluation § Questionnaire QUHCC § Trace analysis Mira, Argonne National Lab 3

Slide 4

Slide 4 text

technische universität dortmund Robotics Research Institute Section Information Technology Stephan Schlagkamp 09/21/15 Feedback-Aware Performance Evaluation § Evaluation with previously recorded workload traces Ø Crucial difference (E. Shmueliand D. Feitelson. Using site-level modeling to evaluate the performance of parallel system schedulers, MASCOTS 2006) § One instantiation of a dynamic process Ø Replaying a trace (D. G. Feitelson and N. Zakay. Preserving user behavior characteristics in trace-based simulation of parallel job scheduling, MASCOTS 2014) Ø User reaction is a „mystery“ (D. G. Feitelson, Workload Modeling for Computer Systems Performance Evaluation. Cambridge University Press, 2015) § Lack between theory and practice (U. Schwiegelshohn. How to Design a Job Scheduling Algorithm. JSSPP, 2014.) Ø Model user reaction to system performance 4 workload trace Scheduler Performance User Scheduler Performance response time generated load 1 0 stable state user reaction system performance

Slide 5

Slide 5 text

technische universität dortmund Robotics Research Institute Section Information Technology Stephan Schlagkamp 09/21/15 Forms of Feedback § Working Times Ø Beginning and ending of working day § Work on Weekends Ø Decision to work on weekends § Job characteristics Ø Used resources and runtime 5

Slide 6

Slide 6 text

technische universität dortmund Robotics Research Institute Section Information Technology Stephan Schlagkamp 09/21/15 User Model 6

Slide 7

Slide 7 text

technische universität dortmund Robotics Research Institute Section Information Technology Stephan Schlagkamp 09/21/15 Analyze User Behavior 7 § Data/Trace Analysis Ø Statistical analyses Ø Assumptions on user behavior § Analyze users “personally” Ø Questionnaires Ø Working diaries Ø etc. Questionnaire - QUHCC Analysis of Mira Trace

Slide 8

Slide 8 text

technische universität dortmund Robotics Research Institute Section Information Technology Stephan Schlagkamp 09/21/15 QUHCC (Questionnaire for User Habits in Compute Clusters) § Focus Group Ø Admins of the compute cluster of Physics department - TU Dortmund (HTC) § Questions on Ø Acceptance of waiting times Ø Influence on working times Ø Strategies to cope with limited/slow resources Ø Satisfaction § Questionnaire Ø Compute cluster of Physics department - TU Dortmund (HTC) Ø LiDO cluster - TU Dortmund (HPC) Ø 24 Participants 8

Slide 9

Slide 9 text

technische universität dortmund Robotics Research Institute Section Information Technology Stephan Schlagkamp 09/21/15 User Behavior (Strategies) 9 13# 3# 8# 8# 4# Bag#of#tasks# priori1ze# Other#cluster# Arrangements# None#of#the#above# Strategies)on)faster)results) § Strategies to receive results faster 1. Submit bag of tasks 2. Move to other cluster/arrangements

Slide 10

Slide 10 text

technische universität dortmund Robotics Research Institute Section Information Technology Stephan Schlagkamp 09/21/15 User Behavior (Working Times) 10 7" 8" 8" 16" 5" Earlier"in"the"morning" Later"in"the"evening" At"night" On"weekends" None"of"the"above" Influence'on'working'.mes' § Influence on working times 1. Weekends 2. At night / later in the evening 3. Earlier in the morning

Slide 11

Slide 11 text

technische universität dortmund Robotics Research Institute Section Information Technology Stephan Schlagkamp 09/21/15 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 WaitSmall WaitMedium WaitLarge Experience Satisfaction WorkingTimes Strategies GeneralAdjustment EgoAdjustment WaitSmall WaitMedium WaitLarge Experience Satisfaction WorkingTimes Strategies GeneralAdjustment EgoAdjustment Correlations 11 § ~5 questions on each scale

Slide 12

Slide 12 text

technische universität dortmund Robotics Research Institute Section Information Technology Stephan Schlagkamp 09/21/15 job category INT 10M SMA 180M MED LAR acceptable waiting time #104 0 0.5 1 1.5 2 Acceptable Waiting Times Acceptance of Waiting Times 12 § Waiting time categories: Ø int: interactive Ø 10 minutes Ø small: 2-4 hours Ø 180 minutes Ø medium: 1-3 days Ø large: more than 3 days

Slide 13

Slide 13 text

technische universität dortmund Robotics Research Institute Section Information Technology Stephan Schlagkamp 09/21/15 Consequences on Job Scheduling § Dissatisfaction follows exponential distribution § Satisfaction rates 13 slowdown 0 5 10 15 20 satisfaction level 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Satisfaction CDF (ALL) CDF exponential CDF µ=3.85, shift +1 SD = 1.0 100.0% SD = 2.0 74.4% SD = 2.5 65.4% SD = 3.0 55.1% § Waiting time optimization goal § Search for working time influence in traces

Slide 14

Slide 14 text

technische universität dortmund Robotics Research Institute Section Information Technology Stephan Schlagkamp 09/21/15 Think Time § Think time: ttj+1 = sj+1 – (sj + pj + wj ) § Correlation between response time and think time (D. G. Feitelson. Looking at data, IPDPS’08) § Traces: Ø Parallel Workloads Archive • LANL CM5 • KTH-SP2 • OSC Cluster • HPC2N • ANL Intrepid • SDSC SP2 § tt(rj ) = 0.4826 * rj + 1779s Ø Least squares 14 101 102 103 104 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 response time subsequent think time linear fit tt

Slide 15

Slide 15 text

technische universität dortmund Robotics Research Institute Section Information Technology Stephan Schlagkamp 09/21/15 15 § General information Ø 49,152 nodes Ø 786,432 processors Ø minimum allocation: 512 nodes (8192 cores) Ø 78,782 jobs § Several science fields § Submission behavior follows daily and weekly cycle Analysis of User Behavior at Mira

Slide 16

Slide 16 text

technische universität dortmund Robotics Research Institute Section Information Technology Stephan Schlagkamp 09/21/15 Think Times at Mira § Think Time Ø only if positive and less than eight hours § No change in the past 20 years 16

Slide 17

Slide 17 text

technische universität dortmund Robotics Research Institute Section Information Technology Stephan Schlagkamp 09/21/15 Correlations (Response Time) 17 § Response Time = Waiting Time + Runtime

Slide 18

Slide 18 text

technische universität dortmund Robotics Research Institute Section Information Technology Stephan Schlagkamp 09/21/15 Correlations (Run and Waiting Time) 18 § Influence on the same scale as Response Time

Slide 19

Slide 19 text

technische universität dortmund Robotics Research Institute Section Information Technology Stephan Schlagkamp 09/21/15 Correlations (Slowdown) 19 § Slowdown § Outliers

Slide 20

Slide 20 text

technische universität dortmund Robotics Research Institute Section Information Technology Stephan Schlagkamp 09/21/15 Correlations (Complexity) 20 § Workload (CPU seconds) § Job complexity growths with number of nodes and workload

Slide 21

Slide 21 text

technische universität dortmund Robotics Research Institute Section Information Technology Stephan Schlagkamp 09/21/15 Slowdown (Nodes) 21 § SD <= 2: runtime dominates § SD > 2: waiting time dominates 512  nodes >  512  node

Slide 22

Slide 22 text

technische universität dortmund Robotics Research Institute Section Information Technology Stephan Schlagkamp 09/21/15 Slowdown (Workload) 22 § SD <= 2: runtime dominates § SD > 2: waiting time dominates <=  1  Mio CPU  s >  1  Mio CPU  s

Slide 23

Slide 23 text

technische universität dortmund Robotics Research Institute Section Information Technology Stephan Schlagkamp 09/21/15 Notification 23 § Think Time = “Don’t know time” + actual think time? § Notification mechanism: Ø Receive a mail on job completion § 17,736 out of 78,782 jobs

Slide 24

Slide 24 text

technische universität dortmund Robotics Research Institute Section Information Technology Stephan Schlagkamp 09/21/15 Notification (small jobs) 24 with Notification without Notification § 512 Nodes

Slide 25

Slide 25 text

technische universität dortmund Robotics Research Institute Section Information Technology Stephan Schlagkamp 09/21/15 Notification (large jobs) 25 with Notification without Notification § > 512 Nodes

Slide 26

Slide 26 text

technische universität dortmund Robotics Research Institute Section Information Technology Stephan Schlagkamp 09/21/15 Conclusions § Waiting or Runtime seemed to have equal influence on subsequent behavior Ø Detail analysis shows difference for dominating run or waiting time § Notification does not influence subsequent behavior § Investigate users specifically on our findings 26

Slide 27

Slide 27 text

technische universität dortmund Robotics Research Institute Section Information Technology Stephan Schlagkamp 09/21/15 References § Schlagkamp, S., Renker, J.: Acceptance of waiting times in high performance computing. In: HCI International 2015 - Posters’ Extended Abstracts. Springer Berlin Heidelberg § Renker, J., Schlagkamp, S.: QUHCC: Questionnaire for User Habits of Compute Clusters. In: HCI International 2015 - Posters’ Extended Abstracts. Springer Berlin Heidelberg § Schlagkamp, S.: Influence of dynamic think times on parallel job scheduler performances in generative simulations. In JSSPP 2015 - 19th Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP 2015). 27

Slide 28

Slide 28 text

technische universität dortmund Robotics Research Institute Section Information Technology Stephan Schlagkamp 09/21/15 Thanks for your attention! 28