A video conversation with performance and capacity management veteran, Boris Zibitsker, about how I saved a multi-million dollar computing platform, using a 1-line performance model (at 21:50 minutes). "Best practices" caused the problem.
english and technology: UML software modeling model train set Kim Kardashian ﬁnancial/accounting models Amdahl's law statistical regression numerical mesh simulation benchmark workload simulation support vector machines convolutional neural nets We need to specify clearly and unambiguously which model
mathematical framework used to assess the validity of performance data (an overlooked necessity) data + model = information 1 Select performance metrics as inputs: λ, R, S, Q, . . . 2 Model is a relationship between those metrics: Q = λ R 3 Model outputs are calculated metrics 4 Compare calculated metrics with (other) measured metrics 5 Repeat until satisﬁed Can then project metric values into circumstances that are not measured or not measureable
tape silos IBM AIX/SP-2 50 nodes IBM AIX/SP-2 50 nodes SP2 SP2 FDDI rings User Tek X-terminals
Time Xfs1 8371 18.57 Xfs2 7113 16.72 NFS1 4781 17.01 NFS2 109 9.41 Observation: 109 ﬁles is nearly 128 = 27 Log2(128) = 7 is close to 9 seconds 4781 is near 4096 = 212 Log2(4096) = 12 is close to 17 seconds 8371 is near 8192 = 213 Log2(8192) = 13 is close to 18 seconds
log10 (N) where N is the number of remote-server ﬁles and k = 4.57 is proportionality constant for base-10 logarithms Table: Log model of mean R times Remote server Measured seconds Log model %Error Xfs1 18.57 17.929948 3.446698 Xfs2 16.72 17.606686 -5.303144 NFS1 17.01 16.818079 1.128281 NFS2 9.41 9.312522 1.035894 Model is accurate to within 5% But where does logarithmic behavior come from?
Get a Log, You Need a Tree • • • • • • 0 1 2 1 10 100 Level Number this is of this logarithm
replacement would NOT have solved anything 3 Problem caused by "best practices" for system management 4 Performance management was completely overlooked 5 Font server held ∼15000 ﬁles but only ∼1000 needed 6 Simple log performance model told the whole story 7 Simple ﬁx with no CapEx cost: prune the tree! 8 300% performance win in shortened launch times! 9 Log model more about explanation than prediction/forecasting