Applying The Universal Scalability Law to Distributed Systems

Applying The Universal Scalability Law to Distributed Systems Dr. Neil
J. Gunther Performance Dynamics Distributed Systems Conference Pune, INDIA 16 February 2019 SM c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 1 / 81

Introduction Outline 1 Introduction 2 Universal Computational Scalability (8) 3
Queueing Theory View of the USL (15) 4 Distributed Application Scaling (22) 5 Distributed in the Cloud (30) 6 Optimal Node Sizing (38) 7 Summary (45) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 2 / 81

Introduction It’s all about NONLINEARITIES “Using a term like nonlinear
science is like referring to the bulk of zoology as the study of non-elephant animals.” –Stan Ulam Models are a must All performance analysis (including scalability analysis) is about nonlinearity. But nonlinear is a class name, like cancer: there are at least as many forms of cancer as there are cell types. You need to correctly characterize the speciﬁc cell type to determine the best treatment. Similarly, you need to correctly characterize the performance nonlinearity in order to improve it. The only sane way to do that is with the appropriate performance model. Data alone is not sufﬁcient. 1 Algorithms are only half the scalability story 2 Measurements are the other half 3 And models are the other other half c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 3 / 81

Introduction What is distributed computing? “A collection of independent computers
that appears to its users as a single coherent system,” –Andy Tanenbaum (2007) Anything outside a single execution unit (von Neumann bottleneck) 1 PROBLEM: How to connect these? RAM Cache ? Disk Processor Tightly coupled vs. loosely coupled Many issues are not really new; a question of scale Confusing overloaded terminology: function, consistency, linearity, monotonicity, concurrency, .... c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 4 / 81

Introduction Easy to be naive about scalability COST (Configuration that
Outperforms a Single Thread), 20151: Single-threaded scaling can be better than multi-threaded scaling. (And your point is? —njg) Scalability! But at what COST? Frank McSherry Michael Isard Derek G. Murray Unaffiliated Unaffiliated⇤ Unaffiliated† Abstract We offer a new metric for big data platforms, COST, or the Configuration that Outperforms a Single Thread. The COST of a given platform for a given problem is the hardware configuration required before the platform outperforms a competent single-threaded implementation. COST weighs a system’s scalability against the overheads introduced by the system, and indicates the actual performance gains of the system, without rewarding systems that bring substantial but parallelizable overheads. We survey measurements of data-parallel systems re- cently reported in SOSP and OSDI, and find that many systems have either a surprisingly large COST, often hundreds of cores, or simply underperform one thread for all of their reported configurations. 1 Introduction “You can have a second computer once you’ve shown you know how to use the first one.” –Paul Barham The published work on big data systems has fetishized 300 1 10 100 50 1 10 cores speed-up system A system B 300 1 10 100 1000 8 100 cores seconds system A system B Figure 1: Scaling and performance measurements for a data-parallel algorithm, before (system A) and after (system B) a simple performance optimization. The unoptimized implementation “scales” far better, despite (or rather, because of) its poor performance. While this may appear to be a contrived example, we will argue that many published big data systems more closely resemble system A than they resemble system B. 1.1 Methodology In this paper we take several recent graph processing pa- pers from the systems literature and compare their reported performance against simple, single-threaded im- plementations on the same datasets using a high-end Gene Amdahl, 19672: Demonstration is made of the continued validity of the single processor approach and of the weaknesses of the multiple processor approach. (Wrong then but REALLY wrong now! —njg) All single CPUs/cores now tapped out ≤ 5 GHz since 2005 SPEC.org benchmarks made the same mistake 30 yrs ago SPECrate: multiple single-threaded instances (dumb fix) SPEC SDM: Unix multi-user processes (very revealing) 1 McSherry et al., “Scalability! But at what COST?”, Usenix conference, 2015 2 “Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities,” AFIPS conference, 1967 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 5 / 81

Introduction Can haz linear scalability Shared nothing architecture 3 —
not easy in general 4 Tandem Himalaya with NonStop SQL Local processing and memory + no global lock Linear OLTP scaling on TPC-C up to N = 128 nodes c.1996 L Interconnect network P C M P C M Processors P C M o o g 3 D. J. DeWitt and J. Gray, “Parallel Database Systems: The Future of High Performance Database Processing,” Comm. ACM, Vol. 36, No. 6, 85–98 (1992) 4 NJG, “Scaling and Shared Nothingness,” 4th HPTS Workshop, Asilomar, California (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 6 / 81

Universal Computational Scalability (8) Outline 1 Introduction 2 Universal Computational
Scalability (8) 3 Queueing Theory View of the USL (15) 4 Distributed Application Scaling (22) 5 Distributed in the Cloud (30) 6 Optimal Node Sizing (38) 7 Summary (45) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 7 / 81

Universal Computational Scalability (8) Motivation for USL Pyramid Technology, c.1992
Vendor Model Unix DBMS TPS-B nCUBE nCUBE2 × OPS 1017.0 Pyramid MISserver UNIFY 468.5 DEC VaxCluster4 × OPS 425.7 DEC VaxCluster3 × OPS 329.8 Sequent Symmetry 2000 ORA 319.6 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 8 / 81

Universal Computational Scalability (8) Amdahl’s law How not to write
a formula (Hennessy & Patterson, p.30) How not to plot performance data (Wikipedia) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 9 / 81

Universal Computational Scalability (8) Universal scaling properties 1. Equal bang
for the buck α = 0 β = 0 Processes Capacity Everybody wants this c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 10 / 81

for the buck α = 0 β = 0 Processes Capacity Everybody wants this 2. Diminishing returns α > 0 β = 0 Processes Capacity Everybody usually gets this c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 10 / 81

for the buck α = 0 β = 0 Processes Capacity Everybody wants this 3. Bottleneck limit 1/α α >> 0 β = 0 Processes Capacity Everybody hates this 2. Diminishing returns α > 0 β = 0 Processes Capacity Everybody usually gets this c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 10 / 81

for the buck α = 0 β = 0 Processes Capacity Everybody wants this 3. Bottleneck limit 1/α α >> 0 β = 0 Processes Capacity Everybody hates this 2. Diminishing returns α > 0 β = 0 Processes Capacity Everybody usually gets this 4. Retrograde throughput 1/α 1/N α > 0 β > 0 Processes Capacity Everybody thinks this never happens c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 10 / 81

Universal Computational Scalability (8) The universal scalability law (USL) N:
processes provide system stimulus or load 5 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 11 / 81

processes provide system stimulus or load XN : response function or relative capacity 5 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 11 / 81

processes provide system stimulus or load XN : response function or relative capacity Question: What kind of function? 5 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 11 / 81

processes provide system stimulus or load XN : response function or relative capacity Question: What kind of function? Answer: A rational function5 (nonlinear) 5 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 11 / 81

processes provide system stimulus or load XN : response function or relative capacity Question: What kind of function? Answer: A rational function5 (nonlinear) XN (α, β, γ) = γ N 1 + α (N − 1) + β N(N − 1) 5 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 11 / 81

processes provide system stimulus or load XN : response function or relative capacity Question: What kind of function? Answer: A rational function5 (nonlinear) XN (α, β, γ) = γ N 1 + α (N − 1) + β N(N − 1) Three Cs: 5 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 11 / 81

processes provide system stimulus or load XN : response function or relative capacity Question: What kind of function? Answer: A rational function5 (nonlinear) XN (α, β, γ) = γ N 1 + α (N − 1) + β N(N − 1) Three Cs: 1 Concurrency (0 < γ < ∞) 5 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 11 / 81

processes provide system stimulus or load XN : response function or relative capacity Question: What kind of function? Answer: A rational function5 (nonlinear) XN (α, β, γ) = γ N 1 + α (N − 1) + β N(N − 1) Three Cs: 1 Concurrency (0 < γ < ∞) 2 Contention (0 < α < 1) 5 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 11 / 81

processes provide system stimulus or load XN : response function or relative capacity Question: What kind of function? Answer: A rational function5 (nonlinear) XN (α, β, γ) = γ N 1 + α (N − 1) + β N(N − 1) Three Cs: 1 Concurrency (0 < γ < ∞) 2 Contention (0 < α < 1) 3 Coherency (0 < β < 1) 5 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 11 / 81

Universal Computational Scalability (8) Measurement meets model X(N) X(1) Thruput
data c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 12 / 81

data −→ CN Capacity metric c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 12 / 81

data −→ CN Capacity metric ←− N 1 + α (N − 1) + β N(N − 1) USL model c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 12 / 81

data −→ CN Capacity metric ←− N 1 + α (N − 1) + β N(N − 1) USL model 0 20 40 60 80 100 5 10 15 20 Processes (N) Relative capacity, C(N) Linear scaling Amdahl−like scaling Retrograde scaling c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 12 / 81

Universal Computational Scalability (8) Why is coherency quadratic? USL coherency
term β N (N − 1) is O(N2) Fully connected graph KN with N nodes K3 K6 K8 K16 has N 2 ≡ N! 2! (N − 2)! = 1 2 N (N − 1) edges (quadratic in N) that represent: communication between each pair of nodes 6 exchange of data or objects between processing nodes actual performance impact captured by magnitude of β parameter 6 This term can also be related to Brooks’ law (too many cooks increase delivery time) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 13 / 81

Universal Computational Scalability (8) Coherency or consistency? c 2019 Performance
Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 14 / 81

Universal Computational Scalability (8) Coherency or consistency? July 19, 2000
July 19, 2000 ers OK ers OK stic) stic) PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000 The CAP Theorem The CAP Theorem Consistency Availability Tolerance to network Partitions Theorem: You can have at most two of these properties for any shared-data system What the man said c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 14 / 81

Universal Computational Scalability (8) Coherency or consistency? July 19, 2000
July 19, 2000 ers OK ers OK stic) stic) PODC Keynote, July 19, 2000 PODC Keynote, July 19, 2000 The CAP Theorem The CAP Theorem Consistency Availability Tolerance to network Partitions Theorem: You can have at most two of these properties for any shared-data system What the man said C A P CA CP AP Correct Venn diagram c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 14 / 81

Universal Computational Scalability (8) Eventual consistency and monotonicity Consistency: partial
order of replicates via a binary relation (posets) Hasse diagrams: replica events converge (like time) to largest (cumulative) value Monotonicity: any upward Hasse path terminates with the same result Developer view is not a performance guarantee Lattice of integers ordered by binary max relation Time 9 7 9 4 7 9 OO 2 4 7 9 Lattice of sets ordered by binary join relation Time {a, b, c, d} {a, b, c} {b, c, d} {a, b, d} {b, c} {a, b) {c, d} {a, d} OO {b} {c} {a} {d} Logical monotonicity = monotonically increasing function A monotonic application can exhibit monotonically decreasing scalability c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 15 / 81

Universal Computational Scalability (8) Determining the USL parameters Throughput measurements
XN at various process loads N sourced from: 1 Load test harness, e.g., LoadRunner, Jmeter 2 Production monitoring, e.g., Linux \proc, JMX interface Want to determine the α, β, γ that best models XN data XN (α, β, γ) = γ N 1 + α (N − 1) + β N(N − 1) XN is a rational function (tricky) Brute force: Gene Amdahl 1967 (only α) Clever ways: 1 Optimization Solver in Excel 2 Nonlinear statistical regression in R c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 16 / 81

Queueing Theory View of the USL (15) Outline 1 Introduction
2 Universal Computational Scalability (8) 3 Queueing Theory View of the USL (15) 4 Distributed Application Scaling (22) 5 Distributed in the Cloud (30) 6 Optimal Node Sizing (38) 7 Summary (45) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 17 / 81

Queueing Theory View of the USL (15) Queueing theory for
those who can’t wait Amdahl’s law for parallel speedup SA T1 /TN with N processors 7 SA (N) = N 1 + α (N − 1) (1) 1 Parameter 0 ≤ α ≤ 1 is serial fraction of time spent running single threaded 2 Subsumed by USL model contention term (with γ = 1 and β = 0) 3 Msg: Forget linearity, you’re gonna hit a ceiling at lim N→∞ SA ∼ 1/α Some people (all named Frank?) consider Amdahl’s law to be ad hoc: Franco P. Preparata, “Should Amdahl’s law be repealed?”, Proc. 6th International Symposium on Algorithms and Computation (ISAAC’95), Cairns, Australia, December 4–6 1995. https: //www.researchgate.net/publication/221543174_Should_Amdahl%27s_Law_Be_Repealed_Abstract Frank L. Devai, “The Refutation of Amdahl?s Law and Its Variants,” 17th International Conference on Computational Science and Applications ICCSA, Trieste, Italy, July 2017 https://www.researchgate.net/publication/ 318648496_The_Refutation_of_Amdahl%27s_Law_and_Its_Variants 7 Gene Amdahl (1967) never wrote down this equation and he measured the mainframe workloads by brute force c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 18 / 81

Queueing Theory View of the USL (15) 0 20 40
60 80 100 5 10 15 20 Processes (N) Relative capacity, C(N) Linear scaling Amdahl−like scaling Retrograde scaling Top blue curve is Amdahl approaching a ceiling speedup C(N) ∼ 20 Which also means α ∼ 0.05 or app serialized 5% of the runtime c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 19 / 81

Queueing Theory View of the USL (15) Queue characterization Please
link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com W S L m ρ Time: R Space: Q + + = = Arriving Wai3ng Servicing Depar3ng A familiar queue

link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com W S L m ρ Time: R Space: Q + + = = Arriving Wai3ng Servicing Depar3ng A familiar queue An abstract queue

link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com W S L m ρ Time: R Space: Q + + = = Arriving Wai3ng Servicing Depar3ng A familiar queue An abstract queue 1. Where is the service facility?

link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com W S L m ρ Time: R Space: Q + + = = Arriving Wai3ng Servicing Depar3ng A familiar queue An abstract queue 1. Where is the service facility? 2. Did anything ﬁnish yet?

link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com W S L m ρ Time: R Space: Q + + = = Arriving Wai3ng Servicing Depar3ng A familiar queue An abstract queue 1. Where is the service facility? 2. Did anything ﬁnish yet? 3. Is anything waiting?

link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com W S L m ρ Time: R Space: Q + + = = Arriving Wai3ng Servicing Depar3ng A familiar queue An abstract queue 1. Where is the service facility? 2. Did anything ﬁnish yet? 3. Is anything waiting? 4. Is anything arriving? c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 20 / 81

Queueing Theory View of the USL (15) Amdahl queueing model
... ... Interconnect network Processing nodes Figure 1: Amdahl queueing model Classical: Queueing model of assembly-line robots (LHS bubbles) Classical: Robots that fail go to service stations (RHS) for repairs Classical: Repair queues impact manufacturing “cycle time” Amdahl: Robots become ﬁnite number of (distributed) processing nodes Amdahl: Repair stations becomes interconnect network Amdahl: Network latency impacts application scalability c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 21 / 81

Queueing Theory View of the USL (15) Amdahl queueing theorem
Theorem 1 (Gunther 2002) Amdahl’s law for parallel speedup is equivalent to the synchronous queueing bound on throughput in the machine repairman model of a multi-node computer system. Proof. 1 ”A New Interpretation of Amdahl’s Law and Geometric Scalability”, arXiv, Submitted 17 Oct (2002) 2 “Celebrity Boxing (and sizing): Alan Greenspan vs. Gene Amdahl,” 28th International Computer Measurement Group Conference, December 8–13, Reno, NV (2002) 3 “Uniﬁcation of Amdahl’s Law, LogP and Other Performance Models for Message-Passing Architectures”, Proc. Parallel and Distributed Computer Systems (PDCS’05) November 14–16, 569–576, Phoenix, AZ (2005) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 22 / 81

Queueing Theory View of the USL (15) Queueing bounds on
throughput 0 20 40 60 80 100 120 Processing nodes (N) Throughput X(N) Xlin (N) = 1 (R1 + Z) Xmax = 1 Smax Xmean (N) = N (RN + Z) Xsync (N) = N (N R1 + Z) Xsax = 1 R1 Figure 2: Average (blue) and synchronous (red) throughput X(N) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 23 / 81

Queueing Theory View of the USL (15) The speedup mapping
1 Z: is the mean processor execution time 2 Sk : service time at kth (network) queue 3 R1 : minimum round-trip time, i.e., k Sk + Z for N = 1 Amdahl speedup SA (N) is given by SA (N) = N · R1 N · k Sk + Z — queueing form (2) SA (N) = N 1 + α(N − 1) — usual “ad hoc” law (3) Connecting (2) and (3) requires identity k Sk R1 ≡ α →    0 as Sk → 0 (processor only) 1 as Z → 0 (network only) (4) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 24 / 81

Queueing Theory View of the USL (15) USL queueing model
... Load-dependent Interconnect Processing nodes Figure 3: USL load-dependent queueing model Amdahl: Robots become ﬁnite number of (distributed) processing nodes Amdahl: Repair stations becomes interconnect network Amdahl: Network latency impacts application scalability USL: Interconnect service time (latency) depends on queue length USL: Requests are ﬁrst exchange sorted O(N2) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 25 / 81

Queueing Theory View of the USL (15) USL load-dependent queueing
Theorem 2 (Gunther 2008) The USL relative capacity is equivalent to the synchronous queueing bound on throughput for a linear load-dependent machine repairman model of a multi-node computer system. Proof. 1 Guerrilla Capacity Planning: A Tactical Approach to Planning for Highly Scalable Applications and Services (Springer 2007) 2 A General Theory of Computational Scalability Based on Rational Functions, arXiv, Submitted on 11 Aug (2008) 3 Discrete-event simulations presented in “Getting in the Zone for Successful Scalability” Proc. Computer Measurement Group International Conference, December 7th–12, Las Vegas, NV (2008) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 26 / 81

Queueing Theory View of the USL (15) Experimental results 0
20 40 60 80 100 0 2 4 6 8 10 Processing nodes (N) Relative capacity C(N) Synchronized repairman simulation USL model: α = 0.1, β = 0 Load−dependent repairman simulation USL model: α = 0.1, β = 0.001 Discrete-event simulations (Jim Holtman 2008, 2018) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 27 / 81

Queueing Theory View of the USL (15) Key points for
distributed USL Each USL denominator term has a physical meaning (not purely stats model): 1 Constant O(1) ⇒ maximum nodal concurrency (no queueing) 2 Linear O(N) ⇒ intra-nodal contention (synchronous queueing) 3 Quadratic O(N2) ⇒ inter-nodal exchange (load-dependent sync queueing) Useful for interpreting USL analysis to developers, engineers, architects, etc. Example 1 If β > α then look for inter-nodal messaging activity rather than the size of message-queues in any particular node Worst-case queueing corresponds to an analytic equation (rational function) USL is agnostic about the application architecture and interconnect topology So where is all that information contained? (exercise for the reader) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 28 / 81

Distributed Application Scaling (22) Outline 1 Introduction 2 Universal Computational

Distributed Application Scaling (22) Memcached “Hidden Scalability Gotchas in Memcached
and Friends” NJG, S. Subramanyam, and S. Parvu Sun Microsystems Presented at Velocity 2010 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 30 / 81

Distributed Application Scaling (22) Distributed scaling strategies Scaleup Scaleout c
2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 31 / 81

Distributed Application Scaling (22) Scaleout generations Distributed cache of key-value
pairs Data pre-loaded from RDBMS backend Deploy memcache on previous generation CPUs (not always multicore) Single worker thread ok — until next hardware roll (multicore) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 32 / 81

Distributed Application Scaling (22) Controlled load-tests 0 2 4 6
8 10 12 0 50 100 150 200 250 300 Worker threads (N) Throughput KOPS X(N) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 33 / 81

Distributed Application Scaling (22) Explains these memcached warnings Conﬁguring the
memcached server Threading is used to scale memcached across CPU’s. The model is by ‘‘worker threads’’, meaning that each thread handles concurrent connections. ... By default 4 threads are allocated. Setting it to very large values (80+) will make it run considerably slower. Linux man pages - memcached (1) -t <threads> Number of threads to use to process incoming requests. ... It is typically not useful to set this higher than the number of CPU cores on the memcached server. Setting a high number (64 or more) of worker threads is not recommended. The default is 4. c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 34 / 81

Distributed Application Scaling (22) Load test data in R c
2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 35 / 81

Distributed Application Scaling (22) R regression analysis c 2019 Performance
Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 36 / 81

Distributed Application Scaling (22) USL scalability model 0 2 4
6 8 10 12 0 50 100 150 200 250 300 Worker threads (N) Throughput KOPS X(N) α = 0.0468 β = 0.021016 γ = 84.89 Nmax = 6.73 Xmax = 274.87 Xroof = 1814.82 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 37 / 81

Distributed Application Scaling (22) Concurrency parameter 0 2 4 6
8 10 12 0 50 100 150 200 250 300 Worker threads (N) Throughput KOPS X(N) α = 0.0468 β = 0.021016 γ = 84.89 Nmax = 6.73 Xmax = 274.87 Xroof = 1814.82 α = 0 β = 0 Processes Capacity 1 γ = 84.89 2 Slope of linear bound as Kops/thread 3 Estimate of throughput X(1) = 84.89 Kops at N = 1 thread c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 38 / 81

Distributed Application Scaling (22) Contention parameter 0 2 4 6
8 10 12 0 50 100 150 200 250 300 Worker threads (N) Throughput KOPS X(N) α = 0.0468 β = 0.021016 γ = 84.89 Nmax = 6.73 Xmax = 274.87 Xroof = 1814.82 1/α α >> 0 β = 0 Processes Capacity α = 0.0468 Waiting or queueing for resources about 4.6% of the time Max possible throughput is X(1)/α = 1814.78 Kops (Xroof ) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 39 / 81

Distributed Application Scaling (22) Coherency parameter 0 2 4 6
8 10 12 0 50 100 150 200 250 300 Worker threads (N) Throughput KOPS X(N) α = 0.0468 β = 0.021016 γ = 84.89 Nmax = 6.73 Xmax = 274.87 Xroof = 1814.82 1/α 1/N α > 0 β > 0 Processes Capacity β = 0.0210 corresponds to retrograde throughput Distributed copies of data (e.g., caches) have to be exchanged/updated about 2.1% of the time to be consistent Peak occurs at Nmax = (1 − α)/β = 6.73 threads c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 40 / 81

Distributed Application Scaling (22) Improving scalability performance 0 10 20
30 40 50 0 5 10 15 20 25 Threads (N) Speedup S(N) mcd 1.2.8 mcd 1.3.2 mcd 1.3.2 + patch c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 41 / 81

Distributed Application Scaling (22) Sirius and Zookeeper M. Bevilacqua-Linn, M.
Byron, P. Cline, J. Moore and S. Muir (Comcast), “Sirius: Distributing and Coordinating Application”, Usenix ATC 2014 P. Hunt, M. Konar, F. P. Junqueira and B. Reed (Yahoo), “ZooKeeper: Wait-free coordination for Internet-scale systems”, Usenix ATC 2010 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 42 / 81

Distributed Application Scaling (22) Coordination throughput data 0 500 1000
1500 Writes per second 1 3 5 7 9 11 13 15 Cluster size Sirius Sirius−NoBrain Sirius−NoDisk USL model c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 43 / 81

1500 Writes per second 1 3 5 7 9 11 13 15 Cluster size Sirius Sirius−NoBrain Sirius−NoDisk USL model It’s all downhill on the coordination ski slopes c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 43 / 81

1500 Writes per second 1 3 5 7 9 11 13 15 Cluster size Sirius Sirius−NoBrain Sirius−NoDisk USL model It’s all downhill on the coordination ski slopes But that’s very bad ... isn’t it? c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 43 / 81

1500 Writes per second 1 3 5 7 9 11 13 15 Cluster size Sirius Sirius−NoBrain Sirius−NoDisk USL model It’s all downhill on the coordination ski slopes But that’s very bad ... isn’t it? What does the USL say? c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 43 / 81

Distributed Application Scaling (22) USL scalability model 0 500 1000
1500 Cluster size Writes per second 1 3 5 7 9 11 13 15 Sirius Sirius−NoBrain Sirius−NoDisk USL model α = 0.05 β = 0.165142 γ = 1024.98 Nmax = 2.4 Xmax = 1513.93 Xroof = 20499.54 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 44 / 81

Distributed Application Scaling (22) Concurrency parameter 0 500 1000 1500
Cluster size Writes per second 1 3 5 7 9 11 13 15 Sirius Sirius−NoBrain Sirius−NoDisk USL model α = 0.05 β = 0.165142 γ = 1024.98 Nmax = 2.4 Xmax = 1513.93 Xroof = 20499.54 α = 0 β = 0 Processes Capacity 1 γ = 1024.98 2 Single node is meaningless (need N ≥ 3 for majority) 3 Interpret γ as N = 1 virtual throughput 4 USL estimates X(1) = 1, 024.98 WPS (black square) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 45 / 81

Distributed Application Scaling (22) Contention parameter 0 500 1000 1500
Cluster size Writes per second 1 3 5 7 9 11 13 15 Sirius Sirius−NoBrain Sirius−NoDisk USL model α = 0.05 β = 0.165142 γ = 1024.98 Nmax = 2.4 Xmax = 1513.93 Xroof = 20499.54 1/α α >> 0 β = 0 Processes Capacity α = 0.05 Queueing for resources about 5% of the time Max possible throughput is X(1)/α = 20, 499.54 WPS (Xroof ) But Xroof not feasible in these systems c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 46 / 81

Distributed Application Scaling (22) Coherency parameter 0 500 1000 1500
Cluster size Writes per second 1 3 5 7 9 11 13 15 Sirius Sirius−NoBrain Sirius−NoDisk USL model α = 0.05 β = 0.165142 γ = 1024.98 Nmax = 2.4 Xmax = 1513.93 Xroof = 20499.54 1/α 1/N α > 0 β > 0 Processes Capacity β = 0.1651 says retrograde throughput dominates! Distributed data being exchanged (compared?) about 16.5% of the time (virtual) Peak at Nmax = (1 − α)/β = 2.4 cluster nodes Shocking but that’s what Paxos-type scaling looks like c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 47 / 81

Distributed Application Scaling (22) Comparison of Sirius and Zookeeper 0
500 1000 1500 Cluster size Writes per second 1 3 5 7 9 11 13 15 Sirius Sirius−NoBrain Sirius−NoDisk USL model α = 0.05 β = 0.165142 γ = 1024.98 Nmax = 2.4 Xmax = 1513.93 Xroof = 20499.54 0 20000 40000 60000 80000 Cluster size Reqs per second 1 3 5 7 9 11 13 15 Mean CI_lo CI_hi USL model α = 0.05 β = 0.157701 γ = 41536.96 Nmax = 2.45 Xmax = 62328.47 Xroof = 830739.1 Both coordination services exhibit equivalent USL scaling parameters: 5% contention delay — not “wait-free” (per Yahoo paper title) 16% coherency delay despite Sirius being write-intensive and Zookeeper being read-intensive (i.e., 100x throughput) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 48 / 81

Distributed Application Scaling (22) Distributed Ledger Scalability A. Charapko, A.
Ailijiang and M. Demirbas “Bridging Paxos and Blockchain Consensus”, 2018 Team Rocket (Cornell), “Snowﬂake to Avalanche: A Novel Metastable Consensus Protocol Family for Cryptocurrencies”, 2018 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 49 / 81

Distributed Application Scaling (22) Avalanche protocol VISA can handle 60,000
TPS Bitcoin bottlenecks at 7 TPS 8 That’s FOUR decades performance inferiority! Avalanche rumor-mongering coordination: Leaderless BFT Multiple samples of small nodal neighborhood (ns ) Threshold > k ns when most of system has “heard” the rumor Greener than Bitcoin (Nakamoto) No PoW needed 8 K. Stinchcombe, “Ten years in, nobody has come up with a use for blockchain, Hackernoon, Dec 22, 2017 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 50 / 81

Distributed Application Scaling (22) Avalanche scalability 0 500 1000 1500
2000 0 500 1000 1500 2000 Network nodes Throughput (TPS) α = 0.9 β = 6.5e−05 γ = 1648.27 Npeak = 39.15 Xpeak = 1821.21 Xroof = 1831.41 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 51 / 81

Distributed Application Scaling (22) Comparison with Paxos 0 500 1000
1500 2000 0 500 1000 1500 2000 Network nodes Throughput (TPS) α = 0.9 β = 6.5e−05 γ = 1648.27 Npeak = 39.15 Xpeak = 1821.21 Xroof = 1831.41 0 500 1000 1500 Cluster size Writes per second 1 3 5 7 9 11 13 15 Sirius Sirius−NoBrain Sirius−NoDisk USL model α = 0.05 β = 0.165142 γ = 1024.98 Nmax = 2.4 Xmax = 1513.93 Xroof = 20499.54 Contention: α about 90% X(1)/α ⇒ immediate saturation at about 2000 TPS Coherency: β extremely small but not zero Much slower ‘linear’ degradation than Zookeeper/Paxos Saturation throughput maintained over 2000 nodes (100x Sirius) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 52 / 81

Distributed Application Scaling (22) Hadoop Super Scaling NJG, P. J.
Puglia, and K. Tomasette, “Hadoop Superlinear Scalability: The perpetual motion of parallel performance”, Comm. ACM, Vol. 58 No. 4, 46–55 (2015) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 53 / 81

Distributed Application Scaling (22) Super-scalability is a thing Superlinear Linear
Sublinear Processors Speedup More than 100% efﬁcient! c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 54 / 81

Distributed Application Scaling (22) Seeing is believing “Speedup in Parallel
Contexts.” http://en.wikipedia.org/wiki/Speedup#Speedup_in_Parallel_Contexts “Where does super-linear speedup come from?” http://stackoverflow.com/questions/4332967/where-does-super-linear-speedup-come-from “Sun Fire X2270 M2 Super-Linear Scaling of Hadoop TeraSort and CloudBurst Benchmarks.” https://blogs.oracle.com/BestPerf/entry/20090920_x2270m2_hadoop Haas, R. “Scalability, in Graphical Form, Analyzed.” http://rhaas.blogspot.com/2011/09/scalability-in-graphical-form-analyzed.html Sutter, H. 2008. “Going Superlinear.” Dr. Dobb’s Journal 33(3), March. http://www.drdobbs.com/cpp/going-superlinear/206100542 Sutter, H. 2008. “Super Linearity and the Bigger Machine.” Dr. Dobb’s Journal 33(4), April. http://www.drdobbs.com/parallel/super-linearity-and-the-bigger-machine/206903306 “SDN analytics and control using sFlow standard — Superlinear.” http://blog.sflow.com/2010/09/superlinear.html Eijkhout, V. 2014. Introduction to High Performance Scientiﬁc Computing. Lulu.com c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 55 / 81

Distributed Application Scaling (22) Empirical evidence 16-node Sun Fire X2270
M2 cluster c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 56 / 81

Distributed Application Scaling (22) Smells like perpetual motion c 2019
Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 57 / 81

Distributed Application Scaling (22) Terasort cluster emulations Want a controlled
environment to study superlinearity Terasort workload sorts 1 TB data in parallel Terasort has benchmarked Hadoop MapReduce performance We used just 100 GB data input (weren’t benchmarking anything) Simulate in AWS cloud EC2 instances (more ﬂexible and much cheaper) Enabled multiple runs in parallel (thank you Comcast for cycle $s) Table 1: Amazon EC2 Conﬁgurations Optimized Processor vCPU Memory Instance Network for Arch number (GiB) Storage (GB) Performance BigMem Memory 64-bit 4 34.2 1 x 850 Moderate BigDisk Compute 64-bit 8 7 4 x 420 High c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 58 / 81

Distributed Application Scaling (22) USL analysis of BigMem N ≤
50 nodes 0 50 100 150 0 50 100 150 USL Analysis of BigMem Hadoop Terrasort EC2 m2 nodes (N) Speedup S(N) α = −0.0288 β = 0.000447 Nmax = 47.96 Smax = 73.48 Sroof = N A Ncross = 64.5 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 59 / 81

Distributed Application Scaling (22) USL analysis of BigMem N ≤
150 nodes 0 50 100 150 0 50 100 150 USL Model of BigMem Hadoop TS Data EC2 m2 nodes (N) Speedup S(N) α = −0.0089 β = 9e−05 Nmax = 105.72 Smax = 99.53 Sroof = N A Ncross = 99.14 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 60 / 81

Distributed Application Scaling (22) Speedup on N ≤ 10 BigMem
Nodes (1 disk) 0 5 10 15 0 5 10 15 BigMem Hadoop Terasort Speedup Data EC2 m2 nodes (N) Speedup S(N) Superlinear region Sublinear region c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 61 / 81

Distributed Application Scaling (22) Speedup on N ≤ 10 BigDisk
Nodes (4 disks) 0 5 10 15 0 5 10 15 BigDisk Hadoop Terasort Speedup Data EC2 c1 nodes (N) Speedup S(N) Superlinear region Sublinear region c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 62 / 81

Distributed Application Scaling (22) Superlinear payback trap Superlinearity is real
(i.e., measurable) But due to hidden resource constraint (disk or memory subsystem sizing) Sets wrong normalization for speedup scale Shows up in USL as negative α value (cf. GCAP book) Theorem 3 (Gunther 2013) The apparent advantage of superlinear scaling (α < 0) is inevitably lost due to crossover into a region of severe performance degradation (β 0) that can be worse than if superlinearity had not been present in the ﬁrst place. Proof was presented at Hotsos 2013. Superlinear Linear Sublinear Processors Speedup Superlinear Payback Processors Speedup c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 63 / 81

Distributed in the Cloud (30) Outline 1 Introduction 2 Universal
Computational Scalability (8) 3 Queueing Theory View of the USL (15) 4 Distributed Application Scaling (22) 5 Distributed in the Cloud (30) 6 Optimal Node Sizing (38) 7 Summary (45) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 64 / 81

Distributed in the Cloud (30) AWS Cloud Application “Exposing the
Cost of Performance Hidden in the Cloud” NJG and M. Chawla Presented at CMG cloudXchange 2018 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 65 / 81

Distributed in the Cloud (30) Production data Previously measured X
and R directly on test rig Table 2: Converting data to performance metrics Data Meaning Metrics Meaning T Elapsed time X = C/T Throughput Tp Processing time R = (Tp /T)(T/C) Response time C Completed work N = X × R Concurrent threads Ucpu CPU utilization S = Ucpu /X Service time Example 4 (Coalesced metrics) Linux epoch Timestamp interval between rows is 300 seconds Timestamp, X, N, S, R, U_cpu 1486771200000, 502.171674, 170.266663, 0.000912, 0.336740, 0.458120 1486771500000, 494.403035, 175.375000, 0.001043, 0.355975, 0.515420 1486771800000, 509.541751, 188.866669, 0.000885, 0.360924, 0.450980 1486772100000, 507.089094, 188.437500, 0.000910, 0.367479, 0.461700 1486772400000, 532.803039, 191.466660, 0.000880, 0.362905, 0.468860 ... c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 66 / 81

Distributed in the Cloud (30) Tomcat data from AWS 0
100 200 300 400 500 0 200 400 600 800 Tomcat threads Throughput (RPS) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 67 / 81

Distributed in the Cloud (30) USL nonlinear analysis 0 100
200 300 400 500 0 200 400 600 800 Tomcat threads Throughput (RPS) α = 0 β = 3e−06 γ = 3 Nmax = 539.2 Xmax = 809.55 Nopt = 274.8 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 68 / 81

Distributed in the Cloud (30) Concurrency parameter 0 100 200
300 400 500 0 200 400 600 800 Tomcat threads Throughput (RPS) α = 0 β = 3e−06 γ = 3 Nmax = 539.2 Xmax = 809.55 Nopt = 274.8 α = 0 β = 0 Processes Capacity 1 γ = 3.0 2 Smallest number of threads during 24 hr sample is N > 100 3 Nonetheless USL estimates throughput X(1) = 3 RPS c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 69 / 81

Distributed in the Cloud (30) Contention parameter 0 100 200
300 400 500 0 200 400 600 800 Tomcat threads Throughput (RPS) α = 0 β = 3e−06 γ = 3 Nmax = 539.2 Xmax = 809.55 Nopt = 274.8 1/α α >> 0 β = 0 Processes Capacity α = 0 No signiﬁcant waiting or queueing Max possible throughput Xroof not deﬁned c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 70 / 81

Distributed in the Cloud (30) Coherency parameter 0 100 200
300 400 500 0 200 400 600 800 Tomcat threads Throughput (RPS) α = 0 β = 3e−06 γ = 3 Nmax = 539.2 Xmax = 809.55 Nopt = 274.8 1/α 1/N α > 0 β > 0 Processes Capacity β = 3 × 10−6 implies very weak retrograde throughput Extremely little data exchange But entirely responsible for sublinearity And peak throughput Xmax = 809.55 RPS Peak occurs at Nmax = 539.2 threads c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 71 / 81

Distributed in the Cloud (30) Revised USL analysis Parallel threads
implies linear scaling Linear slope γ ∼ 3: γ = 2.65 Should be no contention, i.e., α = 0 Discontinuity at N ∼ 275 threads Throughput plateaus, i.e., β = 0 Saturation occurs at processor utilization UCPU ≥ 75% Linux OS can’t do that! Pseudo-saturation due to AWS Auto Scaling policy (hypervisor?) Many EC2 instances spun up and down during 24 hrs c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 72 / 81

Distributed in the Cloud (30) Corrected USL linear model 0
100 200 300 400 500 0 200 400 600 800 Tomcat threads Throughput (RPS) α = 0 β = 0 γ = 2.65 Nmax = NaN Xmax = 727.03 Nopt = 274.8 Parallel threads Pseudo−saturation c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 73 / 81

Optimal Node Sizing (38) Outline 1 Introduction 2 Universal Computational

Optimal Node Sizing (38) 1. Canonical sublinear scalability 0 50
100 150 0 100 200 300 400 500 Processes (N) Thruput X(N) α = 0.05 β = 6e−05 γ = 20.8563 Nmax = 125.5 Xmax = 320.47 Xroof = 417.13 Amhdahl curve (dotted) Linear scaling bound (LHS red) Saturation asymptote (top red) Nopt = 1/α = 20 at intersection USL curve (solid blue) is retrograde Nopt shifts right to Nmax = 125 Throughput is lower than Xroof c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 75 / 81

Optimal Node Sizing (38) 2. Memcache scalability 0 5 10
15 20 25 0 500 1000 1500 2000 Worker threads (N) Throughput X(N) α = 0.0468 β = 0.021016 γ = 84.89 Nmax = 6.73 Xmax = 274.87 Nopt = 21.38 0 10 20 30 40 50 0 100 200 300 400 500 Worker processes (N) Throughput X(N) α = 0 β = 0.00052 γ = 17.76 Npeak = 43.87 Xpeak = 394.06 Nopt = Inf Initial scalability (green curve) Nmax occurs before Nopt Nmax = 6.73 Nopt = 1/0.0468 = 21.36752 Sun SPARC patch (blue curve) Nmax = 43.87 Now exceeds previous Nopt And throughput increased by approx 50% c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 76 / 81

Optimal Node Sizing (38) 3. Cloud scalability 0 100 200
300 400 500 0 200 400 600 800 Tomcat threads Throughput (RPS) α = 0 β = 0 γ = 2.65 Nmax = NaN Xmax = 727.03 Nopt = 274.8 Parallel threads Pseudo−saturation 0 100 200 300 400 500 0.0 0.2 0.4 0.6 0.8 1.0 Tomcat processes (N) Response time R(N) α = 0 β = 0 γ = 2.65 Rmin = 0.38 Sbnk = 0.001365 Nopt = 274.8 Throughput profile X(N) data tightly follow bounds Nopt = 274.8 due to autoscaling policy Significant dispersion about mean X Response time profile Autoscaling throttles throughput Causes R(N) to increase linearly Will users tolerate R(N) > Nopt ? c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 77 / 81

Optimal Node Sizing (38) 4. DLT scalability 0 20000 40000
60000 80000 Cluster size Reqs per second 1 3 5 7 9 11 13 15 Mean CI_lo CI_hi USL model α = 0.05 β = 0.157701 γ = 41536.96 Nmax = 2.45 Xmax = 62328.47 Xroof = 830739.1 0 500 1000 1500 2000 0 500 1000 1500 2000 Network nodes Throughput (TPS) α = 0.9 β = 6.5e−05 γ = 1648.27 Npeak = 39.15 Xpeak = 1821.21 Xroof = 1831.41 Throughput profile Optimum is the smallest configuration Likely bigger in real applications Coordination kills Scaling degrades (that’s how it works!) No good for DLT Throughput profile There is no optimum Throughput similar to Zookeeper peak Pick a number N Throughput is independent of scale But N scale is 100x Zookeeper Generally needed for DLT c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 78 / 81

Summary (45) Outline 1 Introduction 2 Universal Computational Scalability (8)
3 Queueing Theory View of the USL (15) 4 Distributed Application Scaling (22) 5 Distributed in the Cloud (30) 6 Optimal Node Sizing (38) 7 Summary (45) c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 79 / 81

Summary (45) Summary All performance is nonlinear Need both data
and models (e.g., USL) USL is a nonlinear (rational) function USL is architecture agnostic All scalability information contained in parameters: 1 α: contention, queueing 2 β: coherency, consistency (pairwise data exchange) 3 γ: concurrency, X(1) value Apply USL via nonlinear statistical regression CAP-like consistency associated with magnitude of quadratic β term Can apply USL to both controlled (testing) and production platforms USL analysis often different from paper designs and algorithms c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 80 / 81

Summary (45) Thank you for the (virtual) invitation! www.perfdynamics.com Castro
Valley, California Twitter twitter.com/DrQz Facebook facebook.com/PerformanceDynamics Blog perfdynamics.blogspot.com Training classes perfdynamics.com/Classes [email protected] +1-510-537-5758 c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems March 13, 2019 81 / 81

Applying The Universal Scalability Law to Distr...

Applying The Universal Scalability Law to Distributed Systems

More Decks by Dr. Neil Gunther

Other Decks in Technology

Featured

Transcript