Dr. Neil Gunther
February 16, 2019
2.4k

# Applying The Universal Scalability Law to Distributed Systems

After demonstrating that both Amdahl's law and the Universal Scalability law (USL) are fundamental, in that they can be derived from queueing theory, the USL is used to quantify the scalability of a wide variety of distributed systems, including Memcache, Hadoop, Zookeeper, AWS cloud, and Avalanche DLT. Contact is also made with some well-known concepts in modern distributed systems, viz., Hasse diagrams, the CAP theorem, eventual consistency and Paxos coordination.

## Dr. Neil Gunther

February 16, 2019

## Transcript

1. Applying The Universal Scalability Law
to Distributed Systems
Dr. Neil J. Gunther
Performance Dynamics
Distributed Systems Conference
Pune, INDIA
16 February 2019
SM
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 1 / 81

2. Introduction
Outline
1 Introduction
2 Universal Computational Scalability (8)
3 Queueing Theory View of the USL (15)
4 Distributed Application Scaling (22)
5 Distributed in the Cloud (30)
6 Optimal Node Sizing (38)
7 Summary (45)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 2 / 81

3. Introduction
“Using a term like nonlinear science is like referring to the bulk of
zoology as the study of non-elephant animals.” –Stan Ulam
Models are a must
All performance analysis (including scalability analysis) is about nonlinearity. But nonlinear is
a class name, like cancer: there are at least as many forms of cancer as there are cell types. You
need to correctly characterize the speciﬁc cell type to determine the best treatment. Similarly,
you need to correctly characterize the performance nonlinearity in order to improve it. The
only sane way to do that is with the appropriate performance model. Data alone is not
sufﬁcient.
1 Algorithms are only half the scalability story
2 Measurements are the other half
3 And models are the other other half
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 3 / 81

4. Introduction
What is distributed computing?
“A collection of independent computers that appears to its users as a
single coherent system,” –Andy Tanenbaum (2007)
Anything outside a single execution unit (von Neumann bottleneck)
1
PROBLEM: How to connect these?
RAM
Cache
?
Disk
Processor
Tightly coupled vs. loosely coupled
Many issues are not really new; a question of scale
Confusing overloaded terminology: function, consistency, linearity,
monotonicity, concurrency, ....
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 4 / 81

5. Introduction
Easy to be naive about scalability
COST (Conﬁguration that Outperforms a Single Thread), 20151:
point is? —njg)
Scalability! But at what COST?
Frank McSherry Michael Isard Derek G. Murray
Unafﬁliated Unafﬁliated⇤ Unafﬁliated†
Abstract
We offer a new metric for big data platforms, COST,
or the Conﬁguration that Outperforms a Single Thread.
The COST of a given platform for a given problem is the
hardware conﬁguration required before the platform out-
COST weighs a system’s scalability against the over-
heads introduced by the system, and indicates the actual
performance gains of the system, without rewarding sys-
tems that bring substantial but parallelizable overheads.
We survey measurements of data-parallel systems re-
cently reported in SOSP and OSDI, and ﬁnd that many
systems have either a surprisingly large COST, often
hundreds of cores, or simply underperform one thread
for all of their reported conﬁgurations.
1 Introduction
“You can have a second computer once you’ve
shown you know how to use the ﬁrst one.”
–Paul Barham
The published work on big data systems has fetishized
300
1 10 100
50
1
10
cores
speed-up
system
A
system B
300
1 10 100
1000
8
100
cores
seconds
system
A
system
B
Figure 1: Scaling and performance measurements
for a data-parallel algorithm, before (system A) and
after (system B) a simple performance optimization.
The unoptimized implementation “scales” far better,
despite (or rather, because of) its poor performance.
While this may appear to be a contrived example, we will
argue that many published big data systems more closely
resemble system A than they resemble system B.
1.1 Methodology
In this paper we take several recent graph processing pa-
pers from the systems literature and compare their re-
ported performance against simple, single-threaded im-
plementations on the same datasets using a high-end
Gene Amdahl, 19672:
Demonstration is made of the continued validity of the single processor
approach and of the weaknesses of the multiple processor approach. (Wrong
then but REALLY wrong now! —njg)
All single CPUs/cores now tapped out ≤ 5 GHz since 2005
SPEC.org benchmarks made the same mistake 30 yrs ago
SPECrate: multiple single-threaded instances (dumb ﬁx)
SPEC SDM: Unix multi-user processes (very revealing)
1
McSherry et al., “Scalability! But at what COST?”, Usenix conference, 2015
2
“Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities,” AFIPS conference, 1967
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 5 / 81

6. Introduction
Can haz linear scalability
Shared nothing architecture 3 — not easy in general 4
Tandem Himalaya with NonStop SQL
Local processing and memory + no global lock
Linear OLTP scaling on TPC-C up to N = 128 nodes c.1996
L
Interconnect
network
P
C
M
P
C
M
Processors
P
C
M
o o
g
3
D. J. DeWitt and J. Gray, “Parallel Database Systems: The Future of High Performance Database Processing,” Comm.
ACM, Vol. 36, No. 6, 85–98 (1992)
4
NJG, “Scaling and Shared Nothingness,” 4th HPTS Workshop, Asilomar, California (1993)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 6 / 81

7. Universal Computational Scalability (8)
Outline
1 Introduction
2 Universal Computational Scalability (8)
3 Queueing Theory View of the USL (15)
4 Distributed Application Scaling (22)
5 Distributed in the Cloud (30)
6 Optimal Node Sizing (38)
7 Summary (45)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 7 / 81

8. Universal Computational Scalability (8)
Motivation for USL
Pyramid Technology, c.1992
Vendor Model Unix DBMS TPS-B
nCUBE nCUBE2 × OPS 1017.0
Pyramid MISserver UNIFY 468.5
DEC VaxCluster4 × OPS 425.7
DEC VaxCluster3 × OPS 329.8
Sequent Symmetry 2000 ORA 319.6
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 8 / 81

9. Universal Computational Scalability (8)
Amdahl’s law
How not to write a formula (Hennessy & Patterson, p.30)
How not to plot performance data (Wikipedia)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 9 / 81

10. Universal Computational Scalability (8)
Universal scaling properties
1. Equal bang for the buck
α = 0
β = 0
Processes
Capacity
Everybody wants this
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 10 / 81

11. Universal Computational Scalability (8)
Universal scaling properties
1. Equal bang for the buck
α = 0
β = 0
Processes
Capacity
Everybody wants this
2. Diminishing returns
α > 0
β = 0
Processes
Capacity
Everybody usually gets this
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 10 / 81

12. Universal Computational Scalability (8)
Universal scaling properties
1. Equal bang for the buck
α = 0
β = 0
Processes
Capacity
Everybody wants this
3. Bottleneck limit
1/α
α >> 0
β = 0
Processes
Capacity
Everybody hates this
2. Diminishing returns
α > 0
β = 0
Processes
Capacity
Everybody usually gets this
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 10 / 81

13. Universal Computational Scalability (8)
Universal scaling properties
1. Equal bang for the buck
α = 0
β = 0
Processes
Capacity
Everybody wants this
3. Bottleneck limit
1/α
α >> 0
β = 0
Processes
Capacity
Everybody hates this
2. Diminishing returns
α > 0
β = 0
Processes
Capacity
Everybody usually gets this
1/α
1/N
α > 0
β > 0
Processes
Capacity
Everybody thinks this never happens
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 10 / 81

14. Universal Computational Scalability (8)
The universal scalability law (USL)
N: processes provide system stimulus or load
5
NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 11 / 81

15. Universal Computational Scalability (8)
The universal scalability law (USL)
N: processes provide system stimulus or load
XN
: response function or relative capacity
5
NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 11 / 81

16. Universal Computational Scalability (8)
The universal scalability law (USL)
N: processes provide system stimulus or load
XN
: response function or relative capacity
Question: What kind of function?
5
NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 11 / 81

17. Universal Computational Scalability (8)
The universal scalability law (USL)
N: processes provide system stimulus or load
XN
: response function or relative capacity
Question: What kind of function?
5
NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 11 / 81

18. Universal Computational Scalability (8)
The universal scalability law (USL)
N: processes provide system stimulus or load
XN
: response function or relative capacity
Question: What kind of function?
XN
(α, β, γ) =
γ N
1 + α (N − 1) + β N(N − 1)
5
NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 11 / 81

19. Universal Computational Scalability (8)
The universal scalability law (USL)
N: processes provide system stimulus or load
XN
: response function or relative capacity
Question: What kind of function?
XN
(α, β, γ) =
γ N
1 + α (N − 1) + β N(N − 1)
Three Cs:
5
NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 11 / 81

20. Universal Computational Scalability (8)
The universal scalability law (USL)
N: processes provide system stimulus or load
XN
: response function or relative capacity
Question: What kind of function?
XN
(α, β, γ) =
γ N
1 + α (N − 1) + β N(N − 1)
Three Cs:
1
Concurrency (0 < γ < ∞)
5
NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 11 / 81

21. Universal Computational Scalability (8)
The universal scalability law (USL)
N: processes provide system stimulus or load
XN
: response function or relative capacity
Question: What kind of function?
XN
(α, β, γ) =
γ N
1 + α (N − 1) + β N(N − 1)
Three Cs:
1
Concurrency (0 < γ < ∞)
2
Contention (0 < α < 1)
5
NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 11 / 81

22. Universal Computational Scalability (8)
The universal scalability law (USL)
N: processes provide system stimulus or load
XN
: response function or relative capacity
Question: What kind of function?
XN
(α, β, γ) =
γ N
1 + α (N − 1) + β N(N − 1)
Three Cs:
1
Concurrency (0 < γ < ∞)
2
Contention (0 < α < 1)
3
Coherency (0 < β < 1)
5
NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 11 / 81

23. Universal Computational Scalability (8)
Measurement meets model
X(N)
X(1)
Thruput data
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 12 / 81

24. Universal Computational Scalability (8)
Measurement meets model
X(N)
X(1)
Thruput data
−→ CN
Capacity metric
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 12 / 81

25. Universal Computational Scalability (8)
Measurement meets model
X(N)
X(1)
Thruput data
−→ CN
Capacity metric
←−
N
1 + α (N − 1) + β N(N − 1)
USL model
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 12 / 81

26. Universal Computational Scalability (8)
Measurement meets model
X(N)
X(1)
Thruput data
−→ CN
Capacity metric
←−
N
1 + α (N − 1) + β N(N − 1)
USL model
0 20 40 60 80 100
5 10 15 20
Processes (N)
Relative capacity, C(N)
Linear scaling Amdahl−like scaling
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 12 / 81

27. Universal Computational Scalability (8)
USL coherency term β N (N − 1) is O(N2)
Fully connected graph KN
with N nodes
K3
K6
K8
K16
has
N
2

N!
2! (N − 2)!
=
1
2
N (N − 1)
edges (quadratic in N) that represent:
communication between each pair of nodes 6
exchange of data or objects between processing nodes
actual performance impact captured by magnitude of β parameter
6
This term can also be related to Brooks’ law (too many cooks increase delivery time)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 13 / 81

28. Universal Computational Scalability (8)
Coherency or consistency?
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 14 / 81

29. Universal Computational Scalability (8)
Coherency or consistency?
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 14 / 81

30. Universal Computational Scalability (8)
Coherency or consistency?
July 19, 2000
July 19, 2000
ers OK
ers OK
stic)
stic)
PODC Keynote, July 19, 2000
PODC Keynote, July 19, 2000
The CAP Theorem
The CAP Theorem
Consistency Availability
Tolerance to network
Partitions
Theorem: You can have at
most two of these properties
for any shared-data system
What the man said
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 14 / 81

31. Universal Computational Scalability (8)
Coherency or consistency?
July 19, 2000
July 19, 2000
ers OK
ers OK
stic)
stic)
PODC Keynote, July 19, 2000
PODC Keynote, July 19, 2000
The CAP Theorem
The CAP Theorem
Consistency Availability
Tolerance to network
Partitions
Theorem: You can have at
most two of these properties
for any shared-data system
What the man said
C A
P
CA
CP AP
Correct Venn diagram
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 14 / 81

32. Universal Computational Scalability (8)
Eventual consistency and monotonicity
Consistency: partial order of replicates via a binary relation (posets)
Hasse diagrams: replica events converge (like time) to largest (cumulative) value
Monotonicity: any upward Hasse path terminates with the same result
Developer view is not a performance guarantee
Lattice of integers ordered by binary max relation
Time 9
7 9
4 7 9
OO
2 4 7 9
Lattice of sets ordered by binary join relation
Time {a, b, c, d}
{a, b, c} {b, c, d} {a, b, d}
{b, c} {a, b) {c, d} {a, d}
OO
{b} {c} {a} {d}
Logical monotonicity = monotonically increasing function
A monotonic application can exhibit monotonically decreasing scalability
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 15 / 81

33. Universal Computational Scalability (8)
Determining the USL parameters
Throughput measurements XN
at various process loads N sourced from:
2 Production monitoring, e.g., Linux \proc, JMX interface
Want to determine the α, β, γ that best models XN
data
XN
(α, β, γ) =
γ N
1 + α (N − 1) + β N(N − 1)
XN
is a rational function (tricky)
Brute force: Gene Amdahl 1967 (only α)
Clever ways:
1 Optimization Solver in Excel
2 Nonlinear statistical regression in R
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 16 / 81

34. Queueing Theory View of the USL (15)
Outline
1 Introduction
2 Universal Computational Scalability (8)
3 Queueing Theory View of the USL (15)
4 Distributed Application Scaling (22)
5 Distributed in the Cloud (30)
6 Optimal Node Sizing (38)
7 Summary (45)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 17 / 81

35. Queueing Theory View of the USL (15)
Queueing theory for those who can’t wait
Amdahl’s law for parallel speedup SA
T1
/TN
with N processors 7
SA
(N) =
N
1 + α (N − 1)
(1)
1 Parameter 0 ≤ α ≤ 1 is serial fraction of time spent running single threaded
2 Subsumed by USL model contention term (with γ = 1 and β = 0)
3 Msg: Forget linearity, you’re gonna hit a ceiling at lim
N→∞
SA ∼ 1/α
Some people (all named Frank?) consider Amdahl’s law to be ad hoc:
Franco P. Preparata, “Should Amdahl’s law be repealed?”, Proc. 6th International Symposium on Algorithms and
Computation (ISAAC’95), Cairns, Australia, December 4–6 1995. https:
//www.researchgate.net/publication/221543174_Should_Amdahl%27s_Law_Be_Repealed_Abstract
Frank L. Devai, “The Refutation of Amdahl?s Law and Its Variants,” 17th International Conference on Computational
Science and Applications ICCSA, Trieste, Italy, July 2017 https://www.researchgate.net/publication/
318648496_The_Refutation_of_Amdahl%27s_Law_and_Its_Variants
7
Gene Amdahl (1967) never wrote down this equation and he measured the mainframe workloads by brute force
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 18 / 81

36. Queueing Theory View of the USL (15)
0 20 40 60 80 100
5 10 15 20
Processes (N)
Relative capacity, C(N)
Linear scaling Amdahl−like scaling
Top blue curve is Amdahl approaching a ceiling speedup C(N) ∼ 20
Which also means α ∼ 0.05 or app serialized 5% of the runtime
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 19 / 81

37. Queueing Theory View of the USL (15)
Queue characterization
W S
L m ρ
Time: R
Space: Q +
+
=
=
Arriving Wai3ng Servicing Depar3ng
A familiar queue

38. Queueing Theory View of the USL (15)
Queue characterization
W S
L m ρ
Time: R
Space: Q +
+
=
=
Arriving Wai3ng Servicing Depar3ng
A familiar queue
An abstract queue

39. Queueing Theory View of the USL (15)
Queue characterization
W S
L m ρ
Time: R
Space: Q +
+
=
=
Arriving Wai3ng Servicing Depar3ng
A familiar queue
An abstract queue
1. Where is the service facility?

40. Queueing Theory View of the USL (15)
Queue characterization
W S
L m ρ
Time: R
Space: Q +
+
=
=
Arriving Wai3ng Servicing Depar3ng
A familiar queue
An abstract queue
1. Where is the service facility?
2. Did anything ﬁnish yet?

41. Queueing Theory View of the USL (15)
Queue characterization
W S
L m ρ
Time: R
Space: Q +
+
=
=
Arriving Wai3ng Servicing Depar3ng
A familiar queue
An abstract queue
1. Where is the service facility?
2. Did anything ﬁnish yet?
3. Is anything waiting?

42. Queueing Theory View of the USL (15)
Queue characterization
W S
L m ρ
Time: R
Space: Q +
+
=
=
Arriving Wai3ng Servicing Depar3ng
A familiar queue
An abstract queue
1. Where is the service facility?
2. Did anything ﬁnish yet?
3. Is anything waiting?
4. Is anything arriving?
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 20 / 81

43. Queueing Theory View of the USL (15)
Amdahl queueing model
...
...
Interconnect network
Processing nodes
Figure 1: Amdahl queueing model
Classical: Queueing model of assembly-line robots (LHS bubbles)
Classical: Robots that fail go to service stations (RHS) for repairs
Classical: Repair queues impact manufacturing “cycle time”
Amdahl: Robots become ﬁnite number of (distributed) processing nodes
Amdahl: Repair stations becomes interconnect network
Amdahl: Network latency impacts application scalability
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 21 / 81

44. Queueing Theory View of the USL (15)
Amdahl queueing theorem
Theorem 1 (Gunther 2002)
Amdahl’s law for parallel speedup is equivalent to the synchronous queueing
bound on throughput in the machine repairman model of a multi-node
computer system.
Proof.
1 ”A New Interpretation of Amdahl’s Law and Geometric Scalability”, arXiv, Submitted 17 Oct
(2002)
2 “Celebrity Boxing (and sizing): Alan Greenspan vs. Gene Amdahl,” 28th International
Computer Measurement Group Conference, December 8–13, Reno, NV (2002)
3 “Uniﬁcation of Amdahl’s Law, LogP and Other Performance Models for Message-Passing
Architectures”, Proc. Parallel and Distributed Computer Systems (PDCS’05) November
14–16, 569–576, Phoenix, AZ (2005)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 22 / 81

45. Queueing Theory View of the USL (15)
Queueing bounds on throughput
0 20 40 60 80 100 120
Processing nodes (N)
Throughput X(N)
Xlin
(N) = 1 (R1 + Z)
Xmax = 1 Smax
Xmean
(N) = N (RN + Z)
Xsync
(N) = N (N R1 + Z)
Xsax = 1 R1
Figure 2: Average (blue) and synchronous (red) throughput X(N)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 23 / 81

46. Queueing Theory View of the USL (15)
The speedup mapping
1 Z: is the mean processor execution time
2 Sk
: service time at kth (network) queue
3 R1
: minimum round-trip time, i.e.,
k
Sk
+ Z for N = 1
Amdahl speedup SA
(N) is given by
SA
(N) =
N · R1
N ·
k
Sk
+ Z
— queueing form (2)
SA
(N) =
N
1 + α(N − 1)
— usual “ad hoc” law (3)
Connecting (2) and (3) requires identity
k
Sk
R1
≡ α →

0 as Sk
→ 0 (processor only)
1 as Z → 0 (network only)
(4)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 24 / 81

47. Queueing Theory View of the USL (15)
USL queueing model
...
Interconnect
Processing nodes
Figure 3: USL load-dependent queueing model
Amdahl: Robots become ﬁnite number of (distributed) processing nodes
Amdahl: Repair stations becomes interconnect network
Amdahl: Network latency impacts application scalability
USL: Interconnect service time (latency) depends on queue length
USL: Requests are ﬁrst exchange sorted O(N2)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 25 / 81

48. Queueing Theory View of the USL (15)
Theorem 2 (Gunther 2008)
The USL relative capacity is equivalent to the synchronous queueing bound
on throughput for a linear load-dependent machine repairman model of a
multi-node computer system.
Proof.
1 Guerrilla Capacity Planning: A Tactical Approach to Planning for Highly Scalable
Applications and Services (Springer 2007)
2 A General Theory of Computational Scalability Based on Rational Functions, arXiv,
Submitted on 11 Aug (2008)
3 Discrete-event simulations presented in “Getting in the Zone for Successful Scalability”
Proc. Computer Measurement Group International Conference, December 7th–12, Las
Vegas, NV (2008)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 26 / 81

49. Queueing Theory View of the USL (15)
Experimental results
0 20 40 60 80 100
0 2 4 6 8 10
Processing nodes (N)
Relative capacity C(N)
Synchronized repairman simulation
USL model: α = 0.1, β = 0
USL model: α = 0.1, β = 0.001
Discrete-event simulations (Jim Holtman 2008, 2018)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 27 / 81

50. Queueing Theory View of the USL (15)
Key points for distributed USL
Each USL denominator term has a physical meaning (not purely stats model):
1 Constant O(1) ⇒ maximum nodal concurrency (no queueing)
2 Linear O(N) ⇒ intra-nodal contention (synchronous queueing)
Useful for interpreting USL analysis to developers, engineers, architects, etc.
Example 1
If β > α then look for inter-nodal messaging activity rather than the size of
message-queues in any particular node
Worst-case queueing corresponds to an analytic equation (rational function)
USL is agnostic about the application architecture and interconnect topology
So where is all that information contained? (exercise for the reader)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 28 / 81

51. Distributed Application Scaling (22)
Outline
1 Introduction
2 Universal Computational Scalability (8)
3 Queueing Theory View of the USL (15)
4 Distributed Application Scaling (22)
5 Distributed in the Cloud (30)
6 Optimal Node Sizing (38)
7 Summary (45)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 29 / 81

52. Distributed Application Scaling (22)
Memcached
“Hidden Scalability Gotchas in Memcached and Friends”
NJG, S. Subramanyam, and S. Parvu Sun Microsystems
Presented at Velocity 2010
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 30 / 81

53. Distributed Application Scaling (22)
Distributed scaling strategies
Scaleup Scaleout
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 31 / 81

54. Distributed Application Scaling (22)
Scaleout generations
Distributed cache of key-value pairs
Deploy memcache on previous generation CPUs (not always multicore)
Single worker thread ok — until next hardware roll (multicore)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 32 / 81

55. Distributed Application Scaling (22)
0 2 4 6 8 10 12
0 50 100 150 200 250 300
Throughput KOPS X(N)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 33 / 81

56. Distributed Application Scaling (22)
Explains these memcached warnings
Conﬁguring the memcached server
Threading is used to scale memcached across CPU’s. The model is
connections. ... By default 4 threads are allocated. Setting it to
very large values (80+) will make it run considerably slower.
Linux man pages - memcached (1)
-t
Number of threads to use to process incoming requests. ... It is
typically not useful to set this higher than the number of CPU cores
on the memcached server. Setting a high number (64 or more) of worker
threads is not recommended. The default is 4.
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 34 / 81

57. Distributed Application Scaling (22)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 35 / 81

58. Distributed Application Scaling (22)
R regression analysis
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 36 / 81

59. Distributed Application Scaling (22)
USL scalability model
0 2 4 6 8 10 12
0 50 100 150 200 250 300
Throughput KOPS X(N)
α = 0.0468
β = 0.021016
γ = 84.89
Nmax = 6.73
Xmax = 274.87
Xroof = 1814.82
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 37 / 81

60. Distributed Application Scaling (22)
Concurrency parameter
0 2 4 6 8 10 12
0 50 100 150 200 250 300
Throughput KOPS X(N)
α = 0.0468
β = 0.021016
γ = 84.89
Nmax = 6.73
Xmax = 274.87
Xroof = 1814.82
α = 0
β = 0
Processes
Capacity
1 γ = 84.89
2 Slope of linear bound as Kops/thread
3 Estimate of throughput X(1) = 84.89 Kops at N = 1 thread
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 38 / 81

61. Distributed Application Scaling (22)
Contention parameter
0 2 4 6 8 10 12
0 50 100 150 200 250 300
Throughput KOPS X(N)
α = 0.0468
β = 0.021016
γ = 84.89
Nmax = 6.73
Xmax = 274.87
Xroof = 1814.82
1/α
α >> 0
β = 0
Processes
Capacity
α = 0.0468
Waiting or queueing for resources about 4.6% of the time
Max possible throughput is X(1)/α = 1814.78 Kops (Xroof
)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 39 / 81

62. Distributed Application Scaling (22)
Coherency parameter
0 2 4 6 8 10 12
0 50 100 150 200 250 300
Throughput KOPS X(N)
α = 0.0468
β = 0.021016
γ = 84.89
Nmax = 6.73
Xmax = 274.87
Xroof = 1814.82
1/α
1/N
α > 0
β > 0
Processes
Capacity
β = 0.0210 corresponds to retrograde throughput
Distributed copies of data (e.g., caches) have to be exchanged/updated
about 2.1% of the time to be consistent
Peak occurs at Nmax = (1 − α)/β = 6.73 threads
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 40 / 81

63. Distributed Application Scaling (22)
Improving scalability performance
0 10 20 30 40 50
0 5 10 15 20 25
Speedup S(N)
mcd 1.2.8
mcd 1.3.2
mcd 1.3.2 + patch
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 41 / 81

64. Distributed Application Scaling (22)
Sirius and Zookeeper
M. Bevilacqua-Linn, M. Byron, P. Cline, J. Moore and S. Muir
(Comcast), “Sirius: Distributing and Coordinating Application”,
Usenix ATC 2014
P. Hunt, M. Konar, F. P. Junqueira and B. Reed (Yahoo),
“ZooKeeper: Wait-free coordination for Internet-scale systems”,
Usenix ATC 2010
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 42 / 81

65. Distributed Application Scaling (22)
Coordination throughput data
0 500 1000 1500
Writes per second
1 3 5 7 9 11 13 15
Cluster size
Sirius
Sirius−NoBrain
Sirius−NoDisk
USL model
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 43 / 81

66. Distributed Application Scaling (22)
Coordination throughput data
0 500 1000 1500
Writes per second
1 3 5 7 9 11 13 15
Cluster size
Sirius
Sirius−NoBrain
Sirius−NoDisk
USL model
It’s all downhill on the coordination ski slopes
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 43 / 81

67. Distributed Application Scaling (22)
Coordination throughput data
0 500 1000 1500
Writes per second
1 3 5 7 9 11 13 15
Cluster size
Sirius
Sirius−NoBrain
Sirius−NoDisk
USL model
It’s all downhill on the coordination ski slopes
But that’s very bad ... isn’t it?
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 43 / 81

68. Distributed Application Scaling (22)
Coordination throughput data
0 500 1000 1500
Writes per second
1 3 5 7 9 11 13 15
Cluster size
Sirius
Sirius−NoBrain
Sirius−NoDisk
USL model
It’s all downhill on the coordination ski slopes
But that’s very bad ... isn’t it?
What does the USL say?
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 43 / 81

69. Distributed Application Scaling (22)
USL scalability model
0 500 1000 1500
Cluster size
Writes per second
1 3 5 7 9 11 13 15
Sirius
Sirius−NoBrain
Sirius−NoDisk
USL model
α = 0.05
β = 0.165142
γ = 1024.98
Nmax = 2.4
Xmax = 1513.93
Xroof = 20499.54
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 44 / 81

70. Distributed Application Scaling (22)
Concurrency parameter
0 500 1000 1500
Cluster size
Writes per second
1 3 5 7 9 11 13 15
Sirius
Sirius−NoBrain
Sirius−NoDisk
USL model
α = 0.05
β = 0.165142
γ = 1024.98
Nmax = 2.4
Xmax = 1513.93
Xroof = 20499.54
α = 0
β = 0
Processes
Capacity
1 γ = 1024.98
2 Single node is meaningless (need N ≥ 3 for majority)
3 Interpret γ as N = 1 virtual throughput
4 USL estimates X(1) = 1, 024.98 WPS (black square)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 45 / 81

71. Distributed Application Scaling (22)
Contention parameter
0 500 1000 1500
Cluster size
Writes per second
1 3 5 7 9 11 13 15
Sirius
Sirius−NoBrain
Sirius−NoDisk
USL model
α = 0.05
β = 0.165142
γ = 1024.98
Nmax = 2.4
Xmax = 1513.93
Xroof = 20499.54
1/α
α >> 0
β = 0
Processes
Capacity
α = 0.05
Queueing for resources about 5% of the time
Max possible throughput is X(1)/α = 20, 499.54 WPS (Xroof
)
But Xroof
not feasible in these systems
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 46 / 81

72. Distributed Application Scaling (22)
Coherency parameter
0 500 1000 1500
Cluster size
Writes per second
1 3 5 7 9 11 13 15
Sirius
Sirius−NoBrain
Sirius−NoDisk
USL model
α = 0.05
β = 0.165142
γ = 1024.98
Nmax = 2.4
Xmax = 1513.93
Xroof = 20499.54
1/α
1/N
α > 0
β > 0
Processes
Capacity
β = 0.1651 says retrograde throughput dominates!
Distributed data being exchanged (compared?) about 16.5% of the time
(virtual) Peak at Nmax = (1 − α)/β = 2.4 cluster nodes
Shocking but that’s what Paxos-type scaling looks like
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 47 / 81

73. Distributed Application Scaling (22)
Comparison of Sirius and Zookeeper
0 500 1000 1500
Cluster size
Writes per second
1 3 5 7 9 11 13 15
Sirius
Sirius−NoBrain
Sirius−NoDisk
USL model
α = 0.05
β = 0.165142
γ = 1024.98
Nmax = 2.4
Xmax = 1513.93
Xroof = 20499.54
0 20000 40000 60000 80000
Cluster size
Reqs per second
1 3 5 7 9 11 13 15
Mean
CI_lo
CI_hi
USL model
α = 0.05
β = 0.157701
γ = 41536.96
Nmax = 2.45
Xmax = 62328.47
Xroof = 830739.1
Both coordination services exhibit equivalent USL scaling parameters:
5% contention delay — not “wait-free” (per Yahoo paper title)
16% coherency delay
despite Sirius being write-intensive and Zookeeper being read-intensive (i.e.,
100x throughput)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 48 / 81

74. Distributed Application Scaling (22)
Distributed Ledger
Scalability
A. Charapko, A. Ailijiang and M. Demirbas “Bridging Paxos and
Blockchain Consensus”, 2018
Team Rocket (Cornell), “Snowﬂake to Avalanche: A Novel
Metastable Consensus Protocol Family for Cryptocurrencies”, 2018
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 49 / 81

75. Distributed Application Scaling (22)
Avalanche protocol
VISA can handle 60,000 TPS
Bitcoin bottlenecks at 7 TPS 8
Avalanche rumor-mongering coordination:
Multiple samples of small nodal neighborhood (ns
)
Threshold > k ns
when most of system has “heard” the rumor
Greener than Bitcoin (Nakamoto)
No PoW needed
8
K. Stinchcombe, “Ten years in, nobody has come up with a use for blockchain, Hackernoon, Dec 22, 2017
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 50 / 81

76. Distributed Application Scaling (22)
Avalanche scalability
0 500 1000 1500 2000
0 500 1000 1500 2000
Network nodes
Throughput (TPS)
α = 0.9
β = 6.5e−05
γ = 1648.27
Npeak = 39.15
Xpeak = 1821.21
Xroof = 1831.41
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 51 / 81

77. Distributed Application Scaling (22)
Comparison with Paxos
0 500 1000 1500 2000
0 500 1000 1500 2000
Network nodes
Throughput (TPS)
α = 0.9
β = 6.5e−05
γ = 1648.27
Npeak = 39.15
Xpeak = 1821.21
Xroof = 1831.41
0 500 1000 1500
Cluster size
Writes per second
1 3 5 7 9 11 13 15
Sirius
Sirius−NoBrain
Sirius−NoDisk
USL model
α = 0.05
β = 0.165142
γ = 1024.98
Nmax = 2.4
Xmax = 1513.93
Xroof = 20499.54
X(1)/α ⇒ immediate saturation at about 2000 TPS
Coherency: β extremely small but not zero
Much slower ‘linear’ degradation than Zookeeper/Paxos
Saturation throughput maintained over 2000 nodes (100x Sirius)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 52 / 81

78. Distributed Application Scaling (22)
NJG, P. J. Puglia, and K. Tomasette, “Hadoop Superlinear
Scalability: The perpetual motion of parallel performance”, Comm.
ACM, Vol. 58 No. 4, 46–55 (2015)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 53 / 81

79. Distributed Application Scaling (22)
Super-scalability is a thing
Superlinear
Linear
Sublinear
Processors
Speedup
More than 100% efﬁcient!
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 54 / 81

80. Distributed Application Scaling (22)
Seeing is believing
“Speedup in Parallel Contexts.”
http://en.wikipedia.org/wiki/Speedup#Speedup_in_Parallel_Contexts
“Where does super-linear speedup come from?”
http://stackoverflow.com/questions/4332967/where-does-super-linear-speedup-come-from
“Sun Fire X2270 M2 Super-Linear Scaling of Hadoop TeraSort and CloudBurst
Benchmarks.”
Haas, R. “Scalability, in Graphical Form, Analyzed.”
http://rhaas.blogspot.com/2011/09/scalability-in-graphical-form-analyzed.html
Sutter, H. 2008. “Going Superlinear.”
Dr. Dobb’s Journal 33(3), March.
http://www.drdobbs.com/cpp/going-superlinear/206100542
Sutter, H. 2008. “Super Linearity and the Bigger Machine.”
Dr. Dobb’s Journal 33(4), April.
http://www.drdobbs.com/parallel/super-linearity-and-the-bigger-machine/206903306
“SDN analytics and control using sFlow standard — Superlinear.”
http://blog.sflow.com/2010/09/superlinear.html
Eijkhout, V. 2014. Introduction to High Performance Scientiﬁc Computing. Lulu.com
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 55 / 81

81. Distributed Application Scaling (22)
Empirical evidence
16-node Sun Fire X2270 M2 cluster
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 56 / 81

82. Distributed Application Scaling (22)
Smells like perpetual motion
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 57 / 81

83. Distributed Application Scaling (22)
Terasort cluster emulations
Want a controlled environment to study superlinearity
Terasort workload sorts 1 TB data in parallel
Terasort has benchmarked Hadoop MapReduce performance
We used just 100 GB data input (weren’t benchmarking anything)
Simulate in AWS cloud EC2 instances (more ﬂexible and much cheaper)
Enabled multiple runs in parallel (thank you Comcast for cycle \$s)
Table 1: Amazon EC2 Conﬁgurations
Optimized Processor vCPU Memory Instance Network
for Arch number (GiB) Storage (GB) Performance
BigMem Memory 64-bit 4 34.2 1 x 850 Moderate
BigDisk Compute 64-bit 8 7 4 x 420 High
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 58 / 81

84. Distributed Application Scaling (22)
USL analysis of BigMem N ≤ 50 nodes
0 50 100 150
0 50 100 150
USL Analysis of BigMem Hadoop Terrasort
EC2 m2 nodes (N)
Speedup S(N)
α = −0.0288
β = 0.000447
Nmax = 47.96
Smax = 73.48
Sroof = N A
Ncross = 64.5
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 59 / 81

85. Distributed Application Scaling (22)
USL analysis of BigMem N ≤ 150 nodes
0 50 100 150
0 50 100 150
USL Model of BigMem Hadoop TS Data
EC2 m2 nodes (N)
Speedup S(N)
α = −0.0089
β = 9e−05
Nmax = 105.72
Smax = 99.53
Sroof = N A
Ncross = 99.14
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 60 / 81

86. Distributed Application Scaling (22)
Speedup on N ≤ 10 BigMem Nodes (1 disk)
0 5 10 15
0 5 10 15
EC2 m2 nodes (N)
Speedup S(N)
Superlinear region Sublinear region
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 61 / 81

87. Distributed Application Scaling (22)
Speedup on N ≤ 10 BigDisk Nodes (4 disks)
0 5 10 15
0 5 10 15
EC2 c1 nodes (N)
Speedup S(N)
Superlinear region Sublinear region
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 62 / 81

88. Distributed Application Scaling (22)
Superlinear payback trap
Superlinearity is real (i.e., measurable)
But due to hidden resource constraint (disk or memory subsystem sizing)
Sets wrong normalization for speedup scale
Shows up in USL as negative α value (cf. GCAP book)
Theorem 3 (Gunther 2013)
The apparent advantage of superlinear scaling (α < 0) is inevitably lost due to
crossover into a region of severe performance degradation (β 0) that can be worse
than if superlinearity had not been present in the ﬁrst place. Proof was presented at
Hotsos 2013.
Superlinear
Linear
Sublinear
Processors
Speedup
Superlinear Payback
Processors
Speedup
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 63 / 81

89. Distributed in the Cloud (30)
Outline
1 Introduction
2 Universal Computational Scalability (8)
3 Queueing Theory View of the USL (15)
4 Distributed Application Scaling (22)
5 Distributed in the Cloud (30)
6 Optimal Node Sizing (38)
7 Summary (45)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 64 / 81

90. Distributed in the Cloud (30)
AWS Cloud Application
“Exposing the Cost of Performance Hidden in the Cloud”
NJG and M. Chawla
Presented at CMG cloudXchange 2018
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 65 / 81

91. Distributed in the Cloud (30)
Production data
Previously measured X and R directly on test rig
Table 2: Converting data to performance metrics
Data Meaning Metrics Meaning
T Elapsed time X = C/T Throughput
Tp
Processing time R = (Tp
/T)(T/C) Response time
C Completed work N = X × R Concurrent threads
Ucpu
CPU utilization S = Ucpu
/X Service time
Example 4 (Coalesced metrics)
Linux epoch Timestamp interval between rows is 300 seconds
Timestamp, X, N, S, R, U_cpu
1486771200000, 502.171674, 170.266663, 0.000912, 0.336740, 0.458120
1486771500000, 494.403035, 175.375000, 0.001043, 0.355975, 0.515420
1486771800000, 509.541751, 188.866669, 0.000885, 0.360924, 0.450980
1486772100000, 507.089094, 188.437500, 0.000910, 0.367479, 0.461700
1486772400000, 532.803039, 191.466660, 0.000880, 0.362905, 0.468860
...
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 66 / 81

92. Distributed in the Cloud (30)
Tomcat data from AWS
0 100 200 300 400 500
0 200 400 600 800
Throughput (RPS)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 67 / 81

93. Distributed in the Cloud (30)
USL nonlinear analysis
0 100 200 300 400 500
0 200 400 600 800
Throughput (RPS)
α = 0
β = 3e−06
γ = 3
Nmax = 539.2
Xmax = 809.55
Nopt = 274.8
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 68 / 81

94. Distributed in the Cloud (30)
Concurrency parameter
0 100 200 300 400 500
0 200 400 600 800
Throughput (RPS)
α = 0
β = 3e−06
γ = 3
Nmax = 539.2
Xmax = 809.55
Nopt = 274.8
α = 0
β = 0
Processes
Capacity
1 γ = 3.0
2 Smallest number of threads during 24 hr sample is N > 100
3 Nonetheless USL estimates throughput X(1) = 3 RPS
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 69 / 81

95. Distributed in the Cloud (30)
Contention parameter
0 100 200 300 400 500
0 200 400 600 800
Throughput (RPS)
α = 0
β = 3e−06
γ = 3
Nmax = 539.2
Xmax = 809.55
Nopt = 274.8
1/α
α >> 0
β = 0
Processes
Capacity
α = 0
No signiﬁcant waiting or queueing
Max possible throughput Xroof
not deﬁned
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 70 / 81

96. Distributed in the Cloud (30)
Coherency parameter
0 100 200 300 400 500
0 200 400 600 800
Throughput (RPS)
α = 0
β = 3e−06
γ = 3
Nmax = 539.2
Xmax = 809.55
Nopt = 274.8
1/α
1/N
α > 0
β > 0
Processes
Capacity
β = 3 × 10−6 implies very weak retrograde throughput
Extremely little data exchange
But entirely responsible for sublinearity
And peak throughput Xmax = 809.55 RPS
Peak occurs at Nmax = 539.2 threads
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 71 / 81

97. Distributed in the Cloud (30)
Revised USL analysis
Linear slope γ ∼ 3: γ = 2.65
Should be no contention, i.e., α = 0
Discontinuity at N ∼ 275 threads
Throughput plateaus, i.e., β = 0
Saturation occurs at processor
utilization UCPU
≥ 75%
Linux OS can’t do that!
Pseudo-saturation due to AWS Auto
Scaling policy (hypervisor?)
Many EC2 instances spun up and down
during 24 hrs
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 72 / 81

98. Distributed in the Cloud (30)
Corrected USL linear model
0 100 200 300 400 500
0 200 400 600 800
Throughput (RPS)
α = 0
β = 0
γ = 2.65
Nmax = NaN
Xmax = 727.03
Nopt = 274.8
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 73 / 81

99. Optimal Node Sizing (38)
Outline
1 Introduction
2 Universal Computational Scalability (8)
3 Queueing Theory View of the USL (15)
4 Distributed Application Scaling (22)
5 Distributed in the Cloud (30)
6 Optimal Node Sizing (38)
7 Summary (45)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 74 / 81

100. Optimal Node Sizing (38)
1. Canonical sublinear scalability
0 50 100 150
0 100 200 300 400 500
Processes (N)
Thruput X(N)
α = 0.05
β = 6e−05
γ = 20.8563
Nmax = 125.5
Xmax = 320.47
Xroof = 417.13
Amhdahl curve (dotted)
Linear scaling bound (LHS red)
Saturation asymptote (top red)
Nopt = 1/α = 20 at intersection
USL curve (solid blue) is
Nopt shifts right to Nmax = 125
Throughput is lower than Xroof
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 75 / 81

101. Optimal Node Sizing (38)
2. Memcache scalability
0 5 10 15 20 25
0 500 1000 1500 2000
Throughput X(N)
α = 0.0468
β = 0.021016
γ = 84.89
Nmax = 6.73
Xmax = 274.87
Nopt = 21.38
0 10 20 30 40 50
0 100 200 300 400 500
Worker processes (N)
Throughput X(N)
α = 0
β = 0.00052
γ = 17.76
Npeak = 43.87
Xpeak = 394.06
Nopt = Inf
Initial scalability (green curve)
Nmax occurs before Nopt
Nmax = 6.73
Nopt = 1/0.0468 = 21.36752
Sun SPARC patch (blue curve)
Nmax = 43.87
Now exceeds previous Nopt
And throughput increased by approx 50%
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 76 / 81

102. Optimal Node Sizing (38)
3. Cloud scalability
0 100 200 300 400 500
0 200 400 600 800
Throughput (RPS)
α = 0
β = 0
γ = 2.65
Nmax = NaN
Xmax = 727.03
Nopt = 274.8
0 100 200 300 400 500
0.0 0.2 0.4 0.6 0.8 1.0
Tomcat processes (N)
Response time R(N)
α = 0
β = 0
γ = 2.65
Rmin = 0.38
Sbnk = 0.001365
Nopt = 274.8
Throughput proﬁle
Nopt = 274.8 due to autoscaling policy
Response time proﬁle
Autoscaling throttles throughput
Causes R(N) to increase linearly
Will users tolerate R(N) > Nopt ?
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 77 / 81

103. Optimal Node Sizing (38)
4. DLT scalability
0 20000 40000 60000 80000
Cluster size
Reqs per second
1 3 5 7 9 11 13 15
Mean
CI_lo
CI_hi
USL model
α = 0.05
β = 0.157701
γ = 41536.96
Nmax = 2.45
Xmax = 62328.47
Xroof = 830739.1
0 500 1000 1500 2000
0 500 1000 1500 2000
Network nodes
Throughput (TPS)
α = 0.9
β = 6.5e−05
γ = 1648.27
Npeak = 39.15
Xpeak = 1821.21
Xroof = 1831.41
Throughput proﬁle
Optimum is the smallest conﬁguration
Likely bigger in real applications
Coordination kills
Scaling degrades (that’s how it works!)
No good for DLT
Throughput proﬁle
There is no optimum
Throughput similar to Zookeeper peak
Pick a number N
Throughput is independent of scale
But N scale is 100x Zookeeper
Generally needed for DLT
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 78 / 81

104. Summary (45)
Outline
1 Introduction
2 Universal Computational Scalability (8)
3 Queueing Theory View of the USL (15)
4 Distributed Application Scaling (22)
5 Distributed in the Cloud (30)
6 Optimal Node Sizing (38)
7 Summary (45)
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 79 / 81

105. Summary (45)
Summary
All performance is nonlinear
Need both data and models (e.g., USL)
USL is a nonlinear (rational) function
USL is architecture agnostic
All scalability information contained in parameters:
1 α: contention, queueing
2 β: coherency, consistency (pairwise data exchange)
3 γ: concurrency, X(1) value
Apply USL via nonlinear statistical regression
CAP-like consistency associated with magnitude of quadratic β term
Can apply USL to both controlled (testing) and production platforms
USL analysis often different from paper designs and algorithms
c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
March 13, 2019 80 / 81

106. Summary (45)
Thank you for the (virtual) invitation!
www.perfdynamics.com
Castro Valley, California