$30 off During Our Annual Pro Sale. View Details »

Applying The Universal Scalability Law to Distributed Systems

Applying The Universal Scalability Law to Distributed Systems

After demonstrating that both Amdahl's law and the Universal Scalability law (USL) are fundamental, in that they can be derived from queueing theory, the USL is used to quantify the scalability of a wide variety of distributed systems, including Memcache, Hadoop, Zookeeper, AWS cloud, and Avalanche DLT. Contact is also made with some well-known concepts in modern distributed systems, viz., Hasse diagrams, the CAP theorem, eventual consistency and Paxos coordination.

Dr. Neil Gunther

February 16, 2019
Tweet

More Decks by Dr. Neil Gunther

Other Decks in Technology

Transcript

  1. Applying The Universal Scalability Law
    to Distributed Systems
    Dr. Neil J. Gunther
    Performance Dynamics
    Distributed Systems Conference
    Pune, INDIA
    16 February 2019
    SM
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 1 / 81

    View Slide

  2. Introduction
    Outline
    1 Introduction
    2 Universal Computational Scalability (8)
    3 Queueing Theory View of the USL (15)
    4 Distributed Application Scaling (22)
    5 Distributed in the Cloud (30)
    6 Optimal Node Sizing (38)
    7 Summary (45)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 2 / 81

    View Slide

  3. Introduction
    It’s all about NONLINEARITIES
    “Using a term like nonlinear science is like referring to the bulk of
    zoology as the study of non-elephant animals.” –Stan Ulam
    Models are a must
    All performance analysis (including scalability analysis) is about nonlinearity. But nonlinear is
    a class name, like cancer: there are at least as many forms of cancer as there are cell types. You
    need to correctly characterize the specific cell type to determine the best treatment. Similarly,
    you need to correctly characterize the performance nonlinearity in order to improve it. The
    only sane way to do that is with the appropriate performance model. Data alone is not
    sufficient.
    1 Algorithms are only half the scalability story
    2 Measurements are the other half
    3 And models are the other other half
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 3 / 81

    View Slide

  4. Introduction
    What is distributed computing?
    “A collection of independent computers that appears to its users as a
    single coherent system,” –Andy Tanenbaum (2007)
    Anything outside a single execution unit (von Neumann bottleneck)
    1
    PROBLEM: How to connect these?
    RAM
    Cache
    ?
    Disk
    Processor
    Tightly coupled vs. loosely coupled
    Many issues are not really new; a question of scale
    Confusing overloaded terminology: function, consistency, linearity,
    monotonicity, concurrency, ....
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 4 / 81

    View Slide

  5. Introduction
    Easy to be naive about scalability
    COST (Configuration that Outperforms a Single Thread), 20151:
    Single-threaded scaling can be better than multi-threaded scaling. (And your
    point is? —njg)
    Scalability! But at what COST?
    Frank McSherry Michael Isard Derek G. Murray
    Unaffiliated Unaffiliated⇤ Unaffiliated†
    Abstract
    We offer a new metric for big data platforms, COST,
    or the Configuration that Outperforms a Single Thread.
    The COST of a given platform for a given problem is the
    hardware configuration required before the platform out-
    performs a competent single-threaded implementation.
    COST weighs a system’s scalability against the over-
    heads introduced by the system, and indicates the actual
    performance gains of the system, without rewarding sys-
    tems that bring substantial but parallelizable overheads.
    We survey measurements of data-parallel systems re-
    cently reported in SOSP and OSDI, and find that many
    systems have either a surprisingly large COST, often
    hundreds of cores, or simply underperform one thread
    for all of their reported configurations.
    1 Introduction
    “You can have a second computer once you’ve
    shown you know how to use the first one.”
    –Paul Barham
    The published work on big data systems has fetishized
    300
    1 10 100
    50
    1
    10
    cores
    speed-up
    system
    A
    system B
    300
    1 10 100
    1000
    8
    100
    cores
    seconds
    system
    A
    system
    B
    Figure 1: Scaling and performance measurements
    for a data-parallel algorithm, before (system A) and
    after (system B) a simple performance optimization.
    The unoptimized implementation “scales” far better,
    despite (or rather, because of) its poor performance.
    While this may appear to be a contrived example, we will
    argue that many published big data systems more closely
    resemble system A than they resemble system B.
    1.1 Methodology
    In this paper we take several recent graph processing pa-
    pers from the systems literature and compare their re-
    ported performance against simple, single-threaded im-
    plementations on the same datasets using a high-end
    Gene Amdahl, 19672:
    Demonstration is made of the continued validity of the single processor
    approach and of the weaknesses of the multiple processor approach. (Wrong
    then but REALLY wrong now! —njg)
    All single CPUs/cores now tapped out ≤ 5 GHz since 2005
    SPEC.org benchmarks made the same mistake 30 yrs ago
    SPECrate: multiple single-threaded instances (dumb fix)
    SPEC SDM: Unix multi-user processes (very revealing)
    1
    McSherry et al., “Scalability! But at what COST?”, Usenix conference, 2015
    2
    “Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities,” AFIPS conference, 1967
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 5 / 81

    View Slide

  6. Introduction
    Can haz linear scalability
    Shared nothing architecture 3 — not easy in general 4
    Tandem Himalaya with NonStop SQL
    Local processing and memory + no global lock
    Linear OLTP scaling on TPC-C up to N = 128 nodes c.1996
    L
    Interconnect
    network
    P
    C
    M
    P
    C
    M
    Processors
    P
    C
    M
    o o
    g
    3
    D. J. DeWitt and J. Gray, “Parallel Database Systems: The Future of High Performance Database Processing,” Comm.
    ACM, Vol. 36, No. 6, 85–98 (1992)
    4
    NJG, “Scaling and Shared Nothingness,” 4th HPTS Workshop, Asilomar, California (1993)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 6 / 81

    View Slide

  7. Universal Computational Scalability (8)
    Outline
    1 Introduction
    2 Universal Computational Scalability (8)
    3 Queueing Theory View of the USL (15)
    4 Distributed Application Scaling (22)
    5 Distributed in the Cloud (30)
    6 Optimal Node Sizing (38)
    7 Summary (45)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 7 / 81

    View Slide

  8. Universal Computational Scalability (8)
    Motivation for USL
    Pyramid Technology, c.1992
    Vendor Model Unix DBMS TPS-B
    nCUBE nCUBE2 × OPS 1017.0
    Pyramid MISserver UNIFY 468.5
    DEC VaxCluster4 × OPS 425.7
    DEC VaxCluster3 × OPS 329.8
    Sequent Symmetry 2000 ORA 319.6
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 8 / 81

    View Slide

  9. Universal Computational Scalability (8)
    Amdahl’s law
    How not to write a formula (Hennessy & Patterson, p.30)
    How not to plot performance data (Wikipedia)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 9 / 81

    View Slide

  10. Universal Computational Scalability (8)
    Universal scaling properties
    1. Equal bang for the buck
    α = 0
    β = 0
    Processes
    Capacity
    Everybody wants this
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 10 / 81

    View Slide

  11. Universal Computational Scalability (8)
    Universal scaling properties
    1. Equal bang for the buck
    α = 0
    β = 0
    Processes
    Capacity
    Everybody wants this
    2. Diminishing returns
    α > 0
    β = 0
    Processes
    Capacity
    Everybody usually gets this
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 10 / 81

    View Slide

  12. Universal Computational Scalability (8)
    Universal scaling properties
    1. Equal bang for the buck
    α = 0
    β = 0
    Processes
    Capacity
    Everybody wants this
    3. Bottleneck limit
    1/α
    α >> 0
    β = 0
    Processes
    Capacity
    Everybody hates this
    2. Diminishing returns
    α > 0
    β = 0
    Processes
    Capacity
    Everybody usually gets this
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 10 / 81

    View Slide

  13. Universal Computational Scalability (8)
    Universal scaling properties
    1. Equal bang for the buck
    α = 0
    β = 0
    Processes
    Capacity
    Everybody wants this
    3. Bottleneck limit
    1/α
    α >> 0
    β = 0
    Processes
    Capacity
    Everybody hates this
    2. Diminishing returns
    α > 0
    β = 0
    Processes
    Capacity
    Everybody usually gets this
    4. Retrograde throughput
    1/α
    1/N
    α > 0
    β > 0
    Processes
    Capacity
    Everybody thinks this never happens
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 10 / 81

    View Slide

  14. Universal Computational Scalability (8)
    The universal scalability law (USL)
    N: processes provide system stimulus or load
    5
    NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 11 / 81

    View Slide

  15. Universal Computational Scalability (8)
    The universal scalability law (USL)
    N: processes provide system stimulus or load
    XN
    : response function or relative capacity
    5
    NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 11 / 81

    View Slide

  16. Universal Computational Scalability (8)
    The universal scalability law (USL)
    N: processes provide system stimulus or load
    XN
    : response function or relative capacity
    Question: What kind of function?
    5
    NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 11 / 81

    View Slide

  17. Universal Computational Scalability (8)
    The universal scalability law (USL)
    N: processes provide system stimulus or load
    XN
    : response function or relative capacity
    Question: What kind of function?
    Answer: A rational function5 (nonlinear)
    5
    NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 11 / 81

    View Slide

  18. Universal Computational Scalability (8)
    The universal scalability law (USL)
    N: processes provide system stimulus or load
    XN
    : response function or relative capacity
    Question: What kind of function?
    Answer: A rational function5 (nonlinear)
    XN
    (α, β, γ) =
    γ N
    1 + α (N − 1) + β N(N − 1)
    5
    NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 11 / 81

    View Slide

  19. Universal Computational Scalability (8)
    The universal scalability law (USL)
    N: processes provide system stimulus or load
    XN
    : response function or relative capacity
    Question: What kind of function?
    Answer: A rational function5 (nonlinear)
    XN
    (α, β, γ) =
    γ N
    1 + α (N − 1) + β N(N − 1)
    Three Cs:
    5
    NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 11 / 81

    View Slide

  20. Universal Computational Scalability (8)
    The universal scalability law (USL)
    N: processes provide system stimulus or load
    XN
    : response function or relative capacity
    Question: What kind of function?
    Answer: A rational function5 (nonlinear)
    XN
    (α, β, γ) =
    γ N
    1 + α (N − 1) + β N(N − 1)
    Three Cs:
    1
    Concurrency (0 < γ < ∞)
    5
    NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 11 / 81

    View Slide

  21. Universal Computational Scalability (8)
    The universal scalability law (USL)
    N: processes provide system stimulus or load
    XN
    : response function or relative capacity
    Question: What kind of function?
    Answer: A rational function5 (nonlinear)
    XN
    (α, β, γ) =
    γ N
    1 + α (N − 1) + β N(N − 1)
    Three Cs:
    1
    Concurrency (0 < γ < ∞)
    2
    Contention (0 < α < 1)
    5
    NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 11 / 81

    View Slide

  22. Universal Computational Scalability (8)
    The universal scalability law (USL)
    N: processes provide system stimulus or load
    XN
    : response function or relative capacity
    Question: What kind of function?
    Answer: A rational function5 (nonlinear)
    XN
    (α, β, γ) =
    γ N
    1 + α (N − 1) + β N(N − 1)
    Three Cs:
    1
    Concurrency (0 < γ < ∞)
    2
    Contention (0 < α < 1)
    3
    Coherency (0 < β < 1)
    5
    NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conference (1993)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 11 / 81

    View Slide

  23. Universal Computational Scalability (8)
    Measurement meets model
    X(N)
    X(1)
    Thruput data
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 12 / 81

    View Slide

  24. Universal Computational Scalability (8)
    Measurement meets model
    X(N)
    X(1)
    Thruput data
    −→ CN
    Capacity metric
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 12 / 81

    View Slide

  25. Universal Computational Scalability (8)
    Measurement meets model
    X(N)
    X(1)
    Thruput data
    −→ CN
    Capacity metric
    ←−
    N
    1 + α (N − 1) + β N(N − 1)
    USL model
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 12 / 81

    View Slide

  26. Universal Computational Scalability (8)
    Measurement meets model
    X(N)
    X(1)
    Thruput data
    −→ CN
    Capacity metric
    ←−
    N
    1 + α (N − 1) + β N(N − 1)
    USL model
    0 20 40 60 80 100
    5 10 15 20
    Processes (N)
    Relative capacity, C(N)
    Linear scaling Amdahl−like scaling
    Retrograde scaling
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 12 / 81

    View Slide

  27. Universal Computational Scalability (8)
    Why is coherency quadratic?
    USL coherency term β N (N − 1) is O(N2)
    Fully connected graph KN
    with N nodes
    K3
    K6
    K8
    K16
    has
    N
    2

    N!
    2! (N − 2)!
    =
    1
    2
    N (N − 1)
    edges (quadratic in N) that represent:
    communication between each pair of nodes 6
    exchange of data or objects between processing nodes
    actual performance impact captured by magnitude of β parameter
    6
    This term can also be related to Brooks’ law (too many cooks increase delivery time)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 13 / 81

    View Slide

  28. Universal Computational Scalability (8)
    Coherency or consistency?
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 14 / 81

    View Slide

  29. Universal Computational Scalability (8)
    Coherency or consistency?
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 14 / 81

    View Slide

  30. Universal Computational Scalability (8)
    Coherency or consistency?
    July 19, 2000
    July 19, 2000
    ers OK
    ers OK
    stic)
    stic)
    PODC Keynote, July 19, 2000
    PODC Keynote, July 19, 2000
    The CAP Theorem
    The CAP Theorem
    Consistency Availability
    Tolerance to network
    Partitions
    Theorem: You can have at
    most two of these properties
    for any shared-data system
    What the man said
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 14 / 81

    View Slide

  31. Universal Computational Scalability (8)
    Coherency or consistency?
    July 19, 2000
    July 19, 2000
    ers OK
    ers OK
    stic)
    stic)
    PODC Keynote, July 19, 2000
    PODC Keynote, July 19, 2000
    The CAP Theorem
    The CAP Theorem
    Consistency Availability
    Tolerance to network
    Partitions
    Theorem: You can have at
    most two of these properties
    for any shared-data system
    What the man said
    C A
    P
    CA
    CP AP
    Correct Venn diagram
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 14 / 81

    View Slide

  32. Universal Computational Scalability (8)
    Eventual consistency and monotonicity
    Consistency: partial order of replicates via a binary relation (posets)
    Hasse diagrams: replica events converge (like time) to largest (cumulative) value
    Monotonicity: any upward Hasse path terminates with the same result
    Developer view is not a performance guarantee
    Lattice of integers ordered by binary max relation
    Time 9
    7 9
    4 7 9
    OO
    2 4 7 9
    Lattice of sets ordered by binary join relation
    Time {a, b, c, d}
    {a, b, c} {b, c, d} {a, b, d}
    {b, c} {a, b) {c, d} {a, d}
    OO
    {b} {c} {a} {d}
    Logical monotonicity = monotonically increasing function
    A monotonic application can exhibit monotonically decreasing scalability
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 15 / 81

    View Slide

  33. Universal Computational Scalability (8)
    Determining the USL parameters
    Throughput measurements XN
    at various process loads N sourced from:
    1 Load test harness, e.g., LoadRunner, Jmeter
    2 Production monitoring, e.g., Linux \proc, JMX interface
    Want to determine the α, β, γ that best models XN
    data
    XN
    (α, β, γ) =
    γ N
    1 + α (N − 1) + β N(N − 1)
    XN
    is a rational function (tricky)
    Brute force: Gene Amdahl 1967 (only α)
    Clever ways:
    1 Optimization Solver in Excel
    2 Nonlinear statistical regression in R
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 16 / 81

    View Slide

  34. Queueing Theory View of the USL (15)
    Outline
    1 Introduction
    2 Universal Computational Scalability (8)
    3 Queueing Theory View of the USL (15)
    4 Distributed Application Scaling (22)
    5 Distributed in the Cloud (30)
    6 Optimal Node Sizing (38)
    7 Summary (45)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 17 / 81

    View Slide

  35. Queueing Theory View of the USL (15)
    Queueing theory for those who can’t wait
    Amdahl’s law for parallel speedup SA
    T1
    /TN
    with N processors 7
    SA
    (N) =
    N
    1 + α (N − 1)
    (1)
    1 Parameter 0 ≤ α ≤ 1 is serial fraction of time spent running single threaded
    2 Subsumed by USL model contention term (with γ = 1 and β = 0)
    3 Msg: Forget linearity, you’re gonna hit a ceiling at lim
    N→∞
    SA ∼ 1/α
    Some people (all named Frank?) consider Amdahl’s law to be ad hoc:
    Franco P. Preparata, “Should Amdahl’s law be repealed?”, Proc. 6th International Symposium on Algorithms and
    Computation (ISAAC’95), Cairns, Australia, December 4–6 1995. https:
    //www.researchgate.net/publication/221543174_Should_Amdahl%27s_Law_Be_Repealed_Abstract
    Frank L. Devai, “The Refutation of Amdahl?s Law and Its Variants,” 17th International Conference on Computational
    Science and Applications ICCSA, Trieste, Italy, July 2017 https://www.researchgate.net/publication/
    318648496_The_Refutation_of_Amdahl%27s_Law_and_Its_Variants
    7
    Gene Amdahl (1967) never wrote down this equation and he measured the mainframe workloads by brute force
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 18 / 81

    View Slide

  36. Queueing Theory View of the USL (15)
    0 20 40 60 80 100
    5 10 15 20
    Processes (N)
    Relative capacity, C(N)
    Linear scaling Amdahl−like scaling
    Retrograde scaling
    Top blue curve is Amdahl approaching a ceiling speedup C(N) ∼ 20
    Which also means α ∼ 0.05 or app serialized 5% of the runtime
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 19 / 81

    View Slide

  37. Queueing Theory View of the USL (15)
    Queue characterization
    Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com
    Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com
    Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com
    W S
    L m ρ
    Time: R
    Space: Q +
    +
    =
    =
    Arriving Wai3ng Servicing Depar3ng
    A familiar queue

    View Slide

  38. Queueing Theory View of the USL (15)
    Queue characterization
    Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com
    Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com
    Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com
    W S
    L m ρ
    Time: R
    Space: Q +
    +
    =
    =
    Arriving Wai3ng Servicing Depar3ng
    A familiar queue
    An abstract queue

    View Slide

  39. Queueing Theory View of the USL (15)
    Queue characterization
    Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com
    Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com
    Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com
    W S
    L m ρ
    Time: R
    Space: Q +
    +
    =
    =
    Arriving Wai3ng Servicing Depar3ng
    A familiar queue
    An abstract queue
    1. Where is the service facility?

    View Slide

  40. Queueing Theory View of the USL (15)
    Queue characterization
    Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com
    Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com
    Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com
    W S
    L m ρ
    Time: R
    Space: Q +
    +
    =
    =
    Arriving Wai3ng Servicing Depar3ng
    A familiar queue
    An abstract queue
    1. Where is the service facility?
    2. Did anything finish yet?

    View Slide

  41. Queueing Theory View of the USL (15)
    Queue characterization
    Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com
    Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com
    Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com
    W S
    L m ρ
    Time: R
    Space: Q +
    +
    =
    =
    Arriving Wai3ng Servicing Depar3ng
    A familiar queue
    An abstract queue
    1. Where is the service facility?
    2. Did anything finish yet?
    3. Is anything waiting?

    View Slide

  42. Queueing Theory View of the USL (15)
    Queue characterization
    Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com
    Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com
    Please link back to the page you downloaded this from, or just link to parkablogs.blogspot.com
    W S
    L m ρ
    Time: R
    Space: Q +
    +
    =
    =
    Arriving Wai3ng Servicing Depar3ng
    A familiar queue
    An abstract queue
    1. Where is the service facility?
    2. Did anything finish yet?
    3. Is anything waiting?
    4. Is anything arriving?
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 20 / 81

    View Slide

  43. Queueing Theory View of the USL (15)
    Amdahl queueing model
    ...
    ...
    Interconnect network
    Processing nodes
    Figure 1: Amdahl queueing model
    Classical: Queueing model of assembly-line robots (LHS bubbles)
    Classical: Robots that fail go to service stations (RHS) for repairs
    Classical: Repair queues impact manufacturing “cycle time”
    Amdahl: Robots become finite number of (distributed) processing nodes
    Amdahl: Repair stations becomes interconnect network
    Amdahl: Network latency impacts application scalability
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 21 / 81

    View Slide

  44. Queueing Theory View of the USL (15)
    Amdahl queueing theorem
    Theorem 1 (Gunther 2002)
    Amdahl’s law for parallel speedup is equivalent to the synchronous queueing
    bound on throughput in the machine repairman model of a multi-node
    computer system.
    Proof.
    1 ”A New Interpretation of Amdahl’s Law and Geometric Scalability”, arXiv, Submitted 17 Oct
    (2002)
    2 “Celebrity Boxing (and sizing): Alan Greenspan vs. Gene Amdahl,” 28th International
    Computer Measurement Group Conference, December 8–13, Reno, NV (2002)
    3 “Unification of Amdahl’s Law, LogP and Other Performance Models for Message-Passing
    Architectures”, Proc. Parallel and Distributed Computer Systems (PDCS’05) November
    14–16, 569–576, Phoenix, AZ (2005)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 22 / 81

    View Slide

  45. Queueing Theory View of the USL (15)
    Queueing bounds on throughput
    0 20 40 60 80 100 120
    Processing nodes (N)
    Throughput X(N)
    Xlin
    (N) = 1 (R1 + Z)
    Xmax = 1 Smax
    Xmean
    (N) = N (RN + Z)
    Xsync
    (N) = N (N R1 + Z)
    Xsax = 1 R1
    Figure 2: Average (blue) and synchronous (red) throughput X(N)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 23 / 81

    View Slide

  46. Queueing Theory View of the USL (15)
    The speedup mapping
    1 Z: is the mean processor execution time
    2 Sk
    : service time at kth (network) queue
    3 R1
    : minimum round-trip time, i.e.,
    k
    Sk
    + Z for N = 1
    Amdahl speedup SA
    (N) is given by
    SA
    (N) =
    N · R1
    N ·
    k
    Sk
    + Z
    — queueing form (2)
    SA
    (N) =
    N
    1 + α(N − 1)
    — usual “ad hoc” law (3)
    Connecting (2) and (3) requires identity
    k
    Sk
    R1
    ≡ α →



    0 as Sk
    → 0 (processor only)
    1 as Z → 0 (network only)
    (4)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 24 / 81

    View Slide

  47. Queueing Theory View of the USL (15)
    USL queueing model
    ...
    Load-dependent
    Interconnect
    Processing nodes
    Figure 3: USL load-dependent queueing model
    Amdahl: Robots become finite number of (distributed) processing nodes
    Amdahl: Repair stations becomes interconnect network
    Amdahl: Network latency impacts application scalability
    USL: Interconnect service time (latency) depends on queue length
    USL: Requests are first exchange sorted O(N2)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 25 / 81

    View Slide

  48. Queueing Theory View of the USL (15)
    USL load-dependent queueing
    Theorem 2 (Gunther 2008)
    The USL relative capacity is equivalent to the synchronous queueing bound
    on throughput for a linear load-dependent machine repairman model of a
    multi-node computer system.
    Proof.
    1 Guerrilla Capacity Planning: A Tactical Approach to Planning for Highly Scalable
    Applications and Services (Springer 2007)
    2 A General Theory of Computational Scalability Based on Rational Functions, arXiv,
    Submitted on 11 Aug (2008)
    3 Discrete-event simulations presented in “Getting in the Zone for Successful Scalability”
    Proc. Computer Measurement Group International Conference, December 7th–12, Las
    Vegas, NV (2008)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 26 / 81

    View Slide

  49. Queueing Theory View of the USL (15)
    Experimental results
    0 20 40 60 80 100
    0 2 4 6 8 10
    Processing nodes (N)
    Relative capacity C(N)
    Synchronized repairman simulation
    USL model: α = 0.1, β = 0
    Load−dependent repairman simulation
    USL model: α = 0.1, β = 0.001
    Discrete-event simulations (Jim Holtman 2008, 2018)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 27 / 81

    View Slide

  50. Queueing Theory View of the USL (15)
    Key points for distributed USL
    Each USL denominator term has a physical meaning (not purely stats model):
    1 Constant O(1) ⇒ maximum nodal concurrency (no queueing)
    2 Linear O(N) ⇒ intra-nodal contention (synchronous queueing)
    3 Quadratic O(N2) ⇒ inter-nodal exchange (load-dependent sync queueing)
    Useful for interpreting USL analysis to developers, engineers, architects, etc.
    Example 1
    If β > α then look for inter-nodal messaging activity rather than the size of
    message-queues in any particular node
    Worst-case queueing corresponds to an analytic equation (rational function)
    USL is agnostic about the application architecture and interconnect topology
    So where is all that information contained? (exercise for the reader)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 28 / 81

    View Slide

  51. Distributed Application Scaling (22)
    Outline
    1 Introduction
    2 Universal Computational Scalability (8)
    3 Queueing Theory View of the USL (15)
    4 Distributed Application Scaling (22)
    5 Distributed in the Cloud (30)
    6 Optimal Node Sizing (38)
    7 Summary (45)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 29 / 81

    View Slide

  52. Distributed Application Scaling (22)
    Memcached
    “Hidden Scalability Gotchas in Memcached and Friends”
    NJG, S. Subramanyam, and S. Parvu Sun Microsystems
    Presented at Velocity 2010
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 30 / 81

    View Slide

  53. Distributed Application Scaling (22)
    Distributed scaling strategies
    Scaleup Scaleout
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 31 / 81

    View Slide

  54. Distributed Application Scaling (22)
    Scaleout generations
    Distributed cache of key-value pairs
    Data pre-loaded from RDBMS backend
    Deploy memcache on previous generation CPUs (not always multicore)
    Single worker thread ok — until next hardware roll (multicore)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 32 / 81

    View Slide

  55. Distributed Application Scaling (22)
    Controlled load-tests
    0 2 4 6 8 10 12
    0 50 100 150 200 250 300
    Worker threads (N)
    Throughput KOPS X(N)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 33 / 81

    View Slide

  56. Distributed Application Scaling (22)
    Explains these memcached warnings
    Configuring the memcached server
    Threading is used to scale memcached across CPU’s. The model is
    by ‘‘worker threads’’, meaning that each thread handles concurrent
    connections. ... By default 4 threads are allocated. Setting it to
    very large values (80+) will make it run considerably slower.
    Linux man pages - memcached (1)
    -t
    Number of threads to use to process incoming requests. ... It is
    typically not useful to set this higher than the number of CPU cores
    on the memcached server. Setting a high number (64 or more) of worker
    threads is not recommended. The default is 4.
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 34 / 81

    View Slide

  57. Distributed Application Scaling (22)
    Load test data in R
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 35 / 81

    View Slide

  58. Distributed Application Scaling (22)
    R regression analysis
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 36 / 81

    View Slide

  59. Distributed Application Scaling (22)
    USL scalability model
    0 2 4 6 8 10 12
    0 50 100 150 200 250 300
    Worker threads (N)
    Throughput KOPS X(N)
    α = 0.0468
    β = 0.021016
    γ = 84.89
    Nmax = 6.73
    Xmax = 274.87
    Xroof = 1814.82
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 37 / 81

    View Slide

  60. Distributed Application Scaling (22)
    Concurrency parameter
    0 2 4 6 8 10 12
    0 50 100 150 200 250 300
    Worker threads (N)
    Throughput KOPS X(N)
    α = 0.0468
    β = 0.021016
    γ = 84.89
    Nmax = 6.73
    Xmax = 274.87
    Xroof = 1814.82
    α = 0
    β = 0
    Processes
    Capacity
    1 γ = 84.89
    2 Slope of linear bound as Kops/thread
    3 Estimate of throughput X(1) = 84.89 Kops at N = 1 thread
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 38 / 81

    View Slide

  61. Distributed Application Scaling (22)
    Contention parameter
    0 2 4 6 8 10 12
    0 50 100 150 200 250 300
    Worker threads (N)
    Throughput KOPS X(N)
    α = 0.0468
    β = 0.021016
    γ = 84.89
    Nmax = 6.73
    Xmax = 274.87
    Xroof = 1814.82
    1/α
    α >> 0
    β = 0
    Processes
    Capacity
    α = 0.0468
    Waiting or queueing for resources about 4.6% of the time
    Max possible throughput is X(1)/α = 1814.78 Kops (Xroof
    )
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 39 / 81

    View Slide

  62. Distributed Application Scaling (22)
    Coherency parameter
    0 2 4 6 8 10 12
    0 50 100 150 200 250 300
    Worker threads (N)
    Throughput KOPS X(N)
    α = 0.0468
    β = 0.021016
    γ = 84.89
    Nmax = 6.73
    Xmax = 274.87
    Xroof = 1814.82
    1/α
    1/N
    α > 0
    β > 0
    Processes
    Capacity
    β = 0.0210 corresponds to retrograde throughput
    Distributed copies of data (e.g., caches) have to be exchanged/updated
    about 2.1% of the time to be consistent
    Peak occurs at Nmax = (1 − α)/β = 6.73 threads
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 40 / 81

    View Slide

  63. Distributed Application Scaling (22)
    Improving scalability performance
    0 10 20 30 40 50
    0 5 10 15 20 25
    Threads (N)
    Speedup S(N)
    mcd 1.2.8
    mcd 1.3.2
    mcd 1.3.2 + patch
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 41 / 81

    View Slide

  64. Distributed Application Scaling (22)
    Sirius and Zookeeper
    M. Bevilacqua-Linn, M. Byron, P. Cline, J. Moore and S. Muir
    (Comcast), “Sirius: Distributing and Coordinating Application”,
    Usenix ATC 2014
    P. Hunt, M. Konar, F. P. Junqueira and B. Reed (Yahoo),
    “ZooKeeper: Wait-free coordination for Internet-scale systems”,
    Usenix ATC 2010
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 42 / 81

    View Slide

  65. Distributed Application Scaling (22)
    Coordination throughput data
    0 500 1000 1500
    Writes per second
    1 3 5 7 9 11 13 15
    Cluster size
    Sirius
    Sirius−NoBrain
    Sirius−NoDisk
    USL model
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 43 / 81

    View Slide

  66. Distributed Application Scaling (22)
    Coordination throughput data
    0 500 1000 1500
    Writes per second
    1 3 5 7 9 11 13 15
    Cluster size
    Sirius
    Sirius−NoBrain
    Sirius−NoDisk
    USL model
    It’s all downhill on the coordination ski slopes
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 43 / 81

    View Slide

  67. Distributed Application Scaling (22)
    Coordination throughput data
    0 500 1000 1500
    Writes per second
    1 3 5 7 9 11 13 15
    Cluster size
    Sirius
    Sirius−NoBrain
    Sirius−NoDisk
    USL model
    It’s all downhill on the coordination ski slopes
    But that’s very bad ... isn’t it?
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 43 / 81

    View Slide

  68. Distributed Application Scaling (22)
    Coordination throughput data
    0 500 1000 1500
    Writes per second
    1 3 5 7 9 11 13 15
    Cluster size
    Sirius
    Sirius−NoBrain
    Sirius−NoDisk
    USL model
    It’s all downhill on the coordination ski slopes
    But that’s very bad ... isn’t it?
    What does the USL say?
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 43 / 81

    View Slide

  69. Distributed Application Scaling (22)
    USL scalability model
    0 500 1000 1500
    Cluster size
    Writes per second
    1 3 5 7 9 11 13 15
    Sirius
    Sirius−NoBrain
    Sirius−NoDisk
    USL model
    α = 0.05
    β = 0.165142
    γ = 1024.98
    Nmax = 2.4
    Xmax = 1513.93
    Xroof = 20499.54
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 44 / 81

    View Slide

  70. Distributed Application Scaling (22)
    Concurrency parameter
    0 500 1000 1500
    Cluster size
    Writes per second
    1 3 5 7 9 11 13 15
    Sirius
    Sirius−NoBrain
    Sirius−NoDisk
    USL model
    α = 0.05
    β = 0.165142
    γ = 1024.98
    Nmax = 2.4
    Xmax = 1513.93
    Xroof = 20499.54
    α = 0
    β = 0
    Processes
    Capacity
    1 γ = 1024.98
    2 Single node is meaningless (need N ≥ 3 for majority)
    3 Interpret γ as N = 1 virtual throughput
    4 USL estimates X(1) = 1, 024.98 WPS (black square)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 45 / 81

    View Slide

  71. Distributed Application Scaling (22)
    Contention parameter
    0 500 1000 1500
    Cluster size
    Writes per second
    1 3 5 7 9 11 13 15
    Sirius
    Sirius−NoBrain
    Sirius−NoDisk
    USL model
    α = 0.05
    β = 0.165142
    γ = 1024.98
    Nmax = 2.4
    Xmax = 1513.93
    Xroof = 20499.54
    1/α
    α >> 0
    β = 0
    Processes
    Capacity
    α = 0.05
    Queueing for resources about 5% of the time
    Max possible throughput is X(1)/α = 20, 499.54 WPS (Xroof
    )
    But Xroof
    not feasible in these systems
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 46 / 81

    View Slide

  72. Distributed Application Scaling (22)
    Coherency parameter
    0 500 1000 1500
    Cluster size
    Writes per second
    1 3 5 7 9 11 13 15
    Sirius
    Sirius−NoBrain
    Sirius−NoDisk
    USL model
    α = 0.05
    β = 0.165142
    γ = 1024.98
    Nmax = 2.4
    Xmax = 1513.93
    Xroof = 20499.54
    1/α
    1/N
    α > 0
    β > 0
    Processes
    Capacity
    β = 0.1651 says retrograde throughput dominates!
    Distributed data being exchanged (compared?) about 16.5% of the time
    (virtual) Peak at Nmax = (1 − α)/β = 2.4 cluster nodes
    Shocking but that’s what Paxos-type scaling looks like
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 47 / 81

    View Slide

  73. Distributed Application Scaling (22)
    Comparison of Sirius and Zookeeper
    0 500 1000 1500
    Cluster size
    Writes per second
    1 3 5 7 9 11 13 15
    Sirius
    Sirius−NoBrain
    Sirius−NoDisk
    USL model
    α = 0.05
    β = 0.165142
    γ = 1024.98
    Nmax = 2.4
    Xmax = 1513.93
    Xroof = 20499.54
    0 20000 40000 60000 80000
    Cluster size
    Reqs per second
    1 3 5 7 9 11 13 15
    Mean
    CI_lo
    CI_hi
    USL model
    α = 0.05
    β = 0.157701
    γ = 41536.96
    Nmax = 2.45
    Xmax = 62328.47
    Xroof = 830739.1
    Both coordination services exhibit equivalent USL scaling parameters:
    5% contention delay — not “wait-free” (per Yahoo paper title)
    16% coherency delay
    despite Sirius being write-intensive and Zookeeper being read-intensive (i.e.,
    100x throughput)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 48 / 81

    View Slide

  74. Distributed Application Scaling (22)
    Distributed Ledger
    Scalability
    A. Charapko, A. Ailijiang and M. Demirbas “Bridging Paxos and
    Blockchain Consensus”, 2018
    Team Rocket (Cornell), “Snowflake to Avalanche: A Novel
    Metastable Consensus Protocol Family for Cryptocurrencies”, 2018
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 49 / 81

    View Slide

  75. Distributed Application Scaling (22)
    Avalanche protocol
    VISA can handle 60,000 TPS
    Bitcoin bottlenecks at 7 TPS 8
    That’s FOUR decades performance inferiority!
    Avalanche rumor-mongering coordination:
    Leaderless BFT
    Multiple samples of small nodal neighborhood (ns
    )
    Threshold > k ns
    when most of system has “heard” the rumor
    Greener than Bitcoin (Nakamoto)
    No PoW needed
    8
    K. Stinchcombe, “Ten years in, nobody has come up with a use for blockchain, Hackernoon, Dec 22, 2017
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 50 / 81

    View Slide

  76. Distributed Application Scaling (22)
    Avalanche scalability
    0 500 1000 1500 2000
    0 500 1000 1500 2000
    Network nodes
    Throughput (TPS)
    α = 0.9
    β = 6.5e−05
    γ = 1648.27
    Npeak = 39.15
    Xpeak = 1821.21
    Xroof = 1831.41
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 51 / 81

    View Slide

  77. Distributed Application Scaling (22)
    Comparison with Paxos
    0 500 1000 1500 2000
    0 500 1000 1500 2000
    Network nodes
    Throughput (TPS)
    α = 0.9
    β = 6.5e−05
    γ = 1648.27
    Npeak = 39.15
    Xpeak = 1821.21
    Xroof = 1831.41
    0 500 1000 1500
    Cluster size
    Writes per second
    1 3 5 7 9 11 13 15
    Sirius
    Sirius−NoBrain
    Sirius−NoDisk
    USL model
    α = 0.05
    β = 0.165142
    γ = 1024.98
    Nmax = 2.4
    Xmax = 1513.93
    Xroof = 20499.54
    Contention: α about 90%
    X(1)/α ⇒ immediate saturation at about 2000 TPS
    Coherency: β extremely small but not zero
    Much slower ‘linear’ degradation than Zookeeper/Paxos
    Saturation throughput maintained over 2000 nodes (100x Sirius)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 52 / 81

    View Slide

  78. Distributed Application Scaling (22)
    Hadoop Super Scaling
    NJG, P. J. Puglia, and K. Tomasette, “Hadoop Superlinear
    Scalability: The perpetual motion of parallel performance”, Comm.
    ACM, Vol. 58 No. 4, 46–55 (2015)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 53 / 81

    View Slide

  79. Distributed Application Scaling (22)
    Super-scalability is a thing
    Superlinear
    Linear
    Sublinear
    Processors
    Speedup
    More than 100% efficient!
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 54 / 81

    View Slide

  80. Distributed Application Scaling (22)
    Seeing is believing
    “Speedup in Parallel Contexts.”
    http://en.wikipedia.org/wiki/Speedup#Speedup_in_Parallel_Contexts
    “Where does super-linear speedup come from?”
    http://stackoverflow.com/questions/4332967/where-does-super-linear-speedup-come-from
    “Sun Fire X2270 M2 Super-Linear Scaling of Hadoop TeraSort and CloudBurst
    Benchmarks.”
    https://blogs.oracle.com/BestPerf/entry/20090920_x2270m2_hadoop
    Haas, R. “Scalability, in Graphical Form, Analyzed.”
    http://rhaas.blogspot.com/2011/09/scalability-in-graphical-form-analyzed.html
    Sutter, H. 2008. “Going Superlinear.”
    Dr. Dobb’s Journal 33(3), March.
    http://www.drdobbs.com/cpp/going-superlinear/206100542
    Sutter, H. 2008. “Super Linearity and the Bigger Machine.”
    Dr. Dobb’s Journal 33(4), April.
    http://www.drdobbs.com/parallel/super-linearity-and-the-bigger-machine/206903306
    “SDN analytics and control using sFlow standard — Superlinear.”
    http://blog.sflow.com/2010/09/superlinear.html
    Eijkhout, V. 2014. Introduction to High Performance Scientific Computing. Lulu.com
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 55 / 81

    View Slide

  81. Distributed Application Scaling (22)
    Empirical evidence
    16-node Sun Fire X2270 M2 cluster
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 56 / 81

    View Slide

  82. Distributed Application Scaling (22)
    Smells like perpetual motion
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 57 / 81

    View Slide

  83. Distributed Application Scaling (22)
    Terasort cluster emulations
    Want a controlled environment to study superlinearity
    Terasort workload sorts 1 TB data in parallel
    Terasort has benchmarked Hadoop MapReduce performance
    We used just 100 GB data input (weren’t benchmarking anything)
    Simulate in AWS cloud EC2 instances (more flexible and much cheaper)
    Enabled multiple runs in parallel (thank you Comcast for cycle $s)
    Table 1: Amazon EC2 Configurations
    Optimized Processor vCPU Memory Instance Network
    for Arch number (GiB) Storage (GB) Performance
    BigMem Memory 64-bit 4 34.2 1 x 850 Moderate
    BigDisk Compute 64-bit 8 7 4 x 420 High
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 58 / 81

    View Slide

  84. Distributed Application Scaling (22)
    USL analysis of BigMem N ≤ 50 nodes
    0 50 100 150
    0 50 100 150
    USL Analysis of BigMem Hadoop Terrasort
    EC2 m2 nodes (N)
    Speedup S(N)
    α = −0.0288
    β = 0.000447
    Nmax = 47.96
    Smax = 73.48
    Sroof = N A
    Ncross = 64.5
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 59 / 81

    View Slide

  85. Distributed Application Scaling (22)
    USL analysis of BigMem N ≤ 150 nodes
    0 50 100 150
    0 50 100 150
    USL Model of BigMem Hadoop TS Data
    EC2 m2 nodes (N)
    Speedup S(N)
    α = −0.0089
    β = 9e−05
    Nmax = 105.72
    Smax = 99.53
    Sroof = N A
    Ncross = 99.14
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 60 / 81

    View Slide

  86. Distributed Application Scaling (22)
    Speedup on N ≤ 10 BigMem Nodes (1 disk)
    0 5 10 15
    0 5 10 15
    BigMem Hadoop Terasort Speedup Data
    EC2 m2 nodes (N)
    Speedup S(N)
    Superlinear region Sublinear region
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 61 / 81

    View Slide

  87. Distributed Application Scaling (22)
    Speedup on N ≤ 10 BigDisk Nodes (4 disks)
    0 5 10 15
    0 5 10 15
    BigDisk Hadoop Terasort Speedup Data
    EC2 c1 nodes (N)
    Speedup S(N)
    Superlinear region Sublinear region
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 62 / 81

    View Slide

  88. Distributed Application Scaling (22)
    Superlinear payback trap
    Superlinearity is real (i.e., measurable)
    But due to hidden resource constraint (disk or memory subsystem sizing)
    Sets wrong normalization for speedup scale
    Shows up in USL as negative α value (cf. GCAP book)
    Theorem 3 (Gunther 2013)
    The apparent advantage of superlinear scaling (α < 0) is inevitably lost due to
    crossover into a region of severe performance degradation (β 0) that can be worse
    than if superlinearity had not been present in the first place. Proof was presented at
    Hotsos 2013.
    Superlinear
    Linear
    Sublinear
    Processors
    Speedup
    Superlinear Payback
    Processors
    Speedup
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 63 / 81

    View Slide

  89. Distributed in the Cloud (30)
    Outline
    1 Introduction
    2 Universal Computational Scalability (8)
    3 Queueing Theory View of the USL (15)
    4 Distributed Application Scaling (22)
    5 Distributed in the Cloud (30)
    6 Optimal Node Sizing (38)
    7 Summary (45)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 64 / 81

    View Slide

  90. Distributed in the Cloud (30)
    AWS Cloud Application
    “Exposing the Cost of Performance Hidden in the Cloud”
    NJG and M. Chawla
    Presented at CMG cloudXchange 2018
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 65 / 81

    View Slide

  91. Distributed in the Cloud (30)
    Production data
    Previously measured X and R directly on test rig
    Table 2: Converting data to performance metrics
    Data Meaning Metrics Meaning
    T Elapsed time X = C/T Throughput
    Tp
    Processing time R = (Tp
    /T)(T/C) Response time
    C Completed work N = X × R Concurrent threads
    Ucpu
    CPU utilization S = Ucpu
    /X Service time
    Example 4 (Coalesced metrics)
    Linux epoch Timestamp interval between rows is 300 seconds
    Timestamp, X, N, S, R, U_cpu
    1486771200000, 502.171674, 170.266663, 0.000912, 0.336740, 0.458120
    1486771500000, 494.403035, 175.375000, 0.001043, 0.355975, 0.515420
    1486771800000, 509.541751, 188.866669, 0.000885, 0.360924, 0.450980
    1486772100000, 507.089094, 188.437500, 0.000910, 0.367479, 0.461700
    1486772400000, 532.803039, 191.466660, 0.000880, 0.362905, 0.468860
    ...
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 66 / 81

    View Slide

  92. Distributed in the Cloud (30)
    Tomcat data from AWS
    0 100 200 300 400 500
    0 200 400 600 800
    Tomcat threads
    Throughput (RPS)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 67 / 81

    View Slide

  93. Distributed in the Cloud (30)
    USL nonlinear analysis
    0 100 200 300 400 500
    0 200 400 600 800
    Tomcat threads
    Throughput (RPS)
    α = 0
    β = 3e−06
    γ = 3
    Nmax = 539.2
    Xmax = 809.55
    Nopt = 274.8
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 68 / 81

    View Slide

  94. Distributed in the Cloud (30)
    Concurrency parameter
    0 100 200 300 400 500
    0 200 400 600 800
    Tomcat threads
    Throughput (RPS)
    α = 0
    β = 3e−06
    γ = 3
    Nmax = 539.2
    Xmax = 809.55
    Nopt = 274.8
    α = 0
    β = 0
    Processes
    Capacity
    1 γ = 3.0
    2 Smallest number of threads during 24 hr sample is N > 100
    3 Nonetheless USL estimates throughput X(1) = 3 RPS
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 69 / 81

    View Slide

  95. Distributed in the Cloud (30)
    Contention parameter
    0 100 200 300 400 500
    0 200 400 600 800
    Tomcat threads
    Throughput (RPS)
    α = 0
    β = 3e−06
    γ = 3
    Nmax = 539.2
    Xmax = 809.55
    Nopt = 274.8
    1/α
    α >> 0
    β = 0
    Processes
    Capacity
    α = 0
    No significant waiting or queueing
    Max possible throughput Xroof
    not defined
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 70 / 81

    View Slide

  96. Distributed in the Cloud (30)
    Coherency parameter
    0 100 200 300 400 500
    0 200 400 600 800
    Tomcat threads
    Throughput (RPS)
    α = 0
    β = 3e−06
    γ = 3
    Nmax = 539.2
    Xmax = 809.55
    Nopt = 274.8
    1/α
    1/N
    α > 0
    β > 0
    Processes
    Capacity
    β = 3 × 10−6 implies very weak retrograde throughput
    Extremely little data exchange
    But entirely responsible for sublinearity
    And peak throughput Xmax = 809.55 RPS
    Peak occurs at Nmax = 539.2 threads
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 71 / 81

    View Slide

  97. Distributed in the Cloud (30)
    Revised USL analysis
    Parallel threads implies linear scaling
    Linear slope γ ∼ 3: γ = 2.65
    Should be no contention, i.e., α = 0
    Discontinuity at N ∼ 275 threads
    Throughput plateaus, i.e., β = 0
    Saturation occurs at processor
    utilization UCPU
    ≥ 75%
    Linux OS can’t do that!
    Pseudo-saturation due to AWS Auto
    Scaling policy (hypervisor?)
    Many EC2 instances spun up and down
    during 24 hrs
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 72 / 81

    View Slide

  98. Distributed in the Cloud (30)
    Corrected USL linear model
    0 100 200 300 400 500
    0 200 400 600 800
    Tomcat threads
    Throughput (RPS)
    α = 0
    β = 0
    γ = 2.65
    Nmax = NaN
    Xmax = 727.03
    Nopt = 274.8
    Parallel threads Pseudo−saturation
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 73 / 81

    View Slide

  99. Optimal Node Sizing (38)
    Outline
    1 Introduction
    2 Universal Computational Scalability (8)
    3 Queueing Theory View of the USL (15)
    4 Distributed Application Scaling (22)
    5 Distributed in the Cloud (30)
    6 Optimal Node Sizing (38)
    7 Summary (45)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 74 / 81

    View Slide

  100. Optimal Node Sizing (38)
    1. Canonical sublinear scalability
    0 50 100 150
    0 100 200 300 400 500
    Processes (N)
    Thruput X(N)
    α = 0.05
    β = 6e−05
    γ = 20.8563
    Nmax = 125.5
    Xmax = 320.47
    Xroof = 417.13
    Amhdahl curve (dotted)
    Linear scaling bound (LHS red)
    Saturation asymptote (top red)
    Nopt = 1/α = 20 at intersection
    USL curve (solid blue) is
    retrograde
    Nopt shifts right to Nmax = 125
    Throughput is lower than Xroof
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 75 / 81

    View Slide

  101. Optimal Node Sizing (38)
    2. Memcache scalability
    0 5 10 15 20 25
    0 500 1000 1500 2000
    Worker threads (N)
    Throughput X(N)
    α = 0.0468
    β = 0.021016
    γ = 84.89
    Nmax = 6.73
    Xmax = 274.87
    Nopt = 21.38
    0 10 20 30 40 50
    0 100 200 300 400 500
    Worker processes (N)
    Throughput X(N)
    α = 0
    β = 0.00052
    γ = 17.76
    Npeak = 43.87
    Xpeak = 394.06
    Nopt = Inf
    Initial scalability (green curve)
    Nmax occurs before Nopt
    Nmax = 6.73
    Nopt = 1/0.0468 = 21.36752
    Sun SPARC patch (blue curve)
    Nmax = 43.87
    Now exceeds previous Nopt
    And throughput increased by approx 50%
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 76 / 81

    View Slide

  102. Optimal Node Sizing (38)
    3. Cloud scalability
    0 100 200 300 400 500
    0 200 400 600 800
    Tomcat threads
    Throughput (RPS)
    α = 0
    β = 0
    γ = 2.65
    Nmax = NaN
    Xmax = 727.03
    Nopt = 274.8
    Parallel threads Pseudo−saturation
    0 100 200 300 400 500
    0.0 0.2 0.4 0.6 0.8 1.0
    Tomcat processes (N)
    Response time R(N)
    α = 0
    β = 0
    γ = 2.65
    Rmin = 0.38
    Sbnk = 0.001365
    Nopt = 274.8
    Throughput profile
    X(N) data tightly follow bounds
    Nopt = 274.8 due to autoscaling policy
    Significant dispersion about mean X
    Response time profile
    Autoscaling throttles throughput
    Causes R(N) to increase linearly
    Will users tolerate R(N) > Nopt ?
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 77 / 81

    View Slide

  103. Optimal Node Sizing (38)
    4. DLT scalability
    0 20000 40000 60000 80000
    Cluster size
    Reqs per second
    1 3 5 7 9 11 13 15
    Mean
    CI_lo
    CI_hi
    USL model
    α = 0.05
    β = 0.157701
    γ = 41536.96
    Nmax = 2.45
    Xmax = 62328.47
    Xroof = 830739.1
    0 500 1000 1500 2000
    0 500 1000 1500 2000
    Network nodes
    Throughput (TPS)
    α = 0.9
    β = 6.5e−05
    γ = 1648.27
    Npeak = 39.15
    Xpeak = 1821.21
    Xroof = 1831.41
    Throughput profile
    Optimum is the smallest configuration
    Likely bigger in real applications
    Coordination kills
    Scaling degrades (that’s how it works!)
    No good for DLT
    Throughput profile
    There is no optimum
    Throughput similar to Zookeeper peak
    Pick a number N
    Throughput is independent of scale
    But N scale is 100x Zookeeper
    Generally needed for DLT
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 78 / 81

    View Slide

  104. Summary (45)
    Outline
    1 Introduction
    2 Universal Computational Scalability (8)
    3 Queueing Theory View of the USL (15)
    4 Distributed Application Scaling (22)
    5 Distributed in the Cloud (30)
    6 Optimal Node Sizing (38)
    7 Summary (45)
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 79 / 81

    View Slide

  105. Summary (45)
    Summary
    All performance is nonlinear
    Need both data and models (e.g., USL)
    USL is a nonlinear (rational) function
    USL is architecture agnostic
    All scalability information contained in parameters:
    1 α: contention, queueing
    2 β: coherency, consistency (pairwise data exchange)
    3 γ: concurrency, X(1) value
    Apply USL via nonlinear statistical regression
    CAP-like consistency associated with magnitude of quadratic β term
    Can apply USL to both controlled (testing) and production platforms
    USL analysis often different from paper designs and algorithms
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 80 / 81

    View Slide

  106. Summary (45)
    Thank you for the (virtual) invitation!
    www.perfdynamics.com
    Castro Valley, California
    Twitter twitter.com/DrQz
    Facebook facebook.com/PerformanceDynamics
    Blog perfdynamics.blogspot.com
    Training classes perfdynamics.com/Classes
    [email protected]
    +1-510-537-5758
    c 2019 Performance Dynamics Applying The Universal Scalability Law to Distributed Systems
    March 13, 2019 81 / 81

    View Slide