∆t < 1 hour Dr. Neil J. Gunther Performance Dynamics SURGE 2010 Sept 30 – Oct 1 SM c 2010 Performance Dynamics Quantifying Scalability FTW October 1, 2010 1 / 45
assessing the cost-beneﬁt of a given scalability strategy quantify system scalability scalability is not a single number (it’s a function) all measurements are wrong by deﬁnition need a framework to validate data measurement + model == information Scalability: sustainable performance under increasing load (size N) c 2010 Performance Dynamics Quantifying Scalability FTW October 1, 2010 3 / 45
magic beanstalk up into the clouds (10,000 ft?) Guarded by a giant who is 10 times bigger than Jack “Fee-ﬁe-foe-fum!” and all that c 2010 Performance Dynamics Quantifying Scalability FTW October 1, 2010 4 / 45
exist? Can 10,000’ beanstalk exist? Guinness world record Robert P. Wadlow (USA) Height: 8’11” (2.72 m) Jack Height: 1.8 m tall (L) Weight: 90 kg Giant (10x bigger) Height: 18 m tall (10 × L) L3 × 90 kg = 103 × 90 kg Weight: 90,000 kg A bone-crushing 100 tons! c 2010 Performance Dynamics Quantifying Scalability FTW October 1, 2010 5 / 45
limits to sustainable loads When the load (volume) exceeds the material strength (supporting area), things tend to snap Load ∼ L3 (volume), but strength ∼ L2 (cross-section area) Computer scalability No critical limit Point of diminishing returns Scalability is about sustainable size c 2010 Performance Dynamics Quantifying Scalability FTW October 1, 2010 6 / 45
400 600 800 1000 Users 100 200 300 400 500 600 Scalability Critical point is maximum in throughput curve Beyond max performance degradation or retrograde scalability c 2010 Performance Dynamics Quantifying Scalability FTW October 1, 2010 8 / 45
paper “Parallel Analysis with Sawzall” “If scaling were perfect, performance would be proportional to the number of machines... In our test, the effect is to contribute 0.98 machines.” Translation: Not 100% linear but 98% of linear or C++, while capable of handling such tasks, are more awkward to use and require more effort on the part of the programmer. Still, Awk and Python are not panaceas; for instance, they have no inherent facilities for processing data on multiple machines. Since the data records we wish to process do live on many machines, it would be fruitful to exploit the combined computing power to perform these analyses. In particular, if the individual steps can be expressed as query operations that can be evaluated one record at a time, we can distribute the calculation across all the machines and achieve very high throughput. The results of these operations will then require an aggregation phase. For example, if we are counting records, we need to gather the counts from the individual machines before we can report the total count. We therefore break our calculations into two phases. The ﬁrst phase evaluates the analysis on each record individually, while the second phase aggregates the results (Figure 2). The system described in this paper goes even further, however. The analysis in the ﬁrst phase is expressed in a new procedural programming language that executes one record at a time, in isolation, to calculate query results for each record. The second phase is restricted to a set of predeﬁned aggregators that process the intermediate results generated by the ﬁrst phase. By restricting the calculations to this model, we can achieve very high throughput. Although not all calculations ﬁt this model well, the ability to harness a thousand or more machines with a few lines of code provides some compensation. !""#$"%&'#( !"#$%&'#%()*$(' +,-$$%&'&.$.' ! ! )*+&$#,( /.0'&.$.' Figure 2: The overall ﬂow of ﬁltering, aggregating, and collating. Each stage typically involves less data than the previous. Of course, there are still many subproblems that remain to be solved. The calculation must be divided into pieces and distributed across the machines holding the data, keeping the computation as near the data as possible to avoid network bottlenecks. And when there are many machines there is a high probability of some of them failing during the analysis, so the system must be 3 Theo Schlossnagle: “Linear scaling is simply a falsehood” p.71 Scalability is a function Not a number Always limits, e.g., throughput capacity Want to quantify such limits c 2010 Performance Dynamics Quantifying Scalability FTW October 1, 2010 10 / 45
math Quantifying scalability requires some math But nothing as complicated as this1 Pr{Murphy} = (U + C + I) × (10 − S) 20 × A 1 − sin(F/10) I have no idea what this equation is (ask Theo ) 1Source: Theo Schlossnagle, Scalable Intenet Architectures, p.12 c 2010 Performance Dynamics Quantifying Scalability FTW October 1, 2010 12 / 45
together: N users or processes C is the scalability function of N C(N, α, β) = N 1 + α (N − 1) + β N(N − 1) c 2010 Performance Dynamics Quantifying Scalability FTW October 1, 2010 17 / 45
together: N users or processes C is the scalability function of N C(N, α, β) = N 1 + α (N − 1) + β N(N − 1) Three Cs: c 2010 Performance Dynamics Quantifying Scalability FTW October 1, 2010 17 / 45
together: N users or processes C is the scalability function of N C(N, α, β) = N 1 + α (N − 1) + β N(N − 1) Three Cs: 1 Concurrency c 2010 Performance Dynamics Quantifying Scalability FTW October 1, 2010 17 / 45
together: N users or processes C is the scalability function of N C(N, α, β) = N 1 + α (N − 1) + β N(N − 1) Three Cs: 1 Concurrency 2 Contention (amount α) c 2010 Performance Dynamics Quantifying Scalability FTW October 1, 2010 17 / 45
together: N users or processes C is the scalability function of N C(N, α, β) = N 1 + α (N − 1) + β N(N − 1) Three Cs: 1 Concurrency 2 Contention (amount α) 3 Consistency as in ACID & CAP Thm (amount β) c 2010 Performance Dynamics Quantifying Scalability FTW October 1, 2010 17 / 45
together: N users or processes C is the scalability function of N C(N, α, β) = N 1 + α (N − 1) + β N(N − 1) Three Cs: 1 Concurrency 2 Contention (amount α) 3 Consistency as in ACID & CAP Thm (amount β) Theorem (Universality) Only need α, β coefﬁcients to determine maximum in C(N) c 2010 Performance Dynamics Quantifying Scalability FTW October 1, 2010 17 / 45
from the Devil Models come from God Skepticism should rule! Theorem Data + Models ≡ Insight Data needs to be put in prison (a model) and made to confess the truth Corollary Waterboarding your data is ok c 2010 Performance Dynamics Quantifying Scalability FTW October 1, 2010 19 / 45
looks ok visually but some data are > 100% efﬁcient Can’t haz Or your have some very serious explaining to do c 2010 Performance Dynamics Quantifying Scalability FTW October 1, 2010 21 / 45
table of various USL quantities. Column F is scaling efﬁciency: C/N. Between N = 5 and 150 vusers, efﬁciencies > 1.0. Can’t have more than 100% of anything. Need to explain? Data + Model == Information Merely attempting to set up the USL model in Excel or R, shows measurement data (not the model) are wrong. c 2010 Performance Dynamics Quantifying Scalability FTW October 1, 2010 22 / 45
Want to compare capacity upgrades for Sun E10K backend Running ORA dbms for both OLTP and DSS eBay 1.0 had no performance measurements of their app eBay Inc. was just hiring into a QA/load-test group No scalability measurements Sun PS provided me with their M-values M-values ⇒ α 0.005 But that’s only α = 1 2 % contention ... WTF !? ORA dmbs is more typically α ≈ 3% But at least we have things quantiﬁed c 2010 Performance Dynamics Quantifying Scalability FTW October 1, 2010 24 / 45
Solution Apply the USL model to Sun’s M-values C(N) = N 1 + α(N − 1) + βN(N − 1) ORA backend ⇒ α 0.03 Simply re-run the USL curves with that value. Voila! Creates a scalability envelope eBay mileage will vary within this scalability envelope c 2010 Performance Dynamics Quantifying Scalability FTW October 1, 2010 26 / 45
servers Servers often blades Mostly single processor Single threading ok c 2010 Performance Dynamics Quantifying Scalability FTW October 1, 2010 30 / 45
blades will be replaced with multicores Multicores will be the only game in town (HW vendor decision) The Problem memcached is thread limited on multicores c 2010 Performance Dynamics Quantifying Scalability FTW October 1, 2010 31 / 45
core + SMT Version α β Nmax 1.2.8 0.0255 0.0210 7 1.4.1 0.0821 0.0207 6 1.4.5 0.0988 0.0209 6 Little’s law 3: N = X(R + Z) threads Also know R is on the order of ms (10−3 s), so latency dominated by client-side “think time” Z = 5 s in tests Avg X ≈ 350 KOPS on Intel quad-core Therefore: N ≈ 350 × 103 × 5 = 1, 750, 000 threads Same as users, assuming 1 user process per thread 3See e.g., Scalable Intenet Architectures, p.127 c 2010 Performance Dynamics Quantifying Scalability FTW October 1, 2010 37 / 45
Solaris Version α β Nmax Vanilla 0.0041 0.0092 22 Modiﬁed 0.0000 0.0004 48 The Solution Partitioned mcd hash table Single hash table contention avoided by partitioning table Solaris patches improve scalability to ≈ 40 threads Throughput X increases from 200 → 400 KOPS on SPARC CMT Can’t assume same 2x win on x86 arch c 2010 Performance Dynamics Quantifying Scalability FTW October 1, 2010 38 / 45
CTO “Scalability is hard because it cannot be an after-thought. Good scalability is possible, but only if we architect and engineer our systems to take scalability into account.” Old reason: Concurrent programming was hard on SMPs New reason: Multicores are SMPs on a chip (HW vendor decision) More threads enable higher concurrency, shorter user latencies But it’s hard: beware the 3rd C in the USL (β coefﬁcient) Theo Schlossnagle, OmniTI CEO “Simply having a solution that scales horizontally doesn’t mean that you are safe.” c 2010 Performance Dynamics Quantifying Scalability FTW October 1, 2010 40 / 45
than Scalability curves A B C 0 20 40 60 80 100 120 N 0 200 400 600 800 1000 X N Websphere measurements (dots) A Asynchronous messaging (average queue lengths) B Synchronous messaging (worst queue lengths) C Synchronous messaging + pairwise exchanges c 2010 Performance Dynamics Quantifying Scalability FTW October 1, 2010 42 / 45