as conﬁguration recipes Lot’s a words (on blogs), but no numbers Cost-beneﬁt analysis demands numbers, not words. Need to measure scalability appropriately it to quantify it. c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 3 / 55

as conﬁguration recipes Lot’s a words (on blogs), but no numbers Cost-beneﬁt analysis demands numbers, not words. Need to measure scalability appropriately it to quantify it. But how? c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 3 / 55

as conﬁguration recipes Lot’s a words (on blogs), but no numbers Cost-beneﬁt analysis demands numbers, not words. Need to measure scalability appropriately it to quantify it. But how? Need controlled measurements (e.g., Apache JMeter) Cannot understand scalability by monitoring Prod systems. The human brain is not built for that. Need to transform time-series data to informational performance metrics. c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 3 / 55

performance would be proportional to the number of machines. In our test, it was 0.98 of the machines.” Since the data records we wish to process do live on many machines, it would be fruitful to exploit the combined computing power to perform these analyses. In particular, if the individual steps can be expressed as query operations that can be evaluated one record at a time, we can distribute the calculation across all the machines and achieve very high throughput. The results of these operations will then require an aggregation phase. For example, if we are counting records, we need to gather the counts from the individual machines before we can report the total count. We therefore break our calculations into two phases. The ﬁrst phase evaluates the analysis on each record individually, while the second phase aggregates the results (Figure 2). The system described in this paper goes even further, however. The analysis in the ﬁrst phase is expressed in a new procedural programming language that executes one record at a time, in isolation, to calculate query results for each record. The second phase is restricted to a set of predeﬁned aggregators that process the intermediate results generated by the ﬁrst phase. By restricting the calculations to this model, we can achieve very high throughput. Although not all calculations ﬁt this model well, the ability to harness a thousand or more machines with a few lines of code provides some compensation. !""#$"%&'#( !"#$%&'#%()*$(' +,-$$%&'&.$.' ! ! )*+&$#,( /.0'&.$.' Figure 2: The overall ﬂow of ﬁltering, aggregating, and collating. Each stage typically involves less data than the previous. Of course, there are still many subproblems that remain to be solved. The calculation must be divided into pieces and distributed across the machines holding the data, keeping the computation as near the data as possible to avoid network bottlenecks. And when there are many machines there is a high probability of some of them failing during the analysis, so the system must be 3 Translation: MR scalability is 98% of ideal linear scaling Scalability is a function , not a single number Diminishing returns due to increasing overhead Want to express overhead loss quantitatively But what (mathematical) function? c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 4 / 55

processing performance I’ll denote it by the symbol Sp in this talk Expect Sp = p if linear with p parallel processors Superlinear if Sp > p Example (MIT Swarm processor) Some of these speedup proﬁles look superlinear(?) 1 32 64 Speedup 1c 32c 64c bfs 117x 1c 32c 64c sssp 1c 32c 64c astar 1c 32c 64c msf 1c 32c 64c des 1c 32c 64c silo Swarm Software-only parallel Figure 9. Speedup of Swarm and state-of-the-art software-parallel implementations from 1 to 64 cores, relative to a tuned serial implementation running on a system of the same size. At 64 cores, Swarm programs are 43 to 117 times faster than the serial versions and 2.7 to 18 times faster than software-parallel versions. 80 100 (%) 1,200 1,400 sed 2.6K 2.6K 2.3K 2.7K “Unlocking Ordered Parallelism with the Swarm Architecture,” IEEE Micro, Issue No. 03, vol.36, 105–117 (2016 ) c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 5 / 55

Components of Scalability Universal Scalability Law (USL) 2 Applying the USL Varnish Memcached Tomcat Java Application Sirius (Zookeeper) 3 Superlinear Scaling What it looks like Perpetual motion Hunting the Superlinear Snark 4 Superlinear Payback Trap c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 6 / 55

Law (USL) p processors or processes provide system load 1 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conf. 1993 c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 11 / 55

Law (USL) p processors or processes provide system load Sp speedup performance function ≡ normalized thruput 1 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conf. 1993 c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 11 / 55

Law (USL) p processors or processes provide system load Sp speedup performance function ≡ normalized thruput Question: What kind of function is Sp ? 1 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conf. 1993 c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 11 / 55

Law (USL) p processors or processes provide system load Sp speedup performance function ≡ normalized thruput Question: What kind of function is Sp ? Answer: A rational function 1 1 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conf. 1993 c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 11 / 55

Law (USL) p processors or processes provide system load Sp speedup performance function ≡ normalized thruput Question: What kind of function is Sp ? Answer: A rational function 1 Sp(σ, κ) = p 1 + σ (p − 1) + κ p(p − 1) 1 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conf. 1993 c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 11 / 55

Law (USL) p processors or processes provide system load Sp speedup performance function ≡ normalized thruput Question: What kind of function is Sp ? Answer: A rational function 1 Sp(σ, κ) = p 1 + σ (p − 1) + κ p(p − 1) The three Cs: 1 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conf. 1993 c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 11 / 55

Law (USL) p processors or processes provide system load Sp speedup performance function ≡ normalized thruput Question: What kind of function is Sp ? Answer: A rational function 1 Sp(σ, κ) = p 1 + σ (p − 1) + κ p(p − 1) The three Cs: 1 Concurrency 1 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conf. 1993 c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 11 / 55

Law (USL) p processors or processes provide system load Sp speedup performance function ≡ normalized thruput Question: What kind of function is Sp ? Answer: A rational function 1 Sp(σ, κ) = p 1 + σ (p − 1) + κ p(p − 1) The three Cs: 1 Concurrency 2 Contention (0 < σ < 1) 1 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conf. 1993 c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 11 / 55

Law (USL) p processors or processes provide system load Sp speedup performance function ≡ normalized thruput Question: What kind of function is Sp ? Answer: A rational function 1 Sp(σ, κ) = p 1 + σ (p − 1) + κ p(p − 1) The three Cs: 1 Concurrency 2 Contention (0 < σ < 1) 3 Coherency (0 < κ < 1) 1 NJG. “A Simple Capacity Model of Massively Parallel Transaction Systems,” CMG Conf. 1993 c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 11 / 55

we determine σ and κ? S(p, σ, κ) = p 1 + σ (p − 1) + κ p(p − 1) Brute force measurements (good luck!) Data from controlled measurements, e.g., JMeter Clever way: Apply statistical regression I’ll use R stats tools throughout this talk: FOSS with 40 yr history since S at Bell Labs GDAT: Guerrilla Data Analysis Techniques c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 13 / 55

we determine σ and κ? S(p, σ, κ) = p 1 + σ (p − 1) + κ p(p − 1) Brute force measurements (good luck!) Data from controlled measurements, e.g., JMeter Clever way: Apply statistical regression I’ll use R stats tools throughout this talk: FOSS with 40 yr history since S at Bell Labs GDAT: Guerrilla Data Analysis Techniques Magic functions in R: nls() nonlinear regression → σ, κ in one swell foop optimize() to estimate Xdata (1) if missing predict() smooth interpolation/extrapolation from data plot() with various bells & whistles c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 13 / 55

of Scalability Universal Scalability Law (USL) 2 Applying the USL Varnish Memcached Tomcat Java Application Sirius (Zookeeper) 3 Superlinear Scaling What it looks like Perpetual motion Hunting the Superlinear Snark 4 Superlinear Payback Trap c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 14 / 55

proxy caching system Sits in front of classic web server Caching handled by virtual memory Claim: Highly scalable (read: linear) c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 16 / 55

proxy caching system Sits in front of classic web server Caching handled by virtual memory Claim: Highly scalable (read: linear) ... but is it? c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 16 / 55

key-value pairs Pre-loaded from RDBMS Deploy mcd on tier of cheap, older CPUs (but not multicores) Single threaded mcd ok — until next hardware roll (i.e., multicores) c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 23 / 55

and K. Tomasette (Comcast) Published in journal Comm. ACM, Vol.58 No.4, April 2015 and online at ACM Queue (unabridged) c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 35 / 55

speedup come from?” http://stackoverflow.com/questions/4332967/where-does-super-linear-speedup-come-from “Sun Fire X2270 M2 Super-Linear Scaling of Hadoop TeraSort and CloudBurst Benchmarks.” https://blogs.oracle.com/BestPerf/entry/20090920_x2270m2_hadoop Haas, R. “Scalability, in Graphical Form, Analyzed.” http://rhaas.blogspot.com/2011/09/scalability-in-graphical-form-analyzed.html Sutter, H. 2008. “Going Superlinear.” Dr. Dobb’s Journal 33(3), March. http://www.drdobbs.com/cpp/going-superlinear/206100542 Sutter, H. 2008. “Super Linearity and the Bigger Machine.” Dr. Dobb’s Journal 33(4), April. http://www.drdobbs.com/parallel/super-linearity-and-the-bigger-machine/206903306 “SDN analytics and control using sFlow standard — Superlinear.” http://blog.sflow.com/2010/09/superlinear.html Eijkhout, V. 2014. Introduction to High Performance Scientiﬁc Computing. Lulu.com c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 37 / 55

on SunFire cluster Superlinear speedup on 16-node SunFire (158% linear scaling) c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 39 / 55

on SunFire cluster Superlinear speedup on 16-node SunFire (158% linear scaling) Linear superlinearity ??? c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 39 / 55

on SunFire cluster Superlinear speedup on 16-node SunFire (158% linear scaling) Linear superlinearity ??? ← Ship it !!! c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 39 / 55

motion contraptions violate conservation of energy law. Super efﬁciency is tantamount to more than 100% output. Even if you know it’s wrong, proving it is the hard part. Superlinear scalability: Superlinearity exceeds 100% of total capacity. Violates the Universal Scalability Law (USL) bounds. Again, proving it wrong is the hard part. Requires serious analysis and debugging. c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 41 / 55

to study superlinearity TeraSort workload sorts 1 TB of data in parallel TeraSort has benchmarked Hadoop MapReduce performance We used just 100 GB data input (not benchmarking anything) Simulate in AWS cloud (more ﬂexible and much cheaper) Many test runs, some done in parallel Table 1: Amazon EC2 Conﬁgurations Optimized Processor vCPU Memory Instance Network for Arch number (GiB) Storage (GB) Performance BigMem Memory 64-bit 4 34.2 1 x 850 Moderate BigDisk Compute 64-bit 8 7 4 x 420 High c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 42 / 55

USL contention coefﬁcient is negative: σ = −0.0288 σ = −0.0089 The sign that superlinear scaling is really there (get it ) Positive σ means capacity loss due to overhead Negative σ therefore implies capacity gain or credit But what could provide such credit? And like a credit card, do you have to pay for it later? c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 46 / 55

Conﬁgs Optimized Processor vCPU Memory Instance Network for Arch number (GiB) Storage (GB) Performance BigMem Memory 64-bit 4 34.2 1 x 850 Moderate BigDisk Compute 64-bit 8 7 4 x 420 High From Table 1: 1 BigMem has a 1 disk per EC2 node type 2 BigDisk has 4 disks per EC2 node type c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 47 / 55

capacity credit produced by IO bottleneck: Credit = Gradual reduction in IO constraint Relaxation of the latent IO bandwidth constraint. Constraint decreases with cluster size p = 1, 2, 3, . . . 2 IO bottleneck induces random Reducer retries: Up to 10% variation in runtimes Stretches measured runtimes Distorts normalization of the speedup data c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 50 / 55

capacity credit produced by IO bottleneck: Credit = Gradual reduction in IO constraint Relaxation of the latent IO bandwidth constraint. Constraint decreases with cluster size p = 1, 2, 3, . . . 2 IO bottleneck induces random Reducer retries: Up to 10% variation in runtimes Stretches measured runtimes Distorts normalization of the speedup data Details are discussed in the unabridged ACM Queue article c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 50 / 55

of Scalability Universal Scalability Law (USL) 2 Applying the USL Varnish Memcached Tomcat Java Application Sirius (Zookeeper) 3 Superlinear Scaling What it looks like Perpetual motion Hunting the Superlinear Snark 4 Superlinear Payback Trap c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 51 / 55

for S(p) to be concave function Convex efﬁciencies S(p)/p > 100% do (appear to) exist Data → σ < 0 in USL model is a superlinear detector Super efﬁciency is not free Like perpetual motion, it’s an illusion You will pay the piper eventually Debugging latent capacity credit can be very tricky c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 52 / 55

for S(p) to be concave function Convex efﬁciencies S(p)/p > 100% do (appear to) exist Data → σ < 0 in USL model is a superlinear detector Super efﬁciency is not free Like perpetual motion, it’s an illusion You will pay the piper eventually Debugging latent capacity credit can be very tricky Theorem (Gunther 2012) USL Payback Trap: Superlinearity is always followed by severe loss of speedup in the payback region Veriﬁed by Kris Tomasette on April 15, 2014 Superlinear Payback Processors Speedup c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 52 / 55

Superlinear Linear Sublinear Processors Speedup But it’s really this Superlinear Payback Processo Speedup c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 53 / 55

Superlinear Linear Sublinear Processors Speedup But it’s really this Superlinear Payback Processo Speedup USL explains Terasort superlinearity on Hadoop Superlinear effects do appear in other guises (see the cited links) More and more apps becoming massively distributed Look for negative USL σ in your performance data c 2016 Performance Dynamics Labs Hadoop Super Scaling August 8, 2016 53 / 55