Slide 1

Slide 1 text

How to Scale in the Cloud Chargeback is Back, Baby! Neil J. Gunther @DrQz Performance Dynamics Rocky Mountain CMG Denver, Colorado December 5, 2019 c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 1 / 38

Slide 2

Slide 2 text

Everything old is new again Abstract The need for system administrators—especially Linux sys admins—to do per- formance management has returned with a vengeance. Why? The cloud. Resource consumption in the cloud is all about run now, pay later1 (AKA chargeback2 in mainframe-ese). This talk will show you how performance models can help to find the most cost-effective deployment of your applications on Amazon Web Services (AWS). The same technique should be transferable to other cloud services. 1 Chargeback disappeared with the arrival of the PC revolution and the advent of distributed client-server architectures. 2 Chargeback underpins the cloud business model, especially when it comes to the development of hot applications, e.g., “Microsoft wants every developer to be an AI developer, which would help its already booming Azure Cloud business do better still: AI demands data, which requires cloud processing power and generates bills.” —The Register, May 2018 c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 2 / 38

Slide 3

Slide 3 text

Previous work 1 Joint work with Mohit Chawla Senior Systems Engineer, Hamburg, Germany First foray into modeling cloud applications with PDQ Period from June 2016 to April 2018 First validated cloud queueing model (AFAIK) 2 Presented jointly at CMG cloudXchange, July 2018 3 Published in Linux Magazin February 2019 (in German) English manuscript available on arXiv.org c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 3 / 38

Slide 4

Slide 4 text

AWS cloud environment Outline 1 AWS cloud environment 2 Production data acquisition 3 Initial scaling model 4 Corrected scaling model 5 AWS auto-scaling costs 6 Cloudy economics c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 4 / 38

Slide 5

Slide 5 text

AWS cloud environment AWS Cloud Application Platform Entire application runs in the Amazon cloud Elastic load balancer (ELB) AWS Elastic Cluster (EC2) instance type m4.10xlarge: 20 cpus = 40 vpus Auto Scaling group (A/S) Mobile users make requests to Apache HTTP server3 via ELB on EC2 Tomcat thread server4 on EC2 calls external services (i.e., 3rd-party web servers) A/S controls number of active EC2 instances based on incoming traffic and configured policies ELB balances incoming traffic across all active EC2 nodes in the AWS cluster 3 Versions 2.2 and 2.4 4 Versions 7 and 8 c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 5 / 38

Slide 6

Slide 6 text

AWS cloud environment Request Processing Workflow On a single EC2 instance: 1 Incoming HTTP Request from mobile user processed by Apache + Tomcat 2 Tomcat then sends multiple requests to External Services based on original request 3 External services respond and Tomcat computes business logic based on all those Responses 4 Tomcat sends the final Response back to originating mobile user c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 6 / 38

Slide 7

Slide 7 text

Production data acquisition Outline 1 AWS cloud environment 2 Production data acquisition 3 Initial scaling model 4 Corrected scaling model 5 AWS auto-scaling costs 6 Cloudy economics c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 7 / 38

Slide 8

Slide 8 text

Production data acquisition Data Tools and Scripts JMX (Java Management Extensions) data from JVM jmxterm VisualVM Java Mission Control Datadog dd-agent Datadog — also integrates with AWS CloudWatch metrics Collectd — Linux performance data collection Graphite and statsd — application metrics collection & storage Grafana — time-series data plotting Custom data collection scripts by M. Chawla R statistical libs and RStudio IDE PDQ performance modeling tool by NJG c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 8 / 38

Slide 9

Slide 9 text

Production data acquisition Distilled EC2 Instance Data These few perf metrics are sufficient to parameterize our PDQ model Timestamp, Xdat, Nest, Sest, Rdat, Udat 1486771200000, 502.171674, 170.266663, 0.000912, 0.336740, 0.458120 1486771500000, 494.403035, 175.375000, 0.001043, 0.355975, 0.515420 1486771800000, 509.541751, 188.866669, 0.000885, 0.360924, 0.450980 1486772100000, 507.089094, 188.437500, 0.000910, 0.367479, 0.461700 1486772400000, 532.803039, 191.466660, 0.000880, 0.362905, 0.468860 1486772700000, 528.587722, 201.187500, 0.000914, 0.366283, 0.483160 1486773000000, 533.439054, 202.600006, 0.000892, 0.378207, 0.476080 1486773300000, 531.708059, 208.187500, 0.000909, 0.392556, 0.483160 1486773600000, 532.693783, 203.266663, 0.000894, 0.379749, 0.476020 1486773900000, 519.748550, 200.937500, 0.000895, 0.381078, 0.465260 ... Interval between Unix Timestamp rows is 300 seconds Little’s law (LL) gives relationships between above metrics: 1 Nest = Xdat × Rdat : macroscopic LL =⇒ thread concurrency 2 Udat = Xdat × Sest : microscopic LL =⇒ resource service times LL provides a consistency check of the data c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 9 / 38

Slide 10

Slide 10 text

Initial scaling model Outline 1 AWS cloud environment 2 Production data acquisition 3 Initial scaling model 4 Corrected scaling model 5 AWS auto-scaling costs 6 Cloudy economics c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 10 / 38

Slide 11

Slide 11 text

Initial scaling model Usual Time Series (Monitoring) View Our brains are not built for this Want best impedance match for cognitive processing5 5 “Seeing It All at Once with Barry,” Gunther and Jauvin, CMG 2007 (PDF) c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 11 / 38

Slide 12

Slide 12 text

Initial scaling model Time-Independent (Steady State) View N X Canonical closed-queue throughput N R Canonical closed-queue latency Queueing theory with finite requests tells us what to expect: Relationship between metrics, e.g., X and N Number of requests seen in our daly data N 500 Throughput X approaches a saturation ceiling as N → 500 (concave) Response time R grows linearly as “hockey stick handle” (convex) c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 12 / 38

Slide 13

Slide 13 text

Initial scaling model Production X-N Data: July 2016 0 100 200 300 400 500 0 200 400 600 800 1000 Production Data July 2016 Concurrent users Throughput (req/s) c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 13 / 38

Slide 14

Slide 14 text

Initial scaling model PDQ Model of Throughput X(N) 0 100 200 300 400 500 0 200 400 600 800 1000 Concurrent users Throughput (req/s) PDQ Model of Production Data July 2016 Nopt = 174.5367 thrds = 250.00 Data PDQ c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 14 / 38

Slide 15

Slide 15 text

Initial scaling model PDQ Model of Response Time R(N) 0 100 200 300 400 0.0 0.2 0.4 0.6 0.8 Concurrent users Response time (s) PDQ Model of Production Data July 2016 Nopt = 174.5367 thrds = 250.00 Data PDQ c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 15 / 38

Slide 16

Slide 16 text

Initial scaling model PDQ (closed) Queueing Model N, Z S R(N) X(N) λ(N) Finite N mobile user-requests Think time Z = 0 for mobile Only 1 service time measured Tomcat on CPU: S = 0.8 ms Only 1 queue is definable Queue represents Tomcat server on an EC2 instance λ(N): mean request rate R(N): Tomcat response time X(N): Tomcat throughput X(N) = λ(N) in steady state c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 16 / 38

Slide 17

Slide 17 text

Initial scaling model PDQ (closed) Queueing Model N, Z S R(N) X(N) λ(N) Finite N mobile user-requests Think time Z = 0 for mobile Only 1 service time measured Tomcat on CPU: S = 0.8 ms Only 1 queue is definable Queue represents Tomcat server on an EC2 instance λ(N): mean request rate R(N): Tomcat response time X(N): Tomcat throughput X(N) = λ(N) in steady state Erm ... except there’s a small problem ... c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 16 / 38

Slide 18

Slide 18 text

Initial scaling model Dummy Queues Single N = 1 takes about 1 ms. But first data point in plot occurs at Nest = 133.8338 (see Slide 15) Nest Xdat Sest Rdat Udat 133.8338 416.4605 0.00088 0.32136 0.36642 for which R = 321.36 ms. Precise service time from LL: Sest = Udat /Xdat > 0.36642 / 416.4605 [1] 0.0008798433 Sest = 0.0008798433 s = 0.8798433 ms ≈ 1 ms. Using linear hockey handle characteristic (Z = 0) > 321.36 - (0.8798433 * 133.8338) [1] 203.6072 which underestimates 321.36 ms by 203.6 ms. Compensate with 200 dummy queues each with S 1 ms c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 17 / 38

Slide 19

Slide 19 text

Initial scaling model 3-Tier Model CMG 2001 Similar model in Chap. 12 of my Perl::PDQ book Dws Das Ddb N clients Z = 0 ms Web Server App Server DBMS Server Requests Responses Dummy Servers Based on CMG 2001 Best Paper by Buch & Pentkovski (Intel Corp.) Tricky: dummy queues cannot have S > Sbottleneck (The meaning of these dummy latencies was never resolved) c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 18 / 38

Slide 20

Slide 20 text

Initial scaling model Outstanding Issues PDQ Tomcat model fits data visually but ... Need ∼ 200 dummy queues to get correct Rmin What do they represent in the actual Tomcat server? Service time Sest ≈ 0.001 s = 1 ms From table on Slide 9 CPU time derived from Udat = ρCPU in /proc Oh yeah, and what about those external service times? Hypotheses: (a) Successive polling (visits) to external services? (MC) (b) Some kind of hidden parallelism? (NJG) ... see Slide 23 c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 19 / 38

Slide 21

Slide 21 text

Initial scaling model Outstanding Issues PDQ Tomcat model fits data visually but ... Need ∼ 200 dummy queues to get correct Rmin What do they represent in the actual Tomcat server? Service time Sest ≈ 0.001 s = 1 ms From table on Slide 9 CPU time derived from Udat = ρCPU in /proc Oh yeah, and what about those external service times? Hypotheses: (a) Successive polling (visits) to external services? (MC) (b) Some kind of hidden parallelism? (NJG) ... see Slide 23 All this remained unresolved until ... c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 19 / 38

Slide 22

Slide 22 text

Initial scaling model New Data Breaks PDQ Model Guerrilla mantra 1.16 Data comes from the Devil, only models come from God. (except when they don’t) c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 20 / 38

Slide 23

Slide 23 text

Corrected scaling model Outline 1 AWS cloud environment 2 Production data acquisition 3 Initial scaling model 4 Corrected scaling model 5 AWS auto-scaling costs 6 Cloudy economics c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 21 / 38

Slide 24

Slide 24 text

Corrected scaling model Suprisingly ... Less (data) Is Better Too much initial data clouded 6 the actual scaling behavior 6See what I did there? c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 22 / 38

Slide 25

Slide 25 text

Corrected scaling model Hypothesis (b) ... Backwards 7 October 2016 Sched 300 300 444.41 411.62 ± 7.36 675.05 651.54 ± 3.66 Approx. 10% March 2018 Spot 254 254 203.60 199.36 ± 1.48 1247.54 1192.03 ± 8.54 Approx. 90% † Nknee is an input parameter to the PDQ model ‡ Corrected PDQ model Parallel is Just Fast Serial From the standpoint of queueing theory, parallel processing can be regarded as a form of fast serial processing. The left side of the diagram shows a pair of parallel queues, where requests arriving from outside at rate are split equally to arrive with reduced rate /2 into either one of the two queues. Assume = 0.5 requests/second and S = 1 second. When a request joins the tail of one of the parallel waiting lines, its expected time to get through that queue (waiting + service) is given by equation (1) in Berechenbare Performance [9], namely: Tpara = S 1 ( /2)S = 1.33 seconds (1) The right side of the diagram shows two queues in tandem, each twice as fast (S/2) as a parallel queue. Since the arrival flow is not split, the expected time to get through both queues is the sum of the times spent in each queue: Tserial = S/2 1 (S/2) + S/2 1 (S/2) = S 1 (S/2) = 1.33 seconds (2) Tserial in equation (2) is identical to Tpara in equation (1). Conversely, multi-stage serial processing can be trans- formed into an equivalent form of parallel processing [6, 8]. This insight helped identify the “hidden parallelism” in the July and October 2016 performance data that led to the correction of the initial PDQ Tomcat model. com/2014/07/a-little-triplet. html 2014 Systems Principles, Bolton Landing, New York, October 19–22, 2003 [13] N. Gunther, Guerrilla Capa Planning: A Tactical Approach 7 Inspired by a CMG 1993 paper, I developed an algorithm to solve parallel queues in the PDQ analyzer circa 1994, based on my observation above, and used it in my 1998 book The Practical Performance Analyst. c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 23 / 38

Slide 26

Slide 26 text

Corrected scaling model Parallel PDQ Model of Throughput X(N) 0 100 200 300 400 500 0 200 400 600 800 1000 Concurrent users Throughput (req/s) PDQ Model of Oct 2016 Data Corrected PDQ model (blue dots) c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 24 / 38

Slide 27

Slide 27 text

Corrected scaling model Parallel PDQ Model of Throughput R(N) 0 100 200 300 400 0.0 0.2 0.4 0.6 0.8 Concurrent users Response time (s) Data PDQ PDQ Model of Oct 2016 Data Corrected PDQ model (blue dots) c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 25 / 38

Slide 28

Slide 28 text

Corrected scaling model Parallel PDQ Model N, Z S R(N) X(N) λ(N) m W S S Key differences: Rmin dominated by time inside external services True service time is Rmin : S = 444.4 ms (not CPU) Tomcat threads are now parallel service facilities Single waiting line (W) produces hockey handle Like every Fry’s customer waits for their own cashier But where is W located in the EC2 system? (still unresolved) c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 26 / 38

Slide 29

Slide 29 text

Corrected scaling model PDQ Numerical Validation 2016 0 100 200 300 400 500 0 200 400 600 800 1000 Concurrent users Throughput (req/s) PDQ Model of Oct 2016 Data 0 100 200 300 400 0.0 0.2 0.4 0.6 0.8 Concurrent users Response time (s) Data PDQ PDQ Model of Oct 2016 Data c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 27 / 38

Slide 30

Slide 30 text

Corrected scaling model Auto-Scaling knee and pseudo-saturation 0 100 200 300 400 500 0 200 400 600 800 1000 Concurrent users Throughput (req/s) PDQ Model of Oct 2016 Data A/S policy triggered when instance CPU busy > 75% Induces pseudo-saturation at Nknee = 300 threads (vertical line) No additional Tomcat threads invoked above Nknee in this instance A/S spins up additional new EC2 instances (elastic capacity) c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 28 / 38

Slide 31

Slide 31 text

AWS auto-scaling costs Outline 1 AWS cloud environment 2 Production data acquisition 3 Initial scaling model 4 Corrected scaling model 5 AWS auto-scaling costs 6 Cloudy economics c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 29 / 38

Slide 32

Slide 32 text

AWS auto-scaling costs AWS Scheduled Scaling A/S policy threshold CPU > 75% Additional EC2 instances require up to 10 minutes to spin up Based on PDQ model, considered pre-emptive scheduling of EC2s (clock) Cheaper than A/S but only 10% savings Use N service threads to size the number of EC2 instances required for incoming traffic Removes expected spikes in latency and traffic (seen in time series analysis) c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 30 / 38

Slide 33

Slide 33 text

AWS auto-scaling costs AWS Spot Pricing Spot instances for 90% discount on On-demand pricing Challenging to diversify instance types and sizes across the same group, e.g., Default instance type m4.10xlarge Spot market only has smaller m4.2xlarge type This forces manual reconfiguration of the application CPU (ρ), latency (R), traffic (λ) are no longer useful metrics for A/S policy Instead, use concurrency N as primary metric in A/S policy c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 31 / 38

Slide 34

Slide 34 text

AWS auto-scaling costs PDQ Numerical Validation 2018 (cf. slide 27) 0 100 200 300 400 500 600 0 500 1000 1500 PDQ Model of Prod Data Mar 2018 Concurrent users Throughput (req/sec) Rmin = 0.2236 Xknee = 1137.65 Nknee = 254.35 0 100 200 300 400 500 600 0.0 0.1 0.2 0.3 0.4 0.5 PDQ Model of Prod Data Mar 2018 Concurrent users Response time (s) Rmin = 0.2236 Xknee = 1137.65 Nknee = 254.35 c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 32 / 38

Slide 35

Slide 35 text

AWS auto-scaling costs Performance Improvements 2016–2018 2016 daily user profile 20:00 01:00 06:00 11:00 16:00 150 200 250 300 350 400 450 UTC time (hours) User requests (N) 2018 daily user profile 20:00 01:00 06:00 11:00 16:00 0 100 200 300 400 500 600 UTC time (hours) User requests (N) Typical numero uno traffic profile discussed in my GCAP performance class Increasing cost-effective performance Date Rmin (ms) Xmax (RPS) NA/S Jul 2016 394.1 761.23 350 Oct 2016 444.4 675.07 300 · · · · · · · · · · · · Mar 2018 223.6 1135.96 254 Less variation in X and R due to improved traffic dsn c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 33 / 38

Slide 36

Slide 36 text

Cloudy economics Outline 1 AWS cloud environment 2 Production data acquisition 3 Initial scaling model 4 Corrected scaling model 5 AWS auto-scaling costs 6 Cloudy economics c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 34 / 38

Slide 37

Slide 37 text

Cloudy economics EC2 Instance Pricing Missed revenue? Max capacity line Spot instances On-demand instances Reserved instances Higher risk capex Lower risk capex Time Instances Instance capacity lines8 This is how AWS sees their own infrastructure capacity 8 J.D. Mills, “Amazon Lambda and the Transition to Cloud 2.0”, SF Bay ACM meetup, May 16, 2018 c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 35 / 38

Slide 38

Slide 38 text

Cloudy economics Name of the Game is Chargeback Google Compute Engine also offers reserved and spot pricing Table 1: Google VM per-hour pricing9 Machine vCPUs RAM (GB) Price ($) Preempt ($) n1-umem-40 40 938 6.3039 1.3311 n1-umem-80 80 1922 12.6078 2.6622 n1-umem-96 96 1433 10.6740 2.2600 n1-umem-160 160 3844 25.2156 5.3244 Capacity planning has not gone away because of the cloud Capacity Planning For The Cloud: A New Way Of Thinking Needed (DZone April 25, 2018) Cloud Capacity Management (DZone July 10, 2018) 9 TechCrunch, May 2018 c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 36 / 38

Slide 39

Slide 39 text

Cloudy economics Summary Cloud services are more about economic benefit for the hosting company than they are about technological innovation for the consumer Old-fashioned chargeback is back! Incumbent on you the customer to minimize your own cloud services costs Evolving services: containers, microservices, “serverless,” (e.g., AWS Lambda) Performance and capacity management have not gone away PDQ Tomcat model is a relatively simple yet insightful example of a cloud sizing-tool EC2 instance scalability was not a significant issue for this application You can pay LESS for MORE cloud performance! c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 37 / 38

Slide 40

Slide 40 text

Cloudy economics Questions? www.perfdynamics.com Castro Valley, California Training —note the PDQ Workshop Blog Twitter Facebook [email protected] —outstanding questions +1-510-537-5758 c 2019 Performance Dynamics How to Scale in the Cloud December 6, 2019 38 / 38