Exposing the Cost of Performance Hidden in the Cloud

Slide 1

Slide 1 text

Exposing the Cost of Performance Hidden in the Cloud Neil Gunther @DrQz Performance Dynamics Consulting, Castro Valley, California Mohit Chawla @a1cy Independent Systems Engineer, Hamburg, Germany Performance Dynamics Co. CMG Event CLOUDXCHANGE Online 10am Paciﬁc (5pm UTC), June 19, 2018 c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 1 / 37

Slide 2

Slide 2 text

Exposing the Cost of Performance Hidden in the Cloud Neil Gunther and Mohit Chawla Abstract Whilst offering lift-and-shift migration and versatile elastic capacity, the cloud also reintroduces an old mainframe concept — chargeback1 — which thereby rejuvenates the need for traditional performance and capacity management in the new cloud context. Combining production JMX data with an appropriate performance model, we show how to assess fee-based Amazon AWS conﬁgurations for a mobile-user application running on a Linux-hosted Tomcat cluster. The performance model also facilitates ongoing cost-beneﬁt analysis of various EC2 Auto Scaling policies. 1 Chargeback underpins the cloud business model, especially for hot application development, e.g., “Microsoft wants every developer to be an AI developer, which would help its already booming Azure Cloud business do better still: AI demands data, which requires cloud processing power and generates bills.” —The Register, May 2018 c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 2 / 37

Slide 3

Slide 3 text

AWS cloud environment Outline 1 AWS cloud environment 2 Performance data validation 3 Initial capacity model 4 Improved capacity model 5 Cost of Auto Scaling variants 6 Cloudy economics c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 3 / 37

Slide 4

Slide 4 text

AWS cloud environment Application Cloud Platform Entire application runs in the Amazon cloud Mobile Internet users ELB load balancer Auto Scaling (A/S) group AWS EC2 cluster Mobile users make requests to Apache HTTP-server2 via ELB on EC2 Tomcat thread-server3 on EC2 calls external services (belonging to 3rd parties) Auto Scaling controls number of EC2 instances based on incoming traffic and configured A/S policies ELB balances incoming traffic across all EC2 nodes in AWS cluster 2 Versions 2.2 and 2.4 3 Versions 7 and 8 c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 4 / 37

Slide 5

Slide 5 text

AWS cloud environment Request Processing On a single EC2 instance: 1 Incoming HTTP Request from mobile user processed by Apache + Tomcat 2 Tomcat then sends multiple requests to External Services based on original request 3 External services respond and Tomcat computes business logic based on all those Responses 4 Tomcat sends the ﬁnal Response back to originating mobile user c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 5 / 37

Slide 6

Slide 6 text

AWS cloud environment Performance Tools and Scripts JMX (Java Management Extensions) data from JVM jmxterm VisualVM Java Mission Control Datadog dd-agent Datadog — also integrates with AWS CloudWatch metrics Collectd — Linux performance statistics collection Graphite and statsd — application metrics collection & storage Grafana — time-series data plotting Custom data collection scripts R statistical libs and RStudio IDE PDQ performance modeling lib c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 6 / 37

Slide 7

Slide 7 text

Performance data validation Outline 1 AWS cloud environment 2 Performance data validation 3 Initial capacity model 4 Improved capacity model 5 Cost of Auto Scaling variants 6 Cloudy economics c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 7 / 37

Slide 8

Slide 8 text

Performance data validation Production Data Collection 1 Raw performance metrics: Performance data primarily collected by datadog (dd-agent) Mobile-user requests are analyzed as a homogeneous workload JMX provides a GlobalRequestProcessor Mbean: requestCount: total number of requests processingTime: total processing time for all requests 2 Derived performance metrics: Convert requestCount to a rate in datadog conﬁg to get average throughput Xdat as requests/second Average request processing time (seconds) is then derived as Rdat = processingTime T T requestCount during the same measurement interval, e.g., T = 300 seconds c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 8 / 37

Slide 9

Slide 9 text

Performance data validation Concurrency and Service Times Apply Little’s law to derive additional performance metrics: concurrency (N) and service time (S) from data 1 Little’s Law — macroscopic version N = X ∗ R (gives concurrency) Nest is the calculated or estimated number of concurrent requests in Tomcat during each measurement interval Verify correctness by comparing Nest with measured number of threads Ndat in the service stage of Tomcat We ﬁnd Nest ≡ Ndat 2 Little’s Law — microscopic version U = X ∗ S (gives service time) Udat is the measured processor utilization reported by dd-agent (as a decimal fraction, not %) Already have throughput X reqs/sec from collected JMX data Estimated service time metric is S = U/X c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 9 / 37

Slide 10

Slide 10 text

Performance data validation Reduced EC2 Instance Data These few metrics will be used to parameterize our capacity model Timestamp, Xdat, Nest, Sest, Rdat, Udat 1486771200000, 502.171674, 170.266663, 0.000912, 0.336740, 0.458120 1486771500000, 494.403035, 175.375000, 0.001043, 0.355975, 0.515420 1486771800000, 509.541751, 188.866669, 0.000885, 0.360924, 0.450980 1486772100000, 507.089094, 188.437500, 0.000910, 0.367479, 0.461700 1486772400000, 532.803039, 191.466660, 0.000880, 0.362905, 0.468860 1486772700000, 528.587722, 201.187500, 0.000914, 0.366283, 0.483160 1486773000000, 533.439054, 202.600006, 0.000892, 0.378207, 0.476080 1486773300000, 531.708059, 208.187500, 0.000909, 0.392556, 0.483160 1486773600000, 532.693783, 203.266663, 0.000894, 0.379749, 0.476020 1486773900000, 519.748550, 200.937500, 0.000895, 0.381078, 0.465260 ... Unix Timestamp interval between rows is 300 seconds Little’s law gives relationships between above metrics: 1 Nest = Xdat ∗ Rdat is macroscopic LL 2 Udat = Xdat ∗ Sest is microscopic LL 3 Time-averaged over T = 300 sec sampling intervals c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 10 / 37

Slide 11

Slide 11 text

Initial capacity model Outline 1 AWS cloud environment 2 Performance data validation 3 Initial capacity model 4 Improved capacity model 5 Cost of Auto Scaling variants 6 Cloudy economics c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 11 / 37

Slide 12

Slide 12 text

Initial capacity model Time Series View (Monitoring) c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 12 / 37

Slide 13

Slide 13 text

Initial capacity model Time-Independent View N X Thread-limited Throughput N R Thread-limited Latency Queueing theory tells us what to expect: Relationship between metrics, e.g., X and N Number of requests is thread-limited to N ≤ 500 typically Throughput X approaches a saturation ceiling as N → 500 (concave) Response time R grows linearly, aka “hockey stick handle” (convex) c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 13 / 37

Slide 14

Slide 14 text

Initial capacity model Production X vs. N Data – July 2016 0 100 200 300 400 500 0 200 400 600 800 1000 Production Data July 2016 Concurrent users Throughput (req/s) c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 14 / 37

Slide 15

Slide 15 text

Initial capacity model Interpreting X vs. N Data 0 100 200 300 400 500 0 200 400 600 800 1000 Concurrent users Throughput (req/s) PDQ Model of Production Data July 2016 Nopt = 174.5367 thrds = 250.00 Data PDQ c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 15 / 37

Slide 16

Slide 16 text

Initial capacity model Interpreting R vs. N Data 0 100 200 300 400 0.0 0.2 0.4 0.6 0.8 Concurrent users Response time (s) PDQ Model of Production Data July 2016 Nopt = 174.5367 thrds = 250.00 Data PDQ c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 16 / 37

Slide 17

Slide 17 text

Initial capacity model Outstanding Questions PDQ July model looks good visually but ... Requires ∼ 350 “dummy” queues internally to get correct Rmin Service time assumed to be CPU time ∼ 1 ms (see later) What do dummy queues represent in Tomcat server? Successive polling to external services? Some kind of hidden parallelism? October 2016 data breaks July PDQ model Why? c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 17 / 37

Slide 18

Slide 18 text

Improved capacity model Outline 1 AWS cloud environment 2 Performance data validation 3 Initial capacity model 4 Improved capacity model 5 Cost of Auto Scaling variants 6 Cloudy economics c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 18 / 37

Slide 19

Slide 19 text

Improved capacity model Production X vs. N Data – October 2016 0 100 200 300 400 500 0 200 400 600 800 1000 Production data Oct 2016 Concurrent users Throughput (req/s) Too much data “clouded” the July 2016 analysis c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 19 / 37

Slide 20

Slide 20 text

Improved capacity model Interpreting X vs. N Data 0 100 200 300 400 500 0 200 400 600 800 1000 Concurrent users Throughput (req/s) PDQ Model of Oct 2016 Data c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 20 / 37

Slide 21

Slide 21 text

Improved capacity model Interpreting R vs. N Data 0 100 200 300 400 0.0 0.2 0.4 0.6 0.8 Concurrent users Response time (s) Data PDQ PDQ Model of Oct 2016 Data c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 21 / 37

Slide 22

Slide 22 text

Improved capacity model Adjusted PDQ Model library(pdq) usrmax <- 500 nknee <- 350 smean <- 0.4444 # Rmin seconds srate <- 1 / smean arate <- 2.1 # per user users <- seq(100, usrmax, 50) tp <- NULL rt <- NULL pdqr <- TRUE # PDQ Report for (i in 1:length(users)) { if (users[i] <= nknee) { Arate <- users[i] * arate # total arrivals pdq::Init("Tomcat Submodel") pdq::CreateOpen("requests", Arate) pdq::CreateMultiNode(users[i], "TCthreads") pdq::SetDemand("TCthreads", "requests", smean) pdq::SetWUnit("Reqs") pdq::Solve(CANON) tp[i] <- pdq::GetThruput(TRANS, "requests") rt[i] <- pdq::GetResponse(TRANS, "requests") .... Key differences: Old service time based on %CPU busy: S = 0.8 ms Rmin dominated by time inside external services New service time based on Rmin : S = 444.4 ms Tomcat threads are now parallel service centers in PDQ model Analogous to every supermarket customer getting their own checkout lane c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 22 / 37

Slide 23

Slide 23 text

Improved capacity model Adjusted 2016 PDQ Outputs 0 100 200 300 400 500 0 200 400 600 800 1000 Concurrent users Throughput (req/s) PDQ Model of Oct 2016 Data 0 100 200 300 400 0.0 0.2 0.4 0.6 0.8 Concurrent users Response time (s) Data PDQ PDQ Model of Oct 2016 Data c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 23 / 37

Slide 24

Slide 24 text

Improved capacity model Auto Scaling knee and pseudo-saturation 0 100 200 300 400 500 0 200 400 600 800 1000 Concurrent users Throughput (req/s) PDQ Model of Oct 2016 Data A/S policy triggered when instance CPU busy > 75% Induces pseudo-saturation at Nknee = 300 threads (vertical line) No additional Tomcat threads invoked above Nknee in this instance A/S spins up additional new EC2 instances (elastic capacity) c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 24 / 37

Slide 25

Slide 25 text

Cost of Auto Scaling variants Outline 1 AWS cloud environment 2 Performance data validation 3 Initial capacity model 4 Improved capacity model 5 Cost of Auto Scaling variants 6 Cloudy economics c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 25 / 37

Slide 26

Slide 26 text

Cost of Auto Scaling variants AWS Scheduled Scaling A/S policy threshold CPU > 75% Additional EC2 instances require up to 10 minutes to spin up Based on PDQ model, considered pre-emptive scheduling of EC2s (clock) Cheaper than A/S but only 10% savings Use N service threads to size the number of EC2 instances required for incoming trafﬁc Removes expected spikes in latency and trafﬁc (seen in time series analysis) c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 26 / 37

Slide 27

Slide 27 text

Cost of Auto Scaling variants AWS Spot Pricing Spot instances available at 90% discount over On-demand pricing Challenging to diversify instance types and sizes across the same group, e.g., Default instance type m4.10xlarge Spot market only has smaller m4.2xlarge type Forces manual reconﬁguration of application Thus, CPU%, latency, trafﬁc, no longer useful metrics for A/S policy Instead, use concurrency N as primary metric in A/S policy c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 27 / 37

Slide 28

Slide 28 text

Cloudy economics Outline 1 AWS cloud environment 2 Performance data validation 3 Initial capacity model 4 Improved capacity model 5 Cost of Auto Scaling variants 6 Cloudy economics c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 28 / 37

Slide 29

Slide 29 text

Cloudy economics EC2 Instance Pricing Missed revenue? Max capacity line Spot instances On-demand instances Reserved instances Higher risk capex Lower risk capex Time Instances Instance capacity lines4 This is how AWS sees their own infrastructure capacity 4 J.D. Mills, “Amazon Lambda and the Transition to Cloud 2.0”, SF Bay ACM meetup, May 16, 2018 c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 29 / 37

Slide 30

Slide 30 text

Cloudy economics Updated 2018 PDQ Outputs 0 100 200 300 400 500 600 0 500 1000 1500 PDQ Model of Prod Data Mar 2018 Concurrent users Throughput (req/sec) Rmin = 0.2236 Xknee = 1137.65 Nknee = 254.35 0 100 200 300 400 500 600 0.0 0.1 0.2 0.3 0.4 0.5 PDQ Model of Prod Data Mar 2018 Concurrent users Response time (s) Rmin = 0.2236 Xknee = 1137.65 Nknee = 254.35 c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 30 / 37

Slide 31

Slide 31 text

Cloudy economics Performance Evolution 2016 – 2018 2016 daily users 20:00 01:00 06:00 11:00 16:00 150 200 250 300 350 400 450 UTC time (hours) User requests (N) 2018 daily users 20:00 01:00 06:00 11:00 16:00 0 100 200 300 400 500 600 UTC time (hours) User requests (N) Typical numero uno trafﬁc proﬁle Increasing cost-effective performance Date Rmin (ms) Xmax (RPS) Nknee Jul 2016 394.1 761.23 350 Oct 2016 444.4 675.07 300 · · · · · · · · · · · · Mar 2018 223.6 1135.96 254 c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 31 / 37

Slide 32

Slide 32 text

Cloudy economics Name of the Game is Chargeback Google Compute Engine also offers reserved and spot pricing Table 1: Google VM per-hour pricing5 Machine vCPUs RAM (GB) Price ($) Preempt ($) n1-umem-40 40 938 6.3039 1.3311 n1-umem-80 80 1922 12.6078 2.6622 n1-umem-96 96 1433 10.6740 2.2600 n1-umem-160 160 3844 25.2156 5.3244 Capacity planning has not gone away because of the cloud Cloud Capacity Management (DZone 2018-07-10) Capacity Planning For The Cloud: A New Way Of Thinking Needed (DZone April 25, 2018) 5 TechCrunch, May 2018 c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 32 / 37

Slide 33

Slide 33 text

Cloudy economics Microsoft Azure Capacity Planning Azure conﬁg pricing Azure sizing spreadsheet c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 33 / 37

Slide 34

Slide 34 text

Cloudy economics Microsoft Acquires GitHub (cloud) for $7.5 BB 6 GitHub Enterprise on-site or cloud instances on AWS, Azure, Google or IBM Cloud is $21 per user per month From Twitter: “Supporting the open source ecosystem is way more important to MS than anything else—the revenue they make from hosting OSS-based apps on Azure in the future will dwarf their current devtools revenue.” “[MS] isn’t the same company that [previousy] hated on open source, mostly because it’s [now] symbiotic to their hosting business. They didn’t start supporting open source from altruism!” 6 NOTE: That’s Bs, as in billions, not Ms c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 34 / 37

Slide 35

Slide 35 text

Cloudy economics Goldman Sachs — IT Spending Survey 2018 Typical headline: “Company ABC migrates to cloud service XYZ in 10 days, reduces costs by 60%” Public cloud: AWS, Microsoft and Google were the three big winners but that will be of no shock to anyone as the triumvirate already hold the lion’s share of market sales. “While we remain bullish on the overall public cloud opportunity, we see an increasing numbers of companies confronting the realities and challenges of migrating workloads and re-platforming apps,” “As enterprises come to the conclusion that their IT paradigm will likely be hybrid for longer than anticipated, this dynamic continues to beneﬁt on-premise spending.” [Source: The Register, 10 July 2018] c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 35 / 37

Slide 36

Slide 36 text

Cloudy economics Summary Cloud services are more about economic beneﬁt for the hosting company than they are about technological innovation for the consumer 7 Old-fashioned mainframe chargeback is back! 8 It’s incumbent on paying customers to minimize their own cloud services costs Meaningful cost-beneﬁt decisions require ongoing performance analysis and capacity planning PDQ model presented here is a simple yet insightful example of cloud sizing and performance tools 9 Queueing model framework helps expose where hidden performance costs actually reside You only have the cloud capacity that you pay for 7 Not just plug-and-play. More like pay-and-pay! 8 Chargeback had disappeared with the advent of non-monolithic client-server architectures 9 PDQ Workshop is available at a discount to CMG members. Email [email protected] for details. c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 36 / 37

Slide 37

Slide 37 text

Cloudy economics Questions? www.perfdynamics.com Castro Valley, California Training — including the PDQ Workshop Blog Twitter Facebook [email protected] — any outstanding questions +1-510-537-5758 c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 37 / 37