Exposing the Cost of Performance Hidden in the Cloud

Exposing the Cost of Performance Hidden in the Cloud Neil
Gunther @DrQz Performance Dynamics Consulting, Castro Valley, California Mohit Chawla @a1cy Independent Systems Engineer, Hamburg, Germany Performance Dynamics Co. CMG Event CLOUDXCHANGE Online 10am Paciﬁc (5pm UTC), June 19, 2018 c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 1 / 37

Exposing the Cost of Performance Hidden in the Cloud Neil
Gunther and Mohit Chawla Abstract Whilst offering lift-and-shift migration and versatile elastic capacity, the cloud also reintroduces an old mainframe concept — chargeback1 — which thereby rejuvenates the need for traditional performance and capacity management in the new cloud context. Combining production JMX data with an appropriate performance model, we show how to assess fee-based Amazon AWS conﬁgurations for a mobile-user application running on a Linux-hosted Tomcat cluster. The performance model also facilitates ongoing cost-beneﬁt analysis of various EC2 Auto Scaling policies. 1 Chargeback underpins the cloud business model, especially for hot application development, e.g., “Microsoft wants every developer to be an AI developer, which would help its already booming Azure Cloud business do better still: AI demands data, which requires cloud processing power and generates bills.” —The Register, May 2018 c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 2 / 37

AWS cloud environment Outline 1 AWS cloud environment 2 Performance
data validation 3 Initial capacity model 4 Improved capacity model 5 Cost of Auto Scaling variants 6 Cloudy economics c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 3 / 37

AWS cloud environment Application Cloud Platform Entire application runs in
the Amazon cloud Mobile Internet users ELB load balancer Auto Scaling (A/S) group AWS EC2 cluster Mobile users make requests to Apache HTTP-server2 via ELB on EC2 Tomcat thread-server3 on EC2 calls external services (belonging to 3rd parties) Auto Scaling controls number of EC2 instances based on incoming traffic and configured A/S policies ELB balances incoming traffic across all EC2 nodes in AWS cluster 2 Versions 2.2 and 2.4 3 Versions 7 and 8 c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 4 / 37

AWS cloud environment Request Processing On a single EC2 instance:
1 Incoming HTTP Request from mobile user processed by Apache + Tomcat 2 Tomcat then sends multiple requests to External Services based on original request 3 External services respond and Tomcat computes business logic based on all those Responses 4 Tomcat sends the ﬁnal Response back to originating mobile user c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 5 / 37

AWS cloud environment Performance Tools and Scripts JMX (Java Management
Extensions) data from JVM jmxterm VisualVM Java Mission Control Datadog dd-agent Datadog — also integrates with AWS CloudWatch metrics Collectd — Linux performance statistics collection Graphite and statsd — application metrics collection & storage Grafana — time-series data plotting Custom data collection scripts R statistical libs and RStudio IDE PDQ performance modeling lib c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 6 / 37

Performance data validation Outline 1 AWS cloud environment 2 Performance

Performance data validation Production Data Collection 1 Raw performance metrics:
Performance data primarily collected by datadog (dd-agent) Mobile-user requests are analyzed as a homogeneous workload JMX provides a GlobalRequestProcessor Mbean: requestCount: total number of requests processingTime: total processing time for all requests 2 Derived performance metrics: Convert requestCount to a rate in datadog conﬁg to get average throughput Xdat as requests/second Average request processing time (seconds) is then derived as Rdat = processingTime T T requestCount during the same measurement interval, e.g., T = 300 seconds c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 8 / 37

Performance data validation Concurrency and Service Times Apply Little’s law
to derive additional performance metrics: concurrency (N) and service time (S) from data 1 Little’s Law — macroscopic version N = X ∗ R (gives concurrency) Nest is the calculated or estimated number of concurrent requests in Tomcat during each measurement interval Verify correctness by comparing Nest with measured number of threads Ndat in the service stage of Tomcat We ﬁnd Nest ≡ Ndat 2 Little’s Law — microscopic version U = X ∗ S (gives service time) Udat is the measured processor utilization reported by dd-agent (as a decimal fraction, not %) Already have throughput X reqs/sec from collected JMX data Estimated service time metric is S = U/X c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 9 / 37

Performance data validation Reduced EC2 Instance Data These few metrics
will be used to parameterize our capacity model Timestamp, Xdat, Nest, Sest, Rdat, Udat 1486771200000, 502.171674, 170.266663, 0.000912, 0.336740, 0.458120 1486771500000, 494.403035, 175.375000, 0.001043, 0.355975, 0.515420 1486771800000, 509.541751, 188.866669, 0.000885, 0.360924, 0.450980 1486772100000, 507.089094, 188.437500, 0.000910, 0.367479, 0.461700 1486772400000, 532.803039, 191.466660, 0.000880, 0.362905, 0.468860 1486772700000, 528.587722, 201.187500, 0.000914, 0.366283, 0.483160 1486773000000, 533.439054, 202.600006, 0.000892, 0.378207, 0.476080 1486773300000, 531.708059, 208.187500, 0.000909, 0.392556, 0.483160 1486773600000, 532.693783, 203.266663, 0.000894, 0.379749, 0.476020 1486773900000, 519.748550, 200.937500, 0.000895, 0.381078, 0.465260 ... Unix Timestamp interval between rows is 300 seconds Little’s law gives relationships between above metrics: 1 Nest = Xdat ∗ Rdat is macroscopic LL 2 Udat = Xdat ∗ Sest is microscopic LL 3 Time-averaged over T = 300 sec sampling intervals c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 10 / 37

Initial capacity model Outline 1 AWS cloud environment 2 Performance

Initial capacity model Time Series View (Monitoring) c 2018 Performance
Dynamics Co. Exposing the Cost of Performance October 31, 2018 12 / 37

Initial capacity model Time-Independent View N X Thread-limited Throughput N
R Thread-limited Latency Queueing theory tells us what to expect: Relationship between metrics, e.g., X and N Number of requests is thread-limited to N ≤ 500 typically Throughput X approaches a saturation ceiling as N → 500 (concave) Response time R grows linearly, aka “hockey stick handle” (convex) c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 13 / 37

Initial capacity model Production X vs. N Data – July
2016 0 100 200 300 400 500 0 200 400 600 800 1000 Production Data July 2016 Concurrent users Throughput (req/s) c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 14 / 37

Initial capacity model Interpreting X vs. N Data 0 100
200 300 400 500 0 200 400 600 800 1000 Concurrent users Throughput (req/s) PDQ Model of Production Data July 2016 Nopt = 174.5367 thrds = 250.00 Data PDQ c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 15 / 37

Initial capacity model Interpreting R vs. N Data 0 100
200 300 400 0.0 0.2 0.4 0.6 0.8 Concurrent users Response time (s) PDQ Model of Production Data July 2016 Nopt = 174.5367 thrds = 250.00 Data PDQ c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 16 / 37

Initial capacity model Outstanding Questions PDQ July model looks good
visually but ... Requires ∼ 350 “dummy” queues internally to get correct Rmin Service time assumed to be CPU time ∼ 1 ms (see later) What do dummy queues represent in Tomcat server? Successive polling to external services? Some kind of hidden parallelism? October 2016 data breaks July PDQ model Why? c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 17 / 37

Improved capacity model Outline 1 AWS cloud environment 2 Performance

Improved capacity model Production X vs. N Data – October
2016 0 100 200 300 400 500 0 200 400 600 800 1000 Production data Oct 2016 Concurrent users Throughput (req/s) Too much data “clouded” the July 2016 analysis c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 19 / 37

Improved capacity model Interpreting X vs. N Data 0 100
200 300 400 500 0 200 400 600 800 1000 Concurrent users Throughput (req/s) PDQ Model of Oct 2016 Data c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 20 / 37

Improved capacity model Interpreting R vs. N Data 0 100
200 300 400 0.0 0.2 0.4 0.6 0.8 Concurrent users Response time (s) Data PDQ PDQ Model of Oct 2016 Data c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 21 / 37

Improved capacity model Adjusted PDQ Model library(pdq) usrmax <- 500
nknee <- 350 smean <- 0.4444 # Rmin seconds srate <- 1 / smean arate <- 2.1 # per user users <- seq(100, usrmax, 50) tp <- NULL rt <- NULL pdqr <- TRUE # PDQ Report for (i in 1:length(users)) { if (users[i] <= nknee) { Arate <- users[i] * arate # total arrivals pdq::Init("Tomcat Submodel") pdq::CreateOpen("requests", Arate) pdq::CreateMultiNode(users[i], "TCthreads") pdq::SetDemand("TCthreads", "requests", smean) pdq::SetWUnit("Reqs") pdq::Solve(CANON) tp[i] <- pdq::GetThruput(TRANS, "requests") rt[i] <- pdq::GetResponse(TRANS, "requests") .... Key differences: Old service time based on %CPU busy: S = 0.8 ms Rmin dominated by time inside external services New service time based on Rmin : S = 444.4 ms Tomcat threads are now parallel service centers in PDQ model Analogous to every supermarket customer getting their own checkout lane c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 22 / 37

Improved capacity model Adjusted 2016 PDQ Outputs 0 100 200
300 400 500 0 200 400 600 800 1000 Concurrent users Throughput (req/s) PDQ Model of Oct 2016 Data 0 100 200 300 400 0.0 0.2 0.4 0.6 0.8 Concurrent users Response time (s) Data PDQ PDQ Model of Oct 2016 Data c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 23 / 37

Improved capacity model Auto Scaling knee and pseudo-saturation 0 100
200 300 400 500 0 200 400 600 800 1000 Concurrent users Throughput (req/s) PDQ Model of Oct 2016 Data A/S policy triggered when instance CPU busy > 75% Induces pseudo-saturation at Nknee = 300 threads (vertical line) No additional Tomcat threads invoked above Nknee in this instance A/S spins up additional new EC2 instances (elastic capacity) c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 24 / 37

Cost of Auto Scaling variants Outline 1 AWS cloud environment
2 Performance data validation 3 Initial capacity model 4 Improved capacity model 5 Cost of Auto Scaling variants 6 Cloudy economics c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 25 / 37

Cost of Auto Scaling variants AWS Scheduled Scaling A/S policy
threshold CPU > 75% Additional EC2 instances require up to 10 minutes to spin up Based on PDQ model, considered pre-emptive scheduling of EC2s (clock) Cheaper than A/S but only 10% savings Use N service threads to size the number of EC2 instances required for incoming trafﬁc Removes expected spikes in latency and trafﬁc (seen in time series analysis) c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 26 / 37

Cost of Auto Scaling variants AWS Spot Pricing Spot instances
available at 90% discount over On-demand pricing Challenging to diversify instance types and sizes across the same group, e.g., Default instance type m4.10xlarge Spot market only has smaller m4.2xlarge type Forces manual reconﬁguration of application Thus, CPU%, latency, trafﬁc, no longer useful metrics for A/S policy Instead, use concurrency N as primary metric in A/S policy c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 27 / 37

Cloudy economics Outline 1 AWS cloud environment 2 Performance data
validation 3 Initial capacity model 4 Improved capacity model 5 Cost of Auto Scaling variants 6 Cloudy economics c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 28 / 37

Cloudy economics EC2 Instance Pricing Missed revenue? Max capacity line
Spot instances On-demand instances Reserved instances Higher risk capex Lower risk capex Time Instances Instance capacity lines4 This is how AWS sees their own infrastructure capacity 4 J.D. Mills, “Amazon Lambda and the Transition to Cloud 2.0”, SF Bay ACM meetup, May 16, 2018 c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 29 / 37

Cloudy economics Updated 2018 PDQ Outputs 0 100 200 300
400 500 600 0 500 1000 1500 PDQ Model of Prod Data Mar 2018 Concurrent users Throughput (req/sec) Rmin = 0.2236 Xknee = 1137.65 Nknee = 254.35 0 100 200 300 400 500 600 0.0 0.1 0.2 0.3 0.4 0.5 PDQ Model of Prod Data Mar 2018 Concurrent users Response time (s) Rmin = 0.2236 Xknee = 1137.65 Nknee = 254.35 c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 30 / 37

Cloudy economics Performance Evolution 2016 – 2018 2016 daily users
20:00 01:00 06:00 11:00 16:00 150 200 250 300 350 400 450 UTC time (hours) User requests (N) 2018 daily users 20:00 01:00 06:00 11:00 16:00 0 100 200 300 400 500 600 UTC time (hours) User requests (N) Typical numero uno trafﬁc proﬁle Increasing cost-effective performance Date Rmin (ms) Xmax (RPS) Nknee Jul 2016 394.1 761.23 350 Oct 2016 444.4 675.07 300 · · · · · · · · · · · · Mar 2018 223.6 1135.96 254 c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 31 / 37

Cloudy economics Name of the Game is Chargeback Google Compute
Engine also offers reserved and spot pricing Table 1: Google VM per-hour pricing5 Machine vCPUs RAM (GB) Price ($) Preempt ($) n1-umem-40 40 938 6.3039 1.3311 n1-umem-80 80 1922 12.6078 2.6622 n1-umem-96 96 1433 10.6740 2.2600 n1-umem-160 160 3844 25.2156 5.3244 Capacity planning has not gone away because of the cloud Cloud Capacity Management (DZone 2018-07-10) Capacity Planning For The Cloud: A New Way Of Thinking Needed (DZone April 25, 2018) 5 TechCrunch, May 2018 c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 32 / 37

Cloudy economics Microsoft Azure Capacity Planning Azure conﬁg pricing Azure
sizing spreadsheet c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 33 / 37

Cloudy economics Microsoft Acquires GitHub (cloud) for $7.5 BB 6
GitHub Enterprise on-site or cloud instances on AWS, Azure, Google or IBM Cloud is $21 per user per month From Twitter: “Supporting the open source ecosystem is way more important to MS than anything else—the revenue they make from hosting OSS-based apps on Azure in the future will dwarf their current devtools revenue.” “[MS] isn’t the same company that [previousy] hated on open source, mostly because it’s [now] symbiotic to their hosting business. They didn’t start supporting open source from altruism!” 6 NOTE: That’s Bs, as in billions, not Ms c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 34 / 37

Cloudy economics Goldman Sachs — IT Spending Survey 2018 Typical
headline: “Company ABC migrates to cloud service XYZ in 10 days, reduces costs by 60%” Public cloud: AWS, Microsoft and Google were the three big winners but that will be of no shock to anyone as the triumvirate already hold the lion’s share of market sales. “While we remain bullish on the overall public cloud opportunity, we see an increasing numbers of companies confronting the realities and challenges of migrating workloads and re-platforming apps,” “As enterprises come to the conclusion that their IT paradigm will likely be hybrid for longer than anticipated, this dynamic continues to beneﬁt on-premise spending.” [Source: The Register, 10 July 2018] c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 35 / 37

Cloudy economics Summary Cloud services are more about economic beneﬁt
for the hosting company than they are about technological innovation for the consumer 7 Old-fashioned mainframe chargeback is back! 8 It’s incumbent on paying customers to minimize their own cloud services costs Meaningful cost-beneﬁt decisions require ongoing performance analysis and capacity planning PDQ model presented here is a simple yet insightful example of cloud sizing and performance tools 9 Queueing model framework helps expose where hidden performance costs actually reside You only have the cloud capacity that you pay for 7 Not just plug-and-play. More like pay-and-pay! 8 Chargeback had disappeared with the advent of non-monolithic client-server architectures 9 PDQ Workshop is available at a discount to CMG members. Email [email protected] for details. c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 36 / 37

Cloudy economics Questions? www.perfdynamics.com Castro Valley, California Training — including
the PDQ Workshop Blog Twitter Facebook [email protected] — any outstanding questions +1-510-537-5758 c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 37 / 37

Exposing the Cost of Performance Hidden in the ...

Exposing the Cost of Performance Hidden in the Cloud

More Decks by Dr. Neil Gunther

Other Decks in Technology

Featured

Transcript