Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Exposing the Cost of Performance Hidden in the Cloud

Exposing the Cost of Performance Hidden in the Cloud

Whilst offering lift-and-shift deployment and elastic capacity, the cloud also reintroduces an old mainframe concept—chargeback—which thereby rejuvenates the need for performance and capacity management. Combining production JMX data with an appropriate performance model, we show how to assess fee-based EC2 configurations for a mobile-user application running on Linux-Tomcat hosted by AWS. The capacity model can be used for ongoing cost-benefit analysis of different AWS Auto Scaling policies.

Dr. Neil Gunther

June 19, 2018
Tweet

More Decks by Dr. Neil Gunther

Other Decks in Technology

Transcript

  1. Exposing the Cost of Performance
    Hidden in the Cloud
    Neil Gunther @DrQz
    Performance Dynamics Consulting, Castro Valley, California
    Mohit Chawla @a1cy
    Independent Systems Engineer, Hamburg, Germany
    Performance Dynamics Co.
    CMG Event CLOUDXCHANGE Online
    10am Pacific (5pm UTC), June 19, 2018
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 1 / 37

    View Slide

  2. Exposing the Cost of Performance
    Hidden in the Cloud
    Neil Gunther and Mohit Chawla
    Abstract
    Whilst offering lift-and-shift migration and versatile elastic capacity, the cloud also
    reintroduces an old mainframe concept — chargeback1 — which thereby
    rejuvenates the need for traditional performance and capacity management in the
    new cloud context. Combining production JMX data with an appropriate
    performance model, we show how to assess fee-based Amazon AWS configurations
    for a mobile-user application running on a Linux-hosted Tomcat cluster. The
    performance model also facilitates ongoing cost-benefit analysis of various EC2
    Auto Scaling policies.
    1
    Chargeback underpins the cloud business model, especially for hot application development, e.g., “Microsoft wants every
    developer to be an AI developer, which would help its already booming Azure Cloud business do better still: AI demands data,
    which requires cloud processing power and generates bills.” —The Register, May 2018
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 2 / 37

    View Slide

  3. AWS cloud environment
    Outline
    1 AWS cloud environment
    2 Performance data validation
    3 Initial capacity model
    4 Improved capacity model
    5 Cost of Auto Scaling variants
    6 Cloudy economics
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 3 / 37

    View Slide

  4. AWS cloud environment
    Application Cloud Platform
    Entire application runs in the Amazon cloud
    Mobile Internet users
    ELB load balancer
    Auto Scaling (A/S) group
    AWS EC2 cluster
    Mobile users make requests to Apache
    HTTP-server2 via ELB on EC2
    Tomcat thread-server3 on EC2 calls external
    services (belonging to 3rd parties)
    Auto Scaling controls number of EC2 instances
    based on incoming traffic and configured A/S
    policies
    ELB balances incoming traffic across all EC2
    nodes in AWS cluster
    2
    Versions 2.2 and 2.4
    3
    Versions 7 and 8
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 4 / 37

    View Slide

  5. AWS cloud environment
    Request Processing
    On a single EC2 instance:
    1 Incoming HTTP Request from mobile user processed by Apache + Tomcat
    2 Tomcat then sends multiple requests to External Services based on original request
    3 External services respond and Tomcat computes business logic based on all those
    Responses
    4 Tomcat sends the final Response back to originating mobile user
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 5 / 37

    View Slide

  6. AWS cloud environment
    Performance Tools and Scripts
    JMX (Java Management Extensions) data from JVM
    jmxterm
    VisualVM
    Java Mission Control
    Datadog dd-agent
    Datadog — also integrates with AWS CloudWatch metrics
    Collectd — Linux performance statistics collection
    Graphite and statsd — application metrics collection & storage
    Grafana — time-series data plotting
    Custom data collection scripts
    R statistical libs and RStudio IDE
    PDQ performance modeling lib
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 6 / 37

    View Slide

  7. Performance data validation
    Outline
    1 AWS cloud environment
    2 Performance data validation
    3 Initial capacity model
    4 Improved capacity model
    5 Cost of Auto Scaling variants
    6 Cloudy economics
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 7 / 37

    View Slide

  8. Performance data validation
    Production Data Collection
    1 Raw performance metrics:
    Performance data primarily collected by datadog (dd-agent)
    Mobile-user requests are analyzed as a homogeneous workload
    JMX provides a GlobalRequestProcessor Mbean:
    requestCount: total number of requests
    processingTime: total processing time for all requests
    2 Derived performance metrics:
    Convert requestCount to a rate in datadog config to get average
    throughput Xdat
    as requests/second
    Average request processing time (seconds) is then derived as
    Rdat
    =
    processingTime
    T
    T
    requestCount
    during the same measurement interval, e.g., T = 300 seconds
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 8 / 37

    View Slide

  9. Performance data validation
    Concurrency and Service Times
    Apply Little’s law to derive additional performance metrics: concurrency (N)
    and service time (S) from data
    1 Little’s Law — macroscopic version
    N = X ∗ R (gives concurrency)
    Nest
    is the calculated or estimated number of concurrent requests in
    Tomcat during each measurement interval
    Verify correctness by comparing Nest
    with measured number of
    threads Ndat
    in the service stage of Tomcat
    We find Nest ≡ Ndat
    2 Little’s Law — microscopic version
    U = X ∗ S (gives service time)
    Udat
    is the measured processor utilization reported by dd-agent
    (as a decimal fraction, not %)
    Already have throughput X reqs/sec from collected JMX data
    Estimated service time metric is S = U/X
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 9 / 37

    View Slide

  10. Performance data validation
    Reduced EC2 Instance Data
    These few metrics will be used to parameterize our capacity model
    Timestamp, Xdat, Nest, Sest, Rdat, Udat
    1486771200000, 502.171674, 170.266663, 0.000912, 0.336740, 0.458120
    1486771500000, 494.403035, 175.375000, 0.001043, 0.355975, 0.515420
    1486771800000, 509.541751, 188.866669, 0.000885, 0.360924, 0.450980
    1486772100000, 507.089094, 188.437500, 0.000910, 0.367479, 0.461700
    1486772400000, 532.803039, 191.466660, 0.000880, 0.362905, 0.468860
    1486772700000, 528.587722, 201.187500, 0.000914, 0.366283, 0.483160
    1486773000000, 533.439054, 202.600006, 0.000892, 0.378207, 0.476080
    1486773300000, 531.708059, 208.187500, 0.000909, 0.392556, 0.483160
    1486773600000, 532.693783, 203.266663, 0.000894, 0.379749, 0.476020
    1486773900000, 519.748550, 200.937500, 0.000895, 0.381078, 0.465260
    ...
    Unix Timestamp interval between rows is 300 seconds
    Little’s law gives relationships between above metrics:
    1 Nest = Xdat ∗ Rdat
    is macroscopic LL
    2 Udat
    = Xdat ∗ Sest
    is microscopic LL
    3 Time-averaged over T = 300 sec sampling intervals
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 10 / 37

    View Slide

  11. Initial capacity model
    Outline
    1 AWS cloud environment
    2 Performance data validation
    3 Initial capacity model
    4 Improved capacity model
    5 Cost of Auto Scaling variants
    6 Cloudy economics
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 11 / 37

    View Slide

  12. Initial capacity model
    Time Series View (Monitoring)
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 12 / 37

    View Slide

  13. Initial capacity model
    Time-Independent View
    N
    X
    Thread-limited Throughput
    N
    R
    Thread-limited Latency
    Queueing theory tells us what to expect:
    Relationship between metrics, e.g., X and N
    Number of requests is thread-limited to N ≤ 500 typically
    Throughput X approaches a saturation ceiling as N → 500 (concave)
    Response time R grows linearly, aka “hockey stick handle” (convex)
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 13 / 37

    View Slide

  14. Initial capacity model
    Production X vs. N Data – July 2016
    0 100 200 300 400 500
    0 200 400 600 800 1000
    Production Data July 2016
    Concurrent users
    Throughput (req/s)
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 14 / 37

    View Slide

  15. Initial capacity model
    Interpreting X vs. N Data
    0 100 200 300 400 500
    0 200 400 600 800 1000
    Concurrent users
    Throughput (req/s)
    PDQ Model of Production Data July 2016
    Nopt = 174.5367
    thrds = 250.00
    Data
    PDQ
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 15 / 37

    View Slide

  16. Initial capacity model
    Interpreting R vs. N Data
    0 100 200 300 400
    0.0 0.2 0.4 0.6 0.8
    Concurrent users
    Response time (s)
    PDQ Model of Production Data July 2016
    Nopt = 174.5367
    thrds = 250.00
    Data
    PDQ
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 16 / 37

    View Slide

  17. Initial capacity model
    Outstanding Questions
    PDQ July model looks good visually but ...
    Requires ∼ 350 “dummy” queues internally to get correct Rmin
    Service time assumed to be CPU time ∼ 1 ms (see later)
    What do dummy queues represent in Tomcat server?
    Successive polling to external services?
    Some kind of hidden parallelism?
    October 2016 data breaks July PDQ model Why?
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 17 / 37

    View Slide

  18. Improved capacity model
    Outline
    1 AWS cloud environment
    2 Performance data validation
    3 Initial capacity model
    4 Improved capacity model
    5 Cost of Auto Scaling variants
    6 Cloudy economics
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 18 / 37

    View Slide

  19. Improved capacity model
    Production X vs. N Data – October 2016
    0 100 200 300 400 500
    0 200 400 600 800 1000
    Production data Oct 2016
    Concurrent users
    Throughput (req/s)
    Too much data “clouded” the July 2016 analysis
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 19 / 37

    View Slide

  20. Improved capacity model
    Interpreting X vs. N Data
    0 100 200 300 400 500
    0 200 400 600 800 1000
    Concurrent users
    Throughput (req/s)
    PDQ Model of Oct 2016 Data
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 20 / 37

    View Slide

  21. Improved capacity model
    Interpreting R vs. N Data
    0 100 200 300 400
    0.0 0.2 0.4 0.6 0.8
    Concurrent users
    Response time (s)
    Data
    PDQ
    PDQ Model of Oct 2016 Data
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 21 / 37

    View Slide

  22. Improved capacity model
    Adjusted PDQ Model
    library(pdq)
    usrmax <- 500
    nknee <- 350
    smean <- 0.4444 # Rmin seconds
    srate <- 1 / smean
    arate <- 2.1 # per user
    users <- seq(100, usrmax, 50)
    tp <- NULL
    rt <- NULL
    pdqr <- TRUE # PDQ Report
    for (i in 1:length(users)) {
    if (users[i] <= nknee) {
    Arate <- users[i] * arate # total arrivals
    pdq::Init("Tomcat Submodel")
    pdq::CreateOpen("requests", Arate)
    pdq::CreateMultiNode(users[i], "TCthreads")
    pdq::SetDemand("TCthreads", "requests", smean)
    pdq::SetWUnit("Reqs")
    pdq::Solve(CANON)
    tp[i] <- pdq::GetThruput(TRANS, "requests")
    rt[i] <- pdq::GetResponse(TRANS, "requests")
    ....
    Key differences:
    Old service time based on
    %CPU busy: S = 0.8 ms
    Rmin
    dominated by time
    inside external services
    New service time based
    on Rmin
    : S = 444.4 ms
    Tomcat threads are now
    parallel service centers in
    PDQ model
    Analogous to every
    supermarket customer
    getting their own
    checkout lane
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 22 / 37

    View Slide

  23. Improved capacity model
    Adjusted 2016 PDQ Outputs
    0 100 200 300 400 500
    0 200 400 600 800 1000
    Concurrent users
    Throughput (req/s)
    PDQ Model of Oct 2016 Data
    0 100 200 300 400
    0.0 0.2 0.4 0.6 0.8
    Concurrent users
    Response time (s)
    Data
    PDQ
    PDQ Model of Oct 2016 Data
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 23 / 37

    View Slide

  24. Improved capacity model
    Auto Scaling knee and pseudo-saturation
    0 100 200 300 400 500
    0 200 400 600 800 1000
    Concurrent users
    Throughput (req/s)
    PDQ Model of Oct 2016 Data
    A/S policy triggered when instance CPU busy > 75%
    Induces pseudo-saturation at Nknee
    = 300 threads (vertical line)
    No additional Tomcat threads invoked above Nknee
    in this instance
    A/S spins up additional new EC2 instances (elastic capacity)
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 24 / 37

    View Slide

  25. Cost of Auto Scaling variants
    Outline
    1 AWS cloud environment
    2 Performance data validation
    3 Initial capacity model
    4 Improved capacity model
    5 Cost of Auto Scaling variants
    6 Cloudy economics
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 25 / 37

    View Slide

  26. Cost of Auto Scaling variants
    AWS Scheduled Scaling
    A/S policy threshold CPU > 75%
    Additional EC2 instances require up to
    10 minutes to spin up
    Based on PDQ model, considered
    pre-emptive scheduling of EC2s (clock)
    Cheaper than A/S but only 10% savings
    Use N service threads to size the
    number of EC2 instances required for
    incoming traffic
    Removes expected spikes in latency
    and traffic (seen in time series analysis)
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 26 / 37

    View Slide

  27. Cost of Auto Scaling variants
    AWS Spot Pricing
    Spot instances available at 90%
    discount over On-demand pricing
    Challenging to diversify instance types
    and sizes across the same group, e.g.,
    Default instance type
    m4.10xlarge
    Spot market only has smaller
    m4.2xlarge type
    Forces manual reconfiguration of
    application
    Thus, CPU%, latency, traffic, no longer
    useful metrics for A/S policy
    Instead, use concurrency N as primary
    metric in A/S policy
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 27 / 37

    View Slide

  28. Cloudy economics
    Outline
    1 AWS cloud environment
    2 Performance data validation
    3 Initial capacity model
    4 Improved capacity model
    5 Cost of Auto Scaling variants
    6 Cloudy economics
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 28 / 37

    View Slide

  29. Cloudy economics
    EC2 Instance Pricing
    Missed revenue?
    Max capacity line
    Spot instances
    On-demand instances
    Reserved instances
    Higher
    risk
    capex
    Lower
    risk
    capex
    Time
    Instances
    Instance capacity lines4
    This is how AWS sees their own infrastructure capacity
    4
    J.D. Mills, “Amazon Lambda and the Transition to Cloud 2.0”, SF Bay ACM meetup, May 16, 2018
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 29 / 37

    View Slide

  30. Cloudy economics
    Updated 2018 PDQ Outputs
    0 100 200 300 400 500 600
    0 500 1000 1500
    PDQ Model of Prod Data Mar 2018
    Concurrent users
    Throughput (req/sec)
    Rmin = 0.2236
    Xknee = 1137.65
    Nknee = 254.35
    0 100 200 300 400 500 600
    0.0 0.1 0.2 0.3 0.4 0.5
    PDQ Model of Prod Data Mar 2018
    Concurrent users
    Response time (s)
    Rmin = 0.2236
    Xknee = 1137.65
    Nknee = 254.35
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 30 / 37

    View Slide

  31. Cloudy economics
    Performance Evolution 2016 – 2018
    2016 daily users
    20:00 01:00 06:00 11:00 16:00
    150 200 250 300 350 400 450
    UTC time (hours)
    User requests (N)
    2018 daily users
    20:00 01:00 06:00 11:00 16:00
    0 100 200 300 400 500 600
    UTC time (hours)
    User requests (N)
    Typical numero uno traffic profile
    Increasing cost-effective performance
    Date Rmin (ms) Xmax (RPS) Nknee
    Jul 2016 394.1 761.23 350
    Oct 2016 444.4 675.07 300
    · · · · · · · · · · · ·
    Mar 2018 223.6 1135.96 254
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 31 / 37

    View Slide

  32. Cloudy economics
    Name of the Game is Chargeback
    Google Compute Engine also offers reserved and spot pricing
    Table 1: Google VM per-hour pricing5
    Machine vCPUs RAM (GB) Price ($) Preempt ($)
    n1-umem-40 40 938 6.3039 1.3311
    n1-umem-80 80 1922 12.6078 2.6622
    n1-umem-96 96 1433 10.6740 2.2600
    n1-umem-160 160 3844 25.2156 5.3244
    Capacity planning has not gone away because of the cloud
    Cloud Capacity Management (DZone 2018-07-10)
    Capacity Planning For The Cloud: A New Way Of Thinking Needed
    (DZone April 25, 2018)
    5
    TechCrunch, May 2018
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 32 / 37

    View Slide

  33. Cloudy economics
    Microsoft Azure Capacity Planning
    Azure config pricing
    Azure sizing spreadsheet
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 33 / 37

    View Slide

  34. Cloudy economics
    Microsoft Acquires GitHub (cloud) for $7.5 BB 6
    GitHub Enterprise on-site or cloud instances on AWS, Azure, Google or IBM Cloud is
    $21 per user per month
    From Twitter:
    “Supporting the open source ecosystem is way more important to MS than anything else—the revenue they make from
    hosting OSS-based apps on Azure in the future will dwarf their current devtools revenue.”
    “[MS] isn’t the same company that [previousy] hated on open source, mostly because it’s [now] symbiotic to their hosting
    business. They didn’t start supporting open source from altruism!”
    6
    NOTE: That’s Bs, as in billions, not Ms
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 34 / 37

    View Slide

  35. Cloudy economics
    Goldman Sachs — IT Spending Survey 2018
    Typical headline:
    “Company ABC migrates to cloud service XYZ in 10 days, reduces costs by 60%”
    Public cloud: AWS, Microsoft and Google were the three big winners but
    that will be of no shock to anyone as the triumvirate already hold the
    lion’s share of market sales.
    “While we remain bullish on the overall public cloud opportunity, we see
    an increasing numbers of companies confronting the realities and
    challenges of migrating workloads and re-platforming apps,”
    “As enterprises come to the conclusion that their IT paradigm will likely
    be hybrid for longer than anticipated, this dynamic continues to benefit
    on-premise spending.”
    [Source: The Register, 10 July 2018]
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 35 / 37

    View Slide

  36. Cloudy economics
    Summary
    Cloud services are more about economic benefit for
    the hosting company than they are about technological
    innovation for the consumer 7
    Old-fashioned mainframe chargeback is back! 8
    It’s incumbent on paying customers to minimize their
    own cloud services costs
    Meaningful cost-benefit decisions require ongoing
    performance analysis and capacity planning
    PDQ model presented here is a simple yet insightful
    example of cloud sizing and performance tools 9
    Queueing model framework helps expose where
    hidden performance costs actually reside
    You only have the cloud capacity that you pay for
    7
    Not just plug-and-play. More like pay-and-pay!
    8
    Chargeback had disappeared with the advent of non-monolithic client-server architectures
    9
    PDQ Workshop is available at a discount to CMG members. Email [email protected] for details.
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 36 / 37

    View Slide

  37. Cloudy economics
    Questions?
    www.perfdynamics.com
    Castro Valley, California
    Training — including the PDQ Workshop
    Blog
    Twitter
    Facebook
    [email protected] — any outstanding questions
    +1-510-537-5758
    c 2018 Performance Dynamics Co. Exposing the Cost of Performance October 31, 2018 37 / 37

    View Slide