Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Optimization of Public Cloud for Efficiency and...

Optimization of Public Cloud for Efficiency and Scale

Twelve months ago, Cloudability's operations team set out on an ambitious project designed to control AWS spending while simultaneously adopting automation best practices and improving overall efficiency. This talk will highlight lessons learned in optimization of AWS operations resulting in a spending reduction of over 50% while simultaneously doubling the improvement of resource utilization and growing top line company revenue.

Specific topics will include Cloudability's use of automation tools including Puppet, Docker and Packer in addition to AWS-specific tools like Lambda functions, Elastic Container Service, the EC2 Spot Market, Auto Scaling Groups and AWS Aurora.

This talk will cover specific elements of operating a SaaS infrastructure for cloud efficiency but should be applicable to all users of cloud computing.

Avatar for Erik Onnen

Erik Onnen

June 12, 2017
Tweet

Other Decks in Technology

Transcript

  1. ๏ @cloudability Introduction About Me Erik Onnen VP of Engineering

    ๏ VP Engineering @cloudability (2015) ๏ Previously Urban Airship, Jive Software, 
 Liberty Mutual, Opsware ๏ Background as systems engineer and 
 operator at venture backed startups ๏ Contributions to HBase, Cassandra, Netty ๏ AWS user since 2009
  2. @cloudability Cloudability A Brief History of Cloudability ๏ Founded in

    2010 in Portland, OR ๏ Series B venture funded company ๏ Customers include the largest consumers of public cloud ๏ Pure SaaS offering ๏ Built predominantly in AWS ๏ Followed typical startup path ๏ Grow fast and worry about consequences later ๏ Do what’s comfortable
  3. @cloudability Topics ๏ Terminology ๏ History of Cloudability’s use of

    cloud computing ๏ Why we changed ๏ What we changed ๏ Evolution of systems and improvement initiatives ๏ Q&A Context
  4. @cloudability Terminology ๏ Public Cloud and PaaS ๏ AWS ๏

    Azure ๏ GCP ๏ Efficiency ๏ Scale ๏ Compute Resources Context
  5. ๏ @cloudability Evolution ๏ Ran our cloud like a data

    center ๏ Never trued up decisions or reconciled value ๏ Sporadic, inaccurate capacity planning ๏ Platform was not adopting to change in AWS ๏ One VM per workload ๏ No team accountability or ownership ๏ Little financial leverage over infrastructure ๏ Growth in cloud spend was not sustainable ๏ We need to tell an authentic, credible story Why we changed
  6. ๏ @cloudability Evolution 10 What we changed ๏ Team Structure

    ๏ Architecture ๏ Operations Approach ๏ Approach to Cloud
  7. ๏ @cloudability Evolution 11 What we changed ๏ Team Structure

    ๏ Functional teams with clearly accountable leadership ๏ SRE model for system accountability ๏ KPIs around efficiency ๏ Unit economics driven ๏ Architecture ๏ Embrace native platform tools and services ๏ Don’t operate things we’re not good at ๏ Simplify where the business allows
  8. ๏ @cloudability Evolution 12 What we changed (cont.) ๏ Revamped

    Operations team to: ๏ Become the experts in platform capabilities ๏ Further the advancement of “cloud native” approaches including elasticity, spot, containerization across engineering teams ๏ Shared services including log aggregation, metrics, monitoring and alerting ๏ Security resources ๏ Act as a force multiplier ๏ Drink our own champagne
  9. @cloudability Cloud Adoption The Journey Stage I TIME Stage II

    Stage III Stage IV VISIBILITY FINANCIAL LEVERAGE PLATFORM LEVERAGE & OPTIMIZATION BUSINESS ALIGNMENT CLOUD EFFICIENCY
  10. @cloudability Cloud Adoption The Journey Stage I TIME Stage II

    Stage III Stage IV VISIBILITY FINANCIAL LEVERAGE PLATFORM LEVERAGE & OPTIMIZATION BUSINESS ALIGNMENT CLOUD EFFICIENCY
  11. @cloudability Evolution Resource Visibility ๏Implemented proper tagging (97% attainment) ๏Team

    structure allows visibility into spend by architectural function ๏Review spend for a team on a regular basis and as part of QBRs ๏Planning simplified through a single point of accountability for each function ๏Teams make more informed decisions and learn from each other
  12. @cloudability Cloud Adoption The Journey Stage I TIME Stage II

    Stage III Stage IV VISIBILITY FINANCIAL LEVERAGE PLATFORM LEVERAGE & OPTIMIZATION BUSINESS ALIGNMENT CLOUD EFFICIENCY
  13. @cloudability Evolution Financial Leverage ๏ Accountable teams simplify assessment of

    spending commitments ๏ Optimized with ML models ๏ ≈90% RI coverage for specific workloads ๏ ≈30% savings with near zero engineering effort ๏ Cover modeled unit consumption, not individual hosts* ๏ Hedge using secondary market for RIs
  14. @cloudability Cloud Adoption The Journey Stage I TIME Stage II

    Stage III Stage IV VISIBILITY FINANCIAL LEVERAGE PLATFORM LEVERAGE & OPTIMIZATION BUSINESS ALIGNMENT CLOUD EFFICIENCY
  15. @cloudability Evolution Leverage & Simplify Before After Initiative: Migrate from

    using Riak as a time series data store to an event-driven S3 Architecture for Data Pipeline
  16. @cloudability Evolution Leverage & Simplify ๏ Unraveling NIH induced complexity

    ๏ €10,000 monthly spend reduction by leveraging platform features ๏ Significant reduction in operational complexity ๏ No Erlang stack traces!
  17. @cloudability Evolution MySQL to Aurora Comparison RDS MySQL AWS Aurora

    SELECT .9 Latency 360 seconds 120 seconds PIOPS 30,000 (23,000) 0 Replica Provision Time 30 hours 7 minutes DML .9 Replication Time ≈ 7 minutes 20 ms Failure Mode Operations Manual diagnosis and election Aurora managed Operational Costs € x € .75x
  18. @cloudability Evolution Platform Adaptation Comparison i2.8xlarge i3.8xlarge vCPU 32 32

    GB RAM 244 244 Local Storage 8 x 800MB SSD 4 x 1900MB NVMe SSD Network 10Gbps 40Gbps Mean HBase Scan Latency 100ms 85ms Availability Single AZ placement group Multi-AZ Encryption at Rest Operator managed Default Hourly Cost $6.82 $2.496
  19. @cloudability Evolution Resource Optimization - Containers ๏ Ported asynchronous queue

    consumers to containers from individual EC2 hosts ๏ Microservices as containers on ECS with ALB with blue/green deploys ๏ Automated smoke testing of new deployments prior to cutover ๏ Graceful cutover of ALB target groups ๏ Seamless rollback for stateless services ๏ Improvements in ๏ Utilization - 50% async worker efficiency ๏ Cost savings - 60% before RIs ๏ Operational complexity ๏ Developer agility ๏ Security
  20. @cloudability Evolution Case Study - EC2 Spot Optimization ๏ Multiple

    workloads moved to EC2 Spot: ๏ Time insensitive EMR jobs ๏ CI worker scaling ๏ Non-critical container scaling ๏ Event-driven ASG scale up for large file processing ๏ Resulting in: ๏ €40,000 monthly savings over on-demand ๏ Lower sensitivity to elastic EC2 misuse ๏ Improved systems design discipline
  21. @cloudability Evolution Case Study - EC2 Spot Optimization ๏ Proper

    use of Spot requires: ๏ Probabilistic model and coordination of unused RIs ๏ Multinomial model for market pricing and termination probability ๏ Balanced use of Spot Block and Fleet ๏ Deliberate overfitting for price optimization ๏ Deliberate workload deferral to maximize price optimization ๏ Ensemble approaches to all of the above ๏ Architecture portability ๏ Checkpointing long-running workloads
  22. @cloudability Cloudability The Journey Stage I TIME Stage II Stage

    III Stage IV VISIBILITY FINANCIAL LEVERAGE PLATFORM LEVERAGE & OPTIMIZATION BUSINESS ALIGNMENT CLOUD EFFICIENCY
  23. @cloudability Cloudability Engineering Key SaaS Operating Concepts ๏ As a

    SaaS business: ๏ Selling € x worth of features to our customers has a non-zero cost to deliver ๏ Revenue has a direct correlation to our AWS bill ๏ Cost of doing business == COGS ๏ Managing the top line growth vs. COGS is gross margin ๏ Active management of gross margin is critical to success for our business
  24. @cloudability Evolution Case Study - Business Alignment ๏ High fidelity

    visibility into what it costs to ship a feature ๏ Understand what it costs to operate a feature for a customer ๏ Tune packaging and pricing according to gross margin ๏ Smarter engineering investments: ๏ Focus on value add and core competencies ๏ Informed efficiency efforts ๏ Insight into previously hidden layers of data CAC
  25. @cloudability Recap The Payoff ๏ Variable month over month costs

    down ≈50% from peak ๏ Financial leverage: 90% RI coverage for sustained duration compute yielding 30% savings ๏ Platform leverage and optimization: ๏ Containers produce 50% utilization improvement and 60% savings ๏ 31% savings and 15% throughput improvement by acting on platform improvement ๏ €40,000 monthly cost avoidance leveraging ML-driven use of spot ๏ Improved and simplified security ๏ Reduced operational complexity ๏ Sophisticated packaging, pricing and funnel metrics
  26. @cloudability Recap Not all Roses ๏Rate of change is staggering

    ๏Use caution with spend commitments ๏Spot is its own special snowflake ๏Manage elasticity carefully ๏Aurora IOPS spend anomaly ๏i3 kernel panics ๏Vendor lockin