Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Site Reliability in the Serverless Age

Erik Peterson
September 24, 2018

Site Reliability in the Serverless Age

Is SRE and serverless a match made in heaven? Just what is this serverless thing anyway and what does it mean for building reliable systems? To answer this, lets explore SRE principals and map them to their serverless counterparts and along the way make a few predictions about our serverless future and introduce a new concept: Cost as a 1st class operational metric and the arrival of a new tribe called FinDevOps

Erik Peterson

September 24, 2018
Tweet

More Decks by Erik Peterson

Other Decks in Programming

Transcript

  1. Site Reliability in
    the Serverless Age
    Erik Peterson
    CEO & Founder
    CloudZero
    [email protected] | @silvexis
    DevOpsDays Boston| 9/23/2018

    View Slide

  2. About Me
    Erik Peterson – [email protected], @silvexis
    • CEO and Founder of CloudZero
    • I’m recovering from the application security industry,
    now 100% focused on Cloud and Serverless
    • Have been building systems on AWS since 2008
    • Previously
    • Veracode
    • HP, SPI Dynamics, Sanctum
    • United Nations IAEA, US Department of State,
    SunTrust, Moody’s Investors
    • Fun fact: I’ve lived in 6 US states and 3 countries

    View Slide

  3. What is Serverless?
    What is Reliability?
    How does Serverless affect Reliability?
    The Future

    View Slide

  4. WHAT IS
    SERVERLESS?
    • Event driven
    • Invisible infrastructure
    • Automatically scales with usage
    • Fault tolerance and high availability built in
    • Never pay for idle

    View Slide

  5. View Slide

  6. Serverless is not
    just Functions as
    a Service
    But FaaS is one of its most important building blocks

    View Slide

  7. Serverless is a Spectrum (AWS edition)
    0% 100%
    50%
    More Serverless
    Less Serverless
    EC2
    RDS
    Redshift
    ElastiCache
    Elasticsearch
    Aurora(RDS)
    ECS
    ECS (Fargate)
    Kinesis
    Aurora (Serverless)
    DynamoDB
    API Gateway
    Step Functions
    SQS
    SNS
    S3
    Lambda
    EFS

    View Slide

  8. Werner Vogels
    CTO Amazon Web Services

    View Slide

  9. So what does
    reliability
    even mean?

    View Slide

  10. Reliability is the
    trustworthiness of a system’s
    ability to delight the
    customer

    View Slide

  11. Two forces
    exist today
    that drive
    reliability
    • DevOps (culture)
    • Eliminate Dev and Ops
    silos
    • Accept failure as
    normal
    • MTTR is more
    important than MTBF
    • Driven to achieve the
    fastest feature velocity
    • Measure everything
    • Site Reliability
    Engineering (practice)
    • Availability
    • Latency
    • Performance
    • Efficiency
    • Change management
    • Monitoring
    • Emergency response
    • Capacity planning
    • Provisioning

    View Slide

  12. How does Serverless affect
    these forces?
    Hint: Change is coming

    View Slide

  13. Serverless
    effect on
    DevOps
    REQUIRED
    COST IS A 1ST CLASS METRIC
    CHANGE NOW HAPPENS FASTER THAN
    YOUR CAN KEEP UP WITH
    RELIGION
    MEASURE WHAT’S IMPORTANT
    REQUIRED
    Eliminate organizational silos
    Accept failure as normal
    MTTR is more important than
    MTBF
    Optimize for feature velocity
    Measure everything
    Cost effective systems are well
    built systems
    NEW

    View Slide

  14. Serverless
    effect on Site
    Reliability
    Engineering
    SLO + SLA MANAGEMENT
    AUTOMATION & AUTO SCALING
    OBSERVABILITY
    SERVICE LIMIT PLANNING
    COST
    CHANGE TRACKING
    AVAILABILITY
    LATENCY
    PERFORMANCE
    EFFICIENCY
    CHANGE MANAGEMENT
    MONITORING
    EMERGENCY RESPONSE
    CAPACITY PLANNING
    PROVISIONING

    View Slide

  15. Deeper Dive on the hard stuff
    SLO + SLA
    MANAGEMENT
    COST OBSERVABILITY SERVICE LIMITS

    View Slide

  16. SLA + SLO Management
    • Do your Cloud providers SLA’s support your SLO or SLA?
    • Understand the aggregate availability statistics for your service
    dependencies and plan accordingly
    CloudWatch Logs
    No SLA
    (99.99% EC2)
    No SLA
    99.9% (S3 SLA)
    No SLA
    99.99% (DynamoDB SLA)

    View Slide

  17. AWS Lambda/FaaS cost is just one small part of a Serverless
    system
    CloudWatch Logs
    $1.79
    $15
    $0.89
    $789!!!
    $12
    Lambda Cost: $1.79
    Total Cost: $818.68
    System Costs Per Day
    Track your complete system cost

    View Slide

  18. Observability vs. Monitoring
    • Observability is a measure of how well the state of a system can be
    determined from the analysis of its outputs.
    • Serverless systems are easy to observe, very hard to analyze and as a
    result, have low observability
    • Observability however is the key to measuring what’s important, you are
    going to have to figure this one out
    System
    Output
    Analysis Observability
    Something you need
    to enable or build into
    your systems If you aren’t doing this
    part, you are just
    doing monitoring

    View Slide

  19. FYI: This isn’t Analysis

    View Slide

  20. Serverless Capacity Planning
    Know thy limits
    • Scaling is built in but, Serverless systems have limits and constraints.
    • You will hit them once you are in prod and under heavy customer load…on a Friday…at 6pm
    • It can be very very hard to figure out when the limits are being hit in a large system with
    many moving parts. Here are just a few examples:
    • Maximum number of concurrent
    executions per AWS account
    (1000, changeable)
    • Immediate Concurrency Increase
    (500 or more per min, depends on
    region, fixed)
    AWS Lambda API Gateway
    • Integration timeout (29
    sec max, fixed)
    • Max Payload size (10mb,
    fixed)
    • S3 will asynchronously
    call Lambda
    • Lambda polls DynamoDB
    Streams only once per
    second, per shard
    Serverless Invocation Limits
    Examples: Examples: Examples:

    View Slide

  21. View Slide

  22. Thinking about Cost and Architecture
    • Lets come back to this chart for a second
    CloudWatch Logs
    $1.79
    $15
    $0.89
    $789
    $12
    This is already a big problem
    Question: Could it be worse?

    View Slide

  23. Thinking about Cost and Architecture
    • Yep, It can be worse
    Writes 100
    files
    Invoked
    100x
    1000
    records
    Written
    Invoked
    100x
    Invoked 3x per
    transaction Writes 1000 files
    What happens at step 7?
    1
    2
    3
    4
    5
    6
    7
    Hint: It’s both costly and catastrophic

    View Slide

  24. Thinking about Cost and Architecture
    • What if you are only responsible for a small part of that system?
    • What if your part is in a separate AWS Account or from a 3rd party?
    • How will you know the effect you will have on the whole?
    • Who’s responsible?
    • Who or what is going to solve this challenge?
    7

    View Slide

  25. Prediction – A New DevOps Tribe: FinDevOps
    • First suggested by Simon Wardley @
    ServerlessConf 2018
    • This new Tribe is:
    • A combination of development and financial practices
    • Believes that cost is a first class operational metric
    • Understands the tight correlation between cost and
    well architected Serverless systems
    • Knows intimately what the cost of a user or a
    transaction is
    • Tracks dependencies and data flows like their life
    depended on it
    Source: https://twitter.com/swardley/status/1024107922203111424
    These people are starting to emerge now, expect job
    descriptions looking for these skills to appear in 2019

    View Slide

  26. Thank You!
    [email protected]
    @silvexis
    Come visit our booth to discuss the future of
    Serverless DevOps and learn more about
    CloudZero

    View Slide