Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Site Reliability in the Serverless Age

Site Reliability in the Serverless Age

Is SRE and serverless a match made in heaven? Just what is this serverless thing anyway and what does it mean for building reliable systems? To answer this, lets explore SRE principals and map them to their serverless counterparts and along the way make a few predictions about our serverless future and introduce a new concept: Cost as a 1st class operational metric and the arrival of a new tribe called FinDevOps

Erik Peterson

September 24, 2018

More Decks by Erik Peterson

Other Decks in Programming


  1. Site Reliability in the Serverless Age Erik Peterson CEO &

    Founder CloudZero [email protected] | @silvexis DevOpsDays Boston| 9/23/2018
  2. About Me Erik Peterson – [email protected], @silvexis • CEO and

    Founder of CloudZero • I’m recovering from the application security industry, now 100% focused on Cloud and Serverless • Have been building systems on AWS since 2008 • Previously • Veracode • HP, SPI Dynamics, Sanctum • United Nations IAEA, US Department of State, SunTrust, Moody’s Investors • Fun fact: I’ve lived in 6 US states and 3 countries
  3. WHAT IS SERVERLESS? • Event driven • Invisible infrastructure •

    Automatically scales with usage • Fault tolerance and high availability built in • Never pay for idle
  4. Serverless is not just Functions as a Service But FaaS

    is one of its most important building blocks
  5. Serverless is a Spectrum (AWS edition) 0% 100% 50% More

    Serverless Less Serverless EC2 RDS Redshift ElastiCache Elasticsearch Aurora(RDS) ECS ECS (Fargate) Kinesis Aurora (Serverless) DynamoDB API Gateway Step Functions SQS SNS S3 Lambda EFS
  6. Two forces exist today that drive reliability • DevOps (culture)

    • Eliminate Dev and Ops silos • Accept failure as normal • MTTR is more important than MTBF • Driven to achieve the fastest feature velocity • Measure everything • Site Reliability Engineering (practice) • Availability • Latency • Performance • Efficiency • Change management • Monitoring • Emergency response • Capacity planning • Provisioning
  7. Serverless effect on DevOps REQUIRED COST IS A 1ST CLASS

    METRIC CHANGE NOW HAPPENS FASTER THAN YOUR CAN KEEP UP WITH RELIGION MEASURE WHAT’S IMPORTANT REQUIRED Eliminate organizational silos Accept failure as normal MTTR is more important than MTBF Optimize for feature velocity Measure everything Cost effective systems are well built systems NEW
  8. Serverless effect on Site Reliability Engineering SLO + SLA MANAGEMENT

  9. Deeper Dive on the hard stuff SLO + SLA MANAGEMENT

  10. SLA + SLO Management • Do your Cloud providers SLA’s

    support your SLO or SLA? • Understand the aggregate availability statistics for your service dependencies and plan accordingly CloudWatch Logs No SLA (99.99% EC2) No SLA 99.9% (S3 SLA) No SLA 99.99% (DynamoDB SLA)
  11. AWS Lambda/FaaS cost is just one small part of a

    Serverless system CloudWatch Logs $1.79 $15 $0.89 $789!!! $12 Lambda Cost: $1.79 Total Cost: $818.68 System Costs Per Day Track your complete system cost
  12. Observability vs. Monitoring • Observability is a measure of how

    well the state of a system can be determined from the analysis of its outputs. • Serverless systems are easy to observe, very hard to analyze and as a result, have low observability • Observability however is the key to measuring what’s important, you are going to have to figure this one out System Output Analysis Observability Something you need to enable or build into your systems If you aren’t doing this part, you are just doing monitoring
  13. Serverless Capacity Planning Know thy limits • Scaling is built

    in but, Serverless systems have limits and constraints. • You will hit them once you are in prod and under heavy customer load…on a Friday…at 6pm • It can be very very hard to figure out when the limits are being hit in a large system with many moving parts. Here are just a few examples: • Maximum number of concurrent executions per AWS account (1000, changeable) • Immediate Concurrency Increase (500 or more per min, depends on region, fixed) AWS Lambda API Gateway • Integration timeout (29 sec max, fixed) • Max Payload size (10mb, fixed) • S3 will asynchronously call Lambda • Lambda polls DynamoDB Streams only once per second, per shard Serverless Invocation Limits Examples: Examples: Examples:
  14. Thinking about Cost and Architecture • Lets come back to

    this chart for a second CloudWatch Logs $1.79 $15 $0.89 $789 $12 This is already a big problem Question: Could it be worse?
  15. Thinking about Cost and Architecture • Yep, It can be

    worse Writes 100 files Invoked 100x 1000 records Written Invoked 100x Invoked 3x per transaction Writes 1000 files What happens at step 7? 1 2 3 4 5 6 7 Hint: It’s both costly and catastrophic
  16. Thinking about Cost and Architecture • What if you are

    only responsible for a small part of that system? • What if your part is in a separate AWS Account or from a 3rd party? • How will you know the effect you will have on the whole? • Who’s responsible? • Who or what is going to solve this challenge? 7
  17. Prediction – A New DevOps Tribe: FinDevOps • First suggested

    by Simon Wardley @ ServerlessConf 2018 • This new Tribe is: • A combination of development and financial practices • Believes that cost is a first class operational metric • Understands the tight correlation between cost and well architected Serverless systems • Knows intimately what the cost of a user or a transaction is • Tracks dependencies and data flows like their life depended on it Source: https://twitter.com/swardley/status/1024107922203111424 These people are starting to emerge now, expect job descriptions looking for these skills to appear in 2019
  18. Thank You! [email protected] @silvexis Come visit our booth to discuss

    the future of Serverless DevOps and learn more about CloudZero