Site Reliability in the Serverless Age

Slide 1

Slide 1 text

Site Reliability in the Serverless Age Erik Peterson CEO & Founder CloudZero [email protected] | @silvexis DevOpsDays Boston| 9/23/2018

Slide 2

Slide 2 text

About Me Erik Peterson – [email protected], @silvexis • CEO and Founder of CloudZero • I’m recovering from the application security industry, now 100% focused on Cloud and Serverless • Have been building systems on AWS since 2008 • Previously • Veracode • HP, SPI Dynamics, Sanctum • United Nations IAEA, US Department of State, SunTrust, Moody’s Investors • Fun fact: I’ve lived in 6 US states and 3 countries

Slide 3

Slide 3 text

What is Serverless? What is Reliability? How does Serverless affect Reliability? The Future

Slide 4

Slide 4 text

WHAT IS SERVERLESS? • Event driven • Invisible infrastructure • Automatically scales with usage • Fault tolerance and high availability built in • Never pay for idle

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Serverless is not just Functions as a Service But FaaS is one of its most important building blocks

Slide 7

Slide 7 text

Serverless is a Spectrum (AWS edition) 0% 100% 50% More Serverless Less Serverless EC2 RDS Redshift ElastiCache Elasticsearch Aurora(RDS) ECS ECS (Fargate) Kinesis Aurora (Serverless) DynamoDB API Gateway Step Functions SQS SNS S3 Lambda EFS

Slide 8

Slide 8 text

Werner Vogels CTO Amazon Web Services

Slide 9

Slide 9 text

So what does reliability even mean?

Slide 10

Slide 10 text

Reliability is the trustworthiness of a system’s ability to delight the customer

Slide 11

Slide 11 text

Two forces exist today that drive reliability • DevOps (culture) • Eliminate Dev and Ops silos • Accept failure as normal • MTTR is more important than MTBF • Driven to achieve the fastest feature velocity • Measure everything • Site Reliability Engineering (practice) • Availability • Latency • Performance • Efficiency • Change management • Monitoring • Emergency response • Capacity planning • Provisioning

Slide 12

Slide 12 text

How does Serverless affect these forces? Hint: Change is coming

Slide 13

Slide 13 text

Serverless effect on DevOps REQUIRED COST IS A 1ST CLASS METRIC CHANGE NOW HAPPENS FASTER THAN YOUR CAN KEEP UP WITH RELIGION MEASURE WHAT’S IMPORTANT REQUIRED Eliminate organizational silos Accept failure as normal MTTR is more important than MTBF Optimize for feature velocity Measure everything Cost effective systems are well built systems NEW

Slide 14

Slide 14 text

Serverless effect on Site Reliability Engineering SLO + SLA MANAGEMENT AUTOMATION & AUTO SCALING OBSERVABILITY SERVICE LIMIT PLANNING COST CHANGE TRACKING AVAILABILITY LATENCY PERFORMANCE EFFICIENCY CHANGE MANAGEMENT MONITORING EMERGENCY RESPONSE CAPACITY PLANNING PROVISIONING

Slide 15

Slide 15 text

Deeper Dive on the hard stuff SLO + SLA MANAGEMENT COST OBSERVABILITY SERVICE LIMITS

Slide 16

Slide 16 text

SLA + SLO Management • Do your Cloud providers SLA’s support your SLO or SLA? • Understand the aggregate availability statistics for your service dependencies and plan accordingly CloudWatch Logs No SLA (99.99% EC2) No SLA 99.9% (S3 SLA) No SLA 99.99% (DynamoDB SLA)

Slide 17

Slide 17 text

AWS Lambda/FaaS cost is just one small part of a Serverless system CloudWatch Logs $1.79 $15 $0.89 $789!!! $12 Lambda Cost: $1.79 Total Cost: $818.68 System Costs Per Day Track your complete system cost

Slide 18

Slide 18 text

Observability vs. Monitoring • Observability is a measure of how well the state of a system can be determined from the analysis of its outputs. • Serverless systems are easy to observe, very hard to analyze and as a result, have low observability • Observability however is the key to measuring what’s important, you are going to have to figure this one out System Output Analysis Observability Something you need to enable or build into your systems If you aren’t doing this part, you are just doing monitoring

Slide 19

Slide 19 text

FYI: This isn’t Analysis

Slide 20

Slide 20 text

Serverless Capacity Planning Know thy limits • Scaling is built in but, Serverless systems have limits and constraints. • You will hit them once you are in prod and under heavy customer load…on a Friday…at 6pm • It can be very very hard to figure out when the limits are being hit in a large system with many moving parts. Here are just a few examples: • Maximum number of concurrent executions per AWS account (1000, changeable) • Immediate Concurrency Increase (500 or more per min, depends on region, fixed) AWS Lambda API Gateway • Integration timeout (29 sec max, fixed) • Max Payload size (10mb, fixed) • S3 will asynchronously call Lambda • Lambda polls DynamoDB Streams only once per second, per shard Serverless Invocation Limits Examples: Examples: Examples:

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

Thinking about Cost and Architecture • Lets come back to this chart for a second CloudWatch Logs $1.79 $15 $0.89 $789 $12 This is already a big problem Question: Could it be worse?

Slide 23

Slide 23 text

Thinking about Cost and Architecture • Yep, It can be worse Writes 100 files Invoked 100x 1000 records Written Invoked 100x Invoked 3x per transaction Writes 1000 files What happens at step 7? 1 2 3 4 5 6 7 Hint: It’s both costly and catastrophic

Slide 24

Slide 24 text

Thinking about Cost and Architecture • What if you are only responsible for a small part of that system? • What if your part is in a separate AWS Account or from a 3rd party? • How will you know the effect you will have on the whole? • Who’s responsible? • Who or what is going to solve this challenge? 7

Slide 25

Slide 25 text

Prediction – A New DevOps Tribe: FinDevOps • First suggested by Simon Wardley @ ServerlessConf 2018 • This new Tribe is: • A combination of development and financial practices • Believes that cost is a first class operational metric • Understands the tight correlation between cost and well architected Serverless systems • Knows intimately what the cost of a user or a transaction is • Tracks dependencies and data flows like their life depended on it Source: https://twitter.com/swardley/status/1024107922203111424 These people are starting to emerge now, expect job descriptions looking for these skills to appear in 2019

Slide 26

Slide 26 text

Thank You! [email protected] @silvexis Come visit our booth to discuss the future of Serverless DevOps and learn more about CloudZero