Site Reliability in the Serverless Age

Site Reliability in the Serverless Age Erik Peterson CEO &
Founder CloudZero [email protected] | @silvexis DevOpsDays Boston| 9/23/2018

About Me Erik Peterson – [email protected], @silvexis • CEO and
Founder of CloudZero • I’m recovering from the application security industry, now 100% focused on Cloud and Serverless • Have been building systems on AWS since 2008 • Previously • Veracode • HP, SPI Dynamics, Sanctum • United Nations IAEA, US Department of State, SunTrust, Moody’s Investors • Fun fact: I’ve lived in 6 US states and 3 countries

What is Serverless? What is Reliability? How does Serverless affect
Reliability? The Future

WHAT IS SERVERLESS? • Event driven • Invisible infrastructure •
Automatically scales with usage • Fault tolerance and high availability built in • Never pay for idle

Serverless is not just Functions as a Service But FaaS
is one of its most important building blocks

Serverless is a Spectrum (AWS edition) 0% 100% 50% More
Serverless Less Serverless EC2 RDS Redshift ElastiCache Elasticsearch Aurora(RDS) ECS ECS (Fargate) Kinesis Aurora (Serverless) DynamoDB API Gateway Step Functions SQS SNS S3 Lambda EFS

Werner Vogels CTO Amazon Web Services

So what does reliability even mean?

Reliability is the trustworthiness of a system’s ability to delight
the customer

Two forces exist today that drive reliability • DevOps (culture)
• Eliminate Dev and Ops silos • Accept failure as normal • MTTR is more important than MTBF • Driven to achieve the fastest feature velocity • Measure everything • Site Reliability Engineering (practice) • Availability • Latency • Performance • Efficiency • Change management • Monitoring • Emergency response • Capacity planning • Provisioning

How does Serverless affect these forces? Hint: Change is coming

Serverless effect on DevOps REQUIRED COST IS A 1ST CLASS
METRIC CHANGE NOW HAPPENS FASTER THAN YOUR CAN KEEP UP WITH RELIGION MEASURE WHAT’S IMPORTANT REQUIRED Eliminate organizational silos Accept failure as normal MTTR is more important than MTBF Optimize for feature velocity Measure everything Cost effective systems are well built systems NEW

Serverless effect on Site Reliability Engineering SLO + SLA MANAGEMENT
AUTOMATION & AUTO SCALING OBSERVABILITY SERVICE LIMIT PLANNING COST CHANGE TRACKING AVAILABILITY LATENCY PERFORMANCE EFFICIENCY CHANGE MANAGEMENT MONITORING EMERGENCY RESPONSE CAPACITY PLANNING PROVISIONING

Deeper Dive on the hard stuff SLO + SLA MANAGEMENT
COST OBSERVABILITY SERVICE LIMITS

SLA + SLO Management • Do your Cloud providers SLA’s
support your SLO or SLA? • Understand the aggregate availability statistics for your service dependencies and plan accordingly CloudWatch Logs No SLA (99.99% EC2) No SLA 99.9% (S3 SLA) No SLA 99.99% (DynamoDB SLA)

AWS Lambda/FaaS cost is just one small part of a
Serverless system CloudWatch Logs $1.79 $15 $0.89 $789!!! $12 Lambda Cost: $1.79 Total Cost: $818.68 System Costs Per Day Track your complete system cost

Observability vs. Monitoring • Observability is a measure of how
well the state of a system can be determined from the analysis of its outputs. • Serverless systems are easy to observe, very hard to analyze and as a result, have low observability • Observability however is the key to measuring what’s important, you are going to have to figure this one out System Output Analysis Observability Something you need to enable or build into your systems If you aren’t doing this part, you are just doing monitoring

FYI: This isn’t Analysis

Serverless Capacity Planning Know thy limits • Scaling is built
in but, Serverless systems have limits and constraints. • You will hit them once you are in prod and under heavy customer load…on a Friday…at 6pm • It can be very very hard to figure out when the limits are being hit in a large system with many moving parts. Here are just a few examples: • Maximum number of concurrent executions per AWS account (1000, changeable) • Immediate Concurrency Increase (500 or more per min, depends on region, fixed) AWS Lambda API Gateway • Integration timeout (29 sec max, fixed) • Max Payload size (10mb, fixed) • S3 will asynchronously call Lambda • Lambda polls DynamoDB Streams only once per second, per shard Serverless Invocation Limits Examples: Examples: Examples:

Thinking about Cost and Architecture • Lets come back to
this chart for a second CloudWatch Logs $1.79 $15 $0.89 $789 $12 This is already a big problem Question: Could it be worse?

Thinking about Cost and Architecture • Yep, It can be
worse Writes 100 files Invoked 100x 1000 records Written Invoked 100x Invoked 3x per transaction Writes 1000 files What happens at step 7? 1 2 3 4 5 6 7 Hint: It’s both costly and catastrophic

Thinking about Cost and Architecture • What if you are
only responsible for a small part of that system? • What if your part is in a separate AWS Account or from a 3rd party? • How will you know the effect you will have on the whole? • Who’s responsible? • Who or what is going to solve this challenge? 7

Prediction – A New DevOps Tribe: FinDevOps • First suggested
by Simon Wardley @ ServerlessConf 2018 • This new Tribe is: • A combination of development and financial practices • Believes that cost is a first class operational metric • Understands the tight correlation between cost and well architected Serverless systems • Knows intimately what the cost of a user or a transaction is • Tracks dependencies and data flows like their life depended on it Source: https://twitter.com/swardley/status/1024107922203111424 These people are starting to emerge now, expect job descriptions looking for these skills to appear in 2019

Thank You! [email protected] @silvexis Come visit our booth to discuss
the future of Serverless DevOps and learn more about CloudZero

Site Reliability in the Serverless Age

Site Reliability in the Serverless Age

Erik Peterson PRO

More Decks by Erik Peterson

Other Decks in Programming

Featured

Transcript