SRE in the Cloud

SRE IN THE CLOUD Rich Adams SRECon14 30th May, 2014

Formalities

Formalities • Hi, I'm Rich! o/ • I'm a systems
engineer at Gracenote. • (I write server applications, and manage the infrastructure for those applications on AWS). • I'm British, sorry for the accent*. • Be gentle, this is my first ever talk! • (Don't worry, I'll provide an email address for hate mail towards the end). * not really.

Let's Talk About The Cloud

Why bother? • “Free” reliability and automation! • Low upfront
cost. • Low operating cost. • Faster to get up and running than on metal. • Pay as you go, no minimum contracts, etc. • Easier to scale than metal. • Easier to learn than physical hardware (one vendor vs many). • On-demand capacity and elasticity. Perfect for startups!

Changing Roles SREs in physical environment have the advantage, •
Know the physical hardware. • Understand intricacies of entire infrastructure. Cloud is maintained by vendor, • Abstracts away physical hardware. • How do you get reliability when you don't control the hardware?

Just Move Servers to The Cloud! Right?

Moving to cloud by copying your servers one-to-one won't work.
I know, I tried

Availability Zone Region Security Group Database Application domain.tld

What Changes? • You need to re-engineer parts of your
application. • Producing reliable applications in the cloud is different than on physical hardware. • Don't have access to physical infrastructure. • Need to build for scalability/elasticity. • Get some reliability for free, the rest you need to architect your way around.

Wait, Free Reliability? • e.g. Relational Database Service (RDS) on
AWS. • Automatic backups. • Automatic cross data center (availability zone) redundancy. • Lots of things handled for you: • Patches. • Replication. • Read-replicas. • Failover. Awesome, our jobs are now obsolete.

Availability Zone Security Group RDS Master Application domain.tld Availability Zone
Security Group RDS Slave S3 Bucket DB Backups Region

Everything Isn't Free • Redundancy of application servers need to
do yourself. • Load balancers need configuring (as does DNS). • Auto-scaling might be automatic, but someone still has to configure it. • At basic level, you can just copy a server into another availability zone, then point your load balancer at it.

Availability Zone Region Security Group RDS Master Application domain.tld RDS
Slave S3 Bucket DB Backups Availability Zone Security Group Application Elastic Load Balancer Route 53 (DNS)

Cool, so we're done. Right?

Server Died. What Now? Physical Environment

Server Died. What Now? Cloud Environment

Slave DB Backups Availability Zone Security Group Application Elastic Load Balancer Route 53 (DNS)

Slave DB Backups Availability Zone Security Group Application Elastic Load Balancer Route 53 (DNS) Uh oh!

Slave DB Backups Availability Zone Security Group Application Elastic Load Balancer Route 53 (DNS) No problem!

Embrace Failure • Faults don't have to be a problem,
if you handle them. • Isolate errors within their component(s). • Each component should be able to fail, without taking down the entire service. • Don't let fatal application errors become fatal service errors. • Fail in a consistent, known way.

Netflix Chaos Monkey The Netflix Simian Army is available on
GitHub: https://github.com/Netflix/SimianArmy "We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient." http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html

Care about services, not servers.

Time to Think Differently • Servers are ephemeral. • You
no longer care about individual servers. • Now you care about the service as a whole. • Servers will fail. It shouldn't matter. If a server suddenly disappears, you don't care. • Recovery, deployment, failover, etc. should all be automated as best they can be. • Package updates, OS updates, etc. need to be managed by "something". Whether it's a Bash script, or Chef/Puppet, etc.

Time to Think Differently • Monitor service as a whole,
not individual servers. • Alerts become notifications. • If you've set up everything correctly, your health check should automatically destroy bad instances and spawn new ones. There's (usually) no action to take when getting an “alert”. • Proactive instead of reactive monitoring. • To get the benefits, you'll need to re-architect your application. This has some prerequisites...

How to Not Care About Servers

Centralized Logging • Can't log to local files anymore, have
to log somewhere else. • Admin tools to view logs need to be remade/refactored. • SSHing to grep logs becomes infeasible at scale. • Can use a third-party for this! • Can archive logs in S3 bucket, pass to Glacier after x days. • Can't log direct to S3, no append ability (yet).

Application Application Log Server Storage rsyslog write to persistent storage
Log Viewer Logging System read-only

Dynamic Configuration • Like Puppet, but for application configuration. •
Previously, infrastructure was static and environment was known, so this didn't matter. Now it's dynamic, so we needed to account for that. • Things can scale at any time, so application configuration needs to be updatable. • Application polls for config changes every so often. Can update config on-the-fly (current memcached nodes, etc) either manually or programatically.

Application Application Storage Configuration Management UI Configuration System Configuration Validation
write to persistent storage Poll for config changes

No Temporary Files • Can't store any temporary files in
local storage, need to move files directly to where they need to be. • For uploads, can use pre-signed URLs to go direct to S3. • Or, add item to asynchronous queue to be processed by a consumer. • Temporary state on a local server becomes a bad idea in the cloud (or any distributed application).

Global Session Storage • Can't store sessions locally and rely
on persistent load balancer connections. • Have to store session state in a global space instead. • Database works just fine for this.

Controversial Opinion Ahead

Disable SSH (Block Port 22)

If you have to SSH into your servers, then your
automation has failed.

No SSH? Are You Mad?!?! • I don't mean disabling
sshd. That would be crazy. • Disable at firewall level to prevent devs from cheating. • “Oh, I'll just SSH in and fix this one issue.” instead of “I should make sure this fix is automated.” But what if I need to debug! • Just re-enable port 22 and you're good to go. It's a few clicks, or 3 seconds of typing. At scale, you simply can't SSH in to fix a problem. Get out of the habit early. Makes things go smoother later. Top Tip: Every time you have a manual action, automate it for next time!

Servers can fail, so we're done. Right?

Availability Zone Region Security Group RDS Master domain.tld RDS Slave
DB Backups Availability Zone Security Group Application Elastic Load Balancer Route 53 (DNS) ? ?

Need Self-Provisioning Servers

Bootstrapping • On boot, identify region/application/etc. Store info on filesystem
for later use (I store in /env). • Don't forget to update bootstrap scripts as first step, so you can change them without having to make a new image every time. • You want fast bootstrapping! Don't start from fresh OS every time, create a base image that has most of the things you need, then work from that. • Can use Puppet/Chef to configure, but pre-configure a base instance first, then save an image from that.

Deployment • Used to push code to known servers, now
each server needs to pull its config/code on boot instead. • Deployment scripts refactored to not care about individual servers but to use AWS API to find active servers. • How does server know which version to deploy? Or which environment it's in? Uses AWS tags! • Can easily deploy old code versions if needed, for rollback.

Availability Zone Region Security Group RDS Master domain.tld RDS Slave
DB Backups Availability Zone Security Group Application Elastic Load Balancer Route 53 (DNS) ? ? ? ?

Reliability is Also About Security Insecure == Unreliable

Monitoring Changes • Automate your security auditing. • Current intrusion
detection tools may not detect AWS specific changes. • Create an IAM account with built-in "Security Audit" policy. • https://s3.amazonaws.com/reinvent2013-sec402/SecConfig.py * • This script will go over your account, creating a canonical representation of security configuration. • Set up a cron job to do this every so often and compare to previous run. Trigger an alert for review if changes are detected. • CloudTrail keeps full audit logs of all changes from web console or API. • Store logs in S3 bucket with versioning so no one can modify your logs without you seeing. * From "Intrusion Detection in the Cloud", http://awsmedia.s3.amazonaws.com/SEC402.pdf

Controlling Access • Everyone gets an IAM account. Never login
to the master account. • You may be used to using an "Operations Account", which you share with your entire team. • Do not do that with AWS/Cloud. Everyone gets their own account, with just the permissions they need (least privilege principle). • An IAM user can control everything in the infrastructure, so there's no need to use master account. • Enable multi-factor authentication for master and IAM accounts. • Could give one user MFA token, another the password. Any action on master account then requires two users to agree. Overkill for my case, but someone may want to use that technique.

No Hardcoded Credentials • If your app has credentials baked
into it, you're "doing it wrong". • Use IAM Roles, • Create role, specify permissions. • When creating instance, specify role it should use. • Whenever using AWS SDK, it will automatically retrieve temporary credentials with the access level specified in the role. • All handled transparently to developers/operations. • Application never needs to know the credentials, infrastructure manages it all for you.

Managing Your Infrastructure

Tools, Tools, and More Tools • Can write scripts using
AWS CLI tools. • Can use the Web Console. • Useful for viewing graphs on CloudWatch, etc. • CloudFormation lets you write your infrastructure in JSON, create stacks that can be deployed over and over. (Bonus: keep your infrastructure in version control!) • OpsWorks, uses Chef recipes, it's just point/click and does most of the work for you. • DB layer, load balancer layer, cache layer, etc. • Schedule periods of higher support. • Scale based on latency or other factors, instead of just time-based.

Scalable in a zone is not enough. You must use
multiple zones!

Redundancy is Required • You absolutely must spread yourself out
over multiple physical locations to have a reliable service. • Unlike metal environments, it's just a few clicks, rather than a trip to another city to rack some servers. • For AWS, this means to always deploy into multiple Availability Zones (AZs). • Use Elastic Load Balancer (ELB) as service endpoint. • Add servers to ELB pool. ELB can see all AZs in a region. • For multiple regions, need to use DNS (round robin, etc.).

Availability Zone N Availability Zone 1 Region Security Group RDS
Master domain.tld RDS Slave S3 Bucket DB Backups Elastic Load Balancer Route 53 (DNS) Auto Scaling Group Application Security Group Security Group Auto Scaling Group Application Security Group Security Group Availability Zone 2 Security Group

It Just Works! Right?

GitHub Down? So Are We!

Redundasize* Critical Processes Problem We deployed direct from GitHub. When
GitHub is down, or there's too much latency to github.com, we can't scale. Oops. Solution We now have a local clone of GitHub repos we pull from instead. GitHub is the backup if that clone goes down. Git is distributed, we should probably have made use of that. * possibly a made-up word.

Which Server is the Log From?

Make Your Logs Useful Problem Aggregated logs didn't contain any
info on the server/region. No idea which region/az is having a problem from the logs. Oops. Solution Now we store extra metadata with each log line. • Region • Availability Zone • Instance ID • Environment (stage/prod/test/demo, etc) • Request ID

Server Dies During Deployment? Let's Just Stop Everything!

Cope with Failure at All Levels Problem Deployment scripts didn't
account for server being replaced mid-deployment. Would stall deployments completely. Oops. Solution Check server state throughout process and moves on if it's been killed. Make sure you can cope with failure not just in your infrastructure, but in any scripts or tools which you use to manage that infrastructure.

Private Network? Nah, Let's Just Use Public IPs.

Use Private Network Problem Didn't use VPC, so using internal
IPs was painful. Just used the external public IPs instead. It works, but much more difficult to secure and manage. Oops. Solution Migrate to VPC. Migrating after the fact was difficult. Use it from the start and save yourself the pain. VPC lets you have egress firewall rules, change things on the fly, specify network ACLs, etc. New accounts have no choice, so this may be moot.

Deploying a New Application? Sorry, You Hit Your Limit.

Be Aware of Cloud Limitations Problem AWS has pre-defined service
limits. These are not clearly displayed unless you know where to look*. First time you'll see the error is when trying to perform an action which you can no longer perform. Oops. Solution Be aware of built-in limits so you can request their increase ahead of time, before you start trying to putting things to use in production. Other limits are things like scalability of ELBs. If you're expecting heavy traffic, you need to pre-warm your ELBs by injecting traffic beforehand. Or contact AWS to pre-warm them for you (preferred). You want to learn this lesson before you get the critical traffic! * http://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html

Was It All Worth It? (Hint: I'm Slightly Biased)

Worth It? • Can now handle growth in a very
organic fashion. • No actionable alert in... well... I can't remember. • When things go wrong, instances kill themselves and we get a fresh instance with a known-good configuration. • Deployments are not as dangerous, can deploy many times a day and rollback easily, so they've become routine instead of "OK, everyone stop what you're doing, we're going to deploy something".

Totally Worth It • Much lower cost than before. •
Spinning up a new application/environment used to take days. Now takes ~15 minutes. • More freedom to prototype and play with changes. • Easy to spin up a new region/environment for a few hours to play with settings; (With minimal cost, and completely isolated from your current environment). • Something you can't do with metal unless you already have the hardware prepared and ready. • Developers can have their own personal prod clone to develop with, means no surprises when moving to production.

Useful Resources • https://cloud.google.com/developers/#articles - Google Cloud Whitepapers and Best
Practice Guides. • http://www.rackspace.co.uk/whitepapers - Rackspace Whitepapers and Guides. • http://azure.microsoft.com/blog - Microsoft Azure Blog. • http://aws.typepad.com/ - AWS Blog. • http://www.youtube.com/user/AmazonWebServices - Lots of AWS training videos, etc.

More Useful Resources • http://www.slideshare.net/AmazonWebServices - All slides and presentations
from the AWS conferences. Lots of useful training stuff for free. • http://netflix.github.com - Netflix are the masters of AWS. They have lots of open source stuff to share. • http://aws.amazon.com/whitepapers - Oh so many papers from AWS on everything from security best practices, to financial services grid computing in the cloud.

Tooting My Own Horn Read more about my AWS mishaps!
wblinks.com/notes/aws-tips-i-wish-id-known-before-i-started Do I suck at presenting? Send your hate mail to hi@richadams.me! Say Hi on Twitter! @r_adams

Thanks! richadams.me/talks/srecon14

SRE in the Cloud

SRE in the Cloud

More Decks by Rich Adams

Other Decks in Technology

Featured

Transcript