Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SRE in the Cloud

SRE in the Cloud

"The future of the cloud changes the role of the SRE. In a large company where you are deploying your service on infrastructure built/managed in-house the SRE has the home field advantage of understanding the intricacies of that infrastructure. With more and more startups launching in the cloud which are maintained by the vendor, the local SREs role offers different challenges. Rich has launched several businesses on AWS and he will talk about his journey towards incorporating reliability into products and ensuring the development team had access to the information needed to improve their services. He will share the highlights of what he’s learned about Amazon’s Web Services and what it took for him to make it work for his companies."

Presented at SRECon in Santa Clara, May 30th 2014.

Also available from https://richadams.me/talks/srecon14/\

(Note: I didn't chose the title :p)

Rich Adams

May 30, 2014

More Decks by Rich Adams

Other Decks in Technology


  1. Formalities • Hi, I'm Rich! o/ • I'm a systems

    engineer at Gracenote. • (I write server applications, and manage the infrastructure for those applications on AWS). • I'm British, sorry for the accent*. • Be gentle, this is my first ever talk! • (Don't worry, I'll provide an email address for hate mail towards the end). * not really.
  2. Why bother? • “Free” reliability and automation! • Low upfront

    cost. • Low operating cost. • Faster to get up and running than on metal. • Pay as you go, no minimum contracts, etc. • Easier to scale than metal. • Easier to learn than physical hardware (one vendor vs many). • On-demand capacity and elasticity. Perfect for startups!
  3. Changing Roles SREs in physical environment have the advantage, •

    Know the physical hardware. • Understand intricacies of entire infrastructure. Cloud is maintained by vendor, • Abstracts away physical hardware. • How do you get reliability when you don't control the hardware?
  4. What Changes? • You need to re-engineer parts of your

    application. • Producing reliable applications in the cloud is different than on physical hardware. • Don't have access to physical infrastructure. • Need to build for scalability/elasticity. • Get some reliability for free, the rest you need to architect your way around.
  5. Wait, Free Reliability? • e.g. Relational Database Service (RDS) on

    AWS. • Automatic backups. • Automatic cross data center (availability zone) redundancy. • Lots of things handled for you: • Patches. • Replication. • Read-replicas. • Failover. Awesome, our jobs are now obsolete.
  6. Everything Isn't Free • Redundancy of application servers need to

    do yourself. • Load balancers need configuring (as does DNS). • Auto-scaling might be automatic, but someone still has to configure it. • At basic level, you can just copy a server into another availability zone, then point your load balancer at it.
  7. Availability Zone Region Security Group RDS Master Application domain.tld RDS

    Slave S3 Bucket DB Backups Availability Zone Security Group Application Elastic Load Balancer Route 53 (DNS)
  8. Availability Zone Region Security Group RDS Master Application domain.tld RDS

    Slave DB Backups Availability Zone Security Group Application Elastic Load Balancer Route 53 (DNS)
  9. Availability Zone Region Security Group RDS Master Application domain.tld RDS

    Slave DB Backups Availability Zone Security Group Application Elastic Load Balancer Route 53 (DNS) Uh oh!
  10. Availability Zone Region Security Group RDS Master Application domain.tld RDS

    Slave DB Backups Availability Zone Security Group Application Elastic Load Balancer Route 53 (DNS) No problem!
  11. Embrace Failure • Faults don't have to be a problem,

    if you handle them. • Isolate errors within their component(s). • Each component should be able to fail, without taking down the entire service. • Don't let fatal application errors become fatal service errors. • Fail in a consistent, known way.
  12. Netflix Chaos Monkey The Netflix Simian Army is available on

    GitHub: https://github.com/Netflix/SimianArmy "We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient." http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html
  13. Time to Think Differently • Servers are ephemeral. • You

    no longer care about individual servers. • Now you care about the service as a whole. • Servers will fail. It shouldn't matter. If a server suddenly disappears, you don't care. • Recovery, deployment, failover, etc. should all be automated as best they can be. • Package updates, OS updates, etc. need to be managed by "something". Whether it's a Bash script, or Chef/Puppet, etc.
  14. Time to Think Differently • Monitor service as a whole,

    not individual servers. • Alerts become notifications. • If you've set up everything correctly, your health check should automatically destroy bad instances and spawn new ones. There's (usually) no action to take when getting an “alert”. • Proactive instead of reactive monitoring. • To get the benefits, you'll need to re-architect your application. This has some prerequisites...
  15. Centralized Logging • Can't log to local files anymore, have

    to log somewhere else. • Admin tools to view logs need to be remade/refactored. • SSHing to grep logs becomes infeasible at scale. • Can use a third-party for this! • Can archive logs in S3 bucket, pass to Glacier after x days. • Can't log direct to S3, no append ability (yet).
  16. Dynamic Configuration • Like Puppet, but for application configuration. •

    Previously, infrastructure was static and environment was known, so this didn't matter. Now it's dynamic, so we needed to account for that. • Things can scale at any time, so application configuration needs to be updatable. • Application polls for config changes every so often. Can update config on-the-fly (current memcached nodes, etc) either manually or programatically.
  17. No Temporary Files • Can't store any temporary files in

    local storage, need to move files directly to where they need to be. • For uploads, can use pre-signed URLs to go direct to S3. • Or, add item to asynchronous queue to be processed by a consumer. • Temporary state on a local server becomes a bad idea in the cloud (or any distributed application).
  18. Global Session Storage • Can't store sessions locally and rely

    on persistent load balancer connections. • Have to store session state in a global space instead. • Database works just fine for this.
  19. No SSH? Are You Mad?!?! • I don't mean disabling

    sshd. That would be crazy. • Disable at firewall level to prevent devs from cheating. • “Oh, I'll just SSH in and fix this one issue.” instead of “I should make sure this fix is automated.” But what if I need to debug! • Just re-enable port 22 and you're good to go. It's a few clicks, or 3 seconds of typing. At scale, you simply can't SSH in to fix a problem. Get out of the habit early. Makes things go smoother later. Top Tip: Every time you have a manual action, automate it for next time!
  20. Availability Zone Region Security Group RDS Master domain.tld RDS Slave

    DB Backups Availability Zone Security Group Application Elastic Load Balancer Route 53 (DNS) ? ?
  21. Bootstrapping • On boot, identify region/application/etc. Store info on filesystem

    for later use (I store in /env). • Don't forget to update bootstrap scripts as first step, so you can change them without having to make a new image every time. • You want fast bootstrapping! Don't start from fresh OS every time, create a base image that has most of the things you need, then work from that. • Can use Puppet/Chef to configure, but pre-configure a base instance first, then save an image from that.
  22. Deployment • Used to push code to known servers, now

    each server needs to pull its config/code on boot instead. • Deployment scripts refactored to not care about individual servers but to use AWS API to find active servers. • How does server know which version to deploy? Or which environment it's in? Uses AWS tags! • Can easily deploy old code versions if needed, for rollback.
  23. Availability Zone Region Security Group RDS Master Application domain.tld RDS

    Slave DB Backups Availability Zone Security Group Application Elastic Load Balancer Route 53 (DNS)
  24. Availability Zone Region Security Group RDS Master domain.tld RDS Slave

    DB Backups Availability Zone Security Group Application Elastic Load Balancer Route 53 (DNS) ? ? ? ?
  25. Availability Zone Region Security Group RDS Master Application domain.tld RDS

    Slave DB Backups Availability Zone Security Group Application Elastic Load Balancer Route 53 (DNS)
  26. Availability Zone Region Security Group RDS Master Application domain.tld RDS

    Slave DB Backups Availability Zone Security Group Application Elastic Load Balancer Route 53 (DNS)
  27. Monitoring Changes • Automate your security auditing. • Current intrusion

    detection tools may not detect AWS specific changes. • Create an IAM account with built-in "Security Audit" policy. • https://s3.amazonaws.com/reinvent2013-sec402/SecConfig.py * • This script will go over your account, creating a canonical representation of security configuration. • Set up a cron job to do this every so often and compare to previous run. Trigger an alert for review if changes are detected. • CloudTrail keeps full audit logs of all changes from web console or API. • Store logs in S3 bucket with versioning so no one can modify your logs without you seeing. * From "Intrusion Detection in the Cloud", http://awsmedia.s3.amazonaws.com/SEC402.pdf
  28. Controlling Access • Everyone gets an IAM account. Never login

    to the master account. • You may be used to using an "Operations Account", which you share with your entire team. • Do not do that with AWS/Cloud. Everyone gets their own account, with just the permissions they need (least privilege principle). • An IAM user can control everything in the infrastructure, so there's no need to use master account. • Enable multi-factor authentication for master and IAM accounts. • Could give one user MFA token, another the password. Any action on master account then requires two users to agree. Overkill for my case, but someone may want to use that technique.
  29. No Hardcoded Credentials • If your app has credentials baked

    into it, you're "doing it wrong". • Use IAM Roles, • Create role, specify permissions. • When creating instance, specify role it should use. • Whenever using AWS SDK, it will automatically retrieve temporary credentials with the access level specified in the role. • All handled transparently to developers/operations. • Application never needs to know the credentials, infrastructure manages it all for you.
  30. Tools, Tools, and More Tools • Can write scripts using

    AWS CLI tools. • Can use the Web Console. • Useful for viewing graphs on CloudWatch, etc. • CloudFormation lets you write your infrastructure in JSON, create stacks that can be deployed over and over. (Bonus: keep your infrastructure in version control!) • OpsWorks, uses Chef recipes, it's just point/click and does most of the work for you. • DB layer, load balancer layer, cache layer, etc. • Schedule periods of higher support. • Scale based on latency or other factors, instead of just time-based.
  31. Redundancy is Required • You absolutely must spread yourself out

    over multiple physical locations to have a reliable service. • Unlike metal environments, it's just a few clicks, rather than a trip to another city to rack some servers. • For AWS, this means to always deploy into multiple Availability Zones (AZs). • Use Elastic Load Balancer (ELB) as service endpoint. • Add servers to ELB pool. ELB can see all AZs in a region. • For multiple regions, need to use DNS (round robin, etc.).
  32. Availability Zone N Availability Zone 1 Region Security Group RDS

    Master domain.tld RDS Slave S3 Bucket DB Backups Elastic Load Balancer Route 53 (DNS) Auto Scaling Group Application Security Group Security Group Auto Scaling Group Application Security Group Security Group Availability Zone 2 Security Group
  33. Redundasize* Critical Processes Problem We deployed direct from GitHub. When

    GitHub is down, or there's too much latency to github.com, we can't scale. Oops. Solution We now have a local clone of GitHub repos we pull from instead. GitHub is the backup if that clone goes down. Git is distributed, we should probably have made use of that. * possibly a made-up word.
  34. Make Your Logs Useful Problem Aggregated logs didn't contain any

    info on the server/region. No idea which region/az is having a problem from the logs. Oops. Solution Now we store extra metadata with each log line. • Region • Availability Zone • Instance ID • Environment (stage/prod/test/demo, etc) • Request ID
  35. Cope with Failure at All Levels Problem Deployment scripts didn't

    account for server being replaced mid-deployment. Would stall deployments completely. Oops. Solution Check server state throughout process and moves on if it's been killed. Make sure you can cope with failure not just in your infrastructure, but in any scripts or tools which you use to manage that infrastructure.
  36. Use Private Network Problem Didn't use VPC, so using internal

    IPs was painful. Just used the external public IPs instead. It works, but much more difficult to secure and manage. Oops. Solution Migrate to VPC. Migrating after the fact was difficult. Use it from the start and save yourself the pain. VPC lets you have egress firewall rules, change things on the fly, specify network ACLs, etc. New accounts have no choice, so this may be moot.
  37. Be Aware of Cloud Limitations Problem AWS has pre-defined service

    limits. These are not clearly displayed unless you know where to look*. First time you'll see the error is when trying to perform an action which you can no longer perform. Oops. Solution Be aware of built-in limits so you can request their increase ahead of time, before you start trying to putting things to use in production. Other limits are things like scalability of ELBs. If you're expecting heavy traffic, you need to pre-warm your ELBs by injecting traffic beforehand. Or contact AWS to pre-warm them for you (preferred). You want to learn this lesson before you get the critical traffic! * http://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html
  38. Worth It? • Can now handle growth in a very

    organic fashion. • No actionable alert in... well... I can't remember. • When things go wrong, instances kill themselves and we get a fresh instance with a known-good configuration. • Deployments are not as dangerous, can deploy many times a day and rollback easily, so they've become routine instead of "OK, everyone stop what you're doing, we're going to deploy something".
  39. Totally Worth It • Much lower cost than before. •

    Spinning up a new application/environment used to take days. Now takes ~15 minutes. • More freedom to prototype and play with changes. • Easy to spin up a new region/environment for a few hours to play with settings; (With minimal cost, and completely isolated from your current environment). • Something you can't do with metal unless you already have the hardware prepared and ready. • Developers can have their own personal prod clone to develop with, means no surprises when moving to production.
  40. Useful Resources • https://cloud.google.com/developers/#articles - Google Cloud Whitepapers and Best

    Practice Guides. • http://www.rackspace.co.uk/whitepapers - Rackspace Whitepapers and Guides. • http://azure.microsoft.com/blog - Microsoft Azure Blog. • http://aws.typepad.com/ - AWS Blog. • http://www.youtube.com/user/AmazonWebServices - Lots of AWS training videos, etc.
  41. More Useful Resources • http://www.slideshare.net/AmazonWebServices - All slides and presentations

    from the AWS conferences. Lots of useful training stuff for free. • http://netflix.github.com - Netflix are the masters of AWS. They have lots of open source stuff to share. • http://aws.amazon.com/whitepapers - Oh so many papers from AWS on everything from security best practices, to financial services grid computing in the cloud.
  42. Tooting My Own Horn Read more about my AWS mishaps!

    wblinks.com/notes/aws-tips-i-wish-id-known-before-i-started Do I suck at presenting? Send your hate mail to [email protected]! Say Hi on Twitter! @r_adams