$30 off During Our Annual Pro Sale. View Details »

SRE in the Cloud

SRE in the Cloud

"The future of the cloud changes the role of the SRE. In a large company where you are deploying your service on infrastructure built/managed in-house the SRE has the home field advantage of understanding the intricacies of that infrastructure. With more and more startups launching in the cloud which are maintained by the vendor, the local SREs role offers different challenges. Rich has launched several businesses on AWS and he will talk about his journey towards incorporating reliability into products and ensuring the development team had access to the information needed to improve their services. He will share the highlights of what he’s learned about Amazon’s Web Services and what it took for him to make it work for his companies."

Presented at SRECon in Santa Clara, May 30th 2014.

Also available from https://richadams.me/talks/srecon14/\

(Note: I didn't chose the title :p)

Rich Adams

May 30, 2014
Tweet

More Decks by Rich Adams

Other Decks in Technology

Transcript

  1. SRE IN THE CLOUD
    Rich Adams
    SRECon14
    30th May, 2014

    View Slide

  2. Formalities

    View Slide

  3. View Slide

  4. Formalities
    ● Hi, I'm Rich! o/
    ● I'm a systems engineer at Gracenote.
    ● (I write server applications, and manage the
    infrastructure for those applications on AWS).
    ● I'm British, sorry for the accent*.
    ● Be gentle, this is my first ever talk!
    ● (Don't worry, I'll provide an email address for hate mail
    towards the end).
    * not really.

    View Slide

  5. Let's Talk About The Cloud

    View Slide

  6. CLOUD

    View Slide

  7. View Slide

  8. View Slide

  9. Why bother?
    ● “Free” reliability and automation!
    ● Low upfront cost.
    ● Low operating cost.
    ● Faster to get up and running than on metal.
    ● Pay as you go, no minimum contracts, etc.
    ● Easier to scale than metal.
    ● Easier to learn than physical hardware (one vendor vs many).
    ● On-demand capacity and elasticity.
    Perfect for startups!

    View Slide

  10. Changing Roles
    SREs in physical environment have the advantage,
    ● Know the physical hardware.
    ● Understand intricacies of entire infrastructure.
    Cloud is maintained by vendor,
    ● Abstracts away physical hardware.
    ● How do you get reliability when you don't control the
    hardware?

    View Slide

  11. Just Move Servers to The Cloud!
    Right?

    View Slide

  12. View Slide

  13. Moving to cloud by copying your
    servers one-to-one won't work.
    I know, I tried

    View Slide

  14. Availability Zone
    Region
    Security Group
    Database
    Application
    domain.tld

    View Slide

  15. What Changes?
    ● You need to re-engineer parts of your application.
    ● Producing reliable applications in the cloud is different than
    on physical hardware.
    ● Don't have access to physical infrastructure.
    ● Need to build for scalability/elasticity.
    ● Get some reliability for free, the rest you need to architect
    your way around.

    View Slide

  16. Wait, Free Reliability?
    ● e.g. Relational Database Service (RDS) on AWS.
    ● Automatic backups.
    ● Automatic cross data center (availability zone) redundancy.
    ● Lots of things handled for you:
    ● Patches.
    ● Replication.
    ● Read-replicas.
    ● Failover.
    Awesome, our jobs are now obsolete.

    View Slide

  17. Availability Zone
    Security Group
    RDS Master
    Application
    domain.tld
    Availability Zone
    Security Group
    RDS Slave S3 Bucket
    DB Backups
    Region

    View Slide

  18. Everything Isn't Free
    ● Redundancy of application servers need to do yourself.
    ● Load balancers need configuring (as does DNS).
    ● Auto-scaling might be automatic, but someone still has to
    configure it.
    ● At basic level, you can just copy a server into another
    availability zone, then point your load balancer at it.

    View Slide

  19. Availability Zone
    Region
    Security Group
    RDS Master
    Application
    domain.tld
    RDS Slave S3 Bucket
    DB Backups
    Availability Zone
    Security Group
    Application
    Elastic Load
    Balancer
    Route 53
    (DNS)

    View Slide

  20. Cool, so we're done.
    Right?

    View Slide

  21. View Slide

  22. Server Died. What Now?
    Physical Environment

    View Slide

  23. Server Died. What Now?
    Cloud Environment

    View Slide

  24. Availability Zone
    Region
    Security Group
    RDS Master
    Application
    domain.tld
    RDS Slave
    DB Backups
    Availability Zone
    Security Group
    Application
    Elastic Load
    Balancer
    Route 53
    (DNS)

    View Slide

  25. Availability Zone
    Region
    Security Group
    RDS Master
    Application
    domain.tld
    RDS Slave
    DB Backups
    Availability Zone
    Security Group
    Application
    Elastic Load
    Balancer
    Route 53
    (DNS)
    Uh oh!

    View Slide

  26. Availability Zone
    Region
    Security Group
    RDS Master
    Application
    domain.tld
    RDS Slave
    DB Backups
    Availability Zone
    Security Group
    Application
    Elastic Load
    Balancer
    Route 53
    (DNS)
    No problem!

    View Slide

  27. Embrace Failure
    ● Faults don't have to be a problem, if you handle them.
    ● Isolate errors within their component(s).
    ● Each component should be able to fail, without taking
    down the entire service.
    ● Don't let fatal application errors become fatal service errors.
    ● Fail in a consistent, known way.

    View Slide

  28. Netflix Chaos Monkey
    The Netflix Simian Army is available on GitHub:
    https://github.com/Netflix/SimianArmy
    "We have found that the best defense against major
    unexpected failures is to fail often. By frequently
    causing failures, we force our services to be built in a
    way that is more resilient."
    http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html

    View Slide

  29. Care about services,
    not servers.

    View Slide

  30. Time to Think Differently
    ● Servers are ephemeral.
    ● You no longer care about individual servers.
    ● Now you care about the service as a whole.
    ● Servers will fail. It shouldn't matter. If a server suddenly
    disappears, you don't care.
    ● Recovery, deployment, failover, etc. should all be
    automated as best they can be.
    ● Package updates, OS updates, etc. need to be managed
    by "something". Whether it's a Bash script, or
    Chef/Puppet, etc.

    View Slide

  31. Time to Think Differently
    ● Monitor service as a whole, not individual servers.
    ● Alerts become notifications.
    ● If you've set up everything correctly, your health check
    should automatically destroy bad instances and spawn
    new ones. There's (usually) no action to take when getting
    an “alert”.
    ● Proactive instead of reactive monitoring.
    ● To get the benefits, you'll need to re-architect your
    application. This has some prerequisites...

    View Slide

  32. How to Not Care About Servers

    View Slide

  33. Centralized Logging
    ● Can't log to local files anymore, have to log somewhere else.
    ● Admin tools to view logs need to be remade/refactored.
    ● SSHing to grep logs becomes infeasible at scale.
    ● Can use a third-party for this!
    ● Can archive logs in S3 bucket, pass to Glacier after x days.
    ● Can't log direct to S3, no append ability (yet).

    View Slide

  34. Application Application
    Log Server Storage
    rsyslog
    write to persistent storage
    Log Viewer
    Logging System
    read-only

    View Slide

  35. Dynamic Configuration
    ● Like Puppet, but for application configuration.
    ● Previously, infrastructure was static and environment was
    known, so this didn't matter. Now it's dynamic, so we
    needed to account for that.
    ● Things can scale at any time, so application configuration
    needs to be updatable.
    ● Application polls for config changes every so often. Can
    update config on-the-fly (current memcached nodes, etc)
    either manually or programatically.

    View Slide

  36. Application Application
    Storage
    Configuration
    Management UI
    Configuration System
    Configuration
    Validation
    write to persistent storage
    Poll for config changes

    View Slide

  37. No Temporary Files
    ● Can't store any temporary files in local storage, need to
    move files directly to where they need to be.
    ● For uploads, can use pre-signed URLs to go direct to S3.
    ● Or, add item to asynchronous queue to be processed by a
    consumer.
    ● Temporary state on a local server becomes a bad idea in the
    cloud (or any distributed application).

    View Slide

  38. Global Session Storage
    ● Can't store sessions locally and rely on persistent load
    balancer connections.
    ● Have to store session state in a global space instead.
    ● Database works just fine for this.

    View Slide

  39. Controversial Opinion Ahead

    View Slide

  40. Disable SSH
    (Block Port 22)

    View Slide

  41. If you have to SSH into
    your servers, then your
    automation has failed.

    View Slide

  42. No SSH? Are You Mad?!?!
    ● I don't mean disabling sshd. That would be crazy.
    ● Disable at firewall level to prevent devs from cheating.
    ● “Oh, I'll just SSH in and fix this one issue.” instead of “I should
    make sure this fix is automated.”
    But what if I need to debug!
    ● Just re-enable port 22 and you're good to go. It's a few clicks, or
    3 seconds of typing.
    At scale, you simply can't SSH in to fix a problem. Get out of the
    habit early. Makes things go smoother later.
    Top Tip: Every time you have a manual action, automate it for next time!

    View Slide

  43. Servers can fail, so we're done.
    Right?

    View Slide

  44. Availability Zone
    Region
    Security Group
    RDS Master
    domain.tld
    RDS Slave
    DB Backups
    Availability Zone
    Security Group
    Application
    Elastic Load
    Balancer
    Route 53
    (DNS)
    ?
    ?

    View Slide

  45. Need Self-Provisioning Servers

    View Slide

  46. Bootstrapping
    ● On boot, identify region/application/etc. Store info on
    filesystem for later use (I store in /env).
    ● Don't forget to update bootstrap scripts as first step, so you
    can change them without having to make a new image every
    time.
    ● You want fast bootstrapping! Don't start from fresh OS every
    time, create a base image that has most of the things you
    need, then work from that.
    ● Can use Puppet/Chef to configure, but pre-configure a
    base instance first, then save an image from that.

    View Slide

  47. Deployment
    ● Used to push code to known servers, now each server needs
    to pull its config/code on boot instead.
    ● Deployment scripts refactored to not care about individual
    servers but to use AWS API to find active servers.
    ● How does server know which version to deploy? Or which
    environment it's in? Uses AWS tags!
    ● Can easily deploy old code versions if needed, for rollback.

    View Slide

  48. Availability Zone
    Region
    Security Group
    RDS Master
    Application
    domain.tld
    RDS Slave
    DB Backups
    Availability Zone
    Security Group
    Application
    Elastic Load
    Balancer
    Route 53
    (DNS)

    View Slide

  49. Availability Zone
    Region
    Security Group
    RDS Master
    domain.tld
    RDS Slave
    DB Backups
    Availability Zone
    Security Group
    Application
    Elastic Load
    Balancer
    Route 53
    (DNS)
    ?
    ?
    ?
    ?

    View Slide

  50. Availability Zone
    Region
    Security Group
    RDS Master
    Application
    domain.tld
    RDS Slave
    DB Backups
    Availability Zone
    Security Group
    Application
    Elastic Load
    Balancer
    Route 53
    (DNS)

    View Slide

  51. Availability Zone
    Region
    Security Group
    RDS Master
    Application
    domain.tld
    RDS Slave
    DB Backups
    Availability Zone
    Security Group
    Application
    Elastic Load
    Balancer
    Route 53
    (DNS)

    View Slide

  52. Reliability is Also About Security
    Insecure == Unreliable

    View Slide

  53. Monitoring Changes
    ● Automate your security auditing.
    ● Current intrusion detection tools may not detect AWS specific changes.
    ● Create an IAM account with built-in "Security Audit" policy.
    ● https://s3.amazonaws.com/reinvent2013-sec402/SecConfig.py *
    ● This script will go over your account, creating a canonical
    representation of security configuration.
    ● Set up a cron job to do this every so often and compare to previous
    run. Trigger an alert for review if changes are detected.
    ● CloudTrail keeps full audit logs of all changes from web console or API.
    ● Store logs in S3 bucket with versioning so no one can modify your logs
    without you seeing.
    * From "Intrusion Detection in the Cloud", http://awsmedia.s3.amazonaws.com/SEC402.pdf

    View Slide

  54. Controlling Access
    ● Everyone gets an IAM account. Never login to the master account.
    ● You may be used to using an "Operations Account", which you
    share with your entire team.
    ● Do not do that with AWS/Cloud. Everyone gets their own account,
    with just the permissions they need (least privilege principle).
    ● An IAM user can control everything in the infrastructure, so
    there's no need to use master account.
    ● Enable multi-factor authentication for master and IAM accounts.
    ● Could give one user MFA token, another the password. Any action
    on master account then requires two users to agree. Overkill for
    my case, but someone may want to use that technique.

    View Slide

  55. No Hardcoded Credentials
    ● If your app has credentials baked into it, you're "doing it wrong".
    ● Use IAM Roles,
    ● Create role, specify permissions.
    ● When creating instance, specify role it should use.
    ● Whenever using AWS SDK, it will automatically retrieve
    temporary credentials with the access level specified in the
    role.
    ● All handled transparently to developers/operations.
    ● Application never needs to know the credentials,
    infrastructure manages it all for you.

    View Slide

  56. Managing Your Infrastructure

    View Slide

  57. Tools, Tools, and More Tools
    ● Can write scripts using AWS CLI tools.
    ● Can use the Web Console.
    ● Useful for viewing graphs on CloudWatch, etc.
    ● CloudFormation lets you write your infrastructure in JSON, create stacks
    that can be deployed over and over. (Bonus: keep your infrastructure in
    version control!)
    ● OpsWorks, uses Chef recipes, it's just point/click and does most of the
    work for you.
    ● DB layer, load balancer layer, cache layer, etc.
    ● Schedule periods of higher support.
    ● Scale based on latency or other factors, instead of just time-based.

    View Slide

  58. Scalable in a zone is not enough.
    You must use multiple zones!

    View Slide

  59. Redundancy is Required
    ● You absolutely must spread yourself out over multiple
    physical locations to have a reliable service.
    ● Unlike metal environments, it's just a few clicks, rather than
    a trip to another city to rack some servers.
    ● For AWS, this means to always deploy into multiple
    Availability Zones (AZs).
    ● Use Elastic Load Balancer (ELB) as service endpoint.
    ● Add servers to ELB pool. ELB can see all AZs in a region.
    ● For multiple regions, need to use DNS (round robin, etc.).

    View Slide

  60. Availability Zone N
    Availability Zone 1
    Region
    Security Group
    RDS Master
    domain.tld
    RDS Slave
    S3 Bucket
    DB Backups
    Elastic Load
    Balancer
    Route 53
    (DNS)
    Auto Scaling Group
    Application
    Security Group
    Security Group
    Auto Scaling Group
    Application
    Security Group Security Group
    Availability Zone 2
    Security Group

    View Slide

  61. It Just Works!
    Right?

    View Slide

  62. View Slide

  63. GitHub Down? So Are We!

    View Slide

  64. View Slide

  65. Redundasize* Critical Processes
    Problem
    We deployed direct from GitHub. When GitHub is down, or
    there's too much latency to github.com, we can't scale.
    Oops.
    Solution
    We now have a local clone of GitHub repos we pull from
    instead. GitHub is the backup if that clone goes down.
    Git is distributed, we should probably have made use of that.
    * possibly a made-up word.

    View Slide

  66. Which Server is the Log From?

    View Slide

  67. View Slide

  68. Make Your Logs Useful
    Problem
    Aggregated logs didn't contain any info on the server/region. No idea which
    region/az is having a problem from the logs.
    Oops.
    Solution
    Now we store extra metadata with each log line.
    ● Region
    ● Availability Zone
    ● Instance ID
    ● Environment (stage/prod/test/demo, etc)
    ● Request ID

    View Slide

  69. Server Dies During Deployment?
    Let's Just Stop Everything!

    View Slide

  70. View Slide

  71. Cope with Failure at All Levels
    Problem
    Deployment scripts didn't account for server being replaced
    mid-deployment. Would stall deployments completely.
    Oops.
    Solution
    Check server state throughout process and moves on if it's been
    killed.
    Make sure you can cope with failure not just in your
    infrastructure, but in any scripts or tools which you use to
    manage that infrastructure.

    View Slide

  72. Private Network?
    Nah, Let's Just Use Public IPs.

    View Slide

  73. View Slide

  74. Use Private Network
    Problem
    Didn't use VPC, so using internal IPs was painful. Just used the external
    public IPs instead. It works, but much more difficult to secure and
    manage.
    Oops.
    Solution
    Migrate to VPC. Migrating after the fact was difficult. Use it from the
    start and save yourself the pain.
    VPC lets you have egress firewall rules, change things on the fly,
    specify network ACLs, etc. New accounts have no choice, so this may
    be moot.

    View Slide

  75. Deploying a New Application?
    Sorry, You Hit Your Limit.

    View Slide

  76. View Slide

  77. Be Aware of Cloud Limitations
    Problem
    AWS has pre-defined service limits. These are not clearly displayed unless
    you know where to look*. First time you'll see the error is when trying to
    perform an action which you can no longer perform.
    Oops.
    Solution
    Be aware of built-in limits so you can request their increase ahead of time,
    before you start trying to putting things to use in production.
    Other limits are things like scalability of ELBs. If you're expecting heavy traffic,
    you need to pre-warm your ELBs by injecting traffic beforehand. Or contact
    AWS to pre-warm them for you (preferred).
    You want to learn this lesson before you get the critical traffic!
    * http://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html

    View Slide

  78. Was It All Worth It?
    (Hint: I'm Slightly Biased)

    View Slide

  79. Worth It?
    ● Can now handle growth in a very organic fashion.
    ● No actionable alert in... well... I can't remember.
    ● When things go wrong, instances kill themselves and we get
    a fresh instance with a known-good configuration.
    ● Deployments are not as dangerous, can deploy many times
    a day and rollback easily, so they've become routine instead
    of "OK, everyone stop what you're doing, we're going to
    deploy something".

    View Slide

  80. Totally Worth It
    ● Much lower cost than before.
    ● Spinning up a new application/environment used to take days.
    Now takes ~15 minutes.
    ● More freedom to prototype and play with changes.
    ● Easy to spin up a new region/environment for a few hours to
    play with settings; (With minimal cost, and completely isolated
    from your current environment).
    ● Something you can't do with metal unless you already have
    the hardware prepared and ready.
    ● Developers can have their own personal prod clone to develop
    with, means no surprises when moving to production.

    View Slide

  81. Useful Resources
    ● https://cloud.google.com/developers/#articles - Google
    Cloud Whitepapers and Best Practice Guides.
    ● http://www.rackspace.co.uk/whitepapers - Rackspace
    Whitepapers and Guides.
    ● http://azure.microsoft.com/blog - Microsoft Azure Blog.
    ● http://aws.typepad.com/ - AWS Blog.
    ● http://www.youtube.com/user/AmazonWebServices - Lots of
    AWS training videos, etc.

    View Slide

  82. More Useful Resources
    ● http://www.slideshare.net/AmazonWebServices - All slides
    and presentations from the AWS conferences. Lots of useful
    training stuff for free.
    ● http://netflix.github.com - Netflix are the masters of AWS.
    They have lots of open source stuff to share.
    ● http://aws.amazon.com/whitepapers - Oh so many papers
    from AWS on everything from security best practices, to
    financial services grid computing in the cloud.

    View Slide

  83. Tooting My Own Horn
    Read more about my AWS mishaps!
    wblinks.com/notes/aws-tips-i-wish-id-known-before-i-started
    Do I suck at presenting?
    Send your hate mail to [email protected]!
    Say Hi on Twitter!
    @r_adams

    View Slide

  84. Thanks!
    richadams.me/talks/srecon14

    View Slide