SRE in the Cloud - Speaker Deck

Slide 1

Slide 1 text

SRE IN THE CLOUD Rich Adams SRECon14 30th May, 2014

Slide 2

Slide 2 text

Formalities

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Formalities ● Hi, I'm Rich! o/ ● I'm a systems engineer at Gracenote. ● (I write server applications, and manage the infrastructure for those applications on AWS). ● I'm British, sorry for the accent*. ● Be gentle, this is my first ever talk! ● (Don't worry, I'll provide an email address for hate mail towards the end). * not really.

Slide 5

Slide 5 text

Let's Talk About The Cloud

Slide 6

Slide 6 text

CLOUD

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Why bother? ● “Free” reliability and automation! ● Low upfront cost. ● Low operating cost. ● Faster to get up and running than on metal. ● Pay as you go, no minimum contracts, etc. ● Easier to scale than metal. ● Easier to learn than physical hardware (one vendor vs many). ● On-demand capacity and elasticity. Perfect for startups!

Slide 10

Slide 10 text

Changing Roles SREs in physical environment have the advantage, ● Know the physical hardware. ● Understand intricacies of entire infrastructure. Cloud is maintained by vendor, ● Abstracts away physical hardware. ● How do you get reliability when you don't control the hardware?

Slide 11

Slide 11 text

Just Move Servers to The Cloud! Right?

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

Moving to cloud by copying your servers one-to-one won't work. I know, I tried

Slide 14

Slide 14 text

Availability Zone Region Security Group Database Application domain.tld

Slide 15

Slide 15 text

What Changes? ● You need to re-engineer parts of your application. ● Producing reliable applications in the cloud is different than on physical hardware. ● Don't have access to physical infrastructure. ● Need to build for scalability/elasticity. ● Get some reliability for free, the rest you need to architect your way around.

Slide 16

Slide 16 text

Wait, Free Reliability? ● e.g. Relational Database Service (RDS) on AWS. ● Automatic backups. ● Automatic cross data center (availability zone) redundancy. ● Lots of things handled for you: ● Patches. ● Replication. ● Read-replicas. ● Failover. Awesome, our jobs are now obsolete.

Slide 17

Slide 17 text

Availability Zone Security Group RDS Master Application domain.tld Availability Zone Security Group RDS Slave S3 Bucket DB Backups Region

Slide 18

Slide 18 text

Everything Isn't Free ● Redundancy of application servers need to do yourself. ● Load balancers need configuring (as does DNS). ● Auto-scaling might be automatic, but someone still has to configure it. ● At basic level, you can just copy a server into another availability zone, then point your load balancer at it.

Slide 19

Slide 19 text

Availability Zone Region Security Group RDS Master Application domain.tld RDS Slave S3 Bucket DB Backups Availability Zone Security Group Application Elastic Load Balancer Route 53 (DNS)

Slide 20

Slide 20 text

Cool, so we're done. Right?

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

Server Died. What Now? Physical Environment

Slide 23

Slide 23 text

Server Died. What Now? Cloud Environment

Slide 24

Slide 24 text

Availability Zone Region Security Group RDS Master Application domain.tld RDS Slave DB Backups Availability Zone Security Group Application Elastic Load Balancer Route 53 (DNS)

Slide 25

Slide 25 text

Availability Zone Region Security Group RDS Master Application domain.tld RDS Slave DB Backups Availability Zone Security Group Application Elastic Load Balancer Route 53 (DNS) Uh oh!

Slide 26

Slide 26 text

Availability Zone Region Security Group RDS Master Application domain.tld RDS Slave DB Backups Availability Zone Security Group Application Elastic Load Balancer Route 53 (DNS) No problem!

Slide 27

Slide 27 text

Embrace Failure ● Faults don't have to be a problem, if you handle them. ● Isolate errors within their component(s). ● Each component should be able to fail, without taking down the entire service. ● Don't let fatal application errors become fatal service errors. ● Fail in a consistent, known way.

Slide 28

Slide 28 text

Netflix Chaos Monkey The Netflix Simian Army is available on GitHub: https://github.com/Netflix/SimianArmy "We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient." http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html

Slide 29

Slide 29 text

Care about services, not servers.

Slide 30

Slide 30 text

Time to Think Differently ● Servers are ephemeral. ● You no longer care about individual servers. ● Now you care about the service as a whole. ● Servers will fail. It shouldn't matter. If a server suddenly disappears, you don't care. ● Recovery, deployment, failover, etc. should all be automated as best they can be. ● Package updates, OS updates, etc. need to be managed by "something". Whether it's a Bash script, or Chef/Puppet, etc.

Slide 31

Slide 31 text

Time to Think Differently ● Monitor service as a whole, not individual servers. ● Alerts become notifications. ● If you've set up everything correctly, your health check should automatically destroy bad instances and spawn new ones. There's (usually) no action to take when getting an “alert”. ● Proactive instead of reactive monitoring. ● To get the benefits, you'll need to re-architect your application. This has some prerequisites...

Slide 32

Slide 32 text

How to Not Care About Servers

Slide 33

Slide 33 text

Centralized Logging ● Can't log to local files anymore, have to log somewhere else. ● Admin tools to view logs need to be remade/refactored. ● SSHing to grep logs becomes infeasible at scale. ● Can use a third-party for this! ● Can archive logs in S3 bucket, pass to Glacier after x days. ● Can't log direct to S3, no append ability (yet).

Slide 34

Slide 34 text

Application Application Log Server Storage rsyslog write to persistent storage Log Viewer Logging System read-only

Slide 35

Slide 35 text

Dynamic Configuration ● Like Puppet, but for application configuration. ● Previously, infrastructure was static and environment was known, so this didn't matter. Now it's dynamic, so we needed to account for that. ● Things can scale at any time, so application configuration needs to be updatable. ● Application polls for config changes every so often. Can update config on-the-fly (current memcached nodes, etc) either manually or programatically.

Slide 36

Slide 36 text

Application Application Storage Configuration Management UI Configuration System Configuration Validation write to persistent storage Poll for config changes

Slide 37

Slide 37 text

No Temporary Files ● Can't store any temporary files in local storage, need to move files directly to where they need to be. ● For uploads, can use pre-signed URLs to go direct to S3. ● Or, add item to asynchronous queue to be processed by a consumer. ● Temporary state on a local server becomes a bad idea in the cloud (or any distributed application).

Slide 38

Slide 38 text

Global Session Storage ● Can't store sessions locally and rely on persistent load balancer connections. ● Have to store session state in a global space instead. ● Database works just fine for this.

Slide 39

Slide 39 text

Controversial Opinion Ahead

Slide 40

Slide 40 text

Disable SSH (Block Port 22)

Slide 41

Slide 41 text

If you have to SSH into your servers, then your automation has failed.

Slide 42

Slide 42 text

No SSH? Are You Mad?!?! ● I don't mean disabling sshd. That would be crazy. ● Disable at firewall level to prevent devs from cheating. ● “Oh, I'll just SSH in and fix this one issue.” instead of “I should make sure this fix is automated.” But what if I need to debug! ● Just re-enable port 22 and you're good to go. It's a few clicks, or 3 seconds of typing. At scale, you simply can't SSH in to fix a problem. Get out of the habit early. Makes things go smoother later. Top Tip: Every time you have a manual action, automate it for next time!

Slide 43

Slide 43 text

Servers can fail, so we're done. Right?

Slide 44

Slide 44 text

Availability Zone Region Security Group RDS Master domain.tld RDS Slave DB Backups Availability Zone Security Group Application Elastic Load Balancer Route 53 (DNS) ? ?

Slide 45

Slide 45 text

Need Self-Provisioning Servers

Slide 46

Slide 46 text

Bootstrapping ● On boot, identify region/application/etc. Store info on filesystem for later use (I store in /env). ● Don't forget to update bootstrap scripts as first step, so you can change them without having to make a new image every time. ● You want fast bootstrapping! Don't start from fresh OS every time, create a base image that has most of the things you need, then work from that. ● Can use Puppet/Chef to configure, but pre-configure a base instance first, then save an image from that.

Slide 47

Slide 47 text

Deployment ● Used to push code to known servers, now each server needs to pull its config/code on boot instead. ● Deployment scripts refactored to not care about individual servers but to use AWS API to find active servers. ● How does server know which version to deploy? Or which environment it's in? Uses AWS tags! ● Can easily deploy old code versions if needed, for rollback.

Slide 48

Slide 48 text

Availability Zone Region Security Group RDS Master Application domain.tld RDS Slave DB Backups Availability Zone Security Group Application Elastic Load Balancer Route 53 (DNS)

Slide 49

Slide 49 text

Availability Zone Region Security Group RDS Master domain.tld RDS Slave DB Backups Availability Zone Security Group Application Elastic Load Balancer Route 53 (DNS) ? ? ? ?

Slide 50

Slide 50 text

Availability Zone Region Security Group RDS Master Application domain.tld RDS Slave DB Backups Availability Zone Security Group Application Elastic Load Balancer Route 53 (DNS)

Slide 51

Slide 51 text

Availability Zone Region Security Group RDS Master Application domain.tld RDS Slave DB Backups Availability Zone Security Group Application Elastic Load Balancer Route 53 (DNS)

Slide 52

Slide 52 text

Reliability is Also About Security Insecure == Unreliable

Slide 53

Slide 53 text

Monitoring Changes ● Automate your security auditing. ● Current intrusion detection tools may not detect AWS specific changes. ● Create an IAM account with built-in "Security Audit" policy. ● https://s3.amazonaws.com/reinvent2013-sec402/SecConfig.py * ● This script will go over your account, creating a canonical representation of security configuration. ● Set up a cron job to do this every so often and compare to previous run. Trigger an alert for review if changes are detected. ● CloudTrail keeps full audit logs of all changes from web console or API. ● Store logs in S3 bucket with versioning so no one can modify your logs without you seeing. * From "Intrusion Detection in the Cloud", http://awsmedia.s3.amazonaws.com/SEC402.pdf

Slide 54

Slide 54 text

Controlling Access ● Everyone gets an IAM account. Never login to the master account. ● You may be used to using an "Operations Account", which you share with your entire team. ● Do not do that with AWS/Cloud. Everyone gets their own account, with just the permissions they need (least privilege principle). ● An IAM user can control everything in the infrastructure, so there's no need to use master account. ● Enable multi-factor authentication for master and IAM accounts. ● Could give one user MFA token, another the password. Any action on master account then requires two users to agree. Overkill for my case, but someone may want to use that technique.

Slide 55

Slide 55 text

No Hardcoded Credentials ● If your app has credentials baked into it, you're "doing it wrong". ● Use IAM Roles, ● Create role, specify permissions. ● When creating instance, specify role it should use. ● Whenever using AWS SDK, it will automatically retrieve temporary credentials with the access level specified in the role. ● All handled transparently to developers/operations. ● Application never needs to know the credentials, infrastructure manages it all for you.

Slide 56

Slide 56 text

Managing Your Infrastructure

Slide 57

Slide 57 text

Tools, Tools, and More Tools ● Can write scripts using AWS CLI tools. ● Can use the Web Console. ● Useful for viewing graphs on CloudWatch, etc. ● CloudFormation lets you write your infrastructure in JSON, create stacks that can be deployed over and over. (Bonus: keep your infrastructure in version control!) ● OpsWorks, uses Chef recipes, it's just point/click and does most of the work for you. ● DB layer, load balancer layer, cache layer, etc. ● Schedule periods of higher support. ● Scale based on latency or other factors, instead of just time-based.

Slide 58

Slide 58 text

Scalable in a zone is not enough. You must use multiple zones!

Slide 59

Slide 59 text

Redundancy is Required ● You absolutely must spread yourself out over multiple physical locations to have a reliable service. ● Unlike metal environments, it's just a few clicks, rather than a trip to another city to rack some servers. ● For AWS, this means to always deploy into multiple Availability Zones (AZs). ● Use Elastic Load Balancer (ELB) as service endpoint. ● Add servers to ELB pool. ELB can see all AZs in a region. ● For multiple regions, need to use DNS (round robin, etc.).

Slide 60

Slide 60 text

Availability Zone N Availability Zone 1 Region Security Group RDS Master domain.tld RDS Slave S3 Bucket DB Backups Elastic Load Balancer Route 53 (DNS) Auto Scaling Group Application Security Group Security Group Auto Scaling Group Application Security Group Security Group Availability Zone 2 Security Group

Slide 61

Slide 61 text

It Just Works! Right?

Slide 62

Slide 62 text

No content

Slide 63

Slide 63 text

GitHub Down? So Are We!

Slide 64

Slide 64 text

No content

Slide 65

Slide 65 text

Redundasize* Critical Processes Problem We deployed direct from GitHub. When GitHub is down, or there's too much latency to github.com, we can't scale. Oops. Solution We now have a local clone of GitHub repos we pull from instead. GitHub is the backup if that clone goes down. Git is distributed, we should probably have made use of that. * possibly a made-up word.

Slide 66

Slide 66 text

Which Server is the Log From?

Slide 67

Slide 67 text

No content

Slide 68

Slide 68 text

Make Your Logs Useful Problem Aggregated logs didn't contain any info on the server/region. No idea which region/az is having a problem from the logs. Oops. Solution Now we store extra metadata with each log line. ● Region ● Availability Zone ● Instance ID ● Environment (stage/prod/test/demo, etc) ● Request ID

Slide 69

Slide 69 text

Server Dies During Deployment? Let's Just Stop Everything!

Slide 70

Slide 70 text

No content

Slide 71

Slide 71 text

Cope with Failure at All Levels Problem Deployment scripts didn't account for server being replaced mid-deployment. Would stall deployments completely. Oops. Solution Check server state throughout process and moves on if it's been killed. Make sure you can cope with failure not just in your infrastructure, but in any scripts or tools which you use to manage that infrastructure.

Slide 72

Slide 72 text

Private Network? Nah, Let's Just Use Public IPs.

Slide 73

Slide 73 text

No content

Slide 74

Slide 74 text

Use Private Network Problem Didn't use VPC, so using internal IPs was painful. Just used the external public IPs instead. It works, but much more difficult to secure and manage. Oops. Solution Migrate to VPC. Migrating after the fact was difficult. Use it from the start and save yourself the pain. VPC lets you have egress firewall rules, change things on the fly, specify network ACLs, etc. New accounts have no choice, so this may be moot.

Slide 75

Slide 75 text

Deploying a New Application? Sorry, You Hit Your Limit.

Slide 76

Slide 76 text

No content

Slide 77

Slide 77 text

Be Aware of Cloud Limitations Problem AWS has pre-defined service limits. These are not clearly displayed unless you know where to look*. First time you'll see the error is when trying to perform an action which you can no longer perform. Oops. Solution Be aware of built-in limits so you can request their increase ahead of time, before you start trying to putting things to use in production. Other limits are things like scalability of ELBs. If you're expecting heavy traffic, you need to pre-warm your ELBs by injecting traffic beforehand. Or contact AWS to pre-warm them for you (preferred). You want to learn this lesson before you get the critical traffic! * http://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html

Slide 78

Slide 78 text

Was It All Worth It? (Hint: I'm Slightly Biased)

Slide 79

Slide 79 text

Worth It? ● Can now handle growth in a very organic fashion. ● No actionable alert in... well... I can't remember. ● When things go wrong, instances kill themselves and we get a fresh instance with a known-good configuration. ● Deployments are not as dangerous, can deploy many times a day and rollback easily, so they've become routine instead of "OK, everyone stop what you're doing, we're going to deploy something".

Slide 80

Slide 80 text

Totally Worth It ● Much lower cost than before. ● Spinning up a new application/environment used to take days. Now takes ~15 minutes. ● More freedom to prototype and play with changes. ● Easy to spin up a new region/environment for a few hours to play with settings; (With minimal cost, and completely isolated from your current environment). ● Something you can't do with metal unless you already have the hardware prepared and ready. ● Developers can have their own personal prod clone to develop with, means no surprises when moving to production.

Slide 81

Slide 81 text

Useful Resources ● https://cloud.google.com/developers/#articles - Google Cloud Whitepapers and Best Practice Guides. ● http://www.rackspace.co.uk/whitepapers - Rackspace Whitepapers and Guides. ● http://azure.microsoft.com/blog - Microsoft Azure Blog. ● http://aws.typepad.com/ - AWS Blog. ● http://www.youtube.com/user/AmazonWebServices - Lots of AWS training videos, etc.

Slide 82

Slide 82 text

More Useful Resources ● http://www.slideshare.net/AmazonWebServices - All slides and presentations from the AWS conferences. Lots of useful training stuff for free. ● http://netflix.github.com - Netflix are the masters of AWS. They have lots of open source stuff to share. ● http://aws.amazon.com/whitepapers - Oh so many papers from AWS on everything from security best practices, to financial services grid computing in the cloud.

Slide 83

Slide 83 text

Tooting My Own Horn Read more about my AWS mishaps! wblinks.com/notes/aws-tips-i-wish-id-known-before-i-started Do I suck at presenting? Send your hate mail to [email protected]! Say Hi on Twitter! @r_adams

Slide 84

Slide 84 text

Thanks! richadams.me/talks/srecon14