You got a couple Microservices, now what? Adding SRE to DevOps

You got a couple Microservices, now what? Adding SRE to DevOps

This talk goes over the infrastructure needed to run Microservices in production by answers the following questions:

* Why do I want to run my software in Containers?
* What is a Kubernetes or Mesos?
* Am I going to need a DevOps or SRE team? What will they do?
* How will my Continuous Integration/Delivery will look like?

Bcb446d5ebec71979786a22e56794c32?s=128

Gonzalo Maldonado

November 16, 2016
Tweet

Transcript

  1. You got a couple Microservices, now what? Adding SRE to

    DevOps Gonzalo Maldonado - MustWin
  2. Microservice Honeymoon ^ Your microservice saved your homepage ^ Everyone

    loves working on the microservice 2 months later ^ Is it still a microservice? ^ Why are we adding new stuff to the monolith? Can we get rid of ticket driven deployments? ^ What makes deploying a microservice so hard? ^ Where can we run this? ^ Monoliths seemed easier to maintain! ^ Datacenter 4.0 ^ Dude, where's my container? ^ The promised land Sysadmin -> DevOps -> SRE ^ The SRE-cret Sauce ^ Resource & Container Management (Schedulers) ^ Service Discovery (Consul, Skydns & Etcd) TOC —Sysadmin -> DevOps -> SRE —Microservice Honeymoon —2 months later —Meanwhile your team is doing ticket driven deployments. —The SRE-cret Sauce —References
  3. Sysadmin -> DevOps -> SRE —SysAdmin: Manages 1 or 2

    services manually. —DevOps Team: Manages ~10 services semi- programmatically. —SRE Team: Manages 100-1K services fully programmatically.
  4. Sysadmin -> DevOps -> SRE (Tech Stack) —SysAdmin: Bash, Perl

    or Python Scripts —DevOps Team: Chef, Puppet —SRE Team: Mesos, Swarm, Kubernetes, Consul, Vault
  5. We don't have 100 services, why should we care about

    the SRE tech stack? Because this stack: —Saves your team time configuring and deploying a service —Allows your engineering team to grow (a single engineer will be able to manage a couple dozen services)
  6. We don't have 100 services, why should we care about

    the SRE tech stack? Because this stack: —It prevents having to rewrite your infrastructure code as your app scales —It gives you elastic resources (Saves you money on aws).
  7. We don't have 100 services, why should we care about

    the SRE tech stack? —Because it makes deploying Microservices as easy as getting a heroku app up (and you used to love microservices).
  8. When doing Microservices gets hard

  9. The Microservice Honeymoon: how a microservice saved your homepage Microservices

    are awesome.
  10. The Microservice Honeymoon: how a microservice saved your homepage —Your

    page loads decreased from 3 seconds to 20ms (Go is so fast!)
  11. The Microservice Honeymoon: how a microservice saved your homepage —Hacker

    News spikes are no longer a big deal (we're elastic!)
  12. The Microservice Honeymoon: how a microservice saved your homepage —Everyone

    loves working on the Microservice (It's only 500 lines!)
  13. 2 months later...

  14. 2 months later... —If it has 2K lines of code,

    is it still a microservice?
  15. 2 months later... —Why are people still adding stuff to

    the monolith? —The code is already there and they didn't want to rewrite it (duh.) —Debugging things is getting harder (You need to test in multiple places) —Getting a new microservice to prod is hard! (! This.)
  16. Why is creating new Microservices so hard now? (monoliths felt

    easier) "Awesome analogy by @timallenwagner: monolithic architecture=carrying a 7ft beach ball, microservice=carrying 200 loose marbles"
  17. Why is creating new Microservices so hard now? (monoliths felt

    easier) —Configuration Management (You have to repeat recipes) —Service-inter-dependency-updates (You can't change a service address or port without affecting other services) —Credentials cannot be shared —Snowflake Runtime Environments (Can't run node.js code on the JVM box)
  18. Meanwhile, your team is doing ticket driven deployments —Deploys have

    become more complicated, when there was only a Monolith, you only had one deploy, and one box.
  19. Meanwhile, your team is doing ticket driven deployments —It has

    gotten to a point, where your team has decided they "need a ticket" for each deploy
  20. Where can we run this? Your Sys Admin asks... —If

    you're typing apt-get to get a new environment up, you're doing something wrong. —Chef, Puppet, Ansible are good replacements, but there's something better you probably already use on your dev machine.
  21. Your Datacenter has to change

  22. Datacenter 1.0 1 "How do we use these machines?" "Can

    we automate?" "How can we integrate?" 1 http://www.slideshare.net/SebastianWeigand/containers-and-customers-55262844
  23. Datacenter 2.0 1 "We need bigger computers" "We need a

    microservice" "We need a SysAdmin" 1 http://www.slideshare.net/SebastianWeigand/containers-and-customers-55262844
  24. Datacenter 3.0 1 "We need some VMS." "We need microservices"

    "We need IT" 1 http://www.slideshare.net/SebastianWeigand/containers-and-customers-55262844
  25. Datacenter 3.5 1 "We have a lot of VMs" "We

    have lots of microservices" "We need DevOps" 1 http://www.slideshare.net/SebastianWeigand/containers-and-customers-55262844
  26. Datacenter 3.5 1 "We need to manage our VMs" "We

    need to manage our microservices" "We need SREs" 1 http://www.slideshare.net/SebastianWeigand/containers-and-customers-55262844
  27. You already heard about docker and why using containers that

    share OS resources is more efficient than using full virtual machines. But what else does docker give you? Dude, where is my container? Virtual Machines vs Docker
  28. Dude, where is my container? What else does docker give

    you? * Contained instances (You can run multiple runtimes on one box) * Incremental images. (You can use an existing image as a base) * Immutable Instances (Your images are stateless)
  29. And this gets us to The Lean Staging $ git

    commit -am "The new cool feature"
  30. The Lean Staging $ git commit -am "The new cool

    feature" $ git push
  31. The Lean Staging $ git commit -am "The new cool

    feature" $ git push Running CI ...........................
  32. The Lean Staging $ git commit -am "The new cool

    feature" $ git push Running CI ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅ CI done. Your branch is available at http://super-tesla.thunderdomes.co
  33. What do we need to get there? We need Service

    Reliability Engineers
  34. What do we need to get there? And what are

    those SRE guys going to build to achieve that?
  35. What do we need to get there? What's the SRE-cret

    sauce? —Code —Servers or The Cloud —A CI service —A deployment system
  36. What do we need to get there? Aka. The SRE-cret

    sauce —Code (We already have that) —Servers or The Cloud (Pick AWS, GCP or Azure) —A CI service (Pick Jenkins, Travis or CircleCI) —Deployment & Monitoring systems ! Lets Focus on this
  37. The SRE-cret sauce. Those Deployment systems will do the following

    a. Container Management b. Service Discovery c. Configuration Management d. Authentication & Authorization
  38. This is what you want. Now that you have discovered

    Docker, you want to us it on production. While you could run all your containers on a single box, this would prevent you to scale horizontally, and you would need downtime to add more memory to that box. Container Management
  39. Like many things on the tech world, Google was one

    of the early adopters of Schedulers. Schedulers are systems in charge of managing the cluster resources by telling applications when to run. Container Management: Enter the scheduler Architectures presented in the white-paper concerning Google's Omega Scheduler.
  40. Container Management: Scheduler Options —Mesosphere DCOS (Based on Apache Mesos)

    —Docker Swarm —Kubernetes —Nomad
  41. Each scheduler option has it's own pros and cons and

    you will need to pick the one that better fits your team needs.
  42. Container Management: Scheduler Options[^2] More info here: https://medium.com/@mustwin/a-handy-guide-to- the-mesos-kubernetes-swarm-jungle- ad6bc086c736#.6ji95fm7e

  43. Service Discovery Service discovery is a mechanism in when adding

    a new service instance, the rest of the services detect this change automatically.
  44. Service Discovery: Options Load balancer + Highly available Storage. Using

    a load balancer like NGINX/HAProxy + etcd you can update service registrations dynamically. The Load Balancer takes care of DNS resolutions.
  45. Service Discovery: Options Etcd + Skydns SkyDNS performance is comparable

    to HAProxy, but it's easier to setup although not as powerful
  46. Service Discovery: Options Consul Consul is a key/value & service

    registry with built in DNS support.
  47. Service Discovery: How to pick? a. Pick a scheduler *

    Kubernetes currently only supports etcd. * Mesos can use Etcd, Zookeeper or Consul. b. If you're using Consul you're done. c. For etcd: * Use HAProxy if you're already using it * Otherwise just use Skydns and call it a day
  48. Configuration Management We're going to assume your Microservices are already

    12 Factor apps3. Where: * Service configuration happens in Environment variables * Backing services are attached resources (Service Discovery FTW) 3 https://12factor.net/
  49. Configuration Management (Options) Most schedulers support this out of the

    box, with the caveat that most don't provide Secret management out of the box (K8s does).
  50. Secret Managment (Vault) For secret management we cannot recommend more

    Vault because it provides: —Secure secret storage —Dynamic Secrets —Leasing and Renewal —Revocation —Auditing —Etc.
  51. Other things you will need —Monitoring: (Prometheus, Nagios, InfluxDB, Grafana)

    —An authentication Service or provider
  52. To Recap. To build The Lean Staging we will need:

    —Setup a Scheduler (Kubernetes) —Setup a CI System (Drone, Jenkins or Travis) —Hook your Github/Gitlab to that CI —Change the CI configuration to trigger a Container build & Deploy —Have fun!
  53. Gitlab made a really good proof of concept of it

    https://about.gitlab.com/ 2016/11/14/idea-to-production/
  54. Recommended reading for SRE Teams: Distributed Systems fundamentals: —Notes on

    Distributed Systems for Young Bloods - Jeff Hodges —You Can’t Sacrifice Partition Tolerance - Coda Hale —The Raft Consensus Algorithm - Diego Ongaro
  55. Recommended reading for SRE Teams: Microservices —Building Microservices - Sam

    Newman SRE —Site Reliability Engineering - Beyer, et al. —Continuous Delivery - Jez Humble —The Principles of Product Development Flow - Reinertsen
  56. https://medium.com/@mustwin/a-handy-guide-to-the-mesos-kubernetes-swarm- jungle-ad6bc086c736#.a2mymzvsi ^ https://medium.com/@ArmandGrillet/comparison-of-container-schedulers- c427f4f7421#.uxtk80w35 ^ https://about.gitlab.com/2016/11/14/idea-to-production/ ^ https://about.gitlab.com/2016/09/14/gitlab-live-event-recap/ ^

    https://signalfx.com/library/slides-operationalizing-docker-scale-microservices- orchestration-zenefits/ ^ https://medium.com/@mattheath/a-long-journey-into-a-microservice-world- a714992d2841#.jluhzvs34 ^ https://engineering.zenefits.com/2016/09/sauron-ci-automation-at-zenefits/ ^ https://news.ycombinator.com/item?id=12880917 ^ http://patrobinson.github.io/2016/11/05/docker-in-production/ ^ https://thehftguy.wordpress.com/2016/11/01/docker-in-production-an-history-of- failure/ ^ https://medium.com/google-cloud/a-survival-guide-for-containerizing-your- infrastructure-part-1-why-switch-8e8dee9fc66#.sr5nct3p3 ^ https://www.youtube.com/watch?v=WiCru2zIWWs ^ https://speakerdeck.com/mattheath/microservices-and-go-goto-copenhagen-2016 References —https://medium.com/@mustwin/a-handy-guide-to- the-mesos-kubernetes-swarm-jungle- ad6bc086c736#.a2mymzvsi —https://medium.com/@ArmandGrillet/comparison- of-container-schedulers-c427f4f7421#.uxtk80w35 —https://about.gitlab.com/2016/11/14/idea-to- production/ —https://about.gitlab.com/2016/09/14/gitlab-live- event-recap/
  57. Questions? Slides will be posted at medium.com/@mustwin