Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Docker Swarm on Production: A Logbook

Docker Swarm on Production: A Logbook

Sharing my experience using Docker Swarm in production since March 2017.

Marcelo Pinheiro

December 14, 2017
Tweet

More Decks by Marcelo Pinheiro

Other Decks in Technology

Transcript

  1. $ whoami • Fireman / Problem Solver / Programmer since

    2000 • Ruby, Python, Golang, Java, Clojure, C#, Classic ASP, PHP, Node.js, Erlang and others • Fought, made coffee, negotiated deadlines • DevOps Engineer @ Work & Co
  2. Work & Co: How we work? • We only make

    digital products & services. • Prototypes, not presentations. • One team: Client + Work & Co. • Fewer people. More senior people. • Good products requires good development. • We’re hiring! Find me. :)
  3. Embracing Infrastructure • Mainly we use customer’s infrastructure in our

    projects. • Common issues: • Legacy datacenters • Bureaucratic culture • Resistance (sometimes aversion) to emergent technologies
  4. Embracing Infrastructure • Our lessons: • Elapsed time to provision

    development / homologation environments negatively impact in our prototype-based deliveries • We prefer to spend time configuring production infrastructure instead • Containerization Concept: majority of customers never had contact with it
  5. Embracing Infrastructure • Embrace development / QA / UAT /

    homologation environments and related infrastructure. • Developers want to develop, not doing server things • Give a automated path to developers create and deploy projects with ease / quick feedback
  6. Venice: Work & Co in-house solution • We have multiple

    teams per project across New York, Portland, Sao Paulo, Rio de Janeiro, Belgrade and other cities. • Common questions for each new project: • Hire / reallocate a DevOps guy? • A new CI / CD server? • For each new application, a pipeline setup • A tedious and repetitive step
  7. Venice: Work & Co in-house solution • We don’t need

    and want to maintain Jenkins / Go / CircleCI / whatever CI / CD solutions because: • Scope-based projects • After our final delivery, customers traditionally embraces this responsibility using their CI / CD solution • CI / CD servers can be totally different between our projects
  8. Venice: Work & Co in-house solution • Why not develop

    a simple solution that fits our philosophy? • Fast feedback from PR’s • Easy way to see build logs • Easy way to deploy a specific branch in any environment • Automate deployments for a specific environment / branch • Developer knows how to generate artifacts from project, give them this power • Docker, Docker, Docker! (To run locally and distribute releases as images)
  9. Venice: Work & Co in-house solution • Our solution: Venice.

    • Node.js application • RabbitMQ workers • Ansible recipes • A lot of conventions • Docker Compose • Folder structure • Configuration file (venice.json)
  10. Docker Swarm: why we choose it • We study some

    solutions: • Docker Swarm • Kubernetes • Amazon EC2 Container Service
  11. Docker Swarm: why we choose it • Amazon EC2 Container

    Service goods: • Experience from previous projects • Rock solid • Tradeoffs: • Complex to orchestrate new deployments (task definitions, tasks) • Not bleeding-edge Docker version
  12. Docker Swarm: why we choose it • Kubernetes goodies: •

    Reliable • Cloud agnostic • Tradeoffs: • Complexity • High learning curve not applicable to our urgent needs • We rely a lot on our Docker Compose standards, what implies in some kind of transformation to create a Kubernetes configuration file
  13. Docker Swarm: why we choose it • Docker Swarm goodies:

    • Cloud agnostic • Swarm stacks fits very well our needs • Tradeoffs: • At the time of research, a experimental feature • When something goes wrong, be prepared for the worst.
  14. Docker Swarm: why we choose it • Our final architecture

    on AWS: • Classic ELB • EC2 instances (c4.large for managers, m4.2xlarge for workers) • ECR to store Docker images • Traefik as Load Balancer for containers • Docker Swarm 1.13 (at the time of launch), today 17.09.0-ce • $ 0.02 tip: AWS Internal Traffic is much more cheaper and brutally fast; consider it • Terraform (provisioning) • Ansible (configuration management) • Sysdig Cloud (monitoring)
  15. Docker Swarm Stacks • How you deploy a container in

    a Docker Swarm cluster? With a Service. • Docker Swarm Services are a definition of a container you want to run in the cluster • You can run a service in all Swarm servers or in a specific server using constraints • For example: Traefik
  16. Docker Swarm Stacks Connect in one of your Docker Swarm

    Managers and type: $ docker network create --driver overlay --attachable --subnet 10.0.0.0/16 traefik-net $ docker service create --mode global --name traefik --constraint ‘node.role==manager' --publish 80:80 --publish 8080:8080 --publish 443:443 --network traefik-net traefik:1.4.5
  17. Docker Swarm Stacks • How you deploy a group of

    containers? With Stacks. • Docker Swarm Stacks are a definition from a group of services • Docker Compose Version 3 is specially developed to support deployment of stacks using a docker-compose.yml file (or another file name you want)
  18. Docker Swarm Stacks 1 version: '3.0' 2 3 volumes: {}

    4 5 networks: 6 app: 7 driver: overlay 8 traefik-net: 9 external: true 10 11 services: 12 web-server: 13 deploy: 14 labels: 15 - venice.project.branch=master 16 - venice.project.environment=cd 17 - venice.project.name=venice-test 18 - venice.project.tag=master-1.0.0-build.72 19 - traefik.docker.network=traefik-net 20 - traefik.frontend.rule=Host:master.cd.venice-test.on.work.co 21 - traefik.frontend.passHostHeader=true 22 - traefik.port=5000 23 mode: replicated 24 placement: 25 constraints: 26 - node.role == worker 27 replicas: 1 28 environment: 29 RUNNING_ENV: cd 30 image: 332243152968.dkr.ecr.us-east-1.amazonaws.com/venice-test/web-server:master-1.0.0-build.72 31 networks: 32 - app 33 - traefik-net 34 ports: 35 - '5000'
  19. Docker Swarm Stacks Connect in one of your Docker Swarm

    Managers and type: $ docker deploy --compose-file stack.yml my_project
  20. Docker Swarm on Production: A Logbook • Version 17.03.0~ce-0 •

    Rock solid. • No problems with Docker CE upgrades: • 17.03.1~ce-0 • 17.03.2~ce-0
  21. Docker Swarm on Production: A Logbook • Version 17.05.0~ce-0 •

    July 22: a new Swarm election was triggered and one of managers got a memory peak during this period and was stuck. • Root cause: https://github.com/moby/ moby/issues/29087 • July 27: another Swarm election was triggered and another manager falls down with same behavior • July 28: the remaining Swarm manager falls down as same reason as others
  22. Docker Swarm on Production: A Logbook • Version 17.05.0~ce-0 •

    Side effect: deploys on Swarm cluster during this period starts to have bizarre behaviors: • Old stack definitions conflicting with newest ones, causing services to start two containers instead of one (i.e. master-1.0.0 vs master-1.0.1) • Traefik consequently returning HTTP 502 when requesting some containers • Removing / Adding Swarm managers and workers to the cluster not works • Known issue: https://github.com/moby/moby/ issues/32195 • Upgraded to 17.06.0~ce-0
  23. Docker Swarm on Production: A Logbook • Version 17.06.0~ce-0 •

    August 18: Traefik starts to return HTTP 502 from all containers running in a specific Swarm worker server • After upgrading to 17.06.1, Swarm servers fails to join into the cluster. One. By. One • Tried to provision a new EC2 instance and join into the cluster. Failed • During removal, another services starts to fail with the same behavior • Known Issue: https://github.com/moby/ moby/issues/31839
  24. Docker Swarm on Production: A Logbook • Lessons learned: •

    Spend time monitoring CPU / RAM / Load Average • Be aggressive configuring alerts to detect any strange behavior based on applications you run in Swarm (sudden CPU / Memory Used peaks) • Log Level configured to DEBUG on Swarm nodes
  25. Docker Swarm on Production: A Logbook • Version 17.09.0~ce-0 •

    Rock solid. • November 7: some services fails to deploy into the cluster. After diving on Docker Swarm logs, found the reason: Docker internal IP allocation fails when a new service is deployed. Root causes: • Traefik Network Driver configured with subnet CIDR 10.0.0.0/24 -> 254 IPs - 1 allocated internally by Docker • 127 Stacks + 128 Services = 255 IPs on total • Docker Network can’t be updated on the fly, you need to recreate from scratch
  26. Docker Swarm on Production: A Logbook Solution: recreate Docker Swarm

    cluster from scratch and recreate Traefik network driver to CIDR 10.0.0.0/16, adding all stacks to run again.
  27. Docker Swarm on Production: A Logbook • Lessons learned: •

    Be careful when creating a Docker Network Overlay Driver, spend some time to properly configure a subnet based on your growth • Take special attention to Network Errors. A increased number tells a lot about Swarm inconsistencies • Same attention to CPU peaks, be more aggressive than you think you are • Create a swiss knife to recover Swarm cluster (Ansible, Bash, Python scripts, whatever)
  28. Docker Swarm on Production: A Logbook • Docker Swarm is

    reliable after all? • In my opinion, yes. Maybe not to you; for us fits well • Add new Docker Swarm servers is very, very easy • Operation is quite simple. Read the docs • 100~150 developers / QA engineers / PMs / customer stakeholders using it every day