Docker Swarm on Production: A Logbook

Docker Swarm on Production: A Logbook Marcelo Pinheiro @salizzar

$ whoami • Fireman / Problem Solver / Programmer since
2000 • Ruby, Python, Golang, Java, Clojure, C#, Classic ASP, PHP, Node.js, Erlang and others • Fought, made coffee, negotiated deadlines • DevOps Engineer @ Work & Co

How we work?

Work & Co: How we work? • We only make
digital products & services. • Prototypes, not presentations. • One team: Client + Work & Co. • Fewer people. More senior people. • Good products requires good development. • We’re hiring! Find me. :)

Embracing Infrastructure. In a responsible way, of course.

Embracing Infrastructure • Mainly we use customer’s infrastructure in our
projects. • Common issues: • Legacy datacenters • Bureaucratic culture • Resistance (sometimes aversion) to emergent technologies

Embracing Infrastructure • Our lessons: • Elapsed time to provision
development / homologation environments negatively impact in our prototype-based deliveries • We prefer to spend time conﬁguring production infrastructure instead • Containerization Concept: majority of customers never had contact with it

Embracing Infrastructure • Embrace development / QA / UAT /
homologation environments and related infrastructure. • Developers want to develop, not doing server things • Give a automated path to developers create and deploy projects with ease / quick feedback

Venice: Work & Co in-house solution. Not reinventing the entire
wheel.

Venice: Work & Co in-house solution • We have multiple
teams per project across New York, Portland, Sao Paulo, Rio de Janeiro, Belgrade and other cities. • Common questions for each new project: • Hire / reallocate a DevOps guy? • A new CI / CD server? • For each new application, a pipeline setup • A tedious and repetitive step

Venice: Work & Co in-house solution • We don’t need
and want to maintain Jenkins / Go / CircleCI / whatever CI / CD solutions because: • Scope-based projects • After our ﬁnal delivery, customers traditionally embraces this responsibility using their CI / CD solution • CI / CD servers can be totally different between our projects

Venice: Work & Co in-house solution • Why not develop
a simple solution that fits our philosophy? • Fast feedback from PR’s • Easy way to see build logs • Easy way to deploy a specific branch in any environment • Automate deployments for a specific environment / branch • Developer knows how to generate artifacts from project, give them this power • Docker, Docker, Docker! (To run locally and distribute releases as images)

Venice: Work & Co in-house solution • Our solution: Venice.
• Node.js application • RabbitMQ workers • Ansible recipes • A lot of conventions • Docker Compose • Folder structure • Conﬁguration ﬁle (venice.json)

Docker Swarm Why we choose it.

Docker Swarm: why we choose it • We study some
solutions: • Docker Swarm • Kubernetes • Amazon EC2 Container Service

Docker Swarm: why we choose it • Amazon EC2 Container
Service goods: • Experience from previous projects • Rock solid • Tradeoffs: • Complex to orchestrate new deployments (task deﬁnitions, tasks) • Not bleeding-edge Docker version

Docker Swarm: why we choose it • Kubernetes goodies: •
Reliable • Cloud agnostic • Tradeoffs: • Complexity • High learning curve not applicable to our urgent needs • We rely a lot on our Docker Compose standards, what implies in some kind of transformation to create a Kubernetes conﬁguration ﬁle

Docker Swarm: why we choose it • Docker Swarm goodies:
• Cloud agnostic • Swarm stacks ﬁts very well our needs • Tradeoffs: • At the time of research, a experimental feature • When something goes wrong, be prepared for the worst.

Docker Swarm: why we choose it • Our final architecture
on AWS: • Classic ELB • EC2 instances (c4.large for managers, m4.2xlarge for workers) • ECR to store Docker images • Traefik as Load Balancer for containers • Docker Swarm 1.13 (at the time of launch), today 17.09.0-ce • $ 0.02 tip: AWS Internal Traffic is much more cheaper and brutally fast; consider it • Terraform (provisioning) • Ansible (configuration management) • Sysdig Cloud (monitoring)

Docker Swarm Stacks Deploy a group of containers into your
cluster.

Docker Swarm Stacks • How you deploy a container in
a Docker Swarm cluster? With a Service. • Docker Swarm Services are a definition of a container you want to run in the cluster • You can run a service in all Swarm servers or in a specific server using constraints • For example: Traefik

Docker Swarm Stacks Connect in one of your Docker Swarm
Managers and type: $ docker network create --driver overlay --attachable --subnet 10.0.0.0/16 traefik-net $ docker service create --mode global --name traefik --constraint ‘node.role==manager' --publish 80:80 --publish 8080:8080 --publish 443:443 --network traefik-net traefik:1.4.5

Docker Swarm Stacks • How you deploy a group of
containers? With Stacks. • Docker Swarm Stacks are a definition from a group of services • Docker Compose Version 3 is specially developed to support deployment of stacks using a docker-compose.yml file (or another file name you want)

Docker Swarm Stacks 1 version: '3.0' 2 3 volumes: {}
4 5 networks: 6 app: 7 driver: overlay 8 traefik-net: 9 external: true 10 11 services: 12 web-server: 13 deploy: 14 labels: 15 - venice.project.branch=master 16 - venice.project.environment=cd 17 - venice.project.name=venice-test 18 - venice.project.tag=master-1.0.0-build.72 19 - traefik.docker.network=traefik-net 20 - traefik.frontend.rule=Host:master.cd.venice-test.on.work.co 21 - traefik.frontend.passHostHeader=true 22 - traefik.port=5000 23 mode: replicated 24 placement: 25 constraints: 26 - node.role == worker 27 replicas: 1 28 environment: 29 RUNNING_ENV: cd 30 image: 332243152968.dkr.ecr.us-east-1.amazonaws.com/venice-test/web-server:master-1.0.0-build.72 31 networks: 32 - app 33 - traefik-net 34 ports: 35 - '5000'

Docker Swarm Stacks Connect in one of your Docker Swarm
Managers and type: $ docker deploy --compose-ﬁle stack.yml my_project

Docker Swarm on Production A Logbook.

Docker Swarm on Production: A Logbook • Version 17.03.0~ce-0 •
Rock solid. • No problems with Docker CE upgrades: • 17.03.1~ce-0 • 17.03.2~ce-0

July 22: a new Swarm election was triggered and one of managers got a memory peak during this period and was stuck. • Root cause: https://github.com/moby/ moby/issues/29087 • July 27: another Swarm election was triggered and another manager falls down with same behavior • July 28: the remaining Swarm manager falls down as same reason as others

Side effect: deploys on Swarm cluster during this period starts to have bizarre behaviors: • Old stack definitions conflicting with newest ones, causing services to start two containers instead of one (i.e. master-1.0.0 vs master-1.0.1) • Traefik consequently returning HTTP 502 when requesting some containers • Removing / Adding Swarm managers and workers to the cluster not works • Known issue: https://github.com/moby/moby/ issues/32195 • Upgraded to 17.06.0~ce-0

August 18: Traeﬁk starts to return HTTP 502 from all containers running in a speciﬁc Swarm worker server • After upgrading to 17.06.1, Swarm servers fails to join into the cluster. One. By. One • Tried to provision a new EC2 instance and join into the cluster. Failed • During removal, another services starts to fail with the same behavior • Known Issue: https://github.com/moby/ moby/issues/31839

Docker Swarm on Production: A Logbook Solution: recreate Docker Swarm
cluster from scratch.

Docker Swarm on Production: A Logbook • Lessons learned: •
Spend time monitoring CPU / RAM / Load Average • Be aggressive conﬁguring alerts to detect any strange behavior based on applications you run in Swarm (sudden CPU / Memory Used peaks) • Log Level conﬁgured to DEBUG on Swarm nodes

Rock solid. • November 7: some services fails to deploy into the cluster. After diving on Docker Swarm logs, found the reason: Docker internal IP allocation fails when a new service is deployed. Root causes: • Traefik Network Driver configured with subnet CIDR 10.0.0.0/24 -> 254 IPs - 1 allocated internally by Docker • 127 Stacks + 128 Services = 255 IPs on total • Docker Network can’t be updated on the fly, you need to recreate from scratch

Docker Swarm on Production: A Logbook Solution: recreate Docker Swarm
cluster from scratch and recreate Traeﬁk network driver to CIDR 10.0.0.0/16, adding all stacks to run again.

Docker Swarm on Production: A Logbook • Lessons learned: •
Be careful when creating a Docker Network Overlay Driver, spend some time to properly conﬁgure a subnet based on your growth • Take special attention to Network Errors. A increased number tells a lot about Swarm inconsistencies • Same attention to CPU peaks, be more aggressive than you think you are • Create a swiss knife to recover Swarm cluster (Ansible, Bash, Python scripts, whatever)

Docker Swarm on Production: A Logbook • Docker Swarm is
reliable after all? • In my opinion, yes. Maybe not to you; for us ﬁts well • Add new Docker Swarm servers is very, very easy • Operation is quite simple. Read the docs • 100~150 developers / QA engineers / PMs / customer stakeholders using it every day

Docker Swarm on Production: A Logbook

Yes. Live Demo time!

Questions? Is free.

Thank you! :)

Docker Swarm on Production: A Logbook

Docker Swarm on Production: A Logbook

More Decks by Marcelo Pinheiro

Other Decks in Technology

Featured

Transcript