Slide 1

Slide 1 text

Container Orchestration and Management Systems Comparison from Technical View Harry Zhang, Member of #CNCF

Slide 2

Slide 2 text

The Scope of This Talk • Kubernetes • by Cloud Native Computing Foundation • Docker 1.12+ • by Docker Inc. • Compose + Swarm is kind of legacy, so they will not be included in this talk • Mesos • by Apache Software Foundation • only with Marathon, DC/OS is not included (the scope of later is larger)

Slide 3

Slide 3 text

Chapter 1: Core Idea and Architecture

Slide 4

Slide 4 text

Kubernetes • Build right things with containers by following concepts and conventions • like a “Spring Framework” in container eco-system • Design • master • api-server, scheduler, controller-manager • node • kubelet, kube-proxy • independent binaries • Pros: modular, transparent, manageable • Cons: a little bit complex to setup (1.4 is much better now) • network & volume plugins • driven by control loops

Slide 5

Slide 5 text

kubelet SyncLoop kubelet SyncLoop proxy proxy 1 Pod created etcd scheduler api-server Kubernetes

Slide 6

Slide 6 text

kubelet SyncLoop kubelet SyncLoop proxy proxy 2 Pod object added etcd scheduler api-server Kubernetes

Slide 7

Slide 7 text

kubelet SyncLoop kubelet SyncLoop proxy proxy 3.1 New pod object detected 3.2 Bind pod with node etcd scheduler api-server Kubernetes

Slide 8

Slide 8 text

kubelet SyncLoop kubelet SyncLoop proxy proxy 4.1 Detected pod bind with me 4.2 Start containers in pod etcd scheduler api-server Kubernetes

Slide 9

Slide 9 text

kubelet SyncLoop controller-manager ControlLoop kubelet SyncLoop proxy proxy Objects: pod replica namespace service endpoint job deployment volume petset … etcd scheduler api-server Reconcile: desired world VS real world handler Kubernetes

Slide 10

Slide 10 text

Tips: Control Theory* *Andrei, Neculai (2005). "Modern Control Theory – A historical Perspective" • It’s the basic model for: • Kubernetes controller and all other event loops • SwarmKit orchestrator • … ControlLoop

Slide 11

Slide 11 text

Docker 1.12+ • Build-in cluster support for Docker containers • powered by swarmkit • SwarmtKit Design • build-in data store • manager • several components build into one binary • control loop driven • worker • use pull model to connect with manager WARNING: SwarmKit is currently a primitive project, expect change of this part

Slide 12

Slide 12 text

Allocator Dispatcher Scheduler Orchestrator • API: accept commands from client • Create object in raft based memory store • github.com/coreos/etcd/raft for consensus • github.com/hashicorp/go-memdb for in-memory object storage • state, cluster, node, service, task, network … $ docker service create API Store SwarmKit Manager

Slide 13

Slide 13 text

Allocator Dispatcher Scheduler • Create Tasks from Service object • Task: “start a container” etc • Reconcile loop for Service objects • Control Theory again Orchestrator API Store Orchestrator Service (replica=2) Task Task check if replica=2 or not SwarmKit Manager

Slide 14

Slide 14 text

• Allocates IP addresses to Services and Tasks • (and allocate volumes in the future) • VIP and ports for Service • IP for all endpoints (veth pairs) in the network the task is attached to Orchestrator Dispatcher Scheduler API Store Allocator SwarmKit Manager Network Create

Slide 15

Slide 15 text

• Assign Task to Node • unassignedTasks • nodeHeap • search in heap to find the best node which meets the constraints && has lightest workloads • ReadyFilter, ResourceFilter, ConstraintFilter Orchestrator Dispatcher API Store Scheduler Allocator SwarmKit Manager

Slide 16

Slide 16 text

Manager • Nodes (agents) management • Dispatch assigned Task to corresponding Node Orchestrator API Store Allocator SwarmKit Manager Scheduler Dispatcher Dispatcher Agent Agent Agent grpc stream grpc stream grpc stream Task

Slide 17

Slide 17 text

• Worker: • connect to Dispatcher to check assigned tasks • executor: execute tasks (containers) on this Node Worker Executor Agent Agent Adapter Docker Daemon docker.sock Worker Executor Worker Executor Agent

Slide 18

Slide 18 text

Mesos 1.0 • A distributed systems kernel • originally designed to run big data job • core idea: fine-grained resource sharing • Mesos Design • Master + Slave + Zookeeper • two level scheduling • scheduler + executor = framework • need to use frameworks like Marathon for orchestration and management • containerizer • multiple container runtime & image support (>=1.0)

Slide 19

Slide 19 text

MPI job MPI scheduler Hadoop job Hadoop scheduler Allocation module Mesos master Mesos slave MPI executor Mesos slave MPI executor task task Resource offer Pick framework to offer resources to *Animate: Operating Systems and Systems Programming Lecture 24 Anthony D. Joseph https://cs162.eecs.berkeley.edu/

Slide 20

Slide 20 text

MPI job MPI scheduler Hadoop job Hadoop scheduler Allocation module Mesos master Mesos slave MPI executor Mesos slave MPI executor task task Pick framework to offer resources to Resource offer Resource offer = list of (node, availableResources) E.g. { (node1, <2 CPUs, 4 GB>), (node2, <3 CPUs, 2 GB>) } *Animate: Operating Systems and Systems Programming Lecture 24 Anthony D. Joseph https://cs162.eecs.berkeley.edu/

Slide 21

Slide 21 text

MPI job MPI scheduler Hadoop job Hadoop scheduler Allocation module Mesos master Mesos slave MPI executor Hadoop executor Mesos slave MPI executor task task Pick framework to offer resources to task Framework-specific scheduling Resource offer Launches and isolates executors *Animate: Operating Systems and Systems Programming Lecture 24 Anthony D. Joseph https://cs162.eecs.berkeley.edu/

Slide 22

Slide 22 text

How Docker plug into Mesos? • Before 1.0 • Docker Containerizer • Docker image -> task -> mesos-docker-executor -> Docker Daemon • Mesos 1.0 • Supporting multiple runtime & images • MesosContainerizer • “Mesos native container stack” • Isolators • Launcher Mesos slave Hadoop executor task mesos- docker- executor

Slide 23

Slide 23 text

Checkpoint Kubernetes Docker SwarmKit Mesos+Marathon Design control loops driven control loops driven (but in single binary) two level scheduling Coordination etcd build-in raft Zookeeper Container Runtime multiple single, but has potential for more OCI runtimes multiple Container Image Docker Image, ACI, more in future Docker Image Docker Image, ACI, more in future Docker Daemon no need need no need

Slide 24

Slide 24 text

About Build-In Data Store Pros Cons easy to setup hard to understand & debug fewer round trips hard to do backup/restore, migration, monitoring/audit easy to do performance tuning lack of mgmt API like:etcd admin guide

Slide 25

Slide 25 text

Chapter 2: Control Panel

Slide 26

Slide 26 text

Control Panel: Orchestration + Management • “Defines when and what to do next through out the automated workflow” • workload management • secret management • configuration management • scale and autoscaling • stateful workload • … and more

Slide 27

Slide 27 text

Workload Management e.g. “a web server with 2 replicas” Kubernetes Docker SwarmKit Mesos+Marathon Description Deployment Service Application Version Control yes (revision) not yet yes (deployments)

Slide 28

Slide 28 text

• Kubernetes “Deployment” • $ kubectl create -f • $ kubectl edit • this will open and edit object stored in etcd • update will trigger rolling update • $ kubectl set image • $ kubectl scale —replicas=5 … • $ kubectl rollout history • $ kubectl rollout undo —to-revision= $ kubectl edit …

Slide 29

Slide 29 text

• Docker SwarmKit “Service” • $ docker service create SERVICE —replicas=5 … • $ docker service scale SERVICE=REPLICAS • $ docker service update [OPTIONS] SERVICE • rolling update • 30+ update options are supported • —container-label-add value • —container-label-rm value • --env-add value • --env-rm value • —image string • …

Slide 30

Slide 30 text

• Mesos + Marathon “Application” • $ dcos marathon app start [--force] [] • $ dcos marathon app update [--force] […] • rolling update • app dependencies are respected • $ dcos marathon app version list [--max-count=] … • $ dcos marathon deployment list [--json ] • $ dcos marathon deployment rollback

Slide 31

Slide 31 text

Secret Management • Kubernetes • Secret volume • encrypted and stored in etcd • consumed by ENV or volume • Docker SwarmKit • under discussion: https://github.com/docker/swarmkit/issues/1329 • Mesos + Marathon • only in DC/OS • stored in ZooKeeper, exposed as ENV in Marathon • Another similar feature is Configuration Management

Slide 32

Slide 32 text

Configuration Management • Kubernetes • ConfigMap • stored in etcd, consumed by ENV or volume • $ kubectl create configmap example-redis-config —from-file=docs/redis-config • Docker SwarmKit • under discussion: https://github.com/docker/swarmkit/issues/1329 • Mesos + Marathon • not yet

Slide 33

Slide 33 text

Autoscaling • Kubernetes • HorizontalPodAutoScaler • default: CPU • Custom Metrics: • user defined endpoint, e.g. http://localhost:9100/metrics • share same metric data structure with CNCF projects like Prometheus • Docker SwarmKit • not yet: https://github.com/docker/swarmkit/issues/486#issuecomment-219133613 • Mesos + Marathon • a stand-by `marathon-autoscale.py` • autoscales application based on the utilization metrics from Mesos

Slide 34

Slide 34 text

Stateful Workload • Kubernetes • PetSet: Replicas with stable membership and volumes • stable hostname • ordinal index • stable storage • Docker SwarmKit • not yet, and don’t suggest stateful service • Mesos + Marathon • Stateful Applications • dynamic reservations, reservation labels, and persistent volumes. cassandra-0 volume 0 cassandra-0.cassandra.default.svc.cluster.local cassandra-1 volume 1 cassandra-1.cassandra.default.svc.cluster.local

Slide 35

Slide 35 text

Chapter 3: Service Discovery & Load Balance

Slide 36

Slide 36 text

Node Node Service Discovery & LB • Kubernetes • Load Balancer • iptables • External Access • : • NodePort: : • External LoadBalancer • Ingress (L7) • Ingress Pod: Nginx, HAproxy • SSL • Name Service • build-in skyDNS pod portal iptables rule 10.10.0.116:8001 random mode iptables rules Pod 2 Pod 1 ingress traffic http://foo.bar.com Node Ingress Pod internal traffic outside traffic pod rule 2 pod rule 1

Slide 37

Slide 37 text

Worker Worker container sandbox ingress sandbox Service Discovery & Load Balance • Docker SwarmKit • Load Balancer • ipvs NAT mode • External Access • Routing Mesh • Name Service • embedded DNS server • for service and task Container 2 Container 1 ipvs Gossip to update the iptables & ipvs rules port mapping iptables iptables outside traffic (when service created with -p) internal traffic ipvs • Two kinds of sandboxes • ingress: on every worker • container: on workers where task lives • Two networks are needed • ingress overlay • user-defined overlay DNS: svc->vip ingress sandbox

Slide 38

Slide 38 text

Service Discovery & Load Balance • Mesos + Marathon • Load Balancer • Marathon-lb: HAproxy based • virtual addresses (VIPs) in DC/OS • External Access • http://: • external load balancer • Name Service • Mesos-DNS Slave Slave Marathon-lb Container 2 Container 1 Mesos-DNS

Slide 39

Slide 39 text

Checkpoint Kubernetes Docker SwarmKit Mesos+Marathon Filter iptables VIP iptables VIP no need LB iptables random mode ipvs NAT mode HAproxy External Access nodeIP:port, Ingress, external IP/LB Routing Mesh (ingress overlay) same as expose HAproxy to public Update watch etcd gossip marathon_lb.py & template

Slide 40

Slide 40 text

Chapter 4: Scheduling

Slide 41

Slide 41 text

Kubernetes • Pod as schedule unit • this is unique, but why? • Multi-Scheduler • pod1: scheduler1, pod2 : scheduler2 • QoS tiers • anyone remember the core idea of Borg? • Guaranteed (requests == limit) • Burstable (requests < limit) • Best-Effort (no request & limit) • More Borg features are on the way • equivalence class, pod level resource boundary … Burstable Pod

Slide 42

Slide 42 text

Docker SwarmKit • Task (container) as schedule unit • Multi-Scheduler • not yet • Strategy • pipeline of filters • ReadyFilter ResourceFilter ConstraintFilter • to sort nodeHeap • QoS tiers • not yet

Slide 43

Slide 43 text

Mesos + Marathon • Task as schedule unit (Pod support in plan) • Multi-Scheduler • Mesos is designed to run multiple frameworks (schedulers) • Strategy • Two level scheduling (the killing weapon of Mesos) • Twitter scale … • fine-grained resource sharing (like Borg) • QoS tiers • of course • And much more • task eviction, data locality, max-min fairness, priority, offer reject, Delay Scheduling • and Big Data of course

Slide 44

Slide 44 text

Chapter 5: Summary

Slide 45

Slide 45 text

A Use Case: hyper.sh • hyper.sh is “Docker Done the Right Way” • $ hyper run mysql • $ hyper run --link mysql wordpress • $ hyper fip attach 22.33.44.55 wordpress • But Hyper.sh is powered by Kubernetes • and also maintain Kubernetes features

Slide 46

Slide 46 text

Extensibility Really Matters • Hypernetes (h8s = k8s + HyperContainer) is what’s backing Hyper.sh: • HyperContainer runtime • Multi-tenant network based on Neutron • Custom Cinder plugin with Ceph backend • Custom HAproxy based Service • Kubernetes is truly extensible and configurable

Slide 47

Slide 47 text

Just Personal Idea • So, if • I am a individual developer/org, trying to find something that is friendly and just works • I use Docker SwarmKit • I have a “Twitter scale” cluster to manage or I am a Big Data user • I need Mesos • But if what I need is a infrastructure layer to build my systems on top of it in right way • Kubernetes is the choice

Slide 48

Slide 48 text

THE END @resouer