About Me • Lei (Harry) Zhang • #Microsoft MVP in cloud and datacenter management • though I’m a Linux guy :/ • Previous: VMware, Baidu • Feature maintainer of Kubernetes • HyperCrew: https://hyper.sh • Publications: Docker & Kubernetes Under the Hood • PhD candidate @ZJU: Large-scale cluster management and scheduling
A survey about “boundary” • Are you comfortable with Linux containers as an effective boundary? • Yes, I use containers in my private/safe environment • No, I use containers to serve the public cloud
As long as we care security… • We have to wrap containers inside full-blown virtual machines • But we lose cloud-native deployment • Slow startup time • Huge resources wasting • Memory tax for every container • … dream reality
Pod: lesson learned from Borg • InitContainers: one or more containers started in sequence before the pod's normal containers are started. • Share volumes, perform network operations, and perform computation prior to the app containers.
So, Pod is • The group of super-affinity containers • The atomic scheduling unit • The process group in container cloud • Do right things • without modifying your container image • Kubernetes = Spring Framework • Pod = IoC Pod log app infra container volume init container
Pod is not easy to simulate • log super affinity app • Requirement: • app: 1G, log: 0.5G • Available: • Node_A: 1.25G, Node_B: 2G • What happens if app scheduled to Node_A?
HyperContainer is a Pod • Linux container based runtimes • wraps and encapsulates several app containers into a logical group • Hypervisor container based runtime • hypervisor serves as a natural boundary of Pod
HyperContainer is a Pod • kubelet Container Runtime Interface • create sandbox Foo --> create container C --> start container C • stop container C --> remove container C --> delete sandbox Foo • Sandbox • Normally: the infra container • HyperContainer: hypervisor • with HyperKernel • a HyperStart process as PID 1 • setup mnt namespace, launch apps from the images etc
Define the Network • Network • a top class api object • each tenant (created by Keystone) has its own Network • Network mapping to Neutron “net” • a Network Controller is responsible to manage Network lifecycle
Example kubelet SyncLoop controller-manager ControlLoop kubelet SyncLoop proxy proxy network pod replica namespace service job deployment volume petset … etcd scheduler api-server Desired World Real World Call Neutron to create/delete network
Kubernetes Network Model • Container reach container • all containers can communicate with all other containers without NAT • Node reach container • all nodes can communicate with all containers (and vice-versa) without NAT • IP addressing • Pod in cluster can be addressed by its IP
How h8s fits that? • Network can be assigned to one or more Namespaces • Pods belonging to the same Network can reach each other directly through IP • a Pod’s network mapping to Neutron “port” • kubelet is responsible for Pod network setup • let’s see how kubelet works
Design of kubelet InitNetworkPlugin Choose Runtime ҁdocker, rkt, hyper/remote҂ InitNetworkPlugin HandlePods {Add, Update, Remove, Delete, …} NodeStatus Network Status status Manager PLEG SyncLoop Pod Update Worker (e.g.ADD) • generale Pod status • check volume status (talk later) • call runtime to start containers • set up Pod network (see next slide) volume Manager PodUpdate image Manager
Multi-tenant Service • Default iptables-based kube-proxy is not tenant aware • Endpoint Pods and Nodes with iptables rules are isolated into different networks • Hypernetes uses a built-in HAproxy as the Service portal • to handle all Service instances within same namespace • the same OnServiceUpdate and OnEndpointsUpdate workflow • ExternalProvider • a OpenStack LB will be created as Service • e.g. curl 58.215.33.98:8078
Kubernetes Persistent Volume Host path Cinder volume plugin Pod Pod mountPath mountPath attach mount Volume Manager desired World reconcile • Get mountedVolume from actualStateOfWorld • Unmount volumes in mountedVolume but not in desiredStateOfWorld • AttachVolume() if vol in desiredStateOfWorld and not attached • MountVolume() if vol in desiredStateOfWorld and not in mountedVolume • Verify devices that should be detached/unmounted are detached/unmounted • Tips: 1. -v host:path 2. attach VS mount 3. Totally independent from container management
Persistent Volume with HyperContainer • Enhanced Cinder volume plugin • Linux container: 1. full OpenStack cluster 2. query Nova to find node 3. attach Cinder volume to host path 4. bind mount host path to Pod containers • HyperContainer: • directly attach block devices to Pod • thanks to the hypervisor based Pod boundary • eliminates extra time to query Nova Host vol Enhanced Cinder volume plugin Pod Pod mountPath mountPath attach vol desired World reconcile Volume Manager
Future of CRI • Keep Docker as the only one default container runtime • oci-runtime, rktlet, hyperd • Frakti: the Remote Container Runtime Kit • https://github.com/kubernetes/frakti • welcome to tryout, star and fork
“if image becomes non-standard” • e.g. Docker image becomes somehow Docker specific • Don’t worry, kubelet.imageManager is moving to runtime specific • but then k8s will probably choose • NO DEFAULT runtime
Node Node Full Topology Node kubestack Neutron L2 Agent kube-proxy kubelet Cinder Plugin Pod Pod Pod Pod KeyStone Neutron Cinder Master Object: Network Ceph Object: Pod Object: …
Summary • A new way to build secure and multi-tenant Kubernetes • Kubernetes + HyperContainer + Neutron Plugin + Cinder Plugin + Keystone • Project URL: https://github.com/hyperhq/hypernetes • Roadmap • Graduate HyperContainer runtime on k8s upstream • see HyperContainer in official k8s release • Neutron CNI plugin • Tip: https://hyper.sh is totally built on Hypernetes, try it out :)