Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OpenStack, Kubernetes How we should face them / LINE Campus Talk in Hong Kong by Yuki Nishiwaki

OpenStack, Kubernetes How we should face them / LINE Campus Talk in Hong Kong by Yuki Nishiwaki

25.03.2019 Campus Talk in HKUST, CUHK
26.03.2019 Campus Talk in HKU
Presented by Yuki Nishiwaki

A3966f193f4bef226a0d3e3c1f728d7f?s=128

LINE Developers
PRO

March 25, 2019
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. OpenStack, Kubernetes How we should face them LINE Corp Yuki

    Nishiwaki
  2. About Me • Name : Yuki Nishiwaki • Title :

    Private Cloud Platform Team Lead in LINE • Experience : ◦ OSS Contribution: ▪ rancher/rancher, kubernetes/ingress-nginx, coreos/etcd-operator, openstack/neutron ◦ Presentation: ▪ Japan Container Days v18.12 Keynote (Future of LINE CaaS Platform) ▪ Japan Container Days v18.12 (How we can develop managed k8s service with Rancher) ▪ OpenStack Summit 2018 Vancouver (Excitingly simple multi-path OpenStack Network) ▪ OpenStack Summit 2016 Austin (Swift Private Endpoint) ▪ OpenStack Summit 2015 Tokyo (Automate Deployment & Benchmark) ▪ ….
  3. What is (Private) Cloud ? Kubernetes Cluster VM Load balancer

    Infrastructure Resources API or GUI “Provide the controllability of infrastructure resource via API/GUI”
  4. Private Cloud Platform Team? Responsibility - Develop/Maintain Common/Fundamental Function for

    Private Cloud (IaaS) - OpenStack based Private Cloud - Managed Kubernetes Service Network Service Operation Platform Storage Maintain/Develop Private Cloud
  5. LINE Private Cloud Cloud Service Catalog OpenStack VM (Nova) Image

    Store (Glance) Network Controller (Neutron) Identify (Keystone) DNS Controller (Designate) Loadbalancer L4LB L7LB Kubernetes (Rancher) Storage Block Storage (Ceph) Object Storage (Ceph) Database Search/Analytics Engine (ElasticSearch) RDBMS (Mysql) KVS (Redis) Messaging (Kafka) Function (Knative) Baremetal Platform Service Network Storage
  6. OpenStack ? MicroServices Architecture Pros: We can just deploy what

    we need Cons: Operation Cost for multiple different processes
  7. VM Creation in OpenStack? nova-api nova-scheduler nova-conductor neutron-server nova-compute neutron-agent

    neutron-dhcp-agent glance-api Libvirt VM dnsmasq 1. VM Create API request 2. Ask to do VM creation 3. Decide which host to use 4. Ask to create VM 5. Download Image 6. Create Port 9. Detect tap device 10. Get Port Detail 7. Update dhcp 8. Configure dhcp 11. Configure bridge,tap 12. Ask to create VM 13. Create VM 14. DHCP provide
  8. Kubernetes ?

  9. Container Orchestrating? Node1 Node2 Node3 Web Service DevOps Team Load

    balancer 1. Understand which container running on which node 2. Update container image one by one to reduce down time 3. Configure Loadbalancer to distribute traffic for multiple nodes 4. If container get dead, re-create container 5. If traffic get increased dramatically need to add new docker node, container 6. To share some secret data with container, setup NFS or distribute it to all nodes What to do Skip if audience know about kubernetes
  10. Container Orchestrating with Kubernetes Node1 Node2 Node3 Load balancer Just

    Ask, Check! Skip if audience know about kubernetes 1. Understand which container running on which node 2. Update container image one by one to reduce down time 3. Configure Loadbalancer to distribute traffic for multiple nodes 4. If container get dead, re-create container 5. If traffic get increased dramatically need to add new docker node, container 6. To share some secret data with container, setup NFS or distribute it to all nodes What to do
  11. Difficulty of DevOps for Kubernetes, OpenStack 1. 1 Deployment include

    “Networking”, “Virtualization”, “Storage”.... 1. 1 Deployment include “Networking”, “Virtualization”, “Storage”.... 2. Composed of Multiple Process / Node 1. 1 Deployment include “Networking”, “Virtualization”, “Storage”.... 2. Composed of Multiple Process / Node 3. Many Dependent OSS
  12. Many Dependent OSS ? • Keystone (User Controller) • Nova

    (VM Controller) ◦ More than 8 different processes • Neutron (Networking Controller) ◦ More than 4 different processes • Glance (Image Service) ◦ More than 2 different processes • Designate (DHCP Controller) ◦ More than 4 processes • Dnsmasq (DHCP Service) • Libvirt (VM Monitor) • RabbitMQ (Messaging Bus) • Qemu (Hardware Emulator) OpenStack • Kubernetes ◦ More than 5 different processes • Flannel (Networking Controller) • Rancher (Managing Kubernetes Software) ◦ More than 3 different processes • Docker • Docker Machine • Etcd (KVS) Managed Kubernetes
  13. We need to walk through…. • Keystone (User Controller) •

    Nova (VM Controller) ◦ More than 8 different processes • Neutron (Networking Controller) ◦ More than 4 different processes • Glance (Image Service) ◦ More than 2 different processes • Designate (DHCP Controller) ◦ More than 4 processes • Dnsmasq (DHCP Service) • Libvirt (VM Monitor) • RabbitMQ (Messaging Bus) • Qemu (Hardware Emulator) OpenStack • Kubernetes ◦ More than 5 different processes • Flannel (Networking Controller) • Rancher (Managing Kubernetes Software) ◦ More than 3 different processes • Docker • Docker Machine • Etcd (KVS) Managed Kubernetes +0.1M line +1.5M line +1.5M line +0.09M line +0.07M line +0.03M line +0.6M line +3-4M line +1M line +0.01M line +0.1M line +2-3M line Require Reading Code Require Reading Code +1M line +0.03M line
  14. Essence of Good Operation for oss distributed system 1. Read

    Code until you understood, Don’t believe just document, bug report. 1. Read Code until you understood, Don’t believe just document, bug report. 2. Grasp internal state of software/process running 1. Read Code until you understood, Don’t believe just document, bug report. 2. Grasp internal state of software/process running 3. Understand where problem happened is not always where there is root cause. 1. Read Code until you understood, Don’t believe just document, bug report. 2. Grasp internal state of software/process running 3. Understand where problem happened is not always where there is root cause. 4. Understand which software have what responsibility 1. Read Code until you understood, Don’t believe just document, bug report. 2. Grasp internal state of software/process running 3. Understand where problem happened is not always where there is root cause. 4. Understand which software have what responsibility 5. Don’t stop to investigate/dive into problem until you understood root cause
  15. Example of Essence: Reading Code not just Document 1. Read

    Code until you understood, Don’t believe just document Architecture/Design Understanding - Documenting is not catching up with Coding - To know more about risk in operation - To improve OSS (not always perfect to us) Bug or Weird Behaviour - Don’t rely on just google, Do effort to solve it by ourselves - Discussion/Communication inside team is based on code.
  16. Code Reading Related Activity... Code Contribution (Only Merged) • OpenStack

    ◦ openstack/neutron (1) • Kubernetes Related ◦ coreos/etcd-operator (1) ◦ kubernetes/ingress-controller (1) ◦ rancher/norman (3) ◦ rancher/types (1) ◦ docker/machine (1) Code Reading Document • https://github.com/ukinau/rancher-analyse • https://www.slideshare.net/linecorp/lets-unbox-rancher-20-v200 We are user for OSS But at the same time, We are developer for OSS 1. Read Code until you understood, Don’t believe just document
  17. Example of Essence: Grasp Internal State of process All of

    Software have internal states like... • Dnsmasq-dhcp (DHCP Server) ◦ The number of DHCP entry ◦ How many times DHCP NACK to be issued ◦ ... • Nginx (WEB Server) ◦ Average Request Processing Time ◦ Average Number of Request ◦ How many requests are pending ◦ …. Dnsmasq-dhcp • 3 entries of dhcp • 1000 times DHCP NACK issued Nginx • 30msec average req processing time • 10k req in min • 200 req are pending (backlog queue) 2. Grasp Internal State of Software/Process running
  18. Help us detect small problem before becoming big Dnsmasq-dhcp •

    3 entries of dhcp • 1000 times DHCP NACK issued Nginx • 30msec average req processing time • 10k req in min • 200 req are pending (backlog queue) There seems be client to send unexpected DHCP REQUEST. => Need to review configuration / troubleshoot The number of worker is not enough for actual requests. => Need to run more nginx or Increase worker 2. Grasp Internal State of Software/Process running
  19. OSS is not always operator friendly... Access to “https://<rancher-server>/metrics” Rancher

    v2.0.8 doesn’t export any internal metrics 2. Grasp Internal State of Software/Process running
  20. Contribute OSS to expose Internal State 2. Grasp Internal State

    of Software/Process running
  21. Confirm we can access internal metrics After our patch We

    can now visualize internal state 2. Grasp Internal State of Software/Process running
  22. Example of Essence: Grasp Software Responsibility 4. Understand which software

    have what responsibility Haproxy (Layer 7 LB) Application Server Application Server Database
  23. When some clients get something error from system 4. Understand

    which software have what responsibility Haproxy (Layer 7 LB) Application Server Application Server Database Client ERROR: Failed to establish TCP connection
  24. If we understood the responsibility for each software 4. Understand

    which software have what responsibility Haproxy (Layer 7 LB) Application Server Application Server Database Client ERROR: Failed to establish TCP connection Responsibility - Establish TCP connection with client - Distribute HTTP request into different application servers
  25. If we didn’t understand responsibility…. Haproxy (Layer 7 LB) Application

    Server Application Server Database Client ERROR: Failed to establish TCP connection DevOps Team Oh, Something is happening. Check Check everything !!!! which need 1 day...2 day.... 3day... 4. Understand which software have what responsibility
  26. OpenStack, Kubernetes is more complicated than... 4. Understand which software

    have what responsibility What if this process get down? How this process affect others? Understanding of responsibility is more important than usual web system
  27. How usually troubleshoot/bug fixes?

  28. Monitoring System Detect Something Happened

  29. What’s happening? Kubernetes Cluster Server Cluster Agent Kubernetes Cluster Cluster

    Agent Failed to establish websocket session Websocket Websocket
  30. Check log of Cluster Agent $ kubectl logs -f cattle-cluster-agent-df7f69b68-s7mqg

    -n cattle-system INFO: Environment: CATTLE_ADDRESS=172.18.6.6 CATTLE_CA_CHECKSUM=8b791af7a1dd5f28ca19f8dd689bb816d399ed02753f2472cf25d1eea5c20be1 CATTLE_CLUSTER=true CATTLE_INTERNAL_ADDRESS= CATTLE_K8S_MANAGED=true CATTLE_NODE_NAME=cattle-cluster-agent-df7f69b68-s7mqg CATTLE_SERVER=https://rancher.com INFO: Using resolv.conf: nameserver 172.19.0.10 search cattle-system.svc.cluster.local svc.cluster.local cluster.local ERROR: https://rancher.com/ping is not accessible (Could not resolve host: rancher.com) Somehow this container seems failed to resolve domain name. Kubernetes Cluster Cluster Agent
  31. Check log of Cluster Agent $ kubectl logs -f cattle-cluster-agent-df7f69b68-s7mqg

    -n cattle-system INFO: Environment: CATTLE_ADDRESS=172.18.6.6 CATTLE_CA_CHECKSUM=8b791af7a1dd5f28ca19f8dd689bb816d399ed02753f2472cf25d1eea5c20be1 CATTLE_CLUSTER=true CATTLE_INTERNAL_ADDRESS= CATTLE_K8S_MANAGED=true CATTLE_NODE_NAME=cattle-cluster-agent-df7f69b68-s7mqg CATTLE_SERVER=https://rancher.com INFO: Using resolv.conf: nameserver 172.19.0.10 search cattle-system.svc.cluster.local svc.cluster.local cluster.local ERROR: https://rancher.com/ping is not accessible (Could not resolve host: rancher.com) Somehow this container seems failed to resolve domain name. Kubernetes Cluster Cluster Agent $ kubectl get svc -n kube-system NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) kube-dns ClusterIP 172.19.0.10 <none> 53/UDP,53/TCP
  32. Kube DNS Problem? $ kubectl logs kube-dns-5ccb66df65-dhrqx -n kube-system kubedns

    | grep '^E' $ # no error detected Kubernetes Cluster Cluster Agent Kube DNS Busybox $ kubectl run -it busybox --image busybox -- sh / # nslookup google.com Server: 172.19.0.10 Address: 172.19.0.10:53 Non-authoritative answer: Name: google.com Address: 216.58.196.238 Check log of Kube DNS Check if other container can resolve dns
  33. Kube DNS Problem? $ kubectl logs kube-dns-5ccb66df65-dhrqx -n kube-system kubedns

    | grep '^E' $ # no error detected Kubernetes Cluster Cluster Agent Kube DNS Busybox $ kubectl run -it busybox --image busybox -- sh / # nslookup google.com Server: 172.19.0.10 Address: 172.19.0.10:53 Non-authoritative answer: Name: google.com Address: 216.58.196.238 Check log of Kube DNS Check if other container can resolve dns
  34. Only Container running on Node1 failed to resolve Kubernetes Cluster

    Node1 Node2 Node3 Cluster Agent BusyBox Kube DNS
  35. What about network connectivity between Node1, 3? Node1 Node2 Node3

    $ tcpdump -i eth0 port 8472 and src host <node 1> … # nothing output $ tcpdump -i eth0 port 8472 and src host <node 2> 23:11:32.017389 IP 10.0.0.2.56312 > 10.0.0.3.otv: OTV, flags [I] (0x08), overlay 0, instance 1 IP 172.18.5.7 > 172.18.6.5: ICMP echo request, id 15872, seq 273, length 64 eth0 VXLAN UDP Port eth0 eth0 dstIP: Node3 dstPort: 8472 srcIP: Node2 srcPort: 56312 Container Ether frame vxlan header VXLAN Overlay Network dstIP: Node3 dstPort: 8472 srcIP: Node1 srcPort: 56312 Container Ether frame vxlan header IP Header UDP Header VXLAN Header Container Ether Frame Missing
  36. Who is responsible for Container Network? • Building Container Network

    is just pre-condition of Kubernetes • It is supposed to be done outside of Kubernetes
  37. What we use? What we use? • Flannel which is

    used to connect Linux Containers • Flannel support multiple backends like vxlan, ipip…. Responsibility of Flannel in our case? • Configure Linux kernel to create termination device of overlay network • Configure Linux kernel to route, bridge, related sub-system
  38. Understand how flannel configure linux networking zz:zz:zz:zz:zz:zz eth0: 10.0.0.1/24 Flannel.1:

    172.17.1.0/32 cni: 172.17.1.1/24 yy:yy:yy:yy:yy:yy Pod A xx:xx:xx:xx:xx:xx eth0: 172.17.1.2/24 cc:cc:cc:cc:cc:cc eth0: 10.0.0.2/24 Flannel.1: 172.17.2.0/32 cni: 172.17.2.1/24 bb:bb:bb:bb:bb:bb Pod B aa:aa:aa:aa:aa:aa eth0: 172.17.2.2/24 $ip r 172.17.2.0/24 via 172.17.2.0 dev flannel.1 $ip n 172.17.2.0 flannel.1 lladdr cc.cc.cc.cc.cc.cc $bridge fdb show dev flannel.1 cc.cc.cc.cc.cc.cc dev flannel.1 dst 10.0.0.2 Configure 1 route, 1 arp entry, 1 fdb entry per host route arp entry fdb entry
  39. Check Linux Network Configuration Routing table 172.17.1.0/24 via 172.17.1.0 dev

    flannel.1 172.17.2.0/24 dev cni 172.17.3.0/24 via 172.17.3.0 dev flannel.1 ARP cache 172.17.1.0 flannel.1 lladdr aa:aa:aa:aa:aa:aa 172.17.3.0 flannel.1 lladdr cc:cc:cc:cc:cc:cc FDB aa.aa.aa.aa.aa.aa dev flannel.1 dst 10.0.0.1 cc.cc.cc.cc.cc.cc dev flannel.1 dst 10.0.0.3 Routing table 172.17.2.0/24 via 172.17.2.0 dev flannel.1 172.17.3.0/24 dev cni ARP cache 172.17.2.0 flannel.1 lladdr bb.bb.bb.bb.bb.bb FDB bb.bb.bb.bb.bb.bb dev flannel.1 dst 10.0.0.2 Routing table 172.17.1.0/24 dev cni 172.17.2.0/24 via 172.17.2.0 dev flannel.1 172.17.3.0/24 via 172.17.3.0 dev flannel.1 ARP cache 172.17.2.0 flannel.1 lladdr bb:bb:bb:bb:bb:bb 172.17.3.0 flannel.1 lladdr cc:cc:cc:cc:cc:cc FDB bb.bb.bb.bb.bb.bb dev flannel.1 dst 10.0.0.2 cc.cc.cc.cc.cc.cc dev flannel.1 dst 10.0.0.3 Node1 Node2 Node3 Node1 related information has been missing • 172.17.1.0/24 via 172.17.1.0 dev flannel.1 • 172.17.1.0 flannel.1 lladdr aa:aa:aa:aa:aa:aa • Aa.aa.aa.aa.aa.aa dev flannel.1 dst 10.0.0.1
  40. Flannel having something problem? $ kubectl logs kube-flannel-knwd7 -n kube-system

    kube-flannel | grep -v '^I' # exclude info log $ Node1 Node2 Node3 eth0 eth0 eth0 Flannel Flannel Flannel But There is no error log….
  41. Reading Code / Understand How flannel works deeply • Store

    node specific metadata into k8s node annotation when flannel start • flannel setup route, fdb, arp cache if there is a node with flannel annotation Flannel Agent will $ kubectl get node yuki-testc1 -o yaml apiVersion: v1 kind: Node Metadata: Annotations: flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"9a:4f:ef:9c:2e:2f"}' flannel.alpha.coreos.com/backend-type: vxlan flannel.alpha.coreos.com/kube-subnet-manager: "true" flannel.alpha.coreos.com/public-ip: 10.0.0.1 All Nodes should have these annotation
  42. Check kubernetes node annotation apiVersion: v1 kind: Node Metadata: Annotations:

    flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"bb:bb:bb:bb:bb:bb"}' flannel.alpha.coreos.com/backend-type: vxlan flannel.alpha.coreos.com/kube-subnet-manager: "true" flannel.alpha.coreos.com/public-ip: 10.0.0.2 rke.cattle.io/external-ip: 10.0.0.2 apiVersion: v1 kind: Node Metadata: Annotations: flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"cc:cc:cc:cc:cc:cc"}' flannel.alpha.coreos.com/backend-type: vxlan flannel.alpha.coreos.com/kube-subnet-manager: "true" flannel.alpha.coreos.com/public-ip: 10.0.0.3 rke.cattle.io/external-ip: 10.0.0.3 apiVersion: v1 kind: Node Metadata: Annotations: rke.cattle.io/external-ip: 10.0.0.1 Node1 Node2 Node3 Missing Flannel related annotation flannel running on node3 could not configure for node1 because there’s no annotation => why node1 doesn’t have flannel related annotation? => why node2 has node1 network information?
  43. Annotation has been changed by someone else... apiVersion: v1 kind:

    Node Metadata: Annotations: flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"bb:bb:bb:bb:bb:bb"}' flannel.alpha.coreos.com/backend-type: vxlan flannel.alpha.coreos.com/kube-subnet-manager: "true" flannel.alpha.coreos.com/public-ip: 10.0.0.2 rke.cattle.io/external-ip: 10.0.0.2 apiVersion: v1 kind: Node Metadata: Annotations: flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"cc:cc:cc:cc:cc:cc"}' flannel.alpha.coreos.com/backend-type: vxlan flannel.alpha.coreos.com/kube-subnet-manager: "true" flannel.alpha.coreos.com/public-ip: 10.0.0.3 rke.cattle.io/external-ip: 10.0.0.3 apiVersion: v1 kind: Node Metadata: Annotations: rke.cattle.io/external-ip: 10.0.0.1 Node1 Node2 Node3 Server Rancher Server also update kubernetes node annotations => Why flannel annotations on only node1 has gone...
  44. Reading Code / Understand How rancher works When Rancher build

    Kubernetes Nodes 1. Gets current node annotation 2. Build desired annotation 3. Get node resource 4. Replace annotation with desired one 5. Update node with desired annotation FunctionA FunctionB Few sec interval This logic ignore Optimistic Locking
  45. How this logic cause problem? Rancher update annotation Server Node2

    is up Node2 is booting Flannel update annotations Rancher update annotation Server Node1 is up Node1 is booting Flannel update annotations Create Node 2 Update Annotation Rancher’s Annotation Rancher’s Annotation Flannel’s Annotation Flannel’s Annotation Rancher’s Annotation Update Annotation Create Node 1 Drop existing annotation because of logic
  46. Write patch and Confirmed it works diff --git a/pkg/controllers/user/nodesyncer/nodessyncer.go b/pkg/controllers/user/nodesyncer/nodessyncer.go

    index 11cc9c4e..64526ccf 100644 --- a/pkg/controllers/user/nodesyncer/nodessyncer.go +++ b/pkg/controllers/user/nodesyncer/nodessyncer.go @@ -143,7 +143,19 @@ func (m *NodesSyncer) syncLabels(key string, obj *v3.Node) error { toUpdate.Labels = obj.Spec.DesiredNodeLabels } if updateAnnotations { - toUpdate.Annotations = obj.Spec.DesiredNodeAnnotations + // NOTE: This is just workaround. + // There are multiple solutions to solve the problem of https://github.com/rancher/rancher/issues/13644 + // and this problem is kind of desigin bugs. So solving the root cause of problem need design decistions. + // That's why for now we solved the problem by the soltion which don't have to change many places. Because + // We don't wanna create/maintain large change which have high possibility not to be merged in upstream + // Rancher Community tend to hesitate to merge big change created by engineer not from Rancher Lab + + // The solution is to change NodeSyncer so as not to replace annotation with desiredAnnotations but just update annotations which + // is specified in desiredAnnotation. This change have side-effects that disable for user to delete exisiting annotation + // via desiredAnnotation. but we belived this case is not so famous so we chose this solution + for k, v := range obj.Spec.DesiredNodeAnnotations { + toUpdate.Annotations[k] = v
  47. Reporting/Proposing to OSS Community

  48. Look back of troubleshooting One of agents failed to connect

    to rancher server First suspect Rancher Agent itself? Second suspect Kube-dns? Third suspect Kubernetes? Fifth suspect Rancher Server? => Need to read code Fourth suspect Flannel? => Need to read code What we observed
  49. Look back of troubleshooting One of agents failed to connect

    to rancher server First suspect Rancher Agent itself? Second suspect Kube-dns? Third suspect Kubernetes? Fifth suspect Rancher Server? => Need to read code Fourth suspect Flannel? => Need to read code What we observed 3. Understand where problem happened is not always where there is root cause
  50. Look back of troubleshooting One of agents failed to connect

    to rancher server First suspect Rancher Agent itself? Second suspect Kube-dns? Third suspect Kubernetes? Fifth suspect Rancher Server? => Need to read code Fourth suspect Flannel? => Need to read code What we observed 3. Understand where problem happened is not always where there is root cause 4. Understand which software have what responsibility
  51. Look back of troubleshooting One of agents failed to connect

    to rancher server First suspect Rancher Agent itself? Second suspect Kube-dns? Third suspect Kubernetes? Fifth suspect Rancher Server? => Need to read code Fourth suspect Flannel? => Need to read code What we observed 3. Understand where problem happened is not always where there is root cause 4. Understand which software have what responsibility 5. Don’t stop to investigate/dive into problem until you understood root cause => There was chance to stop to dive into and Do just workaround for now • Second Suspect (Kube DNS): As some internet information described, If we deployed kube dns on all nodes, this problem seems be hidden • Fourth Suspect (Flannel): If we just thought flannel annotation get disappeared for trivial reason and manually fixed annotation, this problem seems be hidden.
  52. Essence of Good Operation for oss distributed system 1. Read

    Code until you understood, Don’t believe just document, bug report. 2. Grasp internal state of software/process running 3. Understand where problem happened is not always where there is root cause. 4. Understand which software have what responsibility 5. Don’t stop to investigate/dive into problem until you understood root cause Remind:
  53. Being developer in Private Cloud Platform Team • Wide range

    stack of technology ◦ Microservice/Distributed System Operation Knowledge ◦ Reading Large Amount of Code ◦ Many Dependent OSS e.g. OpenStack, Kubernetes, Rancher, Docker…. ◦ Networking e.g. OS Networking, Overlay Network ◦ Virtualization e.g. Container, Libvirt/KVM • Strong problem solving skill ◦ Troubleshooting tend to be complicated like cascade disaster • Mind to communicate/contribute to OSS Community ◦ As much as possible, we want to follow upstream development to reduce costs • Continuously learning new tech deeply ◦ Don’t finish just playing for new tech! There are many chances to improve ourselves
  54. What’s coming in the future? • Our Private Cloud Scale

    getting bigger ◦ More Region (Current: 3 Regions) ◦ More Hypervisors (Current: +1000 HV) ◦ More Cluster for Specific Use Case • Enhance Operation (Making Operation Easy) ◦ To be able to Operate Large Scale of Cloud with small team • Make Cloud Native Related Component Production Ready ◦ Managed Kubernetes Service is not production ready yet...
  55. Thanks for listening