Slide 1

Slide 1 text

Introduction of Private Cloud in LINE Yuki Nishiwaki

Slide 2

Slide 2 text

Agenda 1. Introduction/Background of Private Cloud 2. OpenStack in LINE 3. Challenge of OpenStack

Slide 3

Slide 3 text

Who are we? Responsibility - Develop/Maintain Common/Fundamental Function for Private Cloud (IaaS) - Consider/Think of Optimization for Whole Private Cloud Network Service Operation Platform Storage Software - IaaS (OpenStack + α) - Kubernetes Knowledge - Software - Network, Virtualization, Linux

Slide 4

Slide 4 text

Before Private Cloud Problems/Concerns: 1. Many Manual Procedure 2. Cost to communicate 3. Difficult to scale for provision a. For 2000+ engineer Problems/Concerns: 1. Need to ask for Infrastructure Department - Change Infrasture - More server 2. Need to predict (difficult) - How many servers - When need servers 3. Tend to have unnecessary amount of server Dev Team 2 Give us a server Provide a server 3 month - Buy - Register - Setup Dev Team 1 Infrastructure Department

Slide 5

Slide 5 text

After Private Cloud Improved 1. Automate many operation 2. No cost communication 3. No provisioning cost 4. Optimize usage of resource But 1. Need Software development Improved 1. Get Infrastructure Resource without human interaction - Capability of - Automate Resource Allocation - Automate Resource Deallocation - No need prediction - No need unnecessary resources Provide a server Dev Team 1 Infrastructure Department Private Cloud Give us a server Just few sec Communicate by API Maintain

Slide 6

Slide 6 text

Private Cloud OpenStack VM (Nova) Image Store (Glance) Network Controller (Neutron) Identify (Keystone) DNS Controller (Designate) Loadbalancer L4LB L7LB Kubernetes (Rancher) Storage Block Storage (Ceph) Object Storage (Ceph) Database Search/Analytics Engine (ElasticSearch) RDBMS (Mysql) KVS (Redis) Messaging (Kafka) Function (Knative) Baremetal Platform Service Network Storage Operation Operation Tools

Slide 7

Slide 7 text

Today’s Topic OpenStack VM (Nova) Image Store (Glance) Network Controller (Neutron) Identify (Keystone) DNS Controller (Designate) Loadbalancer L4LB L7LB Kubernetes (Rancher) Storage Block Storage (Ceph) Object Storage (Ceph) Database Search/Analytics Engine (ElasticSearch) RDBMS (Mysql) KVS (Redis) Messaging (Kafka) Function (Knative) Baremetal Operation Tools

Slide 8

Slide 8 text

OpenStack ● Open Source Software to build Private Cloud (like AWS) ● Microservice Architecture ○ Use only parts/components to be needed ○ Scale out for parts/components to be needed

Slide 9

Slide 9 text

Microservice Architecture of OpenStack

Slide 10

Slide 10 text

Assemble your own cloud Private Cloud

Slide 11

Slide 11 text

OpenStack in LINE 導入時期 2016年 Version Mitaka + Customization クラスタ数 4 Hypervisor数 1100+ ● Dev Cluster: 400 ● Prod Cluster: 600 (region 1) ● Prod Cluster: 76 (region 2) ● Prod Cluster: 80 (region 3) VM数 26000+ ● Dev Cluster: 15503 ● Prod Cluster: 8870 (region 1) ● Prod Cluster: 335 (region 2) ● Prod Cluster: 229 (region 3)

Slide 12

Slide 12 text

Difficulty of building OpenStack Cloud TOR Core Aggregation ToR Aggregation ToR Hypervisor Hypervisor Hypervisor Hypervisor Hypervisor Hypervisor Hypervisor Hypervisor Aggregation ToR OpenStack database OpenStack database OpenStack API OpenStack API Core Aggregation Datacenter Rack ● Knowledge of Networking ○ Design/Plan whole DC Network ● Knowledge of Operation for Large Product ○ Build Operation Tool which is not for specific software ○ Consider User Support ● Knowledge of Server Kitting ○ Communicate procurement department ● Knowledge of OpenStack Software ○ Design deployment of OpenStack ○ Deploy OpenStack ○ Customize OpenStack ○ Troubleshooting ■ OpenStack Component ■ Related Software

Slide 13

Slide 13 text

Building OpenStack is not completed in one team Network Operation Platform ● Maintain ○ Golden VM Image ○ ElasticSearch for logging ○ Prometheus for alerting ● Develop Operation Tools ● User Support ● Buy New Servers ● Design/Planning ○ DC Network ○ Inter-DC Network ● Implement Network Orchestrator (Outside OpenStack) ● Design OpenStack Deployment ● Deploy OpenStack ● Customize OpenStack ● Troubleshooting Member: 3+ Member: 4+ Member: 4+

Slide 14

Slide 14 text

Challenge of OpenStack Basically We are trying to make OpenStack(IaaS) stable What we have done 1. Legacy System Integration 2. Bring New Network Architecture into OpenStack Network 3. Maintain Customization for OSS while keep to catch up upstream What we will do 1. Scale Emulation Environment 2. Internal Communication Visualizing/Tuning 3. Containerize OpenStack 4. Event Hub as a Platform

Slide 15

Slide 15 text

Challenge of OpenStack Basically We are trying to make OpenStack(IaaS) stable What we have done 1. Legacy System Integration 2. Bring New Network Architecture into OpenStack Network 3. Maintain Customization for OSS while keep to catch up upstream What we will do 1. Scale Emulation Environment 2. Internal Communication Visualizing/Tuning 3. Containerize OpenStack 4. Event Hub as a Platform

Slide 16

Slide 16 text

Configuration Management Challenge 1: Integration with Legacy System Even before cloud, We have many Company-wide Systems for some purpose CMDB Monitoring System Server Login Authority Management IPDB Server Register Spec, OS, Location.. Register IP address, Hostname Register server as a monitoring target Register acceptable user of server setup Ask for new server Infra Dev

Slide 17

Slide 17 text

Challenge 1: Integration with Legacy System After private cloud, “Server Creation” is completed without Infrastructure department interruption. Thus Private Cloud itself should register new server Private Cloud Configuration Management CMDB Monitoring System Server Login Authority Management IPDB Server Create new server Dev Register

Slide 18

Slide 18 text

Challenge 2: New Network Architecture in our DC For scalability, operatabilty. We introduce CLOS Network Architecture and terminate L3 on Hypervisor. Previous New

Slide 19

Slide 19 text

Challenge 2: Support new architecture in OpenStack Network Controller (Neutron) neutron-server neutron-dhcp-agent neutron-linuxbridge-agent OSS implementation neutron-metadata-agent Expect to share L2 Network We want all vms not to share l2 network neutron-custom-agent Replace New

Slide 20

Slide 20 text

Challenge 3: Improve Customization for OSS ● We have customized many OpenStack Components ● Previously we just customize it after customize again and again OpenStack VM (Nova) Image Store (Glance) Network Controller (Neutron) Identify (Keystone) DNS Controller (Designate) VM (Nova) customize commit for A customize commit for C customize commit for A customize commit for B customize commit for A It’s difficult for us to take specific patch away from our customized OpenStack. Specific version upstream LINE version forked

Slide 21

Slide 21 text

Challenge 3: Improve Customization for OSS VM (Nova) customize commit for A customize commit for C customize commit for A customize commit for B customize commit for A Specific version upstream LINE version forked patch for A patch for B patch for C Base Commit ID VM (Nova) Specific version maintain by git maintain by git ● Don’t fork/Stop to fork ● Just maintain only patch file in git => easily take patch out than before

Slide 22

Slide 22 text

Challenge will be different from Day1 to Day2 Day1 (So far) ● Develop user faced feature ○ Keep same experience as before (legacy system) ○ Support new architecture ● Daily operation ○ Predictable ○ Unpredictable based on trouble Day2 (from now) ● Enhance Operation ● Optimize Development ● Reduce daily operation ○ Predictable ○ Unpredictable

Slide 23

Slide 23 text

Challenge of OpenStack Basically We are trying to make OpenStack(IaaS) stable What we have done 1. Legacy System Integration 2. Bring New Network Architecture into OpenStack Network 3. Maintain Customization for OSS while keep to catch up upstream What we will do 1. Scale Emulation Environment 2. Internal Communication Visualizing/Tuning 3. Containerize OpenStack 4. Event Hub as a Platform

Slide 24

Slide 24 text

Future Challenge 1: Scale Emulation Environment 導入時期 2016年 Version Mitaka + Customization クラスタ数 4+1 (WIP: Semi Public Cloud) Hypervisor数 1100+ ● Dev Cluster: 400 ● Prod Cluster: 600 (region 1) ● Prod Cluster: 76 (region 2) ● Prod Cluster: 80 (region 3) VM数 26000+ ● Dev Cluster: 15503 ● Prod Cluster: 8870 (region 1) ● Prod Cluster: 335 (region 2) ● Prod Cluster: 229 (region 3) The number of hypervisor is continuously increased We faced the situation - Timing/Scale related error - Some operation took long time !

Slide 25

Slide 25 text

We need environment to simulate scale from following point of view without preparing same number of Hypervisor ● Database Access ● RPC over RabbitMQ Future Challenge 1: Scale Emulation Environment They are control plane specific load. We can use this environment for tuning of control plane in OpenStack

Slide 26

Slide 26 text

● Implement Fake Agent (nova-compute) (neutron-agent) ● Use container instead of actual HV Future Challenge 1: Scale Emulation Environment Hypervisor (nova-compute, neutron-agent) Controle Plane Controle Plane Controle Plane 600 HV Orchestrate/Manage Real Environment Scale Environment Controle Plane Controle Plane Controle Plane ● Use same env 600 fake-HV Server Fake HV (docker container) (nova-compute, neutron-agent) Hypervisor (nova-compute, neutron-agent) Hypervisor (HV) (nova-compute, neutron-agent) Fake HV (docker container) (nova-compute, neutron-agent)

Slide 27

Slide 27 text

● Implement Fake Agent (nova-compute) (neutron-agent) ● Use container instead of actual HV Future Challenge 1: Scale Emulation Environment Hypervisor (nova-compute, neutron-agent) Controle Plane Controle Plane Controle Plane 600 HV Orchestrate/Manage Real Environment Scale Environment Controle Plane Controle Plane Controle Plane ● Use same env 600 fake-HV Server Fake HV (docker container) (nova-compute, neutron-agent) Hypervisor (nova-compute, neutron-agent) Hypervisor (HV) (nova-compute, neutron-agent) Fake HV (docker container) (nova-compute, neutron-agent) Easy to add new Fake HV => We can emulate any number of scale

Slide 28

Slide 28 text

Future Challenge 2: Communication Visualizing There are 2 types of communication among OpenStack each software Authentication (Keystone) VM (Nova) Network (Neutron) Microservice ● Restful API (between component) ● RPC over Messaging Bus (inside component) Restful API Restful API Restful API neutron-agent neutron-server RPC

Slide 29

Slide 29 text

Future Challenge 2: Communication Visualizing Authentication (Keystone) VM (Nova) Network (Neutron) Microservice Restful API Restful API Restful API neutron-agent neutron-server RPC Anytime this can be broken Communication can be failed. - Because of scale - Because of in-proper config Error sometimes got propagated from one to other

Slide 30

Slide 30 text

Future Challenge 2: Communication Visualizing Authentication (Keystone) VM (Nova) Network (Neutron) Microservice Restful API Restful API Restful API neutron-agent neutron-server RPC Anytime this can be broken Communication can be failed. - Because of scale - Because of in-proper config Error sometimes got propagated from one to other 1. Very difficult to troubleshoot this kind of issue because of - Error got propagated from one to another - Log is not always enough information - Log is only shown when something happen 2. Sometimes problem can be predicted by some metrics - how many rpc got received - how many rpc waited for reply

Slide 31

Slide 31 text

Future Challenge 2: Communication Visualizing Authentication (Keystone) VM (Nova) Network (Neutron) Microservice Restful API Restful API Restful API neutron-agent neutron-server RPC Monitoring tool Monitor Communication related metrics

Slide 32

Slide 32 text

Future Challenge 3: Containerize OpenStack Motivation/Current Pain Point ● Complexity of packaging tool like RPM ○ Dependency between packages ○ Configuration for new file => We need to build RPM everytime we changed the code ● Impossible to run different version of OpenStack on same server ○ Dependency of common library of OpenStack => we actually deployed much more control plane servers than we actually need ● Lack of observability for all softwares running on control plane ○ No way to identify which part is to install depended library and which part is to install our software in deployment script (ansible, chef…) ○ Deployment script doesn’t take care software running after deployed ○ We can not notice if some developer run something temporally script

Slide 33

Slide 33 text

Future Challenge 3: Containerize OpenStack Server Server Server Ansible Playbook Ansible Playbook Ansible Playbook Install library Install software Start software K8s manifest K8s manifest nova-api neutron-server common-library RPM Server nova-api neutron-server common-library Docker Registry Get package Server Server Server nova-api container nova-api common-library nova-api container nova-api common-library Install software Start software

Slide 34

Slide 34 text

Future Challenge 4: EventHub as a Platform OpenStack VM (Nova) Image Store (Glance) Network Controller (Neutron) Identify (Keystone) DNS Controller (Designate) Loadbalancer L4LB L7LB Kubernetes (Rancher) Storage Block Storage (Ceph) Object Storage (Ceph) Database Search/Analytics Engine (ElasticSearch) RDBMS (Mysql) KVS (Redis) Messaging (Kafka) Function (Knative) Baremetal Operation Tools

Slide 35

Slide 35 text

Future Challenge 4: EventHub as a Platform OpenStack VM (Nova) Image Store (Glance) Network Controller (Neutron) Identify (Keystone) DNS Controller (Designate) Loadbalancer L4LB L7LB Kubernetes (Rancher) Storage Block Storage (Ceph) Object Storage (Ceph) Database Search/Analytics Engine (ElasticSearch) RDBMS (Mysql) KVS (Redis) Messaging (Kafka) Function (Knative) Baremetal Operation Tools Depending on others Some component/operation script want to do something When User(actually project) in Keystone is deleted When VM is created When RealServer is added to Loadbalancer

Slide 36

Slide 36 text

Pub/Sub Concept in Microservice Architecture Authentication Component VM Component Publish important event of own component Subscribe just interested events Network Component This component can do something when interested event happened This component don’t have to consider who this component need to work with Messaging bus (RabbitMQ)

Slide 37

Slide 37 text

Pub/Sub Concept in Microservice Architecture Authentication Component VM Component Publish important event of own component Subscribe just interested events Network Component This component can do something when interested event happened This component don’t have to consider who this component need to work with Messaging bus This mechanism allow us to extend Private Cloud (Microservice) without changing existing code for future

Slide 38

Slide 38 text

Future Challenge 4: EventHub as a Platform This part of notification logic has been already implemented in OpenStack but... Authentication Component (Keystone) Messaging bus (RabbitMQ) VM Component (Nova) Operation ScriptA Operation ScriptB L7LB Kubernetes Publish Event Subscribe Event Logic for access rabbitmq Logic for access rabbitmq Logic for access rabbitmq Logic for access rabbitmq Business logic Business logic Business logic Business logic

Slide 39

Slide 39 text

Future Challenge 4: EventHub as a Platform This part of notification logic has been already implemented in OpenStack but... Authentication Component (Keystone) Messaging bus (RabbitMQ) VM Component (Nova) Operation ScriptA Operation ScriptB L7LB Kubernetes Publish Event Subscribe Event Logic for access rabbitmq Logic for access rabbitmq Logic for access rabbitmq Logic for access rabbitmq Business logic Business logic Business logic Business logic ● Sometimes Logic for access rabbitmq code got bigger than actual business logic ● All of components/script need to implement that logic first

Slide 40

Slide 40 text

Future Challenge 4: EventHub as a Platform We are currently developing new component which allow us to register program with interested event. It will make more easy to co-work with other component Authentication Component (Keystone) Messaging bus (RabbitMQ) VM Component (Nova) Operation ScriptA Operation ScriptB L7LB Kubernetes Publish Event Logic for access rabbitmq Business logic Business logic Business logic Business logic Subscribe Event Business logic Business logic Business logic Function as a Service New

Slide 41

Slide 41 text

For more future: IaaS to PaaS, CaaS…. We are currently trying to introduce additional abstraction layer above from IaaS ● https://engineering.linecorp.com/ja/blog/japan-container-days-v18-12-report/ ● https://www.slideshare.net/linecorp/lines-private-cloud-meet-cloud-native-world