Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction of_Private Cloud in LINE

LINE Developers
February 15, 2019

Introduction of_Private Cloud in LINE

LINE Developers

February 15, 2019

More Decks by LINE Developers


  1. Who are we? Responsibility - Develop/Maintain Common/Fundamental Function for Private

    Cloud (IaaS) - Consider/Think of Optimization for Whole Private Cloud Network Service Operation Platform Storage Software - IaaS (OpenStack + α) - Kubernetes Knowledge - Software - Network, Virtualization, Linux
  2. Before Private Cloud Problems/Concerns: 1. Many Manual Procedure 2. Cost

    to communicate 3. Difficult to scale for provision a. For 2000+ engineer Problems/Concerns: 1. Need to ask for Infrastructure Department - Change Infrasture - More server 2. Need to predict (difficult) - How many servers - When need servers 3. Tend to have unnecessary amount of server Dev Team 2 Give us a server Provide a server 3 month - Buy - Register - Setup Dev Team 1 Infrastructure Department
  3. After Private Cloud Improved 1. Automate many operation 2. No

    cost communication 3. No provisioning cost 4. Optimize usage of resource But 1. Need Software development Improved 1. Get Infrastructure Resource without human interaction - Capability of - Automate Resource Allocation - Automate Resource Deallocation - No need prediction - No need unnecessary resources Provide a server Dev Team 1 Infrastructure Department Private Cloud Give us a server Just few sec Communicate by API Maintain
  4. Private Cloud OpenStack VM (Nova) Image Store (Glance) Network Controller

    (Neutron) Identify (Keystone) DNS Controller (Designate) Loadbalancer L4LB L7LB Kubernetes (Rancher) Storage Block Storage (Ceph) Object Storage (Ceph) Database Search/Analytics Engine (ElasticSearch) RDBMS (Mysql) KVS (Redis) Messaging (Kafka) Function (Knative) Baremetal Platform Service Network Storage Operation Operation Tools
  5. Today’s Topic OpenStack VM (Nova) Image Store (Glance) Network Controller

    (Neutron) Identify (Keystone) DNS Controller (Designate) Loadbalancer L4LB L7LB Kubernetes (Rancher) Storage Block Storage (Ceph) Object Storage (Ceph) Database Search/Analytics Engine (ElasticSearch) RDBMS (Mysql) KVS (Redis) Messaging (Kafka) Function (Knative) Baremetal Operation Tools
  6. OpenStack • Open Source Software to build Private Cloud (like

    AWS) • Microservice Architecture ◦ Use only parts/components to be needed ◦ Scale out for parts/components to be needed
  7. OpenStack in LINE 導入時期 2016年 Version Mitaka + Customization クラスタ数

    4 Hypervisor数 1100+ • Dev Cluster: 400 • Prod Cluster: 600 (region 1) • Prod Cluster: 76 (region 2) • Prod Cluster: 80 (region 3) VM数 26000+ • Dev Cluster: 15503 • Prod Cluster: 8870 (region 1) • Prod Cluster: 335 (region 2) • Prod Cluster: 229 (region 3)
  8. Difficulty of building OpenStack Cloud TOR Core Aggregation ToR Aggregation

    ToR Hypervisor Hypervisor Hypervisor Hypervisor Hypervisor Hypervisor Hypervisor Hypervisor Aggregation ToR OpenStack database OpenStack database OpenStack API OpenStack API Core Aggregation Datacenter Rack • Knowledge of Networking ◦ Design/Plan whole DC Network • Knowledge of Operation for Large Product ◦ Build Operation Tool which is not for specific software ◦ Consider User Support • Knowledge of Server Kitting ◦ Communicate procurement department • Knowledge of OpenStack Software ◦ Design deployment of OpenStack ◦ Deploy OpenStack ◦ Customize OpenStack ◦ Troubleshooting ▪ OpenStack Component ▪ Related Software
  9. Building OpenStack is not completed in one team Network Operation

    Platform • Maintain ◦ Golden VM Image ◦ ElasticSearch for logging ◦ Prometheus for alerting • Develop Operation Tools • User Support • Buy New Servers • Design/Planning ◦ DC Network ◦ Inter-DC Network • Implement Network Orchestrator (Outside OpenStack) • Design OpenStack Deployment • Deploy OpenStack • Customize OpenStack • Troubleshooting Member: 3+ Member: 4+ Member: 4+
  10. Challenge of OpenStack Basically We are trying to make OpenStack(IaaS)

    stable What we have done 1. Legacy System Integration 2. Bring New Network Architecture into OpenStack Network 3. Maintain Customization for OSS while keep to catch up upstream What we will do 1. Scale Emulation Environment 2. Internal Communication Visualizing/Tuning 3. Containerize OpenStack 4. Event Hub as a Platform
  11. Challenge of OpenStack Basically We are trying to make OpenStack(IaaS)

    stable What we have done 1. Legacy System Integration 2. Bring New Network Architecture into OpenStack Network 3. Maintain Customization for OSS while keep to catch up upstream What we will do 1. Scale Emulation Environment 2. Internal Communication Visualizing/Tuning 3. Containerize OpenStack 4. Event Hub as a Platform
  12. Configuration Management Challenge 1: Integration with Legacy System Even before

    cloud, We have many Company-wide Systems for some purpose CMDB Monitoring System Server Login Authority Management IPDB Server Register Spec, OS, Location.. Register IP address, Hostname Register server as a monitoring target Register acceptable user of server setup Ask for new server Infra Dev
  13. Challenge 1: Integration with Legacy System After private cloud, “Server

    Creation” is completed without Infrastructure department interruption. Thus Private Cloud itself should register new server Private Cloud Configuration Management CMDB Monitoring System Server Login Authority Management IPDB Server Create new server Dev Register
  14. Challenge 2: New Network Architecture in our DC For scalability,

    operatabilty. We introduce CLOS Network Architecture and terminate L3 on Hypervisor. Previous New
  15. Challenge 2: Support new architecture in OpenStack Network Controller (Neutron)

    neutron-server neutron-dhcp-agent neutron-linuxbridge-agent OSS implementation neutron-metadata-agent Expect to share L2 Network We want all vms not to share l2 network neutron-custom-agent Replace New
  16. Challenge 3: Improve Customization for OSS • We have customized

    many OpenStack Components • Previously we just customize it after customize again and again OpenStack VM (Nova) Image Store (Glance) Network Controller (Neutron) Identify (Keystone) DNS Controller (Designate) VM (Nova) customize commit for A customize commit for C customize commit for A customize commit for B customize commit for A It’s difficult for us to take specific patch away from our customized OpenStack. Specific version upstream LINE version forked
  17. Challenge 3: Improve Customization for OSS VM (Nova) customize commit

    for A customize commit for C customize commit for A customize commit for B customize commit for A Specific version upstream LINE version forked patch for A patch for B patch for C Base Commit ID VM (Nova) Specific version maintain by git maintain by git • Don’t fork/Stop to fork • Just maintain only patch file in git => easily take patch out than before
  18. Challenge will be different from Day1 to Day2 Day1 (So

    far) • Develop user faced feature ◦ Keep same experience as before (legacy system) ◦ Support new architecture • Daily operation ◦ Predictable ◦ Unpredictable based on trouble Day2 (from now) • Enhance Operation • Optimize Development • Reduce daily operation ◦ Predictable ◦ Unpredictable
  19. Challenge of OpenStack Basically We are trying to make OpenStack(IaaS)

    stable What we have done 1. Legacy System Integration 2. Bring New Network Architecture into OpenStack Network 3. Maintain Customization for OSS while keep to catch up upstream What we will do 1. Scale Emulation Environment 2. Internal Communication Visualizing/Tuning 3. Containerize OpenStack 4. Event Hub as a Platform
  20. Future Challenge 1: Scale Emulation Environment 導入時期 2016年 Version Mitaka

    + Customization クラスタ数 4+1 (WIP: Semi Public Cloud) Hypervisor数 1100+ • Dev Cluster: 400 • Prod Cluster: 600 (region 1) • Prod Cluster: 76 (region 2) • Prod Cluster: 80 (region 3) VM数 26000+ • Dev Cluster: 15503 • Prod Cluster: 8870 (region 1) • Prod Cluster: 335 (region 2) • Prod Cluster: 229 (region 3) The number of hypervisor is continuously increased We faced the situation - Timing/Scale related error - Some operation took long time !
  21. We need environment to simulate scale from following point of

    view without preparing same number of Hypervisor • Database Access • RPC over RabbitMQ Future Challenge 1: Scale Emulation Environment They are control plane specific load. We can use this environment for tuning of control plane in OpenStack
  22. • Implement Fake Agent (nova-compute) (neutron-agent) • Use container instead

    of actual HV Future Challenge 1: Scale Emulation Environment Hypervisor (nova-compute, neutron-agent) Controle Plane Controle Plane Controle Plane 600 HV Orchestrate/Manage Real Environment Scale Environment Controle Plane Controle Plane Controle Plane • Use same env 600 fake-HV Server Fake HV (docker container) (nova-compute, neutron-agent) Hypervisor (nova-compute, neutron-agent) Hypervisor (HV) (nova-compute, neutron-agent) Fake HV (docker container) (nova-compute, neutron-agent)
  23. • Implement Fake Agent (nova-compute) (neutron-agent) • Use container instead

    of actual HV Future Challenge 1: Scale Emulation Environment Hypervisor (nova-compute, neutron-agent) Controle Plane Controle Plane Controle Plane 600 HV Orchestrate/Manage Real Environment Scale Environment Controle Plane Controle Plane Controle Plane • Use same env 600 fake-HV Server Fake HV (docker container) (nova-compute, neutron-agent) Hypervisor (nova-compute, neutron-agent) Hypervisor (HV) (nova-compute, neutron-agent) Fake HV (docker container) (nova-compute, neutron-agent) Easy to add new Fake HV => We can emulate any number of scale
  24. Future Challenge 2: Communication Visualizing There are 2 types of

    communication among OpenStack each software Authentication (Keystone) VM (Nova) Network (Neutron) Microservice • Restful API (between component) • RPC over Messaging Bus (inside component) Restful API Restful API Restful API neutron-agent neutron-server RPC
  25. Future Challenge 2: Communication Visualizing Authentication (Keystone) VM (Nova) Network

    (Neutron) Microservice Restful API Restful API Restful API neutron-agent neutron-server RPC Anytime this can be broken Communication can be failed. - Because of scale - Because of in-proper config Error sometimes got propagated from one to other
  26. Future Challenge 2: Communication Visualizing Authentication (Keystone) VM (Nova) Network

    (Neutron) Microservice Restful API Restful API Restful API neutron-agent neutron-server RPC Anytime this can be broken Communication can be failed. - Because of scale - Because of in-proper config Error sometimes got propagated from one to other 1. Very difficult to troubleshoot this kind of issue because of - Error got propagated from one to another - Log is not always enough information - Log is only shown when something happen 2. Sometimes problem can be predicted by some metrics - how many rpc got received - how many rpc waited for reply
  27. Future Challenge 2: Communication Visualizing Authentication (Keystone) VM (Nova) Network

    (Neutron) Microservice Restful API Restful API Restful API neutron-agent neutron-server RPC Monitoring tool Monitor Communication related metrics
  28. Future Challenge 3: Containerize OpenStack Motivation/Current Pain Point • Complexity

    of packaging tool like RPM ◦ Dependency between packages ◦ Configuration for new file => We need to build RPM everytime we changed the code • Impossible to run different version of OpenStack on same server ◦ Dependency of common library of OpenStack => we actually deployed much more control plane servers than we actually need • Lack of observability for all softwares running on control plane ◦ No way to identify which part is to install depended library and which part is to install our software in deployment script (ansible, chef…) ◦ Deployment script doesn’t take care software running after deployed ◦ We can not notice if some developer run something temporally script
  29. Future Challenge 3: Containerize OpenStack Server Server Server Ansible Playbook

    Ansible Playbook Ansible Playbook Install library Install software Start software K8s manifest K8s manifest nova-api neutron-server common-library RPM Server nova-api neutron-server common-library Docker Registry Get package Server Server Server nova-api container nova-api common-library nova-api container nova-api common-library Install software Start software
  30. Future Challenge 4: EventHub as a Platform OpenStack VM (Nova)

    Image Store (Glance) Network Controller (Neutron) Identify (Keystone) DNS Controller (Designate) Loadbalancer L4LB L7LB Kubernetes (Rancher) Storage Block Storage (Ceph) Object Storage (Ceph) Database Search/Analytics Engine (ElasticSearch) RDBMS (Mysql) KVS (Redis) Messaging (Kafka) Function (Knative) Baremetal Operation Tools
  31. Future Challenge 4: EventHub as a Platform OpenStack VM (Nova)

    Image Store (Glance) Network Controller (Neutron) Identify (Keystone) DNS Controller (Designate) Loadbalancer L4LB L7LB Kubernetes (Rancher) Storage Block Storage (Ceph) Object Storage (Ceph) Database Search/Analytics Engine (ElasticSearch) RDBMS (Mysql) KVS (Redis) Messaging (Kafka) Function (Knative) Baremetal Operation Tools Depending on others Some component/operation script want to do something When User(actually project) in Keystone is deleted When VM is created When RealServer is added to Loadbalancer
  32. Pub/Sub Concept in Microservice Architecture Authentication Component VM Component Publish

    important event of own component Subscribe just interested events Network Component This component can do something when interested event happened This component don’t have to consider who this component need to work with Messaging bus (RabbitMQ)
  33. Pub/Sub Concept in Microservice Architecture Authentication Component VM Component Publish

    important event of own component Subscribe just interested events Network Component This component can do something when interested event happened This component don’t have to consider who this component need to work with Messaging bus This mechanism allow us to extend Private Cloud (Microservice) without changing existing code for future
  34. Future Challenge 4: EventHub as a Platform This part of

    notification logic has been already implemented in OpenStack but... Authentication Component (Keystone) Messaging bus (RabbitMQ) VM Component (Nova) Operation ScriptA Operation ScriptB L7LB Kubernetes Publish Event Subscribe Event Logic for access rabbitmq Logic for access rabbitmq Logic for access rabbitmq Logic for access rabbitmq Business logic Business logic Business logic Business logic
  35. Future Challenge 4: EventHub as a Platform This part of

    notification logic has been already implemented in OpenStack but... Authentication Component (Keystone) Messaging bus (RabbitMQ) VM Component (Nova) Operation ScriptA Operation ScriptB L7LB Kubernetes Publish Event Subscribe Event Logic for access rabbitmq Logic for access rabbitmq Logic for access rabbitmq Logic for access rabbitmq Business logic Business logic Business logic Business logic • Sometimes Logic for access rabbitmq code got bigger than actual business logic • All of components/script need to implement that logic first
  36. Future Challenge 4: EventHub as a Platform We are currently

    developing new component which allow us to register program with interested event. It will make more easy to co-work with other component Authentication Component (Keystone) Messaging bus (RabbitMQ) VM Component (Nova) Operation ScriptA Operation ScriptB L7LB Kubernetes Publish Event Logic for access rabbitmq Business logic Business logic Business logic Business logic Subscribe Event Business logic Business logic Business logic Function as a Service New
  37. For more future: IaaS to PaaS, CaaS…. We are currently

    trying to introduce additional abstraction layer above from IaaS • https://engineering.linecorp.com/ja/blog/japan-container-days-v18-12-report/ • https://www.slideshare.net/linecorp/lines-private-cloud-meet-cloud-native-world