Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LINE’s Journey; Road to 4 Million Cores in the Private Cloud

LINE’s Journey; Road to 4 Million Cores in the Private Cloud

Masahito Muroi / Senior SWE, Manager / LINE
Mitsuhiro Tanino / Senior SWE / LINE
#ossummit

LINE Developers

December 21, 2022
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. Masahito Muroi / Senior SWE, Manager / LINE Mitsuhiro Tanino

    / Senior SWE / LINE #ossummit LINE’s Journey; Road to 4 Million Cores in the Private Cloud
  2. #ossummit Ourself Masahito Muroi • Senior Software Engineer / Manager

    • Working for LINE over 3 years Mitsuhiro Tanino • Senior Software Engineer • Working for LINE over 3 years
  3. #ossummit Verda Private Cloud • Verda is the private cloud

    platform for LINE NAT Load Balancer VM / Baremetal DNS App engine (like heroku) And More… Image Repo Shared Filesystem Shared Filesystem Elasticsearch Container
  4. #ossummit Verda & LINE infra scale Virtual Machine 100,000+ Baremetal

    server 46,000+ Hypervisor 7,600+ All Physical Servers 70,000+ Peak of User Traffic 3Tbps+ As of Sep. 2022
  5. #ossummit Service stack in the Verda Identity VM/Baremetal Network Image

    DNS Block Storage Object Storage Shared FS Kubernetes MySQL Redis Function Elasticsearch Kafka CI/CD PIPE Load Balancer NAT PaaS Managed Service IaaS
  6. #ossummit Start up period • LINE Infra Problems – Infra

    provisioning took too long time • ~ 2 weeks for 1 VM provisioning • Verda solved – Open the private cloud with minimum API set to LINE developers • VM, Baremetal, Block Storage Start up (2016-2019) Expansion (2019-2021) New infra (2022-) App developers Infra team 1. Apply infra request WF 2. Consult the request details 4. Serve the configured infra VM VM Storage 3. Set up the infra App developers 1. Create a resource by API/GUI VM VM Storage Before Verda After Verda 5. Start to setup apps 2. Automatic provisioning 3. Start to setup apps Infra team HV a. Bulk resource management b. Administrate the Verda cloud
  7. #ossummit Start up period • OpenStack challenges – Start to

    serve common OpenStack API to the developers ASAP – Focused on opening API to the developers • Minimum API set from OpenStack • Develop lots of LINE original API and components – Baremetal API – API filters • Culture changes – Verda changed infra resource characteristic from a facility to on- demand resource and API manageable. • App team view: Less communication with Infra team to tell the infra demands • Infra team view: Bulk facility management Start up (2016-2019) Expansion (2019-2021) New infra (2022-)
  8. #ossummit Expansion period • Problems – LINE developers had to

    install common middleware set by themselves. • Infra preparation: 2 weeks → 10 mins • Middleware preparation: no change • Verda solved – Open managed middleware service API • Kubernetes, MySQL, Redis, and etc Start up (2016-2019) Expansion (2019-2021) New infra (2022-) App developers 1. Create infra resources by API/GUI VM VM Storage 2. Automatic provisioning DBA 3. Request DB setup 4. Install DB middleware 5. Start monitoring and DB administration 6. Serve the DB App developers 1. Create DB resource by API/GUI 2. Automatic DB resource provisioning DBA VM VM a. Start monitoring and DB administration MySQL cluster
  9. #ossummit Expansion period • OpenStack challenges – Opening managed services

    triggered rapid growth of the OpenStack scale. • 1,400 hypervisors to 6,000 hypervisors in 2 years • The rapid growth required OpenStack deployment topology changes and tool change – Some OpenSource OpenStack API plugin was not matured in the large scale cluster. • Kubernetes cinder csi plugin • Ansible Keystone user management plugin • Culture changes – LINE developers can focus on developing service applications. Start up (2016-2019) Expansion (2019-2021) New infra (2022-)
  10. #ossummit New infra period • Problems – Infra management skills

    really rely on development team’s Verda knowledge • Standard Infra management tool can’t use some Verda API • Some teams can use sophisticated infra management tools to Verda • Others rely on traditional manual operation. Start up (2016-2019) Expansion (2019-2021) New infra (2022-) 1. Develop Verda original API set modules in the tool 2. Develop application infra information VM VM MySQL cluster VM VM Kubernetes cluster PM PM Verda App developers Baremetal API Resource Provisioning tools
  11. #ossummit New infra period • Verda solved – Straiten API

    stack in the Verda to follow the de-facto standard API set Start up (2016-2019) Expansion (2019-2021) New infra (2022-) 1. Develop application infra information VM VM MySQL cluster VM VM Kubernetes cluster PM PM Verda App developers Resource Provisioning tools libvirt driver baremetal driver OpenStack standard API
  12. #ossummit New infra period • OpenStack challenges – Revisit the

    OpenStack API philosophy • Unified API to manage some type of backend resources • Renovated some API implementations • Culture change – Solve tool silo in the application development team Start up (2016-2019) Expansion (2019-2021) New infra (2022-)
  13. #ossummit Start up (2016-2019) Expansion (2019-2021) New infra (2022-) Verda

    realized • Change Infra communication style • Change middleware management style • Change team knowledge gap OpenStack challenges • Open the IaaS API • Support 500% rapid growth in 3 years • Straiten API stack
  14. #ossummit Lesson learned from the 7 years journey • Culture

    change made drastic improvements • Technical bottleneck depends on the Infra scale • Open Source eco system has strong power
  15. #ossummit Background • Verda provides VM and Baremetal server for

    application developers to host their applications – VM: we supports the OpenStack-based IaaS management system – Baremetal: we supported an in-house server management system Verda App developers libvirt driver baremetal management OpenStack standard API Baremetal API Resource Provisioning tools
  16. #ossummit Background • However, due to providing two different management

    systems, – Developers need to understand completely different two types of API to automate VM and Baremetal operations – Verda operators always need to develop the same functionality for both OpenStack and the in-house management system. This increased our operation cost. • We started a new project to improve baremetal server management from 2020.
  17. #ossummit Requirements for baremetal server management • We had multiple

    requirements for the developers and Verda operators • For application developers – To provide unified APIs for multiple resources – To provide the same functionalities for both VM and Baremetal server – To provide private stock management system
  18. #ossummit Requirements for baremetal server management • For Verda operators

    – To reduce development, maintenance, and management cost as much as possible – Re-use existing strong hardware layer management systems • We already had hardware management systems for IPMI operation and OS installation which were distributed to multiple data centers in multiple regions
  19. #ossummit What can be done to archive requirements • Developed

    Nova compute driver for baremetal server management – We decided to develop Nova’s compute driver rather than using OpenStack Ironic • Implemented baremetal server stock management mechanisms for Nova • Provided a feature to distribute baremetal server for HA purpose • Prepared CI/CD pipeline to deploy nova compute services
  20. #ossummit Introduction to baremetal compute driver • What is baremetal

    driver • Architecture • Deep dive to features
  21. #ossummit What is baremetal compute driver • OpenStack Nova’s compute

    driver developed by LINE • The driver communicates with the physical server management system to build up baremetal servers • Basic operations like create, delete, rebuild etc. are supported • Allow Verda users to create a new baremetal server from their pre-assigned stock VM VM MySQL cluster VM VM Kubernetes cluster PM PM Verda App developers Resource Provisioning tools libvirt driver baremetal driver OpenStack standard API
  22. #ossummit New architecture 1. A user requests to create new

    baremetal instance via dashboard 2. Nova API receives a request 3. Nova-scheduler picks new Nova compute to launch instance 4. Baremetal nova-compute makes a request to IPMI management to run PXE Boot for baremetal server 5. Nova-compute creates OS install task and wait until the completion Summary of baremetal instance creation flow
  23. #ossummit Verda Dashboard - VM / Baremetal server management •

    "Instance" management view for developers
  24. #ossummit Deep dive to features • Stock management • HA

    group support • Deployment procedure of baremetal driver
  25. #ossummit Stock management • Stock management with host aggregates –

    Public and private – Stocks are registered to nova’s host aggregate
  26. #ossummit Stock management • Private Stock example $ openstack aggregate

    show TEST_N3.small.metal.uuid12345 +-------------------+----------------------------------------------------------------------------------------+ | Field | Value | +-------------------+----------------------------------------------------------------------------------------+ | availability_zone | None | | created_at | 2022-04-11T01:55:52.000000 | | deleted | False | | deleted_at | None | | hosts | server_zzzz1, server_zzzz2, server_zzzz3, server_zzzz4, server_zzzz5, server_zzzz6 … | | id | 317 | | name | TEST_N3.small.metal.uuid12345 | | properties | TEST_N3.small.metal='true', baremetal='true', filter_flavor_id=‘1234567’ | | updated_at | None | +-------------------+----------------------------------------------------------------------------------------+
  27. #ossummit Stock management • Workflow system for private stock management

    • Developers create WF then Verda operator check if it is suitable
  28. #ossummit Deep dive to features • Stock management • HA

    group support • Deployment procedure of baremetal driver
  29. #ossummit HA Group support with failure domain • Production VMs

    and Baremetal servers require High availability based on location • Verda supports – Multi-Regions – Multi-Availability Zones – Server rack level failure domain Region1 Region2 AZ1 AZ2 AZ3 Rack1 Rack2 Rack3 Rack1 Rack2 Rack1 Rack2 Available via HA Group Request from stock WF https://logmi.jp/tech/articles/327491
  30. #ossummit HA Group support with failure domain • HA Group

    – Allow users to deploy baremetal servers with HA based on failure domains – Supported policies Policies Summary of the policy Hard Multiple servers must be distributed to multiple failure domains Soft Multiple servers will be distributed to multiple failure domains as much as possible None Skip HA group
  31. #ossummit HA Group support with failure domain • Verda user

    can select HA Group policy based on the requirements of services
  32. #ossummit HA Group support with failure domain HA Group “Hard”

    policy distributed 5 baremetal servers to different failure domains(=Server racks) Hard
  33. #ossummit Deep dive to features • Stock management • HA

    group support • Deployment procedure of baremetal driver
  34. #ossummit Deployment procedure 1. Operator registers new server information to

    git 2. Argo CD watches and sync new change to Verda Kubernetes 3. During deployment job, ansible k8s module deploy config map and deployment 4. Ansible OpenStack module register servers to host aggregate via nova-api Deployment flow