Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LINE’s Journey; Road to 4 Million Cores in the Private Cloud

LINE’s Journey; Road to 4 Million Cores in the Private Cloud

Masahito Muroi / Senior SWE, Manager / LINE
Mitsuhiro Tanino / Senior SWE / LINE
#ossummit

LINE Developers
PRO

December 21, 2022
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. Masahito Muroi / Senior SWE, Manager / LINE
    Mitsuhiro Tanino / Senior SWE / LINE
    #ossummit
    LINE’s Journey;
    Road to 4 Million Cores in the Private Cloud

    View Slide

  2. #ossummit
    Ourself
    Masahito Muroi
    • Senior Software Engineer / Manager
    • Working for LINE over 3 years
    Mitsuhiro Tanino
    • Senior Software Engineer
    • Working for LINE over 3 years

    View Slide

  3. #ossummit
    Verda Private Cloud
    • Verda is the private cloud platform for LINE
    NAT Load Balancer
    VM / Baremetal
    DNS
    App engine
    (like heroku)
    And More…
    Image Repo
    Shared Filesystem
    Shared Filesystem Elasticsearch
    Container

    View Slide

  4. #ossummit
    Verda & LINE infra scale
    Virtual Machine 100,000+
    Baremetal server 46,000+
    Hypervisor 7,600+
    All Physical Servers 70,000+
    Peak of User Traffic 3Tbps+
    As of Sep. 2022

    View Slide

  5. #ossummit
    Service stack in the Verda
    Identity VM/Baremetal Network Image
    DNS Block Storage
    Object Storage Shared FS
    Kubernetes MySQL Redis
    Function
    Elasticsearch Kafka
    CI/CD PIPE
    Load Balancer NAT
    PaaS
    Managed
    Service
    IaaS

    View Slide

  6. #ossummit
    Start up
    (2016-2019)
    Expansion
    (2019-2021)
    New infra
    (2022-)

    View Slide

  7. #ossummit
    Start up period
    • LINE Infra Problems
    – Infra provisioning took too long time
    • ~ 2 weeks for 1 VM provisioning
    • Verda solved
    – Open the private cloud with minimum API set to LINE developers
    • VM, Baremetal, Block Storage
    Start up
    (2016-2019)
    Expansion
    (2019-2021)
    New infra
    (2022-)
    App developers Infra team
    1. Apply infra request WF
    2. Consult the request details
    4. Serve the configured infra
    VM
    VM
    Storage
    3. Set up the infra
    App developers
    1. Create a resource by API/GUI
    VM
    VM
    Storage
    Before Verda After Verda
    5. Start to setup apps
    2. Automatic provisioning
    3. Start to setup apps
    Infra team
    HV
    a. Bulk resource management
    b. Administrate the Verda cloud

    View Slide

  8. #ossummit
    Start up period
    • OpenStack challenges
    – Start to serve common OpenStack API to the developers ASAP
    – Focused on opening API to the developers
    • Minimum API set from OpenStack
    • Develop lots of LINE original API and components
    – Baremetal API
    – API filters
    • Culture changes
    – Verda changed infra resource characteristic from a facility to on-
    demand resource and API manageable.
    • App team view: Less communication with Infra team to tell the infra demands
    • Infra team view: Bulk facility management
    Start up
    (2016-2019)
    Expansion
    (2019-2021)
    New infra
    (2022-)

    View Slide

  9. #ossummit
    Expansion period
    • Problems
    – LINE developers had to install common middleware set by themselves.
    • Infra preparation: 2 weeks → 10 mins
    • Middleware preparation: no change
    • Verda solved
    – Open managed middleware service API
    • Kubernetes, MySQL, Redis, and etc
    Start up
    (2016-2019)
    Expansion
    (2019-2021)
    New infra
    (2022-)
    App developers
    1. Create infra resources by API/GUI
    VM
    VM
    Storage
    2. Automatic provisioning
    DBA
    3. Request DB setup
    4. Install DB middleware
    5. Start monitoring and DB
    administration
    6. Serve the DB
    App developers
    1. Create DB resource by API/GUI
    2. Automatic DB resource
    provisioning
    DBA
    VM VM
    a. Start monitoring and DB
    administration
    MySQL cluster

    View Slide

  10. #ossummit
    Expansion period
    • OpenStack challenges
    – Opening managed services triggered rapid growth of the OpenStack
    scale.
    • 1,400 hypervisors to 6,000 hypervisors in 2 years
    • The rapid growth required OpenStack deployment topology changes and tool
    change
    – Some OpenSource OpenStack API plugin was not matured in the
    large scale cluster.
    • Kubernetes cinder csi plugin
    • Ansible Keystone user management plugin
    • Culture changes
    – LINE developers can focus on developing service applications.
    Start up
    (2016-2019)
    Expansion
    (2019-2021)
    New infra
    (2022-)

    View Slide

  11. #ossummit
    New infra period
    • Problems
    – Infra management skills really rely on development team’s Verda knowledge
    • Standard Infra management tool can’t use some Verda API
    • Some teams can use sophisticated infra management tools to Verda
    • Others rely on traditional manual operation.
    Start up
    (2016-2019)
    Expansion
    (2019-2021)
    New infra
    (2022-)
    1. Develop Verda original API set modules in the tool
    2. Develop application infra information
    VM VM
    MySQL cluster
    VM VM
    Kubernetes cluster
    PM PM
    Verda
    App developers
    Baremetal API
    Resource Provisioning tools

    View Slide

  12. #ossummit
    New infra period
    • Verda solved
    – Straiten API stack in the Verda to follow the de-facto standard API set
    Start up
    (2016-2019)
    Expansion
    (2019-2021)
    New infra
    (2022-)
    1. Develop application infra information
    VM VM
    MySQL cluster
    VM VM
    Kubernetes cluster
    PM PM
    Verda
    App developers
    Resource Provisioning tools
    libvirt driver
    baremetal
    driver
    OpenStack standard API

    View Slide

  13. #ossummit
    New infra period
    • OpenStack challenges
    – Revisit the OpenStack API philosophy
    • Unified API to manage some type of backend resources
    • Renovated some API implementations
    • Culture change
    – Solve tool silo in the application development team
    Start up
    (2016-2019)
    Expansion
    (2019-2021)
    New infra
    (2022-)

    View Slide

  14. #ossummit
    Start up
    (2016-2019)
    Expansion
    (2019-2021)
    New infra
    (2022-)
    Verda realized
    • Change Infra communication style
    • Change middleware management style
    • Change team knowledge gap
    OpenStack challenges
    • Open the IaaS API
    • Support 500% rapid growth in 3 years
    • Straiten API stack

    View Slide

  15. #ossummit
    Lesson learned from the 7 years journey
    • Culture change made drastic improvements
    • Technical bottleneck depends on the Infra scale
    • Open Source eco system has strong power

    View Slide

  16. Introduction of
    Baremetal server
    management in Verda

    View Slide

  17. #ossummit
    Background
    • Verda provides VM and Baremetal server for application
    developers to host their applications
    – VM: we supports the OpenStack-based IaaS management
    system
    – Baremetal: we supported an in-house server management
    system
    Verda
    App developers
    libvirt driver
    baremetal
    management
    OpenStack
    standard API
    Baremetal API
    Resource Provisioning tools

    View Slide

  18. #ossummit
    Background
    • However, due to providing two different management
    systems,
    – Developers need to understand completely different two
    types of API to automate VM and Baremetal operations
    – Verda operators always need to develop the same
    functionality for both OpenStack and the in-house
    management system. This increased our operation cost.
    • We started a new project to improve baremetal server
    management from 2020.

    View Slide

  19. #ossummit
    Requirements for baremetal server management
    • We had multiple requirements for the developers and Verda
    operators
    • For application developers
    – To provide unified APIs for multiple resources
    – To provide the same functionalities for both VM and
    Baremetal server
    – To provide private stock management system

    View Slide

  20. #ossummit
    Requirements for baremetal server management
    • For Verda operators
    – To reduce development, maintenance, and management
    cost as much as possible
    – Re-use existing strong hardware layer management systems
    • We already had hardware management systems for IPMI
    operation and OS installation which were distributed to
    multiple data centers in multiple regions

    View Slide

  21. #ossummit
    What can be done to archive requirements
    • Developed Nova compute driver for baremetal server
    management
    – We decided to develop Nova’s compute driver
    rather than using OpenStack Ironic
    • Implemented baremetal server stock management
    mechanisms for Nova
    • Provided a feature to distribute baremetal server for HA
    purpose
    • Prepared CI/CD pipeline to deploy nova compute
    services

    View Slide

  22. #ossummit
    Introduction to baremetal compute driver
    • What is baremetal driver
    • Architecture
    • Deep dive to features

    View Slide

  23. #ossummit
    What is baremetal compute driver
    • OpenStack Nova’s compute
    driver developed by LINE
    • The driver communicates with
    the physical server
    management system to build
    up baremetal servers
    • Basic operations like create,
    delete, rebuild etc. are
    supported
    • Allow Verda users to create a
    new baremetal server from
    their pre-assigned stock
    VM VM
    MySQL cluster
    VM VM
    Kubernetes cluster
    PM PM
    Verda
    App developers
    Resource Provisioning tools
    libvirt driver
    baremetal
    driver
    OpenStack standard API

    View Slide

  24. #ossummit
    New architecture
    1. A user requests to create new baremetal
    instance via dashboard
    2. Nova API receives a request
    3. Nova-scheduler picks new Nova
    compute to launch instance
    4. Baremetal nova-compute makes a
    request to IPMI management to run PXE
    Boot for baremetal server
    5. Nova-compute creates OS install task
    and wait until the completion
    Summary of baremetal
    instance creation flow

    View Slide

  25. #ossummit
    Verda Dashboard - VM / Baremetal server management
    • "Instance" management view for developers

    View Slide

  26. #ossummit
    Deep dive to features
    • Stock management
    • HA group support
    • Deployment procedure of baremetal driver

    View Slide

  27. #ossummit
    Stock management
    • Stock management with host aggregates
    – Public and private
    – Stocks are registered to nova’s host aggregate

    View Slide

  28. #ossummit
    Stock management
    • Private Stock example
    $ openstack aggregate show TEST_N3.small.metal.uuid12345
    +-------------------+----------------------------------------------------------------------------------------+
    | Field | Value |
    +-------------------+----------------------------------------------------------------------------------------+
    | availability_zone | None |
    | created_at | 2022-04-11T01:55:52.000000 |
    | deleted | False |
    | deleted_at | None |
    | hosts | server_zzzz1, server_zzzz2, server_zzzz3, server_zzzz4, server_zzzz5, server_zzzz6 … |
    | id | 317 |
    | name | TEST_N3.small.metal.uuid12345 |
    | properties | TEST_N3.small.metal='true', baremetal='true', filter_flavor_id=‘1234567’ |
    | updated_at | None |
    +-------------------+----------------------------------------------------------------------------------------+

    View Slide

  29. #ossummit
    Stock management
    • Workflow system for private stock management
    • Developers create WF then Verda operator check if it is suitable

    View Slide

  30. #ossummit
    Deep dive to features
    • Stock management
    • HA group support
    • Deployment procedure of baremetal driver

    View Slide

  31. #ossummit
    HA Group support with failure domain
    • Production VMs and Baremetal servers
    require High availability based on location
    • Verda supports
    – Multi-Regions
    – Multi-Availability Zones
    – Server rack level failure domain
    Region1 Region2
    AZ1 AZ2 AZ3
    Rack1 Rack2 Rack3 Rack1 Rack2 Rack1 Rack2
    Available via HA Group
    Request from stock WF
    https://logmi.jp/tech/articles/327491

    View Slide

  32. #ossummit
    HA Group support with failure domain
    • HA Group
    – Allow users to deploy baremetal servers with HA
    based on failure domains
    – Supported policies
    Policies Summary of the policy
    Hard Multiple servers must be distributed to multiple
    failure domains
    Soft Multiple servers will be distributed to multiple
    failure domains as much as possible
    None Skip HA group

    View Slide

  33. #ossummit
    HA Group support with failure domain
    • Verda user can select HA Group policy based on the
    requirements of services

    View Slide

  34. #ossummit
    HA Group support with failure domain
    HA Group “Hard” policy
    distributed 5 baremetal
    servers to different failure
    domains(=Server racks)
    Hard

    View Slide

  35. #ossummit
    Deep dive to features
    • Stock management
    • HA group support
    • Deployment procedure of baremetal driver

    View Slide

  36. #ossummit
    Deployment procedure
    1. Operator registers new server information to git
    2. Argo CD watches and sync new change to
    Verda Kubernetes
    3. During deployment job, ansible k8s module
    deploy config map and deployment
    4. Ansible OpenStack module register servers to
    host aggregate via nova-api
    Deployment flow

    View Slide

  37. View Slide