$30 off During Our Annual Pro Sale. View Details »

A beginner's journey of operating production-level Private Cloud using OpenStack"

A beginner's journey of operating production-level Private Cloud using OpenStack"

Presentation materials at Cloud Operator Days Tokyo 2023

LINE Developers
PRO

August 21, 2023
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. HELLO!
    A BEGINNER’S JOURNEY OF OPERATING
    PRODUCTION-LEVEL PRIVATE CLOUD
    USING OPENSTACK
    BHARADWAJ ANUPINDI
    LINE CORPORATION, TOKYO
    NISHA BRAHMANKAR
    LINE CORPORATION, TOKYO
    1
    CLOUD OPERATOR DAYS
    TOKYO 2023

    View Slide

  2. Overview of Cloud Computing and OpenStack.
    Beginner’s view of OpenStack
    TOPICS TO TALK ABOUT
    Overview of LINE's Private Cloud: VERDA
    and our First Task to play with it.
    LINE’s cloud & First Task
    Brief details about our work including nova
    features and l2isolate containerization.
    Our Work
    Challenges faced as beginners working with
    OpenStack and large production cloud.
    Challenges
    Concise details about OpenStack Upgrade
    Project that we are working on.
    Current Project
    2

    View Slide

  3. What is Cloud Computing?
    BEGINNER’S VIEW OF OPENSTACK
    3
    Idea Hardware
    Requirement Cost &
    Time
    Work on
    Idea
    WHY CLOUD?
    WHAT IS CLOUD?
    ➢ Cloud computing is running and managing workload within clouds.
    ➢ Clouds are environments that abstract, pool and share scalable resources
    (memory, network, storage, etc) across the internet.
    BEFORE CLOUD
    Instant
    Cost
    Efficient
    Scalable Reliable
    Security

    View Slide

  4. Layers of Cloud Computing
    BEGINNER’S VIEW OF OPENSTACK
    4
    SaaS
    PaaS
    IaaS
    APPLICATIONS
    MIDDLE
    SERVERS
    APPLICATIONS
    MIDDLE
    SERVERS
    APPLICATIONS
    MIDDLE
    SERVERS
    IT Admins
    Software
    Developers
    End Users
    Leased Car
    Taxi / Uber
    Bus

    View Slide

  5. OpenStack and it's scale
    BEGINNER’S VIEW OF OPENSTACK
    5
    Auto
    Industry
    What
    runs on
    OpenStack?
    Energy
    WHY OPENSTACK?
    OpenStack is a cloud operating system that controls large pool of compute, storage and
    networking resources, all managed and provisioned through APIs.

    View Slide

  6. OpenStack Components
    BEGINNER’S VIEW OF OPENSTACK
    6

    View Slide

  7. VERDA: LINE’s PRIVATE CLOUD
    LINE’S CLOUD & FIRST TASK
    7

    View Slide

  8. VERDA: LINE’s PRIVATE CLOUD
    LINE’S CLOUD & FIRST TASK
    8
    Physical
    Servers
    70,000+
    Baremetal
    Servers
    46,000+
    Hypervisors 10,000+
    Virtual
    Machines
    100,000+
    AS OF JAN 2023

    View Slide

  9. Minimum API set to LINE developers
    LINE’S CLOUD & FIRST TASK
    9
    VERDA: LINE’s PRIVATE CLOUD

    View Slide

  10. First Task: Play with Personal Verda
    LINE’S CLOUD & FIRST TASK
    10
    • Beginners → Need for an OpenStack playground
    • Multiple options already present in the market:
    ➢ Devstack:
    • Quick setup of OpenStack environment
    • Abstracts details from the user
    • Not compatible for testing custom features
    • What we do at LINE (Personal Verda):
    • Use ansible to create personal OpenStack clusters
    • An automated script uses dev OpenStack cluster to create
    smaller clusters
    • Uses of Personal Verda:
    • Beginner's playground
    • Personal test environment before staging or dev/production
    dev cluster hypervisors -> dev cluster VMs -> (configured) personal cluster hypervisors -> personal cluster VMs

    View Slide

  11. 1. Features in nova
    - OpenStack nova provides user script feature for
    VMs
    - nova-metadata accesses the script from nova DB
    - cloud-init runs this script on VM bootup
    - Added the same feature for PM - LINE's original
    nova-baremetal driver
    - using LINE’s custom APIs
    OUR WORK
    11
    A. user-script feature for PM (Physical
    Machine)
    - Specify target aggregate in OpenStack VM
    create command
    - n
    - Adds a new nova scheduling filter
    - It filters the host mentioned in specified aggregate
    - Matches the aggregate key value pair based on
    hint
    B. Schedule VM on specified aggregate
    B > $ openstack server create --image "CentOS 7.9” --availability-zone nova --hint aggregate_id=23 testvm
    A > $ openstack server create --image "CentOS 7.9” --user-data userscript.sh testvm

    View Slide

  12. 2. Neutron Agent Containerization
    • Many services of OpenStack run on the same hypervisor
    • Run them as containerized processes ( using docker or podman ): avoid any package dependencies or
    clash among them
    • In LINE, containerized our neutron-agent ( l2isolate agent ) using podman
    • Enables us to use different package versions independently
    OUR WORK
    12

    View Slide

  13. Keystone Authentication
    • Challenge: Multiple environments to work
    - No. Of Test environments: 5
    - No. Of Production environments: 4
    - No. Of Development environment: 1
    Some tricks and Solution:
    • Some shell (bash/zsh) config setup and tricks can help:
    • Color change and display the env/region name
    • Add the line source ~/.bashrc to the end of each of the
    regions' openrc files
    • Enable shell autocomplete and autosuggestions by
    adding
    bind 'set show-all-if-ambiguous on'
    bind 'TAB:menu-complete'
    CHALLENGES
    13
    • Multiple openrc (authentication URL)
    files to authenticate keystone regions for
    each environment.
    • Need to quickly source different region's
    openrc files • Or there are third party tools that manage multiple openrc and provide a
    user friendly interface for fast context switching. For ex: rally
    user123
    user123

    View Slide

  14. How we Deploy?
    • Huge number of compute nodes ~ 10k hypervisors
    • Takes few hours to deploy to compute nodes (for nova-compute or neutron-agent deployments)
    • Use smaller playbooks/handlers instead of running the entire task file in ansible
    • For example: if you require only to restart the neutron-agents, run only the playbook restart-agent.yml
    • The hypervisors are divided into host subgroups in ansible host file
    • This enables us to deploy the same task parallelly to the hypervisor groups
    CHALLENGES
    14

    View Slide

  15. ERROR/OUTAGE Handling
    • Error: os server show gives error: "failed to allocate network"
    • Get the tap-interface ID ( tapXXXX ) of the VM using the command:
    • neutron port-list --device-id=
    • First 10 characters of the port id becomes the XXXX in above tapXXXX.
    • Check tap device on the compute node: ip addr | grep
    CHALLENGES
    15
    • tap interface failed to create on the compute node:
    - following checks:
    1) check if the neutron-agent is running or not
    - systemctl status
    2) check if too many interfaces already present. some hardwares impose upper constraint
    3) check if iproute, iptables and ipset packages are installed
    - yum list | grep ( in RHEL os )
    4) check logs of the neutron-agent and search for ERROR msgs
    • nova-api failed to receive any vif attachment info from the neutron-server :
    • Check the messaging driver being used: messages being moved or not from neutron- server to nova-api

    View Slide

  16. ERROR/OUTAGE Handling
    1) check status of the dhcp server ( like dnsmasq ):
    - check if the dnsmasq/dhcp-agent process is running and it is assigning the IP:
    - use tcpdump to check the communication through the tap interface :
    - tcpdump -ni
    2) check if security group rules allow inbound traffic on the compute node:
    - if not then create a new SG rule for inbound traffic to VMs using:
    - openstack security group rule create --options
    CHALLENGES
    16
    • Error/Issue: IP is assigned but not pingable
    • Some of the useful tools and commands:
    - ip addr | grep - route –n
    - iptables –L
    - cat /etc/sysconfig/iptables
    - iptables -nL

    View Slide

  17. Upgrade OpenStack to Zed version
    CURRENT PROJECT
    17
    Release Date: Old
    Reached EOL
    Compatible with python 2.7
    • EOL: 2020
    Release Date: Oct 2022
    Maintained
    Compatible with python 3
    HOW TO ACHIEVE?
    .
    ➢ Test setup of Zed version for comfortable switch
    ➢ Install and setup e2e multinode OpenStack environment with Zed version

    View Slide

  18. Compute Node
    Controller Node
    Upgrade OpenStack to Zed version
    CURRENT PROJECT
    18
    Zed Upstream
    Code (with
    custom patches)
    All Unit Tests &
    Functional Tests
    pass
    Download repos
    and install the
    services
    Modify
    configuration
    files with Zed
    compatibility
    Ensure e2e API
    correction using
    CLI
    Nova-api
    Nova-conductor
    Nova-scheduler
    Nova-novncproxy
    Neutron-server
    Glance-api
    Placement-api
    Nova-compute
    Neutron-agent
    Nova.conf
    Neutron.conf
    Glance.conf
    Placement.conf

    View Slide

  19. Challenges:
    • Config parameters deprecated in many of the services config:
    • rabbit_hosts, rpc_backend, allow_overlapping_ips, dnsmasq_dns_server, secure_proxy_ssl_header
    • Required networking packages intallation like:
    • Dnsmasq-utils
    • Ipset
    • Conntrack
    • Linux kernel version (compatibility with network):
    • Some kernel parameters like nf_hooks_lwtunnel is introduced in linux kernel version > v5.15 or above
    • nf_conntrack module is needed on the compute nodes to enable network tunneling ( libvirtd installs this module
    automatically )
    CURRENT PROJECT: UPGRADE OPENSTACK
    19

    View Slide

  20. 21

    View Slide

  21. Contacts
    BHARADWAJ ANUPINDI
    22
    LINKEDIN
    @avs-bharadwaj-530ba6147
    NISHA BRAHMANKAR
    LINKEDIN
    @nisha-brahmankar-04

    View Slide

  22. THANKS!
    23

    View Slide