Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A beginner's journey of operating production-le...

A beginner's journey of operating production-level Private Cloud using OpenStack"

Presentation materials at Cloud Operator Days Tokyo 2023

LINE Developers

August 21, 2023
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. HELLO! A BEGINNER’S JOURNEY OF OPERATING PRODUCTION-LEVEL PRIVATE CLOUD USING

    OPENSTACK BHARADWAJ ANUPINDI LINE CORPORATION, TOKYO NISHA BRAHMANKAR LINE CORPORATION, TOKYO 1 CLOUD OPERATOR DAYS TOKYO 2023
  2. Overview of Cloud Computing and OpenStack. Beginner’s view of OpenStack

    TOPICS TO TALK ABOUT Overview of LINE's Private Cloud: VERDA and our First Task to play with it. LINE’s cloud & First Task Brief details about our work including nova features and l2isolate containerization. Our Work Challenges faced as beginners working with OpenStack and large production cloud. Challenges Concise details about OpenStack Upgrade Project that we are working on. Current Project 2
  3. What is Cloud Computing? BEGINNER’S VIEW OF OPENSTACK 3 Idea

    Hardware Requirement Cost & Time Work on Idea WHY CLOUD? WHAT IS CLOUD? ➢ Cloud computing is running and managing workload within clouds. ➢ Clouds are environments that abstract, pool and share scalable resources (memory, network, storage, etc) across the internet. BEFORE CLOUD Instant Cost Efficient Scalable Reliable Security
  4. Layers of Cloud Computing BEGINNER’S VIEW OF OPENSTACK 4 SaaS

    PaaS IaaS APPLICATIONS MIDDLE SERVERS APPLICATIONS MIDDLE SERVERS APPLICATIONS MIDDLE SERVERS IT Admins Software Developers End Users Leased Car Taxi / Uber Bus
  5. OpenStack and it's scale BEGINNER’S VIEW OF OPENSTACK 5 Auto

    Industry What runs on OpenStack? Energy WHY OPENSTACK? OpenStack is a cloud operating system that controls large pool of compute, storage and networking resources, all managed and provisioned through APIs.
  6. VERDA: LINE’s PRIVATE CLOUD LINE’S CLOUD & FIRST TASK 8

    Physical Servers 70,000+ Baremetal Servers 46,000+ Hypervisors 10,000+ Virtual Machines 100,000+ AS OF JAN 2023
  7. Minimum API set to LINE developers LINE’S CLOUD & FIRST

    TASK 9 VERDA: LINE’s PRIVATE CLOUD
  8. First Task: Play with Personal Verda LINE’S CLOUD & FIRST

    TASK 10 • Beginners → Need for an OpenStack playground • Multiple options already present in the market: ➢ Devstack: • Quick setup of OpenStack environment • Abstracts details from the user • Not compatible for testing custom features • What we do at LINE (Personal Verda): • Use ansible to create personal OpenStack clusters • An automated script uses dev OpenStack cluster to create smaller clusters • Uses of Personal Verda: • Beginner's playground • Personal test environment before staging or dev/production dev cluster hypervisors -> dev cluster VMs -> (configured) personal cluster hypervisors -> personal cluster VMs
  9. 1. Features in nova - OpenStack nova provides user script

    feature for VMs - nova-metadata accesses the script from nova DB - cloud-init runs this script on VM bootup - Added the same feature for PM - LINE's original nova-baremetal driver - using LINE’s custom APIs OUR WORK 11 A. user-script feature for PM (Physical Machine) - Specify target aggregate in OpenStack VM create command - n - Adds a new nova scheduling filter - It filters the host mentioned in specified aggregate - Matches the aggregate key value pair based on hint B. Schedule VM on specified aggregate B > $ openstack server create --image "CentOS 7.9” --availability-zone nova --hint aggregate_id=23 testvm A > $ openstack server create --image "CentOS 7.9” --user-data userscript.sh testvm
  10. 2. Neutron Agent Containerization • Many services of OpenStack run

    on the same hypervisor • Run them as containerized processes ( using docker or podman ): avoid any package dependencies or clash among them • In LINE, containerized our neutron-agent ( l2isolate agent ) using podman • Enables us to use different package versions independently OUR WORK 12
  11. Keystone Authentication • Challenge: Multiple environments to work - No.

    Of Test environments: 5 - No. Of Production environments: 4 - No. Of Development environment: 1 Some tricks and Solution: • Some shell (bash/zsh) config setup and tricks can help: • Color change and display the env/region name • Add the line source ~/.bashrc to the end of each of the regions' openrc files • Enable shell autocomplete and autosuggestions by adding bind 'set show-all-if-ambiguous on' bind 'TAB:menu-complete' CHALLENGES 13 • Multiple openrc (authentication URL) files to authenticate keystone regions for each environment. • Need to quickly source different region's openrc files • Or there are third party tools that manage multiple openrc and provide a user friendly interface for fast context switching. For ex: rally user123 user123
  12. How we Deploy? • Huge number of compute nodes ~

    10k hypervisors • Takes few hours to deploy to compute nodes (for nova-compute or neutron-agent deployments) • Use smaller playbooks/handlers instead of running the entire task file in ansible • For example: if you require only to restart the neutron-agents, run only the playbook restart-agent.yml • The hypervisors are divided into host subgroups in ansible host file • This enables us to deploy the same task parallelly to the hypervisor groups CHALLENGES 14
  13. ERROR/OUTAGE Handling • Error: os server show <UUID> gives error:

    "failed to allocate network" • Get the tap-interface ID ( tapXXXX ) of the VM using the command: • neutron port-list --device-id=<UUID> • First 10 characters of the port id becomes the XXXX in above tapXXXX. • Check tap device on the compute node: ip addr | grep <tapXXXX> CHALLENGES 15 • tap interface failed to create on the compute node: - following checks: 1) check if the neutron-agent is running or not - systemctl status <neutron-agent> 2) check if too many interfaces already present. some hardwares impose upper constraint 3) check if iproute, iptables and ipset packages are installed - yum list | grep <package> ( in RHEL os ) 4) check logs of the neutron-agent and search for ERROR msgs • nova-api failed to receive any vif attachment info from the neutron-server : • Check the messaging driver being used: messages being moved or not from neutron- server to nova-api
  14. ERROR/OUTAGE Handling 1) check status of the dhcp server (

    like dnsmasq ): - check if the dnsmasq/dhcp-agent process is running and it is assigning the IP: - use tcpdump to check the communication through the tap interface : - tcpdump -ni <tap-id> 2) check if security group rules allow inbound traffic on the compute node: - if not then create a new SG rule for inbound traffic to VMs using: - openstack security group rule create --options <name> CHALLENGES 16 • Error/Issue: IP is assigned but not pingable • Some of the useful tools and commands: - ip addr | grep <tap-device - route –n - iptables –L - cat /etc/sysconfig/iptables - iptables -nL
  15. Upgrade OpenStack to Zed version CURRENT PROJECT 17 Release Date:

    Old Reached EOL Compatible with python 2.7 • EOL: 2020 Release Date: Oct 2022 Maintained Compatible with python 3 HOW TO ACHIEVE? . ➢ Test setup of Zed version for comfortable switch ➢ Install and setup e2e multinode OpenStack environment with Zed version
  16. Compute Node Controller Node Upgrade OpenStack to Zed version CURRENT

    PROJECT 18 Zed Upstream Code (with custom patches) All Unit Tests & Functional Tests pass Download repos and install the services Modify configuration files with Zed compatibility Ensure e2e API correction using CLI Nova-api Nova-conductor Nova-scheduler Nova-novncproxy Neutron-server Glance-api Placement-api Nova-compute Neutron-agent Nova.conf Neutron.conf Glance.conf Placement.conf
  17. Challenges: • Config parameters deprecated in many of the services

    config: • rabbit_hosts, rpc_backend, allow_overlapping_ips, dnsmasq_dns_server, secure_proxy_ssl_header • Required networking packages intallation like: • Dnsmasq-utils • Ipset • Conntrack • Linux kernel version (compatibility with network): • Some kernel parameters like nf_hooks_lwtunnel is introduced in linux kernel version > v5.15 or above • nf_conntrack module is needed on the compute nodes to enable network tunneling ( libvirtd installs this module automatically ) CURRENT PROJECT: UPGRADE OPENSTACK 19
  18. 21