How We Integrate and Develop Private Cloud in LINE

How We Integrate and Develop Private Cloud in LINE Gene
Kuo 2020.08.01 COSCUP

About Me • Gene Kuo • Infrastructure Engineer @LINE •
Platform Team • Co-Organizer @Cloud Native Taiwan User Group • OpenStack / Kubernetes 2

Outline • Why LINE chosen to build it own private
cloud? • Basic Overview of Verda • What difficulties we faced to integrate and operate open source projects in our private cloud • What difficulties we faced when developing our in-house features and components on top of OpenStack • Future Works • Demo — OpenStack Development Workflow 3

WHY BUILD OUR OWN CLOUD? 4

Why Build Our Own Cloud? • We already have a
large Infrastructure & teams managing • Make use of what we have • Get benefit from it continuously • Cost • More transparent, no hidden cost • Easier to predict and manage • Take full control • No vendor lock-in • Change in Motivation 5

Change in Motivation 6 LINE Employee Platform (Cloud) Team Ask
for new feature Discuss for details of feature Implement the feature Release new feature Ask to use a new feature Open to use new feature Create rule to use new feature Public Cloud Emoji from https://openmoji.org/

Change in Motivation 7 LINE Employee Platform (Cloud) Team Ask
for new feature Discuss for details of feature Implement the feature Release new feature Ask to use a new feature Open to use new feature Create rule to use new feature Public Cloud Solving problems that other LINE department face Mapping company policy to public cloud e.g. security, cost, and etc. Emoji from https://openmoji.org/

Change in Motivation • Work as a proxy to public
cloud services —> Work as a developer to solve user’s problem • Do feasibility check of public cloud services against company policies —> Implement new features to user • Respond Yes/No for the request to use cloud services —> Provide values to end users and App developers 8

BASIC OVERVIEW OF VERDA 9

11 IaaS PaaS FaaS High Level Architecture VM Identity Network
Image DNS Block Storage Object Storage Bare metal LB Kubernetes Kafka Redis MySQL ElasticSearch Function as a Service

2000+ Hypervisor 50+ Virtual Machines 20000~ Physical Servers Hypervisors Virtual
Machine Baremetal Thousand LINE TV Verda About Scale 12 Data as of 7/13/2020

Difficulties Integrating Open Source Projects 13

Difficulties Integrating Open Source • Upstream version does not 100%
fit our needs • Features • Policies • In-house components Integration • Upgrades / Operation • Complexity • Development • Debugging • Performance issue at large scale 14

Example 15 Issue Solution While enabling PCI devices such as
GPU on top of OpenStack, we need to manage them with quota which is not supported upstream Develop and patch OpenStack quota managing logic to add PCI devices as one of the quotas Upstream Version Does Not 100% Fit Our Needs

Example 16 Complexity Issue Solution Most Infrastructure related open source
projects are complicated, and while integrating them together, the whole system become extremely complicated 1. Weekly technical sharing on the issue / projects that each team members works on, improving members understanding on softwares. 2. Wiki pages to manage detail debugging process when outage / issue happens in our environment.

Example 17 Performance Issue at Large Scale Issue Solution OpenStack
upstream filter scheduler will timeout when scheduling 100+ instances on a scale of more than 1000+ hypervisor Short term: Modify scheduler to cache weigher result when scheduling multiple instance Long term: Develop our own scheduler and integrated with OpenStack

Solution 18 Performance Issue at Large Scale hosts = get_all_hosts()
result = [] for n in num_of_requested_instance: # Filtering filtered_hosts = hosts for h in filtered_hosts: if host_pass_filters(h): continue else: filtered_hosts.remove(h) hosts=filtered_hosts # Weighting weighted_hosts = filtered_hosts for w in weigher: for h in weighted_hosts: h.weight += w.weight(h) # Sorting weighted_host = sorted(weighted_hosts) chosen_host = random.choice(weighted_hosts[0:scheduler_host_subset_size]) result.append(result) return result

Solution 19 Performance Issue at Large Scale hosts = get_all_hosts()
result = [] # Pre-weight and filter hosts pre_filter_host = get_filtered_hosts(hosts) pre_weighed_hosts = get_weighed_hosts(hosts) # Store in Priority Queue weighed_host_pq = PriorityQueue() for weighed_host in pre_weighed_hosts: weighed_host_pq.put((-1 * weighed_host.weight, weighed_host)) for n in num_of_requested_instance: filtered_hosts = [] # Filtering while not weighed_host_pq.empty(): most_weighed_host = weighed_host_pq.get()[1] if host_pass_filters(most_weighed_host): filtered_hosts.append(most_weighed_host) if len(filtered_hosts) == scheduler_host_subset_size: break chosen_host = random.choice(filtered_hosts) result.append(chosen_host) return result

Solution 20 Performance Issue at Large Scale Hypervisor count: 800

Solutions • Upstream version does not 100% fit our needs
• Development team to develop necessary feature and integrate with in- house components • Upgrades / Operation • Operation / SRE team to develop deployment / monitoring platform • Complexity • Weekly technical sharing • Performance issue at large scale 21

Difficulties Developing on Top of OpenStack 22

Difficulties Developing OpenStack • Keep our system to fit “The
OpenStack Way” • Manage custom changes along with upstream code • Not every parts of our changes can be done as plugin based • Have to manage patches to open source projects 23

Keeping “The OpenStack Way” • Business logic integration between OpenStack
and legacy system • Project structure mapping to LINE’s company organization structure • Some concepts in OpenStack doesn’t exist in legacy system • e.g. Service accounts • Integration among OpenStack and ACL • In-house components need to follow OpenStack feature 24

Keeping “The OpenStack Way” 25 Integration among OpenStack and ACL
Issue Solution OpenStack changes VM’s IP per creation, but DB ACL assumes client IP is fixed Static IP NodePool feature in VKS to make sure that newly spawn nodes to replace the malfunction node have the same IP address as the one being replace

Manage Custom Changes • Easy to add/take out custom logic
to/from OpenStack repository • Patch should be able to be applied against OpenStack upstream • Easy to differentiate if custom logic is for own my business, OpenStack general issue or cherry-pick from upstream code • Make customization independent from OpenStack • Implemented as a plugin or just cherry-pick (Ideal) • Less customization is nothing better than that. • Do our best to put custom logic into upstream 26 What We Want

Manage Custom Changes 27 What We Come Up With Requirement
Solution 1. Easy to add/take out custom logic to/from OpenStack repository 2. Patches should be able to be applied against OpenStack upstream A repo containing only patches. All patches are based on upstream OpenStack and are applied when building packages / containers

Manage Custom Changes 28 What We Come Up With Requirement
Solution Easy to differentiate if custom logic is for own my business, OpenStack general issue or cherry-pick from upstream code Separate different types of patches (own logic, general issue, cherry-pick) into folders with corresponding name If possible, implement own business logic as plugin

OpenStack Development Workflow 29 Clone repo Build Base Codebase Make
changes Generate diff (patch) Add new patch Make review Create PR Remove Unused Patches Deploy to test environments Deploy to production Release QA check

OpenStack Development Repo 30 . ├── base-commit ├── custom-source │
└── nova │ └── dummy.py ├── dockerfiles ├── patches │ └── nova │ ├── cherry-pick │ └── customs │ └── dummy.patch ├── review │ └── nova └── scripts └── apply_patch.sh

Future Works 31

Future Works • Upgrade OpenStack version • Provide more scalable
IaaS service • Integrate in-house scheduler to OpenStack • Disaster recovery tests • Scalability tests / benchmarks • Upstream contribution • oslo.metrics • Large-scale sig 32

DEMO OpenStack Development Workflow 33

OpenStack Development Workflow 34 Clone repo Build Base Codebase Make
changes Generate diff (patch) Add new patch Make review Create PR Remove Unused Patches Deploy to test environments Deploy to production Release QA check

Q&A For more info about our team 35

How We Integrate and Develop Private Cloud in ...

How We Integrate and Develop Private Cloud in LINE

More Decks by LINE Developers

Other Decks in Technology

Featured

Transcript