Slide 1

Slide 1 text

How We Integrate and Develop Private Cloud in LINE Gene Kuo 2020.08.01 COSCUP

Slide 2

Slide 2 text

About Me • Gene Kuo • Infrastructure Engineer @LINE • Platform Team • Co-Organizer @Cloud Native Taiwan User Group • OpenStack / Kubernetes 2

Slide 3

Slide 3 text

Outline • Why LINE chosen to build it own private cloud? • Basic Overview of Verda • What difficulties we faced to integrate and operate open source projects in our private cloud • What difficulties we faced when developing our in-house features and components on top of OpenStack • Future Works • Demo — OpenStack Development Workflow 3

Slide 4

Slide 4 text

WHY BUILD OUR OWN CLOUD? 4

Slide 5

Slide 5 text

Why Build Our Own Cloud? • We already have a large Infrastructure & teams managing • Make use of what we have • Get benefit from it continuously • Cost • More transparent, no hidden cost • Easier to predict and manage • Take full control • No vendor lock-in • Change in Motivation 5

Slide 6

Slide 6 text

Change in Motivation 6 LINE Employee Platform (Cloud) Team Ask for new feature Discuss for details of feature Implement the feature Release new feature Ask to use a new feature Open to use new feature Create rule to use new feature Public Cloud Emoji from https://openmoji.org/

Slide 7

Slide 7 text

Change in Motivation 7 LINE Employee Platform (Cloud) Team Ask for new feature Discuss for details of feature Implement the feature Release new feature Ask to use a new feature Open to use new feature Create rule to use new feature Public Cloud Solving problems that other LINE department face Mapping company policy to public cloud e.g. security, cost, and etc. Emoji from https://openmoji.org/

Slide 8

Slide 8 text

Change in Motivation • Work as a proxy to public cloud services —> Work as a developer to solve user’s problem • Do feasibility check of public cloud services against company policies —> Implement new features to user • Respond Yes/No for the request to use cloud services —> Provide values to end users and App developers 8

Slide 9

Slide 9 text

BASIC OVERVIEW OF VERDA 9

Slide 10

Slide 10 text

10

Slide 11

Slide 11 text

11 IaaS PaaS FaaS High Level Architecture VM Identity Network Image DNS Block Storage Object Storage Bare metal LB Kubernetes Kafka Redis MySQL ElasticSearch Function as a Service

Slide 12

Slide 12 text

2000+ Hypervisor 50+ Virtual Machines 20000~ Physical Servers Hypervisors Virtual Machine Baremetal Thousand LINE TV Verda About Scale 12 Data as of 7/13/2020

Slide 13

Slide 13 text

Difficulties Integrating Open Source Projects 13

Slide 14

Slide 14 text

Difficulties Integrating Open Source • Upstream version does not 100% fit our needs • Features • Policies • In-house components Integration • Upgrades / Operation • Complexity • Development • Debugging • Performance issue at large scale 14

Slide 15

Slide 15 text

Example 15 Issue Solution While enabling PCI devices such as GPU on top of OpenStack, we need to manage them with quota which is not supported upstream Develop and patch OpenStack quota managing logic to add PCI devices as one of the quotas Upstream Version Does Not 100% Fit Our Needs

Slide 16

Slide 16 text

Example 16 Complexity Issue Solution Most Infrastructure related open source projects are complicated, and while integrating them together, the whole system become extremely complicated 1. Weekly technical sharing on the issue / projects that each team members works on, improving members understanding on softwares. 2. Wiki pages to manage detail debugging process when outage / issue happens in our environment.

Slide 17

Slide 17 text

Example 17 Performance Issue at Large Scale Issue Solution OpenStack upstream filter scheduler will timeout when scheduling 100+ instances on a scale of more than 1000+ hypervisor Short term: Modify scheduler to cache weigher result when scheduling multiple instance Long term: Develop our own scheduler and integrated with OpenStack

Slide 18

Slide 18 text

Solution 18 Performance Issue at Large Scale hosts = get_all_hosts() result = [] for n in num_of_requested_instance: # Filtering filtered_hosts = hosts for h in filtered_hosts: if host_pass_filters(h): continue else: filtered_hosts.remove(h) hosts=filtered_hosts # Weighting weighted_hosts = filtered_hosts for w in weigher: for h in weighted_hosts: h.weight += w.weight(h) # Sorting weighted_host = sorted(weighted_hosts) chosen_host = random.choice(weighted_hosts[0:scheduler_host_subset_size]) result.append(result) return result

Slide 19

Slide 19 text

Solution 19 Performance Issue at Large Scale hosts = get_all_hosts() result = [] # Pre-weight and filter hosts pre_filter_host = get_filtered_hosts(hosts) pre_weighed_hosts = get_weighed_hosts(hosts) # Store in Priority Queue weighed_host_pq = PriorityQueue() for weighed_host in pre_weighed_hosts: weighed_host_pq.put((-1 * weighed_host.weight, weighed_host)) for n in num_of_requested_instance: filtered_hosts = [] # Filtering while not weighed_host_pq.empty(): most_weighed_host = weighed_host_pq.get()[1] if host_pass_filters(most_weighed_host): filtered_hosts.append(most_weighed_host) if len(filtered_hosts) == scheduler_host_subset_size: break chosen_host = random.choice(filtered_hosts) result.append(chosen_host) return result

Slide 20

Slide 20 text

Solution 20 Performance Issue at Large Scale Hypervisor count: 800

Slide 21

Slide 21 text

Solutions • Upstream version does not 100% fit our needs • Development team to develop necessary feature and integrate with in- house components • Upgrades / Operation • Operation / SRE team to develop deployment / monitoring platform • Complexity • Weekly technical sharing • Performance issue at large scale 21

Slide 22

Slide 22 text

Difficulties Developing on Top of OpenStack 22

Slide 23

Slide 23 text

Difficulties Developing OpenStack • Keep our system to fit “The OpenStack Way” • Manage custom changes along with upstream code • Not every parts of our changes can be done as plugin based • Have to manage patches to open source projects 23

Slide 24

Slide 24 text

Keeping “The OpenStack Way” • Business logic integration between OpenStack and legacy system • Project structure mapping to LINE’s company organization structure • Some concepts in OpenStack doesn’t exist in legacy system • e.g. Service accounts • Integration among OpenStack and ACL • In-house components need to follow OpenStack feature 24

Slide 25

Slide 25 text

Keeping “The OpenStack Way” 25 Integration among OpenStack and ACL Issue Solution OpenStack changes VM’s IP per creation, but DB ACL assumes client IP is fixed Static IP NodePool feature in VKS to make sure that newly spawn nodes to replace the malfunction node have the same IP address as the one being replace

Slide 26

Slide 26 text

Manage Custom Changes • Easy to add/take out custom logic to/from OpenStack repository • Patch should be able to be applied against OpenStack upstream • Easy to differentiate if custom logic is for own my business, OpenStack general issue or cherry-pick from upstream code • Make customization independent from OpenStack • Implemented as a plugin or just cherry-pick (Ideal) • Less customization is nothing better than that. • Do our best to put custom logic into upstream 26 What We Want

Slide 27

Slide 27 text

Manage Custom Changes 27 What We Come Up With Requirement Solution 1. Easy to add/take out custom logic to/from OpenStack repository 2. Patches should be able to be applied against OpenStack upstream A repo containing only patches. All patches are based on upstream OpenStack and are applied when building packages / containers

Slide 28

Slide 28 text

Manage Custom Changes 28 What We Come Up With Requirement Solution Easy to differentiate if custom logic is for own my business, OpenStack general issue or cherry-pick from upstream code Separate different types of patches (own logic, general issue, cherry-pick) into folders with corresponding name If possible, implement own business logic as plugin

Slide 29

Slide 29 text

OpenStack Development Workflow 29 Clone repo Build Base Codebase Make changes Generate diff (patch) Add new patch Make review Create PR Remove Unused Patches Deploy to test environments Deploy to production Release QA check

Slide 30

Slide 30 text

OpenStack Development Repo 30 . ├── base-commit ├── custom-source │ └── nova │ └── dummy.py ├── dockerfiles ├── patches │ └── nova │ ├── cherry-pick │ └── customs │ └── dummy.patch ├── review │ └── nova └── scripts └── apply_patch.sh

Slide 31

Slide 31 text

Future Works 31

Slide 32

Slide 32 text

Future Works • Upgrade OpenStack version • Provide more scalable IaaS service • Integrate in-house scheduler to OpenStack • Disaster recovery tests • Scalability tests / benchmarks • Upstream contribution • oslo.metrics • Large-scale sig 32

Slide 33

Slide 33 text

DEMO OpenStack Development Workflow 33

Slide 34

Slide 34 text

OpenStack Development Workflow 34 Clone repo Build Base Codebase Make changes Generate diff (patch) Add new patch Make review Create PR Remove Unused Patches Deploy to test environments Deploy to production Release QA check

Slide 35

Slide 35 text

Q&A For more info about our team 35