How We Integrate and Develop Private Cloud in LINE

How We Integrate and Develop Private Cloud in LINE

Nowadays, more and more companies moves their infrastructure to public clouds instead of managing it themself. However, LINE have chosen to build and operate its own private cloud (Verda) for company use, and it is a total different experience compare to directly using public cloud. In this talk, we will share our experience on how we integrate and develop our own private cloud using open source softwares.
https://coscup.org/2020/en/agenda/PZPMCR

53850955f15249a1a9dc49df6113e400?s=128

LINE Developers

August 01, 2020
Tweet

Transcript

  1. How We Integrate and Develop Private Cloud in LINE Gene

    Kuo 2020.08.01 COSCUP
  2. About Me • Gene Kuo • Infrastructure Engineer @LINE •

    Platform Team • Co-Organizer @Cloud Native Taiwan User Group • OpenStack / Kubernetes 2
  3. Outline • Why LINE chosen to build it own private

    cloud? • Basic Overview of Verda • What difficulties we faced to integrate and operate open source projects in our private cloud • What difficulties we faced when developing our in-house features and components on top of OpenStack • Future Works • Demo — OpenStack Development Workflow 3
  4. WHY BUILD OUR OWN CLOUD? 4

  5. Why Build Our Own Cloud? • We already have a

    large Infrastructure & teams managing • Make use of what we have • Get benefit from it continuously • Cost • More transparent, no hidden cost • Easier to predict and manage • Take full control • No vendor lock-in • Change in Motivation 5
  6. Change in Motivation 6 LINE Employee Platform (Cloud) Team Ask

    for new feature Discuss for details of feature Implement the feature Release new feature Ask to use a new feature Open to use new feature Create rule to use new feature Public Cloud Emoji from https://openmoji.org/
  7. Change in Motivation 7 LINE Employee Platform (Cloud) Team Ask

    for new feature Discuss for details of feature Implement the feature Release new feature Ask to use a new feature Open to use new feature Create rule to use new feature Public Cloud Solving problems that other LINE department face Mapping company policy to public cloud e.g. security, cost, and etc. Emoji from https://openmoji.org/
  8. Change in Motivation • Work as a proxy to public

    cloud services —> Work as a developer to solve user’s problem • Do feasibility check of public cloud services against company policies —> Implement new features to user • Respond Yes/No for the request to use cloud services —> Provide values to end users and App developers 8
  9. BASIC OVERVIEW OF VERDA 9

  10. 10

  11. 11 IaaS PaaS FaaS High Level Architecture VM Identity Network

    Image DNS Block Storage Object Storage Bare metal LB Kubernetes Kafka Redis MySQL ElasticSearch Function as a Service
  12. 2000+ Hypervisor 50+ Virtual Machines 20000~ Physical Servers Hypervisors Virtual

    Machine Baremetal Thousand LINE TV Verda About Scale 12 Data as of 7/13/2020
  13. Difficulties Integrating Open Source Projects 13

  14. Difficulties Integrating Open Source • Upstream version does not 100%

    fit our needs • Features • Policies • In-house components Integration • Upgrades / Operation • Complexity • Development • Debugging • Performance issue at large scale 14
  15. Example 15 Issue Solution While enabling PCI devices such as

    GPU on top of OpenStack, we need to manage them with quota which is not supported upstream Develop and patch OpenStack quota managing logic to add PCI devices as one of the quotas Upstream Version Does Not 100% Fit Our Needs
  16. Example 16 Complexity Issue Solution Most Infrastructure related open source

    projects are complicated, and while integrating them together, the whole system become extremely complicated 1. Weekly technical sharing on the issue / projects that each team members works on, improving members understanding on softwares. 2. Wiki pages to manage detail debugging process when outage / issue happens in our environment.
  17. Example 17 Performance Issue at Large Scale Issue Solution OpenStack

    upstream filter scheduler will timeout when scheduling 100+ instances on a scale of more than 1000+ hypervisor Short term: Modify scheduler to cache weigher result when scheduling multiple instance Long term: Develop our own scheduler and integrated with OpenStack
  18. Solution 18 Performance Issue at Large Scale hosts = get_all_hosts()

    result = [] for n in num_of_requested_instance: # Filtering filtered_hosts = hosts for h in filtered_hosts: if host_pass_filters(h): continue else: filtered_hosts.remove(h) hosts=filtered_hosts # Weighting weighted_hosts = filtered_hosts for w in weigher: for h in weighted_hosts: h.weight += w.weight(h) # Sorting weighted_host = sorted(weighted_hosts) chosen_host = random.choice(weighted_hosts[0:scheduler_host_subset_size]) result.append(result) return result
  19. Solution 19 Performance Issue at Large Scale hosts = get_all_hosts()

    result = [] # Pre-weight and filter hosts pre_filter_host = get_filtered_hosts(hosts) pre_weighed_hosts = get_weighed_hosts(hosts) # Store in Priority Queue weighed_host_pq = PriorityQueue() for weighed_host in pre_weighed_hosts: weighed_host_pq.put((-1 * weighed_host.weight, weighed_host)) for n in num_of_requested_instance: filtered_hosts = [] # Filtering while not weighed_host_pq.empty(): most_weighed_host = weighed_host_pq.get()[1] if host_pass_filters(most_weighed_host): filtered_hosts.append(most_weighed_host) if len(filtered_hosts) == scheduler_host_subset_size: break chosen_host = random.choice(filtered_hosts) result.append(chosen_host) return result
  20. Solution 20 Performance Issue at Large Scale Hypervisor count: 800

  21. Solutions • Upstream version does not 100% fit our needs

    • Development team to develop necessary feature and integrate with in- house components • Upgrades / Operation • Operation / SRE team to develop deployment / monitoring platform • Complexity • Weekly technical sharing • Performance issue at large scale 21
  22. Difficulties Developing on Top of OpenStack 22

  23. Difficulties Developing OpenStack • Keep our system to fit “The

    OpenStack Way” • Manage custom changes along with upstream code • Not every parts of our changes can be done as plugin based • Have to manage patches to open source projects 23
  24. Keeping “The OpenStack Way” • Business logic integration between OpenStack

    and legacy system • Project structure mapping to LINE’s company organization structure • Some concepts in OpenStack doesn’t exist in legacy system • e.g. Service accounts • Integration among OpenStack and ACL • In-house components need to follow OpenStack feature 24
  25. Keeping “The OpenStack Way” 25 Integration among OpenStack and ACL

    Issue Solution OpenStack changes VM’s IP per creation, but DB ACL assumes client IP is fixed Static IP NodePool feature in VKS to make sure that newly spawn nodes to replace the malfunction node have the same IP address as the one being replace
  26. Manage Custom Changes • Easy to add/take out custom logic

    to/from OpenStack repository • Patch should be able to be applied against OpenStack upstream • Easy to differentiate if custom logic is for own my business, OpenStack general issue or cherry-pick from upstream code • Make customization independent from OpenStack • Implemented as a plugin or just cherry-pick (Ideal) • Less customization is nothing better than that. • Do our best to put custom logic into upstream 26 What We Want
  27. Manage Custom Changes 27 What We Come Up With Requirement

    Solution 1. Easy to add/take out custom logic to/from OpenStack repository 2. Patches should be able to be applied against OpenStack upstream A repo containing only patches. All patches are based on upstream OpenStack and are applied when building packages / containers
  28. Manage Custom Changes 28 What We Come Up With Requirement

    Solution Easy to differentiate if custom logic is for own my business, OpenStack general issue or cherry-pick from upstream code Separate different types of patches (own logic, general issue, cherry-pick) into folders with corresponding name If possible, implement own business logic as plugin
  29. OpenStack Development Workflow 29 Clone repo Build Base Codebase Make

    changes Generate diff (patch) Add new patch Make review Create PR Remove Unused Patches Deploy to test environments Deploy to production Release QA check
  30. OpenStack Development Repo 30 . ├── base-commit ├── custom-source │

    └── nova │ └── dummy.py ├── dockerfiles ├── patches │ └── nova │ ├── cherry-pick │ └── customs │ └── dummy.patch ├── review │ └── nova └── scripts └── apply_patch.sh
  31. Future Works 31

  32. Future Works • Upgrade OpenStack version • Provide more scalable

    IaaS service • Integrate in-house scheduler to OpenStack • Disaster recovery tests • Scalability tests / benchmarks • Upstream contribution • oslo.metrics • Large-scale sig 32
  33. DEMO OpenStack Development Workflow 33

  34. OpenStack Development Workflow 34 Clone repo Build Base Codebase Make

    changes Generate diff (patch) Add new patch Make review Create PR Remove Unused Patches Deploy to test environments Deploy to production Release QA check
  35. Q&A For more info about our team 35