Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LINEにおけるOpenStackの運用とUpstreamへの取り組み / our openstack operation and upstream policy

LINEにおけるOpenStackの運用とUpstreamへの取り組み / our openstack operation and upstream policy

LINE Developers

July 24, 2019
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. About Me 2011.08 ~ : livedoor → NHN Japan →

    LINE 2017.02 ~ : Infra platform dept. • Develop the private cloud in-house component(MySQL, approval) 2018.04 ~ : develop/operate the OpenStack Component (Verda platform development team) • In charge of Keystone and Designate
  2. Introduction of our private cloud • Called‘ Verda ‘ internally

    • Launched at April 2017 • Using ‘Mitaka’ release. ◦ partially “Newton” release
  3. Introduction of our private cloud services Identity VM/PM LB DNS

    KaaS (Rancher) FasS Object Storage Block Storage MySQL Redis Elastic search Kafka Managed Service IaaS Verda UI Platform Network Service Storage UI
  4. Scale of our private cloud clusters Hypervisors VMs Records production

    region 1 700 13242 35000 production region 2 80 538 - production region 3 80 562 - Dev 600 20378 39000 • We have 4 clusters. • There are working single tenant. • Production has 3 regions. • Hypervisors ◦ Prod: about 860 ◦ Dev: about 600 • VMs ◦ Prod: 14342 ◦ Dev: 20378 • In August, we will start to provide new cluster. NEW!
  5. Active VMs transition for dev and production Dev: 219% UP⬆

    Prod: 387% UP⏫ 2018/06 3701 VMs 2018/06 9237 VMs 2019/06 14342 VMs 2019/06 20378 VMs
  6. Custom requirements in Designate • Disaster Recovery(DR) for DNS ◦

    customize the architecture • How we manage the extra API filter (API Gateway) ◦ paste deploy • Our upgrade strategy ◦ customize source code management
  7. Why we need the DR • Even in the case

    of we get the disaster and we can’t operate on region 1, we have to provide LINE service. • There are many development branch ◦ Tokyo, Fukuoka, Kyoto ◦ Korea, China, Taiwan, Thai, Vietnam, Indonesia
  8. Pain point to achieve DR • We only had deployed

    the designate service to region 1 ◦ Install the designate • Designate itself doesn't provide methods for Disaster Recovery functionality. ◦ Think how to build a DR • How do we monitor the DR ◦ Use other region or
  9. Design of before starting DR project from UI or API

    Region 1 API Gateway Designate BIND PowerDNS BIND BIND BIND from UI or API Region 2 nova compute MySQL RabbitMQ nova compute RabbitMQ Shovel A shovel can move messages between brokers in different administrative domains.
  10. Design of before starting DR project from UI or API

    Region 1 API Gateway Designate BIND PowerDNS BIND BIND BIND from UI or API Region 2 nova compute MySQL RabbitMQ nova compute RabbitMQ Shovel
  11. Deployed the designate to Region 2 from UI or API

    Region 1 API Gateway Designate BIND PowerDNS BIND BIND BIND from UI or API Region 2 nova compute MySQL RabbitMQ nova compute RabbitMQ API Gateway Designate PowerDNS MySQL
  12. Region 1 designate gets disaster. from UI or API Region

    1 API Gateway Designate BIND PowerDNS BIND BIND BIND from UI or API Region 2 nova compute MySQL RabbitMQ nova compute RabbitMQ API Gateway Designate PowerDNS MySQL Region 2 MySQL doesn’t have a Region 1 data.
  13. Introduce pools configuration • What’s the pools.yml ◦ Designate supports

    the multiple ‘pools’ of DNS service ◦ There is `aslo_notify` and `Target` column in pools.yml that decides how to send the update data on it. ▪ aslo_notify: Optional list of additional IP/Port's for which designate- mdns will send DNS NOTIFY packets to ▪ target: List out the designate-mdns servers from which PowerDNS servers should request zone transfers (AXFRs) from ◦ It can set various backend(PowerDNS, BIND, etc.) from newton.
  14. Failed Plan : Set two backend in pools.yml from UI

    or API Region 1 API Gateway Designate BIND PowerDNS BIND BIND BIND from UI or API Region 2 nova compute MySQL RabbitMQ nova compute RabbitMQ API Gateway Designate PowerDNS MySQL MySQL doesn’t have the same zone serial number
  15. Failed Plan: Set the shovel each other from UI or

    API Region 1 API Gateway Designate BIND PowerDNS BIND BIND BIND from UI or API Region 2 nova compute MySQL RabbitMQ nova compute RabbitMQ API Gateway Designate PowerDNS MySQL
  16. Actual Design: Set the cross region replication from UI or

    API Region 1 API Gateway Designate BIND PowerDNS BIND BIND BIND from UI or API Region 2 nova compute MySQL RabbitMQ nova compute RabbitMQ API Gateway Designate PowerDNS MySQL
  17. Actual Design: Set the cross region replication from UI or

    API Region 1 API Gateway Designate BIND PowerDNS BIND BIND BIND from UI or API Region 2 nova compute MySQL RabbitMQ nova compute RabbitMQ API Gateway Designate PowerDNS MySQL configure the MySQL connection endpoint of region 1
  18. Actual Design: Set the multi target on each Designate from

    UI or API Region 1 API Gateway Designate BIND PowerDNS BIND BIND BIND from UI or API Region 2 nova compute MySQL RabbitMQ nova compute RabbitMQ API Gateway Designate PowerDNS MySQL
  19. we need to change the configuration after losing the region

    1 from UI or API Region 1 API Gateway Designate BIND PowerDNS BIND BIND BIND from UI or API Region 2 nova compute MySQL RabbitMQ nova compute RabbitMQ API Gateway Designate PowerDNS MySQL
  20. we need to change the configuration after losing the region

    1 from UI or API Region 1 API Gateway Designate BIND PowerDNS BIND BIND BIND from UI or API Region 2 nova compute MySQL RabbitMQ nova compute RabbitMQ API Gateway Designate PowerDNS MySQL Update the pools.yml (delete the region 1 information) And Update the ‘Master‘ Column
  21. How to monitor the the DNS service from UI or

    API Region 1 API Gateway Designate BIND PowerDNS BIND BIND BIND from UI or API Region 2 nova compute MySQL Rabbit MQ nova compute Rabbit MQ API Gateway Designate PowerDNS MySQL Region 3 Monitor Server
  22. Why we need the extra API filter • Don’t register

    unmanaged IP ◦ we need a whitelist • Don’t register unmanaged Domain
  23. DNS architecture from UI or API Region 1 API Gateway

    Designate BIND PowerDNS BIND BIND BIND from UI or API Region 2 nova compute MySQL RabbitMQ nova compute RabbitMQ API Gateway Designate PowerDNS MySQL
  24. Pain point of this flow • This flow is difficult

    for us to troubleshooting • It takes a maintenance cost ◦ Deploy ◦ New requirement
  25. How do we change • Component endpoint has been managed

    by PasteDeploy. ◦ It is easy to use and scalable. ◦ We can inject business logic. We changed the API Gateway(Flask) to WSGI Application and configured to the paste.ini.
  26. Registering flow • Paste Deploy can inject a business logic

    before main logic. • Easy to manage • Easy to troubleshooting
  27. Example our use case • Request filter ◦ Designate: IP,

    domain ◦ Keystone: user • replace a response format (keystone) ◦ Our Verda provides SSO authentication that uses keystone and apache Mellon. ◦ If keystone can’t find user, return 401 error with JSON format.
  28. Why we decide to upgrade designate • Want to use

    a new search API ◦ We have costumed the search API, but it was a little slow ◦ A new recordset api "/v2/recordsets" is exposed with GET method allowed only. The api can be used for retrieving recordsets across all the zones under a tenant. Filtering on certain fields is supported as well. https://docs.openstack.org/releasenotes/designate/newton.html
  29. What’s the pain point of upgrade Upstream Designate designate yum

    repo Change 1 Change 2 Change 3 fork Build server Change A Change B Change C The reason is that we were adding the code directly.
  30. Difficulty of upgrade Upstream Designate designate yum repo Change 1

    Change 2 Change 3 fork Build server Change A Change B Change C I can not keep up with the change. I don’t know how does it affect
  31. Difficulty of upgrade Upstream Designate designate yum repo Change 1

    Change 2 Change 3 fork Build server Change A Change B Change C Summarize all changes
  32. How to upgrade a component Upstream Designate line-designate yum repo

    Build server patch A patch C patch B make a new repository for our changes Change A Change B Change C
  33. Upgrade a component Directory is separate the cherry-pick and customs

    cherry-pick: from upstream There is two type patch in cherry-pick directory 1. .patch is NOT an original upstream patch. 2. .upstream: this is an original upstream patch. customs: requirements and internal issue
  34. How to upgrade and rollback UPGRADE 1. build new package

    2. update a package version on Ansible 3. deploy ROLLBACK 1. stop process of component 2. delete installed package 3. fix a package version on Ansible 4. re-install package Designate
  35. Conclusion and future plan in Designate • Conclusion ◦ We

    have a DR architecture for DNS ◦ We have been using a paste-deploy effectively ◦ We have a plan of upgrading a Designate • Future plan ◦ Build Control Plane over k8s ▪ There is an issue on current upgrading way • the upgrade process and rollback process are pain in the neck. ▪ troubleshooting ◦ upgrade to latest release
  36. Our OSS system policy • Don’t believe just document •

    Grasp code and internal State of process running • discussion inside team is based on code
  37. Contribute to 1. DB Deadlock issue ◦ https://review.opendev.org/#/c/647711/ 2. service_status

    checker issue ◦ https://review.opendev.org/#/c/657382/ 3. Improve log message for better understanding ◦ https://review.opendev.org/#/c/669154/
  38. Contribute to 1. DB Deadlock issue ◦ https://review.opendev.org/#/c/647711/ 2. service_status

    checker issue ◦ https://review.opendev.org/#/c/657382/ 3. Improve log message for better understanding ◦ https://review.opendev.org/#/c/669154/
  39. What happened on designate • I got a report from

    the user side. • The record status is ERROR. But it can respond.
  40. Check log transaction1 transaction2 2. UPDATE zones SET version=(zones.version +

    1), updated_at='2019-02-21 12:30:47.183292', serial=1550752338, status='PENDING', action='UPDATE' WHERE zones.id = '5e5ffc278c5745baae5287c160f54dce' AND zones.deleted = '0' 3. UPDATE records SET version=(records.version + 1), updated_at='2019-02-21 12:30:47.508057', data='ns.example.jp. domain.example.com. 1550752338 3552 600 86400 3600', hash='39795ee18c6e3c9ad1c0190c6a3d8d4f', status='PENDING', action='UPDATE', serial=1550752338 WHERE records.id = '7a655eeda4d446cdaa81caf19ab55fcc' 7. UPDATE records SET version=(records.version + 1), updated_at='2019-02-21 12:30:11.282188', status='ACTIVE', action='NONE' WHERE records.id = '7a655eeda4d446cdaa81caf19ab55fcc' 8. UPDATE zones SET version=(zones.version + 1), updated_at='2019-02-21 12:30:47.178028', status='ACTIVE', action='NONE' WHERE zones.id = '5e5ffc278c5745baae5287c160f54dce' AND zones.deleted = '0'
  41. Summarize for internal • When we got the something error.

    We summarize a report for internal.(share how to resolve)
  42. Check bug report I sometimes find the same situation bug

    report. So I check the report before submitting a patch. https://bugs.launchpad.net/designate/+bug/1785459
  43. backport to supported(not EOL) openstack releases We never forgot to

    backport to supported OpenStack release. Because this is our motivation for upgrading.
  44. Session from LINE • [Related Event] Rancher Day • :

    7/24 11:30~ • :    • https://eventregist.com/e/rdt2019 • 
  45. Open Infrastructure Summit Shanghai • Open Infrastructure Summit • https://www.openstack.org/summit/shanghai-2019/

    • November 4-6, 2019 •  • Submitted 2 CFP. • How we used RabbitMQ in wrong way at a scale • Multi-Site, Shared Zone, Extra API Filter...How we brought Designate up in Our Production Private Cloud