$30 off During Our Annual Pro Sale. View Details »

Hyperscale Projects, Why and How to Build on Ceph’s Limitations

Hyperscale Projects, Why and How to Build on Ceph’s Limitations

Tech-Verse2022
PRO

November 17, 2022
Tweet

More Decks by Tech-Verse2022

Other Decks in Technology

Transcript

  1. Hyperscale Projects, Why and How to Build on Ceph’s Limitations

    Ilsoo Byun/LINE Plus
  2. About me

  3. Agenda - Motivation & Design Considerations - Limitation of Ceph

    - Federation of Clusters - Hybrid storage for efficiency - Conclusions
  4. Motivation

  5. What to store VOOM & Album - Image/video: from several

    KB to several GB. - Metadata: tens of Bytes - Total size: hundreds of PB - Total objects: 300+ billion
  6. Ceph Ceph is an open source software-defined storage solution that

    is highly scalable. http://www.yet.org/2012/12/staas/
  7. Ceph usage status within LINE • 30+ Clusters • 2,500+

    Servers • 70,000+ OSDs • 700+ PB
  8. Hyper-scale Object Storage • Single Cluster x 10 ea •

    3,000+ OSDs • 150+ Hosts • SSD 3 PiB • HDD 20 PiB
  9. Limitations on Ceph • Beyond a certain scale, the behavior

    of ceph-mgr becomes severely unstable • https://ceph.io/en/news/blog/2022/scaletesting-with-pawsey/ • https://ceph.io/en/news/blog/2022/mgr-ttlcache • ceph-mgrs were killed repeatedly when adding OSDs • ceph status showed the incorrect information • Ceph monitor is not perfect! • ceph-mon was killed due to OOM due to client issues
  10. Unstable ceph-mgr beyond the certain scale Repeatedly restarted ceph-mgr

  11. • Increased PAXOS commit latency Unresponsive Ceph monitors

  12. Monitor’s log trim bug

  13. Don't put all your eggs in one basket

  14. Design Considerations • Storing hundreds of petabytes of data •

    Sustainable to scale • Fault-tolerant • S3-compatible object storage • Storage efficiency • Acting as a single cluster
  15. Federation of clusters

  16. Constraints • The number of backend clusters can be increased.

    • The existing clusters can't be removed. • Weight can be changed. • Rebalance(reshuffle) is not allowed.
  17. Illusion of a single cluster DNS Round-robin Load Balancer Load

    Balancer Ceph Custer #1 Ceph Custer #2 Ceph Custer #3 Ceph Custer #n Load Balancer Load Balancer Load Balancer Load Balancer Routing Layer nginx nginx nginx nginx nginx Looks like a single cluster to users
  18. The router chooses a cluster deterministically based on a cluster

    map belonging to a bucket. Routing Layer Determines which cluster a bucket belongs to Cluster Map (epoch 1) Cluster #1 Cluster #2 Cluster Map (epoch 2) Cluster #1 Cluster #2 Cluster #3 Cluster #4 Bucket A Bucket B Bucket C Bucket D
  19. Cluster map increase the epoch when • adding a new

    cluster • changing the weights of clusters Cluster Map How to find a bucket location Cluster Map #1 Cluster #1 Cluster #2 10 5 10 5
  20. Keep S3 compatibility Simple routing is not enough Multipart Upload

    1. Init request & get upload-id 2. Upload parts by upload-id 3. Complete by upload-id Multipart Copy 1. Init request & get upload-id 2. Copy part of other object 3. Complete by upload-id A head Tail #1 Tail #2 B head
  21. Keep S3 compatibility Simple routing is not enough Multipart Upload

    1. Init request & get upload-id 2. Upload parts by upload-id 3. Complete by upload-id Multipart Copy 1. Init request & get upload-id 2. Copy part of other object 3. Complete by upload-id A head Tail #1 Tail #2 B head Cluster #1 Cluster #2
  22. Why to rule out reshuffling? Too many to move •

    Inter-cluster reshuffling is inefficient • Inner clusters are independently scalable • We can control incoming traffic to each internal cluster • An increase in total capacity does not necessarily mean an increase in traffic.
  23. Storage efficiency

  24. • Not a tiered solution • Choose media based on

    object size • Runtime configurable option Hybrid storage type Is small object? SSD (3 replica) HDD (EC) Yes No
  25. Inefficiency of EC profile • How to separate storing media

    • Hot vs. Cold • Small vs. Large • 3 Replication = 3x overhead • 4:3 EC = 1.75x overhead • Ex. 1KB object • 3 replication = 12KB • 4:3 EC = 28KB
  26. Lifecycle Threads Optimization LC.1 Bucket A Bucket B Bucket C

    LC.2 Bucket D Bucket E Bucket F LC.3 Bucket G Bucket H Bucket I LC.4 Bucket J Bucket K Bucket L LC Worker LC Worker LC Worker LC Worker Locked! Locked!
  27. • To overcome the limitations of Ceph, We federated several

    clusters to act as a single cluster • Cluster federation • S3 compatibility • High storage efficiency • We can achieve sustainable scalability Conclusions
  28. Thank you!