Hyperscale Projects, Why and How to Build on Ceph’s Limitations

Hyperscale Projects, Why and How to Build on Ceph’s Limitations
Ilsoo Byun/LINE Plus

About me

Agenda - Motivation & Design Considerations - Limitation of Ceph
- Federation of Clusters - Hybrid storage for efficiency - Conclusions

Motivation

What to store VOOM & Album - Image/video: from several
KB to several GB. - Metadata: tens of Bytes - Total size: hundreds of PB - Total objects: 300+ billion

Ceph Ceph is an open source software-defined storage solution that
is highly scalable. http://www.yet.org/2012/12/staas/

Ceph usage status within LINE • 30+ Clusters • 2,500+
Servers • 70,000+ OSDs • 700+ PB

Hyper-scale Object Storage • Single Cluster x 10 ea •
3,000+ OSDs • 150+ Hosts • SSD 3 PiB • HDD 20 PiB

Limitations on Ceph • Beyond a certain scale, the behavior
of ceph-mgr becomes severely unstable • https://ceph.io/en/news/blog/2022/scaletesting-with-pawsey/ • https://ceph.io/en/news/blog/2022/mgr-ttlcache • ceph-mgrs were killed repeatedly when adding OSDs • ceph status showed the incorrect information • Ceph monitor is not perfect! • ceph-mon was killed due to OOM due to client issues

Unstable ceph-mgr beyond the certain scale Repeatedly restarted ceph-mgr

• Increased PAXOS commit latency Unresponsive Ceph monitors

Monitor’s log trim bug

Don't put all your eggs in one basket

Design Considerations • Storing hundreds of petabytes of data •
Sustainable to scale • Fault-tolerant • S3-compatible object storage • Storage efficiency • Acting as a single cluster

Federation of clusters

Constraints • The number of backend clusters can be increased.
• The existing clusters can't be removed. • Weight can be changed. • Rebalance(reshuffle) is not allowed.

Illusion of a single cluster DNS Round-robin Load Balancer Load
Balancer Ceph Custer #1 Ceph Custer #2 Ceph Custer #3 Ceph Custer #n Load Balancer Load Balancer Load Balancer Load Balancer Routing Layer nginx nginx nginx nginx nginx Looks like a single cluster to users

The router chooses a cluster deterministically based on a cluster
map belonging to a bucket. Routing Layer Determines which cluster a bucket belongs to Cluster Map (epoch 1) Cluster #1 Cluster #2 Cluster Map (epoch 2) Cluster #1 Cluster #2 Cluster #3 Cluster #4 Bucket A Bucket B Bucket C Bucket D

Cluster map increase the epoch when • adding a new
cluster • changing the weights of clusters Cluster Map How to find a bucket location Cluster Map #1 Cluster #1 Cluster #2 10 5 10 5

Keep S3 compatibility Simple routing is not enough Multipart Upload
1. Init request & get upload-id 2. Upload parts by upload-id 3. Complete by upload-id Multipart Copy 1. Init request & get upload-id 2. Copy part of other object 3. Complete by upload-id A head Tail #1 Tail #2 B head

Keep S3 compatibility Simple routing is not enough Multipart Upload
1. Init request & get upload-id 2. Upload parts by upload-id 3. Complete by upload-id Multipart Copy 1. Init request & get upload-id 2. Copy part of other object 3. Complete by upload-id A head Tail #1 Tail #2 B head Cluster #1 Cluster #2

Why to rule out reshuffling? Too many to move •
Inter-cluster reshuffling is inefficient • Inner clusters are independently scalable • We can control incoming traffic to each internal cluster • An increase in total capacity does not necessarily mean an increase in traffic.

Storage efficiency

• Not a tiered solution • Choose media based on
object size • Runtime configurable option Hybrid storage type Is small object? SSD (3 replica) HDD (EC) Yes No

Inefficiency of EC profile • How to separate storing media
• Hot vs. Cold • Small vs. Large • 3 Replication = 3x overhead • 4:3 EC = 1.75x overhead • Ex. 1KB object • 3 replication = 12KB • 4:3 EC = 28KB

Lifecycle Threads Optimization LC.1 Bucket A Bucket B Bucket C
LC.2 Bucket D Bucket E Bucket F LC.3 Bucket G Bucket H Bucket I LC.4 Bucket J Bucket K Bucket L LC Worker LC Worker LC Worker LC Worker Locked! Locked!

• To overcome the limitations of Ceph, We federated several
clusters to act as a single cluster • Cluster federation • S3 compatibility • High storage efficiency • We can achieve sustainable scalability Conclusions

Thank you!

Hyperscale Projects, Why and How to Build on Ce...

Hyperscale Projects, Why and How to Build on Ceph’s Limitations

Tech-Verse2022

More Decks by Tech-Verse2022

Other Decks in Technology

Featured

Transcript