Slide 1

Slide 1 text

Hyperscale Projects, Why and How to Build on Ceph’s Limitations Ilsoo Byun/LINE Plus

Slide 2

Slide 2 text

About me

Slide 3

Slide 3 text

Agenda - Motivation & Design Considerations - Limitation of Ceph - Federation of Clusters - Hybrid storage for efficiency - Conclusions

Slide 4

Slide 4 text

Motivation

Slide 5

Slide 5 text

What to store VOOM & Album - Image/video: from several KB to several GB. - Metadata: tens of Bytes - Total size: hundreds of PB - Total objects: 300+ billion

Slide 6

Slide 6 text

Ceph Ceph is an open source software-defined storage solution that is highly scalable. http://www.yet.org/2012/12/staas/

Slide 7

Slide 7 text

Ceph usage status within LINE • 30+ Clusters • 2,500+ Servers • 70,000+ OSDs • 700+ PB

Slide 8

Slide 8 text

Hyper-scale Object Storage • Single Cluster x 10 ea • 3,000+ OSDs • 150+ Hosts • SSD 3 PiB • HDD 20 PiB

Slide 9

Slide 9 text

Limitations on Ceph • Beyond a certain scale, the behavior of ceph-mgr becomes severely unstable • https://ceph.io/en/news/blog/2022/scaletesting-with-pawsey/ • https://ceph.io/en/news/blog/2022/mgr-ttlcache • ceph-mgrs were killed repeatedly when adding OSDs • ceph status showed the incorrect information • Ceph monitor is not perfect! • ceph-mon was killed due to OOM due to client issues

Slide 10

Slide 10 text

Unstable ceph-mgr beyond the certain scale Repeatedly restarted ceph-mgr

Slide 11

Slide 11 text

• Increased PAXOS commit latency Unresponsive Ceph monitors

Slide 12

Slide 12 text

Monitor’s log trim bug

Slide 13

Slide 13 text

Don't put all your eggs in one basket

Slide 14

Slide 14 text

Design Considerations • Storing hundreds of petabytes of data • Sustainable to scale • Fault-tolerant • S3-compatible object storage • Storage efficiency • Acting as a single cluster

Slide 15

Slide 15 text

Federation of clusters

Slide 16

Slide 16 text

Constraints • The number of backend clusters can be increased. • The existing clusters can't be removed. • Weight can be changed. • Rebalance(reshuffle) is not allowed.

Slide 17

Slide 17 text

Illusion of a single cluster DNS Round-robin Load Balancer Load Balancer Ceph Custer #1 Ceph Custer #2 Ceph Custer #3 Ceph Custer #n Load Balancer Load Balancer Load Balancer Load Balancer Routing Layer nginx nginx nginx nginx nginx Looks like a single cluster to users

Slide 18

Slide 18 text

The router chooses a cluster deterministically based on a cluster map belonging to a bucket. Routing Layer Determines which cluster a bucket belongs to Cluster Map (epoch 1) Cluster #1 Cluster #2 Cluster Map (epoch 2) Cluster #1 Cluster #2 Cluster #3 Cluster #4 Bucket A Bucket B Bucket C Bucket D

Slide 19

Slide 19 text

Cluster map increase the epoch when • adding a new cluster • changing the weights of clusters Cluster Map How to find a bucket location Cluster Map #1 Cluster #1 Cluster #2 10 5 10 5

Slide 20

Slide 20 text

Keep S3 compatibility Simple routing is not enough Multipart Upload 1. Init request & get upload-id 2. Upload parts by upload-id 3. Complete by upload-id Multipart Copy 1. Init request & get upload-id 2. Copy part of other object 3. Complete by upload-id A head Tail #1 Tail #2 B head

Slide 21

Slide 21 text

Keep S3 compatibility Simple routing is not enough Multipart Upload 1. Init request & get upload-id 2. Upload parts by upload-id 3. Complete by upload-id Multipart Copy 1. Init request & get upload-id 2. Copy part of other object 3. Complete by upload-id A head Tail #1 Tail #2 B head Cluster #1 Cluster #2

Slide 22

Slide 22 text

Why to rule out reshuffling? Too many to move • Inter-cluster reshuffling is inefficient • Inner clusters are independently scalable • We can control incoming traffic to each internal cluster • An increase in total capacity does not necessarily mean an increase in traffic.

Slide 23

Slide 23 text

Storage efficiency

Slide 24

Slide 24 text

• Not a tiered solution • Choose media based on object size • Runtime configurable option Hybrid storage type Is small object? SSD (3 replica) HDD (EC) Yes No

Slide 25

Slide 25 text

Inefficiency of EC profile • How to separate storing media • Hot vs. Cold • Small vs. Large • 3 Replication = 3x overhead • 4:3 EC = 1.75x overhead • Ex. 1KB object • 3 replication = 12KB • 4:3 EC = 28KB

Slide 26

Slide 26 text

Lifecycle Threads Optimization LC.1 Bucket A Bucket B Bucket C LC.2 Bucket D Bucket E Bucket F LC.3 Bucket G Bucket H Bucket I LC.4 Bucket J Bucket K Bucket L LC Worker LC Worker LC Worker LC Worker Locked! Locked!

Slide 27

Slide 27 text

• To overcome the limitations of Ceph, We federated several clusters to act as a single cluster • Cluster federation • S3 compatibility • High storage efficiency • We can achieve sustainable scalability Conclusions

Slide 28

Slide 28 text

Thank you!