Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hyperscale Projects, Why and How to Build on Ceph’s Limitations

Hyperscale Projects, Why and How to Build on Ceph’s Limitations

Tech-Verse2022

November 17, 2022
Tweet

More Decks by Tech-Verse2022

Other Decks in Technology

Transcript

  1. Hyperscale Projects, Why
    and How to Build on
    Ceph’s Limitations
    Ilsoo Byun/LINE Plus

    View full-size slide

  2. Agenda
    - Motivation & Design Considerations
    - Limitation of Ceph
    - Federation of Clusters
    - Hybrid storage for efficiency
    - Conclusions

    View full-size slide

  3. What to store
    VOOM & Album
    - Image/video: from several KB
    to several GB.
    - Metadata: tens of Bytes
    - Total size: hundreds of PB
    - Total objects: 300+ billion

    View full-size slide

  4. Ceph
    Ceph is an open source software-defined storage solution that is highly scalable.
    http://www.yet.org/2012/12/staas/

    View full-size slide

  5. Ceph usage status within LINE
    • 30+ Clusters
    • 2,500+ Servers
    • 70,000+ OSDs
    • 700+ PB

    View full-size slide

  6. Hyper-scale Object Storage
    • Single Cluster x 10 ea
    • 3,000+ OSDs
    • 150+ Hosts
    • SSD 3 PiB
    • HDD 20 PiB

    View full-size slide

  7. Limitations on Ceph
    • Beyond a certain scale, the behavior of ceph-mgr becomes
    severely unstable
    • https://ceph.io/en/news/blog/2022/scaletesting-with-pawsey/
    • https://ceph.io/en/news/blog/2022/mgr-ttlcache
    • ceph-mgrs were killed repeatedly when adding OSDs
    • ceph status showed the incorrect information
    • Ceph monitor is not perfect!
    • ceph-mon was killed due to OOM due to client issues

    View full-size slide

  8. Unstable ceph-mgr beyond the certain scale
    Repeatedly restarted ceph-mgr

    View full-size slide

  9. • Increased PAXOS commit latency
    Unresponsive Ceph monitors

    View full-size slide

  10. Monitor’s log trim bug

    View full-size slide

  11. Don't put all your eggs in one basket

    View full-size slide

  12. Design Considerations
    • Storing hundreds of petabytes of data
    • Sustainable to scale
    • Fault-tolerant
    • S3-compatible object storage
    • Storage efficiency
    • Acting as a single cluster

    View full-size slide

  13. Federation of clusters

    View full-size slide

  14. Constraints
    • The number of backend clusters can be increased.
    • The existing clusters can't be removed.
    • Weight can be changed.
    • Rebalance(reshuffle) is not allowed.

    View full-size slide

  15. Illusion of a single cluster
    DNS Round-robin
    Load Balancer Load Balancer
    Ceph Custer #1 Ceph Custer #2 Ceph Custer #3 Ceph Custer #n
    Load Balancer Load Balancer Load Balancer Load Balancer
    Routing Layer
    nginx nginx nginx nginx nginx
    Looks like a single cluster to users

    View full-size slide

  16. The router chooses a cluster deterministically based on a cluster map belonging to a bucket.
    Routing Layer
    Determines which cluster a bucket belongs to
    Cluster Map (epoch 1)
    Cluster #1 Cluster #2
    Cluster Map (epoch 2)
    Cluster #1 Cluster #2
    Cluster #3 Cluster #4
    Bucket A Bucket B Bucket C Bucket D

    View full-size slide

  17. Cluster map increase the epoch when
    • adding a new cluster
    • changing the weights of clusters
    Cluster Map
    How to find a bucket location
    Cluster Map #1
    Cluster #1 Cluster #2
    10 5
    10
    5

    View full-size slide

  18. Keep S3 compatibility
    Simple routing is not enough
    Multipart Upload
    1. Init request & get upload-id
    2. Upload parts by upload-id
    3. Complete by upload-id
    Multipart Copy
    1. Init request & get upload-id
    2. Copy part of other object
    3. Complete by upload-id
    A
    head
    Tail
    #1
    Tail
    #2
    B
    head

    View full-size slide

  19. Keep S3 compatibility
    Simple routing is not enough
    Multipart Upload
    1. Init request & get upload-id
    2. Upload parts by upload-id
    3. Complete by upload-id
    Multipart Copy
    1. Init request & get upload-id
    2. Copy part of other object
    3. Complete by upload-id
    A
    head
    Tail
    #1
    Tail
    #2
    B
    head
    Cluster #1
    Cluster #2

    View full-size slide

  20. Why to rule out reshuffling?
    Too many to move
    • Inter-cluster reshuffling is inefficient
    • Inner clusters are independently
    scalable
    • We can control incoming traffic to
    each internal cluster
    • An increase in total capacity does not
    necessarily mean an increase in
    traffic.

    View full-size slide

  21. Storage efficiency

    View full-size slide

  22. • Not a tiered solution
    • Choose media based on object size
    • Runtime configurable option
    Hybrid storage type
    Is small object?
    SSD
    (3 replica)
    HDD
    (EC)
    Yes No

    View full-size slide

  23. Inefficiency of EC profile
    • How to separate storing media
    • Hot vs. Cold
    • Small vs. Large
    • 3 Replication = 3x overhead
    • 4:3 EC = 1.75x overhead
    • Ex. 1KB object
    • 3 replication = 12KB
    • 4:3 EC = 28KB

    View full-size slide

  24. Lifecycle Threads Optimization
    LC.1
    Bucket A
    Bucket B
    Bucket C
    LC.2
    Bucket D
    Bucket E
    Bucket F
    LC.3
    Bucket G
    Bucket H
    Bucket I
    LC.4
    Bucket J
    Bucket K
    Bucket L
    LC Worker
    LC Worker
    LC Worker
    LC Worker
    Locked! Locked!

    View full-size slide

  25. • To overcome the limitations of Ceph, We federated several
    clusters to act as a single cluster
    • Cluster federation
    • S3 compatibility
    • High storage efficiency
    • We can achieve sustainable scalability
    Conclusions

    View full-size slide