Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cybozu's storage system and their contributions to OSS

Cybozu
June 14, 2023
1.7k

Cybozu's storage system and their contributions to OSS

Cybozu

June 14, 2023
Tweet

More Decks by Cybozu

Transcript

  1. T o s h i k u n i F u k a y a @ C y b o z u
    S a t o r u T a k e u c h i @ C y b o z u
    Cybozu's storage system
    and their contributions
    to OSS

    View full-size slide

  2. W h o a m I ?
    T o s h i k u n i F u k a y a
    • 10+ year experience DevOps for cybozu.com
    • Widespread contributions to OSS
    • Apache httpd, nginx, OpenSSL, MySQL, etc.
    • Now work on storage system on Kubernetes(K8s)

    View full-size slide

  3. A g e n d a
    • Our current storage system
    • The new storage system
    • Conclusion

    View full-size slide

  4. A g e n d a
    • Our current storage system
    • The new storage system
    • Conclusion

    View full-size slide

  5. O u r b u s i n e s s
    c y b o z u . c o m
    • Providing groupware software as a service
    • 56,000 companies
    • 2,500,000 users

    View full-size slide

  6. C y b o z u . c o m i n t e r n a l
    O v e r v i e w
    • We have our own infrastructure
    • Racks, bare metal servers, switches
    • Linux based virtual machine environment
    • We built its control plane
    • Storage stack is also self built

    View full-size slide

  7. C y b o z u . c o m i n t e r n a l
    S t o r a g e s t a c k
    • We use OSS to build our storage stack
    • Linux MD (SW RAID) for data replication
    • Linux-IO (iSCSI) for remote connection between VM and storage server
    • LVM for volume management
    S t o r a g e S e r v e r S t o r a g e S e r v e r
    V M V M
    D a t a r e p l i c a t i o n b y S W
    R A I D 1
    S p l i t t i n g a p h y s i c a l
    v o l u m e t o s m a l l v i r t u a l
    v o l u m e s b y L V M
    i S C S I c o n n e c t i o n

    View full-size slide

  8. P a i n p o i n t s
    N o s c a l a b i l i t y
    • IOPS, bandwidth and capacity are limited by one storage server
    • We divide users to groups of users (shards)
    • We allocate storage servers for each shard
    • It introduces another pain: operation costs
    😣 Too many…

    View full-size slide

  9. W h o a m I ?
    S a t o r u T a k e u c h i
    • A developer of Cybozu’s storage system
    • ex-Linux kernel developer
    • Rook’s maintainer

    View full-size slide

  10. A g e n d a
    • Our current storage system
    • The new storage system
    • Conclusion

    View full-size slide

  11. C y b o z u ’ s n e w s t o r a g e s y s t e m
    R e q u i r e m e n t s
    • Reduce the operation costs and prevent to increase it at scale
    • Scalable block storage and object storage
    • Rack failure toleration and bit-rot toleration
    O S D ( d a t a )
    O S D ( d a t a )
    O S D ( d a t a )
    O S D ( d a t a )
    D i s t r i b u t e d s t o r a g e
    H D D
    S S D
    H D D
    H D D
    S S D
    N V M e S S D
    A p p
    b u c k e t B l o c k v o l u m e
    A p p
    n o d e n o d e

    View full-size slide

  12. K 8 s c l u s t e r
    C y b o z u ’ s n e w s t o r a g e s y s t e m
    A r c h i t e c t u r e
    • Running on top of Kubernetes(K8s) to reduce the operation costs
    • Scalable block storage and object storage provided
    • Ceph: An open source distributed storage
    • Rook: An open source Ceph orchestration running on top of K8s
    O S D ( d a t a )
    O S D ( d a t a )
    O S D ( d a t a )
    O S D ( d a t a )
    C e p h c l u s t e r C e p h c l u s t e r
    H D D
    S S D
    H D D
    H D D
    S S D
    N V M e S S D
    A p p
    b u c k e t B l o c k v o l u m e
    A p p
    n o d e n o d e
    R o o k
    m a n a g e
    O S D
    O S D
    Ceph’s data structure

    View full-size slide

  13. W h y o p e n s o u r c e ?
    O u r p o l i c y o f u s e r d a t a
    • To resolve critical problems (e.g., data corruption) as much as possible and as fast as possible
    • We want to read source codes to troubleshoot and use local changed version if necessary
    Read source code Use local changed version
    Buy proprietary products
    Buy OSS support ✅
    Manage upstream OSS by ourselves ✅ ✅
    O u r s e l e c t i o n 👉

    View full-size slide

  14. W h a t i s K u b e r n e t e s
    D e c l a r a t i v e c o n f i g u r a t i o n
    • Declare the desired state of K8s cluster in “resources” (YAML format)
    • If there are differences between the current state and the desired
    state, K8s change the cluster to meet the desired state
    k i n d : P o d
    m e t a d a t a :
    n a m e : n g i n x

    s p e c :
    r e p l i c a s :
    c o u n t : 3

    K 8 s c l u s t e r
    n o d e n o d e
    n o d e
    n g i n x n g i n x n g i n x

    View full-size slide

  15. W h a t i s C e p h
    A n o p e n s o u r c e d i s t r i b u t e d s t o r a g e
    • Block volume, distributed filesystem, and s3-compatible object storage
    • Widely used for 20 years
    • Meet all our requirements
    d i s k
    n o d e
    Storage pool
    C e p h c l u s t e r
    Block volume filesystem Object storage
    A p p
    A p p
    A p p
    d i s k
    n o d e
    d i s k
    n o d e
    O S D O S D O S D

    View full-size slide

  16. W h a t i s R o o k
    A C e p h o r c h e s t r a t i o n r u n n i n g o n t o p o f K 8 s
    • An official project of Cloud Native Computing Foundation(CNCF)
    • Ceph cluster is managed by “CephCluster” resource
    • Rook creates the number of OSDs written in “count” field
    • OSDs are evenly distributed over nodes and racks (through additional configuration)
    k i n d : C e p h C l u s t e r
    M e t a d a t a :
    n a m e : t e s t - c l u s t e r

    s t o r a g e :

    c o u n t : 3
    Storage pool
    C e p h c l u s t e r
    r a c k
    d i s k
    n o d e
    d i s k
    d i s k
    n o d e
    d i s k
    r a c k
    d i s k
    n o d e
    d i s k
    d i s k
    n o d e
    d i s k
    r a c k
    d i s k
    n o d e
    d i s k
    d i s k
    n o d e
    d i s k
    O S D O S D O S D

    View full-size slide

  17. W h a t i s R o o k
    A n e x a m p l e o f d a i l y o p e r a t i o n
    • OSDs can be added just by adding “count” field
    k i n d : C e p h C l u s t e r
    M e t a d a t a :
    n a m e : t e s t - c l u s t e r

    s t o r a g e :

    c o u n t : 6
    Storage pool
    C e p h c l u s t e r
    r a c k
    d i s k
    n o d e
    d i s k
    d i s k
    n o d e
    d i s k
    r a c k
    d i s k
    n o d e
    d i s k
    d i s k
    n o d e
    d i s k
    r a c k
    d i s k
    n o d e
    d i s k
    d i s k
    n o d e
    d i s k
    O S D O S D O S D
    O S D O S D O S D

    View full-size slide

  18. C a v e a t s
    R o o k i s n o t a l l - p o w e r f u l
    • Rook reduces the Ceph cluster’s operation cost very much
    • However, we must have deep knowledge of not only Rook but also Ceph
    • If Ceph is a suspect, having only Rook’s knowledge is insufficient
    R o o k
    C e p h
    D a y 2 o p e r a t i o n s
    m a n a g e
    t r o u b l e s h o o t i n g

    View full-size slide

  19. O u r d e v e l o p m e n t s t y l e
    U p s t r e a m f i r s t d e v e l o p m e n t
    • If we find bugs and missing features, send PRs to upstream as possible
    • It reduces the total maintenance cost, and everyone can get the improvement
    • Sometimes we use local patches temporarily in case of emergency
    O f f i c i a l
    v e r s i o n
    V 1 . 0
    O f f i c i a l
    v e r s i o n
    V 1 . 1
    O f f i c i a l
    v e r s i o n
    V 1 . 0
    O f f i c i a l
    v e r s i o n
    V 1 . 1
    our change
    our change
    O f f i c i a l
    v e r s i o n
    V 1 . 2
    our change
    O f f i c i a l
    v e r s i o n
    V 1 . 2
    our change
    N e e d b a c k p o r t
    U p s t r e a m f i r s t d e v e l o p m e n t
    M a n a g e l o c a l v e r s i o n

    View full-size slide

  20. D a i l y w o r k
    C h e c k e v e r y u p d a t e s o f C e p h a n d R o o k
    • GitHub, slack, mailing lists, blog, and so on
    • Find important new features, performance improvements, and so on
    • Find critical bugs and their workarounds, watch the fixes of these bugs
    • Get knowledge of Ceph and Rook
    U p s t r e a m c o m m u n i t y
    C e p h a n d R o o k
    w a t c h f e e d b a c k
    O u r R o o k / C e p h c l u s t e r
    D e v e l o p

    View full-size slide

  21. O u r c o n t r i b u t i o n s t o O S S p r o j e c t s
    F e e d b a c k e v e r y t h i n g
    • Shared our experience and know-how in some presentations
    • Reviewed and backported critical bugs
    • Implemented Rook features (e.g., even OSD spreading)
    • Worked as Rook’s maintainer for several years

    View full-size slide

  22. C h a l l e n g e s
    R e m a i n i n g w o r k s
    • More automation like automatic OSD replacement on node failure/retirement
    • Implement backup/restore and remote replication
    • Get more knowledge of Ceph and Rook
    K 8 s c l u s t e r
    O S D ( d a t a )
    O S D ( d a t a )
    C e p h c l u s t e r
    C e p h c l u s t e r
    H D D
    H D D
    d i s k
    n o d e
    R o o k
    m a n a g e
    O S D
    D a t a c e n t e r A
    K 8 s c l u s t e r
    O S D ( d a t a )
    O S D ( d a t a )
    C e p h c l u s t e r
    C e p h c l u s t e r
    H D D
    H D D
    d i s k
    n o d e
    R o o k
    m a n a g e
    O S D
    D a t a c e n t e r B
    r e p l i c a t i o n
    B a c k u p
    d a t a
    b a c k u p

    View full-size slide

  23. A g e n d a
    • Our current storage system
    • The new storage system
    • Conclusion

    View full-size slide

  24. C o n c l u s i o n
    • Our new storage system overcomes the current one’s pain points
    • These improvements are accomplished by using open source
    • We give all results back to open source communities
    • We continue to develop the new storage system and to contribute to open source

    View full-size slide