Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cybozu's storage system and their contributions to OSS

Cybozu
June 14, 2023
2.1k

Cybozu's storage system and their contributions to OSS

Cybozu

June 14, 2023
Tweet

More Decks by Cybozu

Transcript

  1. T o s h i k u n i F

    u k a y a @ C y b o z u S a t o r u T a k e u c h i @ C y b o z u Cybozu's storage system and their contributions to OSS
  2. W h o a m I ? T o s

    h i k u n i F u k a y a • 10+ year experience DevOps for cybozu.com • Widespread contributions to OSS • Apache httpd, nginx, OpenSSL, MySQL, etc. • Now work on storage system on Kubernetes(K8s)
  3. A g e n d a • Our current storage

    system • The new storage system • Conclusion
  4. A g e n d a • Our current storage

    system • The new storage system • Conclusion
  5. O u r b u s i n e s

    s c y b o z u . c o m • Providing groupware software as a service • 56,000 companies • 2,500,000 users
  6. C y b o z u . c o m

    i n t e r n a l O v e r v i e w • We have our own infrastructure • Racks, bare metal servers, switches • Linux based virtual machine environment • We built its control plane • Storage stack is also self built
  7. C y b o z u . c o m

    i n t e r n a l S t o r a g e s t a c k • We use OSS to build our storage stack • Linux MD (SW RAID) for data replication • Linux-IO (iSCSI) for remote connection between VM and storage server • LVM for volume management S t o r a g e S e r v e r S t o r a g e S e r v e r V M V M D a t a r e p l i c a t i o n b y S W R A I D 1 S p l i t t i n g a p h y s i c a l v o l u m e t o s m a l l v i r t u a l v o l u m e s b y L V M i S C S I c o n n e c t i o n
  8. P a i n p o i n t s

    N o s c a l a b i l i t y • IOPS, bandwidth and capacity are limited by one storage server • We divide users to groups of users (shards) • We allocate storage servers for each shard • It introduces another pain: operation costs 😣 Too many…
  9. W h o a m I ? S a t

    o r u T a k e u c h i • A developer of Cybozu’s storage system • ex-Linux kernel developer • Rook’s maintainer
  10. A g e n d a • Our current storage

    system • The new storage system • Conclusion
  11. C y b o z u ’ s n e

    w s t o r a g e s y s t e m R e q u i r e m e n t s • Reduce the operation costs and prevent to increase it at scale • Scalable block storage and object storage • Rack failure toleration and bit-rot toleration O S D ( d a t a ) O S D ( d a t a ) O S D ( d a t a ) O S D ( d a t a ) D i s t r i b u t e d s t o r a g e H D D S S D H D D H D D S S D N V M e S S D A p p b u c k e t B l o c k v o l u m e A p p n o d e n o d e
  12. K 8 s c l u s t e r

    C y b o z u ’ s n e w s t o r a g e s y s t e m A r c h i t e c t u r e • Running on top of Kubernetes(K8s) to reduce the operation costs • Scalable block storage and object storage provided • Ceph: An open source distributed storage • Rook: An open source Ceph orchestration running on top of K8s O S D ( d a t a ) O S D ( d a t a ) O S D ( d a t a ) O S D ( d a t a ) C e p h c l u s t e r C e p h c l u s t e r H D D S S D H D D H D D S S D N V M e S S D A p p b u c k e t B l o c k v o l u m e A p p n o d e n o d e R o o k m a n a g e O S D O S D Ceph’s data structure
  13. W h y o p e n s o u

    r c e ? O u r p o l i c y o f u s e r d a t a • To resolve critical problems (e.g., data corruption) as much as possible and as fast as possible • We want to read source codes to troubleshoot and use local changed version if necessary Read source code Use local changed version Buy proprietary products Buy OSS support ✅ Manage upstream OSS by ourselves ✅ ✅ O u r s e l e c t i o n 👉
  14. W h a t i s K u b e

    r n e t e s D e c l a r a t i v e c o n f i g u r a t i o n • Declare the desired state of K8s cluster in “resources” (YAML format) • If there are differences between the current state and the desired state, K8s change the cluster to meet the desired state k i n d : P o d m e t a d a t a : n a m e : n g i n x … s p e c : r e p l i c a s : c o u n t : 3 … K 8 s c l u s t e r n o d e n o d e n o d e n g i n x n g i n x n g i n x
  15. W h a t i s C e p h

    A n o p e n s o u r c e d i s t r i b u t e d s t o r a g e • Block volume, distributed filesystem, and s3-compatible object storage • Widely used for 20 years • Meet all our requirements d i s k n o d e Storage pool C e p h c l u s t e r Block volume filesystem Object storage A p p A p p A p p d i s k n o d e d i s k n o d e O S D O S D O S D
  16. W h a t i s R o o k

    A C e p h o r c h e s t r a t i o n r u n n i n g o n t o p o f K 8 s • An official project of Cloud Native Computing Foundation(CNCF) • Ceph cluster is managed by “CephCluster” resource • Rook creates the number of OSDs written in “count” field • OSDs are evenly distributed over nodes and racks (through additional configuration) k i n d : C e p h C l u s t e r M e t a d a t a : n a m e : t e s t - c l u s t e r … s t o r a g e : … c o u n t : 3 Storage pool C e p h c l u s t e r r a c k d i s k n o d e d i s k d i s k n o d e d i s k r a c k d i s k n o d e d i s k d i s k n o d e d i s k r a c k d i s k n o d e d i s k d i s k n o d e d i s k O S D O S D O S D
  17. W h a t i s R o o k

    A n e x a m p l e o f d a i l y o p e r a t i o n • OSDs can be added just by adding “count” field k i n d : C e p h C l u s t e r M e t a d a t a : n a m e : t e s t - c l u s t e r … s t o r a g e : … c o u n t : 6 Storage pool C e p h c l u s t e r r a c k d i s k n o d e d i s k d i s k n o d e d i s k r a c k d i s k n o d e d i s k d i s k n o d e d i s k r a c k d i s k n o d e d i s k d i s k n o d e d i s k O S D O S D O S D O S D O S D O S D
  18. C a v e a t s R o o

    k i s n o t a l l - p o w e r f u l • Rook reduces the Ceph cluster’s operation cost very much • However, we must have deep knowledge of not only Rook but also Ceph • If Ceph is a suspect, having only Rook’s knowledge is insufficient R o o k C e p h D a y 2 o p e r a t i o n s m a n a g e t r o u b l e s h o o t i n g
  19. O u r d e v e l o p

    m e n t s t y l e U p s t r e a m f i r s t d e v e l o p m e n t • If we find bugs and missing features, send PRs to upstream as possible • It reduces the total maintenance cost, and everyone can get the improvement • Sometimes we use local patches temporarily in case of emergency O f f i c i a l v e r s i o n V 1 . 0 O f f i c i a l v e r s i o n V 1 . 1 O f f i c i a l v e r s i o n V 1 . 0 O f f i c i a l v e r s i o n V 1 . 1 our change our change O f f i c i a l v e r s i o n V 1 . 2 our change O f f i c i a l v e r s i o n V 1 . 2 our change N e e d b a c k p o r t U p s t r e a m f i r s t d e v e l o p m e n t M a n a g e l o c a l v e r s i o n
  20. D a i l y w o r k C

    h e c k e v e r y u p d a t e s o f C e p h a n d R o o k • GitHub, slack, mailing lists, blog, and so on • Find important new features, performance improvements, and so on • Find critical bugs and their workarounds, watch the fixes of these bugs • Get knowledge of Ceph and Rook U p s t r e a m c o m m u n i t y C e p h a n d R o o k w a t c h f e e d b a c k O u r R o o k / C e p h c l u s t e r D e v e l o p
  21. O u r c o n t r i b

    u t i o n s t o O S S p r o j e c t s F e e d b a c k e v e r y t h i n g • Shared our experience and know-how in some presentations • Reviewed and backported critical bugs • Implemented Rook features (e.g., even OSD spreading) • Worked as Rook’s maintainer for several years …
  22. C h a l l e n g e s

    R e m a i n i n g w o r k s • More automation like automatic OSD replacement on node failure/retirement • Implement backup/restore and remote replication • Get more knowledge of Ceph and Rook K 8 s c l u s t e r O S D ( d a t a ) O S D ( d a t a ) C e p h c l u s t e r C e p h c l u s t e r H D D H D D d i s k n o d e R o o k m a n a g e O S D D a t a c e n t e r A K 8 s c l u s t e r O S D ( d a t a ) O S D ( d a t a ) C e p h c l u s t e r C e p h c l u s t e r H D D H D D d i s k n o d e R o o k m a n a g e O S D D a t a c e n t e r B r e p l i c a t i o n B a c k u p d a t a b a c k u p
  23. A g e n d a • Our current storage

    system • The new storage system • Conclusion
  24. C o n c l u s i o n

    • Our new storage system overcomes the current one’s pain points • These improvements are accomplished by using open source • We give all results back to open source communities • We continue to develop the new storage system and to contribute to open source