Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cybozu's storage system and their contributions...

Cybozu
June 14, 2023
3.6k

Cybozu's storage system and their contributions to OSS

Cybozu

June 14, 2023
Tweet

More Decks by Cybozu

Transcript

  1. T o s h i k u n i F

    u k a y a @ C y b o z u S a t o r u T a k e u c h i @ C y b o z u Cybozu's storage system and their contributions to OSS
  2. W h o a m I ? T o s

    h i k u n i F u k a y a • 10+ year experience DevOps for cybozu.com • Widespread contributions to OSS • Apache httpd, nginx, OpenSSL, MySQL, etc. • Now work on storage system on Kubernetes(K8s)
  3. A g e n d a • Our current storage

    system • The new storage system • Conclusion
  4. A g e n d a • Our current storage

    system • The new storage system • Conclusion
  5. O u r b u s i n e s

    s c y b o z u . c o m • Providing groupware software as a service • 56,000 companies • 2,500,000 users
  6. C y b o z u . c o m

    i n t e r n a l O v e r v i e w • We have our own infrastructure • Racks, bare metal servers, switches • Linux based virtual machine environment • We built its control plane • Storage stack is also self built
  7. C y b o z u . c o m

    i n t e r n a l S t o r a g e s t a c k • We use OSS to build our storage stack • Linux MD (SW RAID) for data replication • Linux-IO (iSCSI) for remote connection between VM and storage server • LVM for volume management S t o r a g e S e r v e r S t o r a g e S e r v e r V M V M D a t a r e p l i c a t i o n b y S W R A I D 1 S p l i t t i n g a p h y s i c a l v o l u m e t o s m a l l v i r t u a l v o l u m e s b y L V M i S C S I c o n n e c t i o n
  8. P a i n p o i n t s

    N o s c a l a b i l i t y • IOPS, bandwidth and capacity are limited by one storage server • We divide users to groups of users (shards) • We allocate storage servers for each shard • It introduces another pain: operation costs 😣 Too many…
  9. W h o a m I ? S a t

    o r u T a k e u c h i • A developer of Cybozu’s storage system • ex-Linux kernel developer • Rook’s maintainer
  10. A g e n d a • Our current storage

    system • The new storage system • Conclusion
  11. C y b o z u ’ s n e

    w s t o r a g e s y s t e m R e q u i r e m e n t s • Reduce the operation costs and prevent to increase it at scale • Scalable block storage and object storage • Rack failure toleration and bit-rot toleration O S D ( d a t a ) O S D ( d a t a ) O S D ( d a t a ) O S D ( d a t a ) D i s t r i b u t e d s t o r a g e H D D S S D H D D H D D S S D N V M e S S D A p p b u c k e t B l o c k v o l u m e A p p n o d e n o d e
  12. K 8 s c l u s t e r

    C y b o z u ’ s n e w s t o r a g e s y s t e m A r c h i t e c t u r e • Running on top of Kubernetes(K8s) to reduce the operation costs • Scalable block storage and object storage provided • Ceph: An open source distributed storage • Rook: An open source Ceph orchestration running on top of K8s O S D ( d a t a ) O S D ( d a t a ) O S D ( d a t a ) O S D ( d a t a ) C e p h c l u s t e r C e p h c l u s t e r H D D S S D H D D H D D S S D N V M e S S D A p p b u c k e t B l o c k v o l u m e A p p n o d e n o d e R o o k m a n a g e O S D O S D Ceph’s data structure
  13. W h y o p e n s o u

    r c e ? O u r p o l i c y o f u s e r d a t a • To resolve critical problems (e.g., data corruption) as much as possible and as fast as possible • We want to read source codes to troubleshoot and use local changed version if necessary Read source code Use local changed version Buy proprietary products Buy OSS support ✅ Manage upstream OSS by ourselves ✅ ✅ O u r s e l e c t i o n 👉
  14. W h a t i s K u b e

    r n e t e s D e c l a r a t i v e c o n f i g u r a t i o n • Declare the desired state of K8s cluster in “resources” (YAML format) • If there are differences between the current state and the desired state, K8s change the cluster to meet the desired state k i n d : P o d m e t a d a t a : n a m e : n g i n x … s p e c : r e p l i c a s : c o u n t : 3 … K 8 s c l u s t e r n o d e n o d e n o d e n g i n x n g i n x n g i n x
  15. W h a t i s C e p h

    A n o p e n s o u r c e d i s t r i b u t e d s t o r a g e • Block volume, distributed filesystem, and s3-compatible object storage • Widely used for 20 years • Meet all our requirements d i s k n o d e Storage pool C e p h c l u s t e r Block volume filesystem Object storage A p p A p p A p p d i s k n o d e d i s k n o d e O S D O S D O S D
  16. W h a t i s R o o k

    A C e p h o r c h e s t r a t i o n r u n n i n g o n t o p o f K 8 s • An official project of Cloud Native Computing Foundation(CNCF) • Ceph cluster is managed by “CephCluster” resource • Rook creates the number of OSDs written in “count” field • OSDs are evenly distributed over nodes and racks (through additional configuration) k i n d : C e p h C l u s t e r M e t a d a t a : n a m e : t e s t - c l u s t e r … s t o r a g e : … c o u n t : 3 Storage pool C e p h c l u s t e r r a c k d i s k n o d e d i s k d i s k n o d e d i s k r a c k d i s k n o d e d i s k d i s k n o d e d i s k r a c k d i s k n o d e d i s k d i s k n o d e d i s k O S D O S D O S D
  17. W h a t i s R o o k

    A n e x a m p l e o f d a i l y o p e r a t i o n • OSDs can be added just by adding “count” field k i n d : C e p h C l u s t e r M e t a d a t a : n a m e : t e s t - c l u s t e r … s t o r a g e : … c o u n t : 6 Storage pool C e p h c l u s t e r r a c k d i s k n o d e d i s k d i s k n o d e d i s k r a c k d i s k n o d e d i s k d i s k n o d e d i s k r a c k d i s k n o d e d i s k d i s k n o d e d i s k O S D O S D O S D O S D O S D O S D
  18. C a v e a t s R o o

    k i s n o t a l l - p o w e r f u l • Rook reduces the Ceph cluster’s operation cost very much • However, we must have deep knowledge of not only Rook but also Ceph • If Ceph is a suspect, having only Rook’s knowledge is insufficient R o o k C e p h D a y 2 o p e r a t i o n s m a n a g e t r o u b l e s h o o t i n g
  19. O u r d e v e l o p

    m e n t s t y l e U p s t r e a m f i r s t d e v e l o p m e n t • If we find bugs and missing features, send PRs to upstream as possible • It reduces the total maintenance cost, and everyone can get the improvement • Sometimes we use local patches temporarily in case of emergency O f f i c i a l v e r s i o n V 1 . 0 O f f i c i a l v e r s i o n V 1 . 1 O f f i c i a l v e r s i o n V 1 . 0 O f f i c i a l v e r s i o n V 1 . 1 our change our change O f f i c i a l v e r s i o n V 1 . 2 our change O f f i c i a l v e r s i o n V 1 . 2 our change N e e d b a c k p o r t U p s t r e a m f i r s t d e v e l o p m e n t M a n a g e l o c a l v e r s i o n
  20. D a i l y w o r k C

    h e c k e v e r y u p d a t e s o f C e p h a n d R o o k • GitHub, slack, mailing lists, blog, and so on • Find important new features, performance improvements, and so on • Find critical bugs and their workarounds, watch the fixes of these bugs • Get knowledge of Ceph and Rook U p s t r e a m c o m m u n i t y C e p h a n d R o o k w a t c h f e e d b a c k O u r R o o k / C e p h c l u s t e r D e v e l o p
  21. O u r c o n t r i b

    u t i o n s t o O S S p r o j e c t s F e e d b a c k e v e r y t h i n g • Shared our experience and know-how in some presentations • Reviewed and backported critical bugs • Implemented Rook features (e.g., even OSD spreading) • Worked as Rook’s maintainer for several years …
  22. C h a l l e n g e s

    R e m a i n i n g w o r k s • More automation like automatic OSD replacement on node failure/retirement • Implement backup/restore and remote replication • Get more knowledge of Ceph and Rook K 8 s c l u s t e r O S D ( d a t a ) O S D ( d a t a ) C e p h c l u s t e r C e p h c l u s t e r H D D H D D d i s k n o d e R o o k m a n a g e O S D D a t a c e n t e r A K 8 s c l u s t e r O S D ( d a t a ) O S D ( d a t a ) C e p h c l u s t e r C e p h c l u s t e r H D D H D D d i s k n o d e R o o k m a n a g e O S D D a t a c e n t e r B r e p l i c a t i o n B a c k u p d a t a b a c k u p
  23. A g e n d a • Our current storage

    system • The new storage system • Conclusion
  24. C o n c l u s i o n

    • Our new storage system overcomes the current one’s pain points • These improvements are accomplished by using open source • We give all results back to open source communities • We continue to develop the new storage system and to contribute to open source