Slide 1

Slide 1 text

T o s h i k u n i F u k a y a @ C y b o z u S a t o r u T a k e u c h i @ C y b o z u Cybozu's storage system and their contributions to OSS

Slide 2

Slide 2 text

W h o a m I ? T o s h i k u n i F u k a y a • 10+ year experience DevOps for cybozu.com • Widespread contributions to OSS • Apache httpd, nginx, OpenSSL, MySQL, etc. • Now work on storage system on Kubernetes(K8s)

Slide 3

Slide 3 text

A g e n d a • Our current storage system • The new storage system • Conclusion

Slide 4

Slide 4 text

A g e n d a • Our current storage system • The new storage system • Conclusion

Slide 5

Slide 5 text

O u r b u s i n e s s c y b o z u . c o m • Providing groupware software as a service • 56,000 companies • 2,500,000 users

Slide 6

Slide 6 text

C y b o z u . c o m i n t e r n a l O v e r v i e w • We have our own infrastructure • Racks, bare metal servers, switches • Linux based virtual machine environment • We built its control plane • Storage stack is also self built

Slide 7

Slide 7 text

C y b o z u . c o m i n t e r n a l S t o r a g e s t a c k • We use OSS to build our storage stack • Linux MD (SW RAID) for data replication • Linux-IO (iSCSI) for remote connection between VM and storage server • LVM for volume management S t o r a g e S e r v e r S t o r a g e S e r v e r V M V M D a t a r e p l i c a t i o n b y S W R A I D 1 S p l i t t i n g a p h y s i c a l v o l u m e t o s m a l l v i r t u a l v o l u m e s b y L V M i S C S I c o n n e c t i o n

Slide 8

Slide 8 text

P a i n p o i n t s N o s c a l a b i l i t y • IOPS, bandwidth and capacity are limited by one storage server • We divide users to groups of users (shards) • We allocate storage servers for each shard • It introduces another pain: operation costs 😣 Too many…

Slide 9

Slide 9 text

W h o a m I ? S a t o r u T a k e u c h i • A developer of Cybozu’s storage system • ex-Linux kernel developer • Rook’s maintainer

Slide 10

Slide 10 text

A g e n d a • Our current storage system • The new storage system • Conclusion

Slide 11

Slide 11 text

C y b o z u ’ s n e w s t o r a g e s y s t e m R e q u i r e m e n t s • Reduce the operation costs and prevent to increase it at scale • Scalable block storage and object storage • Rack failure toleration and bit-rot toleration O S D ( d a t a ) O S D ( d a t a ) O S D ( d a t a ) O S D ( d a t a ) D i s t r i b u t e d s t o r a g e H D D S S D H D D H D D S S D N V M e S S D A p p b u c k e t B l o c k v o l u m e A p p n o d e n o d e

Slide 12

Slide 12 text

K 8 s c l u s t e r C y b o z u ’ s n e w s t o r a g e s y s t e m A r c h i t e c t u r e • Running on top of Kubernetes(K8s) to reduce the operation costs • Scalable block storage and object storage provided • Ceph: An open source distributed storage • Rook: An open source Ceph orchestration running on top of K8s O S D ( d a t a ) O S D ( d a t a ) O S D ( d a t a ) O S D ( d a t a ) C e p h c l u s t e r C e p h c l u s t e r H D D S S D H D D H D D S S D N V M e S S D A p p b u c k e t B l o c k v o l u m e A p p n o d e n o d e R o o k m a n a g e O S D O S D Ceph’s data structure

Slide 13

Slide 13 text

W h y o p e n s o u r c e ? O u r p o l i c y o f u s e r d a t a • To resolve critical problems (e.g., data corruption) as much as possible and as fast as possible • We want to read source codes to troubleshoot and use local changed version if necessary Read source code Use local changed version Buy proprietary products Buy OSS support ✅ Manage upstream OSS by ourselves ✅ ✅ O u r s e l e c t i o n 👉

Slide 14

Slide 14 text

W h a t i s K u b e r n e t e s D e c l a r a t i v e c o n f i g u r a t i o n • Declare the desired state of K8s cluster in “resources” (YAML format) • If there are differences between the current state and the desired state, K8s change the cluster to meet the desired state k i n d : P o d m e t a d a t a : n a m e : n g i n x … s p e c : r e p l i c a s : c o u n t : 3 … K 8 s c l u s t e r n o d e n o d e n o d e n g i n x n g i n x n g i n x

Slide 15

Slide 15 text

W h a t i s C e p h A n o p e n s o u r c e d i s t r i b u t e d s t o r a g e • Block volume, distributed filesystem, and s3-compatible object storage • Widely used for 20 years • Meet all our requirements d i s k n o d e Storage pool C e p h c l u s t e r Block volume filesystem Object storage A p p A p p A p p d i s k n o d e d i s k n o d e O S D O S D O S D

Slide 16

Slide 16 text

W h a t i s R o o k A C e p h o r c h e s t r a t i o n r u n n i n g o n t o p o f K 8 s • An official project of Cloud Native Computing Foundation(CNCF) • Ceph cluster is managed by “CephCluster” resource • Rook creates the number of OSDs written in “count” field • OSDs are evenly distributed over nodes and racks (through additional configuration) k i n d : C e p h C l u s t e r M e t a d a t a : n a m e : t e s t - c l u s t e r … s t o r a g e : … c o u n t : 3 Storage pool C e p h c l u s t e r r a c k d i s k n o d e d i s k d i s k n o d e d i s k r a c k d i s k n o d e d i s k d i s k n o d e d i s k r a c k d i s k n o d e d i s k d i s k n o d e d i s k O S D O S D O S D

Slide 17

Slide 17 text

W h a t i s R o o k A n e x a m p l e o f d a i l y o p e r a t i o n • OSDs can be added just by adding “count” field k i n d : C e p h C l u s t e r M e t a d a t a : n a m e : t e s t - c l u s t e r … s t o r a g e : … c o u n t : 6 Storage pool C e p h c l u s t e r r a c k d i s k n o d e d i s k d i s k n o d e d i s k r a c k d i s k n o d e d i s k d i s k n o d e d i s k r a c k d i s k n o d e d i s k d i s k n o d e d i s k O S D O S D O S D O S D O S D O S D

Slide 18

Slide 18 text

C a v e a t s R o o k i s n o t a l l - p o w e r f u l • Rook reduces the Ceph cluster’s operation cost very much • However, we must have deep knowledge of not only Rook but also Ceph • If Ceph is a suspect, having only Rook’s knowledge is insufficient R o o k C e p h D a y 2 o p e r a t i o n s m a n a g e t r o u b l e s h o o t i n g

Slide 19

Slide 19 text

O u r d e v e l o p m e n t s t y l e U p s t r e a m f i r s t d e v e l o p m e n t • If we find bugs and missing features, send PRs to upstream as possible • It reduces the total maintenance cost, and everyone can get the improvement • Sometimes we use local patches temporarily in case of emergency O f f i c i a l v e r s i o n V 1 . 0 O f f i c i a l v e r s i o n V 1 . 1 O f f i c i a l v e r s i o n V 1 . 0 O f f i c i a l v e r s i o n V 1 . 1 our change our change O f f i c i a l v e r s i o n V 1 . 2 our change O f f i c i a l v e r s i o n V 1 . 2 our change N e e d b a c k p o r t U p s t r e a m f i r s t d e v e l o p m e n t M a n a g e l o c a l v e r s i o n

Slide 20

Slide 20 text

D a i l y w o r k C h e c k e v e r y u p d a t e s o f C e p h a n d R o o k • GitHub, slack, mailing lists, blog, and so on • Find important new features, performance improvements, and so on • Find critical bugs and their workarounds, watch the fixes of these bugs • Get knowledge of Ceph and Rook U p s t r e a m c o m m u n i t y C e p h a n d R o o k w a t c h f e e d b a c k O u r R o o k / C e p h c l u s t e r D e v e l o p

Slide 21

Slide 21 text

O u r c o n t r i b u t i o n s t o O S S p r o j e c t s F e e d b a c k e v e r y t h i n g • Shared our experience and know-how in some presentations • Reviewed and backported critical bugs • Implemented Rook features (e.g., even OSD spreading) • Worked as Rook’s maintainer for several years …

Slide 22

Slide 22 text

C h a l l e n g e s R e m a i n i n g w o r k s • More automation like automatic OSD replacement on node failure/retirement • Implement backup/restore and remote replication • Get more knowledge of Ceph and Rook K 8 s c l u s t e r O S D ( d a t a ) O S D ( d a t a ) C e p h c l u s t e r C e p h c l u s t e r H D D H D D d i s k n o d e R o o k m a n a g e O S D D a t a c e n t e r A K 8 s c l u s t e r O S D ( d a t a ) O S D ( d a t a ) C e p h c l u s t e r C e p h c l u s t e r H D D H D D d i s k n o d e R o o k m a n a g e O S D D a t a c e n t e r B r e p l i c a t i o n B a c k u p d a t a b a c k u p

Slide 23

Slide 23 text

A g e n d a • Our current storage system • The new storage system • Conclusion

Slide 24

Slide 24 text

C o n c l u s i o n • Our new storage system overcomes the current one’s pain points • These improvements are accomplished by using open source • We give all results back to open source communities • We continue to develop the new storage system and to contribute to open source

Slide 25

Slide 25 text

THANK YOU!