cunoFS - IT Press Tour 51 June 2023 Berlin

Slide 1

Slide 1 text

Overview cuno enables native, unmodiﬁed and accelerated access to lower cost, inﬁnitely scalable object storage buckets reducing cunoFS

Slide 2

Slide 2 text

cunoFS : High Level View High Throughput, Scalable, Cost-Effective File Storage Ø Typically 10x faster, at down to 1/10th the cost Ø Scales to supercomputing levels of performance How? Ø Client-side rather than Server-side storage technology Ø Leverages our 40+ man-years of IP creating high perf client-side virtual filesystems Ø Allows us to disentangle and predict application access patterns Ø Exploits Object Storage for scalability and cost Ø Solved POSIX<->Object compatibility problems thought to be impossible before Ø Innovative encoding and compression of POSIX metadata 2

Slide 3

Slide 3 text

cunoFS workloads Ø Machine Learning – high throughput training/inferencing Ø Media & Entertainment – collaborative media workflows Ø Finance – financial models and kdb+ loading Ø Life Sciences – omics pipelines, images, analysis Ø Scientific Computing / HPC – large scale workloads (1000s compute nodes) 3

Slide 4

Slide 4 text

PetaSuite PetaLink Cloud & cunoFS PetaSuite Protect 100% Lossless compression of genomics data files Resulting in files 60-91% smaller Transparent, random access, just-in-time decompression User-mode library streams compressed data to analysis tools as native files Compress and move data to local or cloud object storage in one step Accelerated transparent access to files in object storage as local file systems FIPS 140-2 Encryption Fine-grained access control, redaction Transaction control Full Audit PetaGene quick intro 4 2016 2018 2019

Slide 5

Slide 5 text

Slide 6

Slide 6 text

customers & partners 6 Hong Kong Genome Project

Slide 7

Slide 7 text

cunoFS : A brief history Ø PetaSuite Cloud Edition – was built for large genomics workloads BUT was being used by customers for all sorts of non-genomics data Ø Not intended use-case, but PetaSuite Cloud Edition was being compared to general file-on-object competitors: Ø We won by 10x on fast large file access (what we built it for) Ø Lost on small files Ø Lost on POSIX compatibility Ø What if we intentionally design for these use cases? Ø cunoFS solved above problems and further boosted performance Ø cunoFS re-brand to expand outside Life Sciences and Genomics 7

Slide 8

Slide 8 text

What we do Ø We do not sell hardware Ø We do not sell software for building your own Object Storage Ø We sell software that enables Object Storage (cloud / on-prem) to be used as fast, POSIX File Storage 8

Slide 9

Slide 9 text

Quick refresher: Object vs File Storage 9 File Storage Object Storage Directory structure Flat, no directory structure Scalability is hard Highly scalable POSIX compatible: UID/GID, Symlinks, Hardlinks, ACL, … Not POSIX compatible Direct access through OS (syscalls) REST API (HTTPS) inside application Strong consistency guarantees Varies but usually weaker consistency Usually single site Often geo-distributed RAID, duplication, or other redundancy Erasure coded (highly efficient) Typically low(er) latency Typically high(er) latency Fast Random Access Write Replace whole object at a time

Slide 10

Slide 10 text

Quick refresher: Object vs File Storage Ø Object storage can be an order of magnitude less expensive than file storage Ø Cost example on AWS: Ø 1TB on FSx Lustre = $1,680-$7,200 pa (scratch to high perf File storage) Ø 1TB of EFS (standard storage) = $3,600 pa Ø 1TB on S3 = $276 pa Ø 1TB of EFS (20% standard, 80% infrequent) = $960 pa (AWS EFS pricing example 2) Ø 1TB on S3 (20% standard, 80% infrequent) = $175 pa Ø 1TB EFS Machine Learning workload = $6792 pa (AWS EFS pricing example 6) Ø 1TB on S3 with this workload = $340 pa 10

Slide 11

Slide 11 text

11 Object is often a second class storage tier

Slide 12

Slide 12 text

12 High Performance File System (Flash and/or Disk Tiers) Object POSIX Interface Archival Tier Object Interface Common deployment: Object as a second class storage tier POSIX Workloads

Slide 13

Slide 13 text

13 High Performance File System (Flash and/or Disk Tiers) Object POSIX Interface Archival Tier Object Interface 2-3 Gbps on AWS e.g. FSx Lustre + S3 and competitor + S3 POSIX Workloads Common deployment: Object as a second class storage tier

Slide 14

Slide 14 text

14 cunoFS : Object as a first class POSIX storage tier Object cunoFS Professional POSIX Interface POSIX Workloads

Slide 15

Slide 15 text

15 Bridging the POSIX ó Object Gap background: other attempts

Slide 16

Slide 16 text

16 Until now: Pick two of three Fast e.g. s3fs e.g. s3ql (and most commercial offerings) e.g. goofys Image based on OLF 2020 talk: Exploring trade-offs in S3 file systems

Slide 17

Slide 17 text

17 Fast e.g. s3fs e.g. s3ql (and most commercial offerings) cunoFS e.g. goofys Image based on OLF 2020 talk: Exploring trade-offs in S3 file systems Until now: Pick two of three

Slide 18

Slide 18 text

18 Object Storage DB FS Metadata Metadata Server NFS File Gateway Usual competing approach: Object+DB gateway Native Object Workloads Object Gateway (bottlenecks in red) POSIX COMPLIANT scrambled file data: N F S N F S N F S (bottlenecks in red) POSIX Workloads

Slide 19

Slide 19 text

19 Object Storage DB FS Metadata Metadata Server NFS File Gateway Usual competing approach: Object+DB gateway Native Object Workloads Object Gateway (bottlenecks in red) POSIX COMPLIANT scrambled file data: N F S N F S N F S (bottlenecks in red) POSIX Workloads

Slide 20

Slide 20 text

20 Object Storage Native Object Workloads (files as objects) POSIX COMPLIANT Introducing cunoFS c u n o c u n o c u n o Native Object Workloads POSIX Workloads

Slide 21

Slide 21 text

21 Software for Simple, Transparent Object FS

Slide 22

Slide 22 text

22 cunoFS use cases Ø On cloud Ø Replacing cloud file storage (faster + cheaper) Ø e.g. AWS EFS, FSx Lustre and similar on Azure and Google Cloud Ø Many customers already use a lot of object storage BUT apps unable to directly use it Ø Eliminate staging to/from block storage (e.g. AWS EBS) Ø Convenient, fast access to object storage repositories Ø On prem Ø Replacing expensive file storage with on-prem object storage Ø Eliminates staging Ø Fast File-system Backups and Disaster Recovery Ø Hybrid Ø Bursting workloads with datasets from on-prem object store TO cloud compute Ø Access in-cloud object stores from on-prem

Slide 23

Slide 23 text

23 Big Files Bandwidth (Gbps) Copying 5x32GiB Files on a c5n.18xlarge Instance in AWS Ohio 56.9 2.5 2.8 2.1 2.1 6.6 52.3 2.0 2.6 0.9 1.2 3.8 0 10 20 30 40 50 60 cuno + S3 AWS CLI + S3 cp + EBS cp + EFS cp + FSx Goofys + S3 Read Write

Slide 24

Slide 24 text

24 Small Files

Slide 25

Slide 25 text

25 Extreme scalability 1 2 4 8 16 32 64 128 256 0 2,000 4,000 6,000 8,000 10,000 12,000 85 168 295 596 1,047 2,048 3,521 6,155 10,936 Using ior_easy_read on c5n.18xlarge nodes in AWS Virginia Achieving over 10 Tbps aggregate throughput Number of Nodes Used Gbps cuno Aggregate Throughput Scalability

Slide 26

Slide 26 text

26 ML training + inferencing Example: Single node PyTorch with 86 GB Zarr image dataset running on Google Cloud Platform PyTorch + s3fs ~260 Mbps PyTorch + SSD ~350 Mbps (90 GB ext4 on GCP SSD) PyTorch + cunoFS ~20 Gbps (60-75x speedup) (Note that PyTorch randomly shuffles this dataset as per standard practice)

Slide 27

Slide 27 text

27 Machine Learning Ø ML workloads appear to need high IOPS and to be very large in size Ø GPUs are throughput-hungry and need to be fed randomly shuffled datasets over and over again Ø In the past IT budget holders saw two classes: small but fast storage (expensive per TB), large but slower storage (inexpensive per TB) Ø Now they are alarmed by ML as needing BOTH large and fast storage (many PB of expensive high-IOPS flash storage such as NVMe) Ø However, what is actually needed is high throughput storage

Slide 28

Slide 28 text

28 Dell whitepaper on cunoFS Ø Dell independently analysed cunoFS on their own Object Storage Ø Validated performance metrics seen on AWS Ø https://infohub.delltechnologies. com/t/cuno-s3-performance- and-validation-testing

Slide 29

Slide 29 text

29 Throughput isn’t IOPS Ø Application workloads that need small, random pieces of data tend to be IOPS- limited Ø Truly random I/O is not suitable for Object Storage Ø This includes most database workloads Ø However we argue that most applications are throughput-limited rather than IOPS limited. Most workloads are not really random I/O. Workloads are quite deterministic.

Slide 30

Slide 30 text

30 Case Studies – ongoing PoCs Ø Hedge fund – globally distributed, wants to speed up analysis while reducing costs. Using cunoFS to replace AWS FSx Lustre and EFS. Ø Pharma companies – scientific compute (e.g. omics analyses) and ML workloads on lots of images. One pharma would eventually like to place all their users’ home directories on cunoFS with Object Storage. Ø Media organisation – testing cunoFS with media workloads Ø Large supercomputing centre – HPC workloads Ø Space agencies – one commenced, another upcoming

Slide 31

Slide 31 text

31 Faster cloud-native apps Ø Samtools has built in S3 support but not Azure or Google Cloud Ø cunoFS intercepts applications and can override built-in S3 support, and add Azure & GCP support Ø Download is 4.8x faster Ø Upload is 30x faster cuno enables organisations to quickly and easily broaden theusage ofobject storagewithout theneed for code changes or updates to existing mission critical applications. 0 100 200 300 400 500 600 cuno + samtools samtools cuno + samtools samtools Time (seconds) Download Upload Real-world Benchmarks: Samtools

Slide 32

Slide 32 text

Comparison Matrix

Slide 33

Slide 33 text

33 Demo

Slide 34

Slide 34 text

34 Questions

Slide 35

Slide 35 text

35 FUSE overheads Application Hardware (incl. networking) Linux Kernel User Mode FS Libraries Libraries e.g. “read 1KB” translated syscalls

Slide 36

Slide 36 text

36 The problem with SHIMs (e.g. LD_PRELOAD) Libraries Application (dynamic binary) Hardware (incl. networking) User Mode FS Linux Kernel SHIM syscalls Library calls Application (static binary) syscalls Libraries Application (mixed e.g. golang) User Mode FS SHIM syscalls Library calls untrapped untrapped untrapped

Slide 37

Slide 37 text

37 Our approach ØUltra-fast, light-weight, robust interception ØWorks with dynamic-binaries, static binaries, mixed binaries (golang) ØRuns anywhere, including inside unprivileged containers

Slide 38

Slide 38 text

38 POSIX metadata on Object Ø Previous attempts Ø Use object storage for data chunks, separate DB for metadata Ø Separate DB servers, scalability issues, consistency issues in backup/restore with object store Ø Data is scrambled and inaccessible via native API Ø Store POSIX metadata inside per-object metadata (very slow and immutable) Ø Per-object overhead of retrieving metadata (e.g. dirlisting 1000 files => 1000+ API calls) Ø Changing metadata (e.g. access permissions) involves overwriting whole object Ø Our approach: Ø POSIX metadata tends to be highly compressible, and slow changing Ø Novel composable encoding for POSIX metadata Ø Encode POSIX metadata as hidden, long filenames Ø LIST API operation retrieves both list of files and encoded metadata

Slide 39

Slide 39 text

39 AWS S3 defines “S3-compatible” Ø The AWS S3 Spec can be vague in places and misinterpreted Ø Ultimately AWS S3 behaviour is the “real” spec, and what SDKs are written against Ø Unlike POSIX which is more easily defined and tested, S3 compatibility is very mixed Ø We’ve had to build cunoFS to be resilient to S3 backend bugs and missing APIs Ø Some vendors only support a subset of S3 APIs, this is not what we mean by bugs Ø We’ve dealt with very many different backend S3 API bugs and behaviours so our customers don’t have to, including from major vendors Ø Even Google Cloud with considerable resources have bugs we’ve had to work around Ø Worse – we’ve detected some bugs that only appear at high load! Ø cunoFS detects backends so it adapts accordingly Ø cunoFS can also run conformance testing when pairing buckets, and cunoFS adapts according to what it detects

Slide 40

Slide 40 text

40 Go To Market Ø Currently focused on Enterprise Sales Ø Early engagement with some Object Storage partners Ø Engaging with regional resellers Ø Publicly downloadable free trial to be launched (in July) Ø Currently there’s a waiting list for PoCs Ø Pricing Ø Volume under management - Enterprise Pricing Ø Price per PB continuously decreases as volume grows Ø Small volume per TB pricing coming soon

Slide 41

Slide 41 text

41 Roadmap Ø Q3: Ø Mac and Windows clients Ø Kubernetes CSI driver Ø Q4: Ø Further ML acceleration Ø Tap-and-go migration tool Ø Native ARM support Ø Even more performance gains coming…

Slide 42

Slide 42 text

42 Competitors: Bridging File <=> Object Ø There are various open source products but they are just too slow, unscalable, not POSIX compatible, or simply not practical Ø There are Object Storage vendors which promise POSIX access as a feature, but all organisations we’ve spoken to say these were unworkable Ø Many organisations we’ve spoken to say their Object Storage is poorly utilised because either it lacks POSIX, or POSIX access is cripplingly slow Ø We see ourselves as complementing rather than competing Ø Azure offers a File gateway to object storage, and even they don’t recommend their File gateway be used for anything other than navigating the directory tree

Slide 43

Slide 43 text

43 Competitors: Cloud (& on-prem) File storage Ø We see ourselves as competing more with: Ø AWS EFS, FSx Lustre, Azure Files, Azure Lustre, Google Cloud Filestore Ø On-prem equivalent general file storage, scalable file storage Ø For throughput-workloads, we are faster, more scalable and less expensive Ø To some extent we are also competing with: Ø AWS EBS, Azure Disk Storage, Google Persistent Disk Ø cunoFS doesn’t need to be pre-provisioned Ø For throughput-workloads, we are faster, more scalable and less expensive Ø We are not competing with high IOPS file storage e.g. Weka Ø We see ourselves as complementing rather than competing