Slide 1

Slide 1 text

Juicedata Intro June 2024

Slide 2

Slide 2 text

About Juicedata ● The company behind JuiceFS ● Community edition released under Apache 2.0 license, widely adopted globally. ● Cloud Service on AWS, Azure, and GCP ● Today ○ 10K stars on GitHub (the fastest-growing open-source project in distributed file systems) ○ Max volume exceeds 100PB ○ More than 70B files in one volume Davies Liu, Founder and CEO ■ MooseFS contributor ■ Author of BeansDBa and DParkb ■ Formerly at Meta and Databricks Rui Su, Co-founder ■ Former CEO & Founder of a startup ■ Former Tech Lead & PM at Douban Moviec ■ 3 years of experience in NGO ■ 1st engineer at Maxthon Browser a BeansDB: a simple object storage created in 2000, used in production environments with petabytes of data. b DPark: a Spark Python clone written in 2010. c Douban Movie: dubbed as the Chinese version of IMDb/Rotten Tomatoes.

Slide 3

Slide 3 text

Timeline ● 2017 Founded (2 people) ● 2018 - 2020 Early releases with several clients ● 2021 Community Edition (Open Source) released under AGPLv3 License ● 2022 Community Edition v1.0 LTS GA under Apache2.0 License ● 2023 Enterprise Edition v5.0 GA, CE v1.1 LTS GA ● 2024 Community Edition v1.2 LTS, RC currently (20+ people)

Slide 4

Slide 4 text

Pain Points & Challenges - S3 is very popular in web services, but has limited compatibility with data-intensive workloads. - No file system was designed for the cloud; most were built from bare metal, not elastic and much more expensive than S3. - Throughput is usually tied to data capacity and can't scale up independently. - Managing LOSF (lots of small files) is a persistent challenge of file storage, especially in the AI domain. - Data access in multi-cloud and hybrid-cloud environments is an emerging requirement.

Slide 5

Slide 5 text

Mission & Vision Storage does not mean expensive hardware and complex maintenance work. Juicedata is committed to empowering every enterprise to easily tackle the challenges of massive data and high-performance workloads. POSIX, Elastic, High throughput, 10B in one volume, Multi-cloud, Not expensive

Slide 6

Slide 6 text

Community & Enterprise Editions: Targeting Different User Groups ● Community Edition: Geared towards general-purpose distributed file systems, emphasizing ease of maintenance, usability, and customization. JuiceFS Architecture (Community Edition)

Slide 7

Slide 7 text

Community & Enterprise Editions: Targeting Different User Groups ● Enterprise Edition: Designed for data-intensive, high-performance workloads. JuiceFS Architecture (Enterprise Edition)

Slide 8

Slide 8 text

Wide Compatibility Fully POSIX compatible, strong consistency, and thousands of concurrent clients for mixed workloads

Slide 9

Slide 9 text

Performance & scalability - Disaggregated performance and data capacity - Multi-layer cache to scale performance - Local cache with SSDs or memory on computing nodes - More data could be cached in the independent cache group - Object store for data durability

Slide 10

Slide 10 text

Multi-cloud - Transparent data replication is very important when computing nodes are insufficient in a single region, especially for GPU today.

Slide 11

Slide 11 text

Community Edition (open source) ● GitHub ~10K stars ● WeChat group 30K (Chinese Users) ● Slack channel 670 ● Max volume from community ○ Capacity > 100PB ○ Inodes > 70B

Slide 12

Slide 12 text

JuiceFS Use Cases Unstructured data store in AI - Generative AI - Autonomous Driving - Quantitative Trading - BioTech Data Lake - HDFS alternative, S3 improvement - Integrated with all components of big data ecosystem Kubernetes Persistent Volume

Slide 13

Slide 13 text

Case Studies - all data-intensive workloads - GenAI: LLM model pre-training - Autonomous Driving: perception model training - Quantitative Capital: trading model training - BioTech: gene sequencing pipeline

Slide 14

Slide 14 text

Case Study - - An LLM startup company, valued at $2.5B - Unified storage for LLM pre-training pipeline - Local NVMe cache on each GPU node + independent NVMe cache group + object storage(cloud service and Ceph RADOS) - > 300GBps throughput in ONE JuiceFS volume - Cloud + DC, auto data replication See also: - JuiceFS Benchmark on MLPerf - NAVER uses JuiceFS in its machine learning platform - BentoML uses JuiceFS to reduce model deployment time

Slide 15

Slide 15 text

Case Study - - An autonomous driving company, invested by GM, Mercedes-Benz, Toyota - Some metrics in the production environment - 20B files in ONE volume, average 100KiB per file - 450K metadata QPS - File read QPS 300K - Avg response latency 0.4ms - Throughput peak 70GiB/s - Cache hit > 80% - Synchronization latency between 2 sites about 20ms

Slide 16

Slide 16 text

Case Study - - A quantitative capital working on the cloud, founded in Silicon Valley, AUM exceeding $15B. - Early on, they used AWS FSx for Lustre and Aliyun CPFS (IBM GPFS), but throughput scale was limited by data capacity and too expensive. - JuiceFS Cloud Service enables them to decouple performance and capacity at a low cost. More details: - Metabit: Setting Up a Cloud-Based Machine Learning Platform with JuiceFS

Slide 17

Slide 17 text

Case Study - Nf-core / RNAseq with test_full dataset - Prestigious biotech company (founded 125 years ago) - Uses NextFlow to run long-running genomic pipelines - Leverages JuiceFS to improve performance by 75% and reduce cost by 60%

Slide 18

Slide 18 text

Comparison with S3FS/Goofys and AWS EFS Ref. - JuiceFS v.s. S3FS - Performance comparison - POSIX compatibility comparison

Slide 19

Slide 19 text

Comparison with AWS FSx for Lustre AWS FSx for Lustre JuiceFS Deployment AWS Managed Service Juicedata Managed Service, On-Prem, and AWS Marketplace Data Access Protocol POSIX POSIX, HDFS and S3 Performance Peak of aggregate throughput 1000MB/s/TiB Unlimited aggregate throughput, scale by adding more cache Multi-cloud and Hybrid-cloud No Yes Pricing $0.6 per GB-month + Provisioned metadata IOPS fee + Backup storage fee + Data transfer fee $0.04 per GB-month ● $0.02 per GB-month (Juicedata charge) ● $0.02 per GB-month (approximate, S3 charge) https://juicefs.com/en/pricing

Slide 20

Slide 20 text

Go to Market Cloud providers Ecosystem Some of users

Slide 21

Slide 21 text

Next - Continuous investment in R&D - Community growth in North America - Prepare marketing and sales team here

Slide 22

Slide 22 text

Thank You❤ Rui Su Cofounder of Juicedata ● linkedin.com/in/suave ● x.com/suavesu Join JuiceFS community ● go.juicefs.com/slack