Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Juicedata - IT Press Tour #56 June 2024

Juicedata - IT Press Tour #56 June 2024

The IT Press Tour

June 13, 2024

More Decks by The IT Press Tour

Other Decks in Technology

Transcript

  1. About Juicedata • The company behind JuiceFS • Community edition

    released under Apache 2.0 license, widely adopted globally. • Cloud Service on AWS, Azure, and GCP • Today ◦ 10K stars on GitHub (the fastest-growing open-source project in distributed file systems) ◦ Max volume exceeds 100PB ◦ More than 70B files in one volume Davies Liu, Founder and CEO ▪ MooseFS contributor ▪ Author of BeansDBa and DParkb ▪ Formerly at Meta and Databricks Rui Su, Co-founder ▪ Former CEO & Founder of a startup ▪ Former Tech Lead & PM at Douban Moviec ▪ 3 years of experience in NGO ▪ 1st engineer at Maxthon Browser a BeansDB: a simple object storage created in 2000, used in production environments with petabytes of data. b DPark: a Spark Python clone written in 2010. c Douban Movie: dubbed as the Chinese version of IMDb/Rotten Tomatoes.
  2. Timeline • 2017 Founded (2 people) • 2018 - 2020

    Early releases with several clients • 2021 Community Edition (Open Source) released under AGPLv3 License • 2022 Community Edition v1.0 LTS GA under Apache2.0 License • 2023 Enterprise Edition v5.0 GA, CE v1.1 LTS GA • 2024 Community Edition v1.2 LTS, RC currently (20+ people)
  3. Pain Points & Challenges - S3 is very popular in

    web services, but has limited compatibility with data-intensive workloads. - No file system was designed for the cloud; most were built from bare metal, not elastic and much more expensive than S3. - Throughput is usually tied to data capacity and can't scale up independently. - Managing LOSF (lots of small files) is a persistent challenge of file storage, especially in the AI domain. - Data access in multi-cloud and hybrid-cloud environments is an emerging requirement.
  4. Mission & Vision Storage does not mean expensive hardware and

    complex maintenance work. Juicedata is committed to empowering every enterprise to easily tackle the challenges of massive data and high-performance workloads. POSIX, Elastic, High throughput, 10B in one volume, Multi-cloud, Not expensive
  5. Community & Enterprise Editions: Targeting Different User Groups • Community

    Edition: Geared towards general-purpose distributed file systems, emphasizing ease of maintenance, usability, and customization. JuiceFS Architecture (Community Edition)
  6. Community & Enterprise Editions: Targeting Different User Groups • Enterprise

    Edition: Designed for data-intensive, high-performance workloads. JuiceFS Architecture (Enterprise Edition)
  7. Performance & scalability - Disaggregated performance and data capacity -

    Multi-layer cache to scale performance - Local cache with SSDs or memory on computing nodes - More data could be cached in the independent cache group - Object store for data durability
  8. Multi-cloud - Transparent data replication is very important when computing

    nodes are insufficient in a single region, especially for GPU today.
  9. Community Edition (open source) • GitHub ~10K stars • WeChat

    group 30K (Chinese Users) • Slack channel 670 • Max volume from community ◦ Capacity > 100PB ◦ Inodes > 70B
  10. JuiceFS Use Cases Unstructured data store in AI - Generative

    AI - Autonomous Driving - Quantitative Trading - BioTech Data Lake - HDFS alternative, S3 improvement - Integrated with all components of big data ecosystem Kubernetes Persistent Volume
  11. Case Studies - all data-intensive workloads - GenAI: LLM model

    pre-training - Autonomous Driving: perception model training - Quantitative Capital: trading model training - BioTech: gene sequencing pipeline
  12. Case Study - - An LLM startup company, valued at

    $2.5B - Unified storage for LLM pre-training pipeline - Local NVMe cache on each GPU node + independent NVMe cache group + object storage(cloud service and Ceph RADOS) - > 300GBps throughput in ONE JuiceFS volume - Cloud + DC, auto data replication See also: - JuiceFS Benchmark on MLPerf - NAVER uses JuiceFS in its machine learning platform - BentoML uses JuiceFS to reduce model deployment time
  13. Case Study - - An autonomous driving company, invested by

    GM, Mercedes-Benz, Toyota - Some metrics in the production environment - 20B files in ONE volume, average 100KiB per file - 450K metadata QPS - File read QPS 300K - Avg response latency 0.4ms - Throughput peak 70GiB/s - Cache hit > 80% - Synchronization latency between 2 sites about 20ms
  14. Case Study - - A quantitative capital working on the

    cloud, founded in Silicon Valley, AUM exceeding $15B. - Early on, they used AWS FSx for Lustre and Aliyun CPFS (IBM GPFS), but throughput scale was limited by data capacity and too expensive. - JuiceFS Cloud Service enables them to decouple performance and capacity at a low cost. More details: - Metabit: Setting Up a Cloud-Based Machine Learning Platform with JuiceFS
  15. Case Study - Nf-core / RNAseq with test_full dataset -

    Prestigious biotech company (founded 125 years ago) - Uses NextFlow to run long-running genomic pipelines - Leverages JuiceFS to improve performance by 75% and reduce cost by 60%
  16. Comparison with S3FS/Goofys and AWS EFS Ref. - JuiceFS v.s.

    S3FS - Performance comparison - POSIX compatibility comparison
  17. Comparison with AWS FSx for Lustre AWS FSx for Lustre

    JuiceFS Deployment AWS Managed Service Juicedata Managed Service, On-Prem, and AWS Marketplace Data Access Protocol POSIX POSIX, HDFS and S3 Performance Peak of aggregate throughput 1000MB/s/TiB Unlimited aggregate throughput, scale by adding more cache Multi-cloud and Hybrid-cloud No Yes Pricing $0.6 per GB-month + Provisioned metadata IOPS fee + Backup storage fee + Data transfer fee $0.04 per GB-month • $0.02 per GB-month (Juicedata charge) • $0.02 per GB-month (approximate, S3 charge) https://juicefs.com/en/pricing
  18. Next - Continuous investment in R&D - Community growth in

    North America - Prepare marketing and sales team here
  19. Thank You❤ Rui Su Cofounder of Juicedata • linkedin.com/in/suave •

    x.com/suavesu Join JuiceFS community • go.juicefs.com/slack