Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Paradigm4 flexFS - IT Press Tour #68 June 2026

Paradigm4 flexFS - IT Press Tour #68 June 2026

Avatar for The IT Press Tour

The IT Press Tour PRO

June 09, 2026

Transcript

  1. 2 THE PROBLEM: The typical AI stack has cracks in

    its foundation Most workloads speak POSIX AI training, HPC pipelines, analytics engines, and even AI Agents expect files, directories, and low- latency I/O. That's not changing. Low object-storage costs are compelling Object storage is the right economic choice at scale — elastic, cheap, durable. But it brings high-latency and a foreign API. That's not changing either. Data infrastructure is inadvertently making data expensive and hard to use. The gap taxes every workload GPU idle time. Slow pipelines. Over- provisioned file systems. Data Scientists doing plumbing instead of science. Our Customers Faced This Challenge
  2. 3 3 Paradigm4 Of course, we wanted it all but

    budgets and offerings were limited
  3. 4 4 Paradigm4 Previous Industry Attempts to Get It All

    “Lift and Shift” Datacenter Storage into Public Cloud • Physical parallel-filesystem storage designs copied onto cloud servers with connected disks • Often managed the same way as on- premise versions • Integrate with cloud object stores via "bolt- on” tier – at best • Throughput directly dependent on deployed capacity, e.g., Lustre, DDN, WEKA, etc. • Don’t satisfy all four needs – especially cost (incl. operational)
  4. 5 5 Paradigm4 flexFS: Object-native Parallel Filesystem …an entirely different

    approach Leverage hyperscale object store for its strengths and add: • POSIX file system semantics • Low-latency metadata I/O • Tunable, low-latency file I/O • Minimal operational overhead • Outstanding price-performance You can have it all! Hyperscale object storage brings: Massively scalable, elastic capacity Separately scalable, elastic throughput Low costs But also comes with: High-latency file and metadata I/O Its own APIs – most software knows about files, not objects
  5. 6 6 Paradigm4 flexFS Architecture Public Cloud (AWS, Azure, GCP,

    OCI, etc.) flexFS Compute Instances (w/ flexFS client) Metadata File Data flexFS Proxy Group (WB cache) 3 2 4 Object Storage (S3, Azure Blob, etc.) flexFS Metadata Server(s) 1
  6. 7 7 Paradigm4 Supported flexFS Configurations • Entire environment within

    a single cloud region managed by a single cloud vendor Single-Region Cloud • Environment spans multiple cloud regions and/or multiple cloud vendors Multi-Region, Multi-Cloud • All flexFS services and file data are in a private data center, lab, or office On-Premises • Data is primarily stored in the cloud but also needs to be accessible on premises – or vice-versa Hybrid • Storage services co-resident on compute nodes – except object store back- end Converged* * Converged configs enable near-local-NVMe performance using networked object storage, as demonstrated on OCI by Oracle working jointly with Paradigm4. For details, see https://blogs.oracle.com/cloud-infrastructure/accelerate-ai-workloads-on-oci-with-flexfs-cache.
  7. 9 9 9 Paradigm4 Current Use Cases in Production •

    Common storage fabric and scratch space • R&D Data Commons • Data staging/munging area • Data Lakehouse Extended Storage • Elastic scratch space for cloud bioinformatics workflows • Direct file access for cloud bioinformatics workflow
  8. 10 10 Paradigm4 Top-5 Global Biopharmaceutical Company • Research Data

    Commons serves as a global central repo for clinical and other research data Use Case • Existing FS had high admin overhead, high downtime to reprovision, budget overruns Problem/Challenge • Combination of AWS S3, EFS, EBS and FSx for Lustre Competing Solution • 1.14 PB, >160M files and folders Current data Storage • $1.44 million – 59% less than competing AWS solution Cost Savings in One Year
  9. 11 11 Paradigm4 flexFS ROI Summary: Top-5 Pharma · Sep

    2022 – Mar 2026 (43 months) flexFS + S3 (43 months) $2.53M actual billing + S3 backend AWS provisioned (43 months) $5.65M enterprise tiers, realistic ops Cumulative savings $3.13M 55% of AWS cost avoided 2025 full-year savings $1.44M 59% of AWS avoided Mar 2026 monthly saving $164K/M flexFS $110K vs AWS $274K Lustre over-prov. waste $332K ~24% of FSx spend, 43 mo At current scale (1.14 PB), the entire flexFS+S3 bill ($110K/mo) is less than competing EFS storage alone ($141K/mo) AWS-alternative parameters: Distribution: FSx Lustre 25% · EFS 40% · EBS 10% · S3 25% FSx: Persistent_2 SSD 500 MB/s/TiB @ $0.170/GB-mo · EFS: Standard Regional @ $0.300/GB-mo · EBS: gp3+provisioned @ $0.125/GB-mo · S3: Standard @ $0.023/GB-mo AWS includes: EFS Elastic Throughput I/O ($0.03/GB reads, $0.06/GB writes) · cross-AZ transfer · AWS Backup (FSx 80%, EFS 80%, EBS 60%) FSx provisioned at 30% headroom in 2.4 TiB increments, no shrink after scale-up* · flexFS+S3: actual billing + $0.023/GB-mo S3 backend · Both include i4i.16xlarge @ $4,008/mo * Before AWS launched “Intelligent Tiering” in May 2025. However, scaling Lustre throughput still adds charges – over $.50 for every Mbps – and for SSD read-cache capacity (write caching not supported).
  10. 12 12 Paradigm4 Advantages Beyond Cost Savings 🕐 Built-in time

    travel Point-in-time recovery at no added cost. AWS Backup for FSx = $0.047/GB-mo; EFS = $0.050/GB-mo. At 1.14 PB scale that's ~$32K/mo if done separately. 🌐 Native multi-AZ — zero transfer cost flexFS serves data across AZs transparently. EFS charges $0.01/GB cross-AZ. FSx Lustre is single-AZ by default. flexFS incurs no cross-AZ transfer charges. ↕ True elasticity — no over- provisioning flexFS grows and shrinks automatically; you pay for bytes used. FSx requires capacity in 2.4 TiB increments and cannot shrink — wasting $332K over 43 months. 🔐 Extended POSIX ACLs for clinical data setfacl/getfacl works uniformly across HPC, AWS Batch, Databricks, and REVEAL. EFS uses NFSv4 ACLs (complex). S3 has IAM/bucket policies only — no POSIX ACLs. ⚡ Proxy group shared caching Reference genomes (30+ GB) cached once across all compute nodes. FSx: each client warms its own cache independently. EFS: no client-side caching layer. 💻 All Linux flavors + macOS Single mount on Amazon Linux, Ubuntu, RHEL, and macOS via NFS re-export. FSx requires the Lustre kernel module — not available on macOS or non-Amazon Linux. 🔗 Single namespace across all services Same /flexfs path from HPC/SLURM, AWS Batch, Databricks, REVEAL, and HTTP. AWS stack requires separate NFS mounts, S3 URIs, and Lustre mounts per service type. 📉 Cost efficiency improves at scale flexFS effective rate fell from ~$90/TB- mo at 25 TB (2022) to ~$66/TB-mo at 1.14 PB (2026). EFS Standard stays flat at $307/TB-mo. FSx stays flat at $174/TB-mo.
  11. 13 13 Paradigm4 flexFS Operations Features Deduplication • Identifies duplicate

    files within a flexFS volume and optionally replaces them with hard links to reclaim storage. Duplicates are verified through checksum comparison and byte-for-byte validation before making any changes. Optimized find utility • Filesystem search tool that queries the metadata server to locate files and directories matching specified criteria • Similar to the Unix find command but operates directly on the metadata store instead of traversing the mounted volume. Non-Disruptive Updates • Mount clients auto-update in place with a seamless FUSE session handoff — no unmount, no interruption, no data loss. • flexFS server updates pause I/O for less than 1 second – no impact on data. Kubernetes Native • CSI volume driver with Helm chart for dynamic and static provisioning. Mount flexFS volumes directly into pods.
  12. 14 14 14 Paradigm4 flexFS cost relief increases as data

    volume grows Cost of flexFS vs Lustre and EFS for 100-800 TB
  13. 15 15 Paradigm4 flexFS Saves Time and Money Four Ways

    • 2-5x cheaper than EFS and FSx for Lustre • Pay only for what you use Save on file storage costs • Big savings on large HPC jobs from reduced file I/O time Save on distributed computing costs • Avoid downtime with high elasticity and no storage-cluster resizing Avoid end user downtime • Minimal infrastructure to monitor and maintain • No re-provisioning to increase capacity Lower operational overhead
  14. 16 16 16 Paradigm4 Newer Use Cases • Data Lakehouse

    Acceleration • Coupled-Architecture DBMS Modernization • AI/ML Training and Execution Acceleration • Agentic AI/ML Workspace with Persistence
  15. 17 17 17 Paradigm4 Data Lakehouse Acceleration Problem High-latency "Metadata

    Tax" and "Small-File Congestion" in Spark, Presto, and other OTF Data Lakehouses. The flexFS Advantages • Sub-Second Planning: Metadata service eliminates "planning hangs." • Throughput Saturation: Increased parallelism, more-efficient byte-range requests enable 2X - 7X performance gains on Spark workloads. • Elasticity: Unlike HDFS, flexFS enables compute clusters to scale instantly (e.g., from 500 to 1,000 nodes) without data rebalancing. See “Accelerating Data Lakehouses with flexFS” whitepaper for details. Engine S3 flexFS direct flexFS proxied Spark 1,191s 796s 532s Spark + Comet 788s 1,257s 301s Spark + Gluten 566s 275s 176s TPC-H Results
  16. 18 18 18 Paradigm4 Coupled- Architecture DBMS Modernization Objective: Upgrade

    Coupled-Architecture databases (MPP DW, Graph and Vector) to use elastic, high-throughput Object Storage via flexFS. Problem: DBMS clusters that currently rely on strict POSIX (atomic renames, locking, etc.) and low-latency, high-throughput I/O on direct- attached disks. flexFS Advantages: • Independent Resource Scaling: De-couple file-storage growth from compute clusters, right-sizing infrastructure and reducing TCO by up to 60% — with no code changes. • Intelligent RAM + NVMe Cache: Proxy Group handles random and burst I/O, speeding results and reducing I/O pressure on the object store. • Local-NVMe-Level Performance: Converged Compute & Proxy config provides throughput and IOPS high-speed engines expect.* • DBMS Snapshots Enabled: leveraging built-in, zero-copy “Time- Travel” metadata mapping that also provides instant, no-cost backups. * As demonstrated on OCI by Oracle working jointly with Paradigm4. For details, see https://blogs.oracle.com/cloud-infrastructure/accelerate-ai-workloads-on-oci-with-flexfs-cache.
  17. 19 19 19 Paradigm4 AI/ML Training and Acceleration Objective: Eliminate

    GPU Starvation during large-scale model training (PyTorch, TensorFlow, JAX). Problem: High-end GPUs sit idle while standard S3 drivers struggle to feed data fast enough, particularly during random shuffles or massive model checkpoints. flexFS Advantages • 2x Speedup (Non-Proxied): Even without a cache, flexFS can access data in half the time of S3 direct by utilizing Object-Native Parallelism to saturate the network pipe. • Small-I/O Optimization: Byte-Range reads are optimized for GPU utilization and High Bandwidth Memory (HBM) efficiency. • Instant Checkpointing: flexFS absorbs massive model saves at near local-NVMe performance, allowing the GPU cluster to resume training in seconds rather than minutes.
  18. 20 20 20 Paradigm4 Agentic Workspace with Persistence Objective: High-performance

    AI/ML memory substrate for autonomous agents to rapidly access context, store reasoning, and manage multi-modal artifacts. Problem: Traditional RAG over S3 is too slow for multi-step agents (10+ "hops"). They require low-latency storage for intermediate data – and POSIX provides a more agent-native environment. flexFS Advantages: • Pointers, Not Payloads: Agents share file paths instead of massive data copies. Data cache is "warmed" once for all compute nodes. • Efficient Byte-Range I/O: Agents access only relevant sections of large files (e.g., 500MB PDF) to reduce latency and token costs. • POSIX Scratchpad: Native file-access API environment for agents to save, execute, and log Python scripts, offering a local- disk feel while persisting results to a shared object store or Data Lake.
  19. 21 21 Paradigm4 flexFS Fast Time-to-Value, Low Maintenance, Trusted Reliability

    • Out-of-the-Box ready to support Agentic AI use cases • Installation typically under an hour • Most customers need only one server • Drop-in replacement for EFS, FSx for Lustre, OCI File Storage, GC Filestore, Azure Files, etc. • Effectively unlimited storage • Very low maintenance – “set it and forget it” • 11 nines data durability on hyperscale cloud • Continuous snapshots • End-to-end data encryption
  20. 24 24 24 Paradigm4 What do you think of the

    idea of a “File Lakehouse” category? We’re exploring the idea of defining a category in today’s AI/ML/Analytics landscape: the “File Lakehouse” We’d like your thoughts • Is this concept useful? • Would it resonate with your audiences? • Does it create more clarity?
  21. 25 25 Paradigm4 Modern AI/ML/Analytics Architecture: with File Lakehouse Data

    Lakehouse Coupled-Architecture DBMS File Lakehouse Object Store (infinite capacity, cost effective storage) Object-native Parallel Filesystem Elastic | Massively Parallel | 100% POSIX | Throughput Acceleration | Low Latency I/O | Advanced Write Back Caching | Optimized Block I/O Analytical Engines (SOL, Dataframes, Batch Streaming) OTF Management (ACID, Schema, Time Travel, Gov) Business Intelligence | Reporting | Analytics AI/ML Models | Generative - Agentic - Causal Data Lakehouse Management MPP DW (SQL) Graph (NoSQL) Vector (NoSQL) De-couple Compute & Storage Enable Snapshotting (HPC Metadata | 100% POSIX | Low Latency I/O | Proxies) AI Inference & Training Data Lakehouse Acceleration (HPC Metadata | 100% POSIX | Byte-Range I/O | Proxies) Bespoke AI Model Training (HPC | Elastic Throughput | o-direct) Unstructured Data Management (HPC | POSIX | Agents | Images | PDFs | Video | Sound) Bespoke AI Model Inference (HPC | Elastic Throughput | o-direct) AI/ML Acceleration (HPC | Elastic Throughput | o-direct) Zero Copy Zero Copy