Slide 31
Slide 31 text
User 1: ML platform startup
Dask-on-Ray → Datasets → Horovod
• Dask-on-Ray and Datasets was 10x faster
than Pandas + S3+ Petastorm, even on
small data/cluster scales
• Benchmark: NYC Taxi dataset (5 GB
subset), single g4dn.4xlarge instance
Case study - benchmark results
User 2: large transport tech company
S3 → Datasets → Horovod
• Datasets from S3 was 4x faster
than Petastorm from S3
• Benchmark: 1.5 TB synthetic
tabular dataset, 70 shuffle workers
(c5.18xlarge), 16 trainers
(c5.18xlarge), 3 shuffle windows
Throughput
Petastorm 1.8 GB/s
Datasets 7.38 GB/s
Ray Datasets gives higher quality
shuffle AND better performance,
even at small scales!