Per-epoch Shuffling Data Loader: Mix It Up As You Train! (Clark Zinzow, Anyscale)

Per-Epoch Shuffling Data Loader Mix it up as you train!
Clark Zinzow Software Engineer @ Anyscale

A compelling Ray-based solution The need for and challenges of
large-scale shuffled data loading for distributed model training What’s this talk about?

A bit about me Software engineer at Anyscale Working on
the core Ray system, focusing on supporting large-scale data processing Before Anyscale, I was a Ray user!

And why do we need it? What’s a shuffling data
loader?

Your typical ML pipeline Data Preprocessing Model Training Model Deployment
What we’re interested in

Data Preprocessing + Model Training Data Preprocessing Model Training Workers
Storage (S3, HDFS) Storing Loading Loading data from storage into the model trainers is simple, right?

With • large data scales • distributed data-parallel training data
loading can become a significant training bottleneck! Not so simple!

We’re not just loading the data once... Model Training Workers
Storage (S3, HDFS) Load data for epoch 1 for epoch in range(num_epochs): for data, target in load_data(): train(data, target) We’re loading data at the start of every epoch. Load data for epoch 3 Load data for epoch 2

We’re not just loading data into a single trainer... Trainer
#1 Storage (S3, HDFS) Trainer #2 Trainer #3 We’re loading partitions of data into multiple trainers, spanning multiple machines.

We’re not just loading data... Trainer #1 Storage (S3, HDFS)
Trainer #2 Trainer #3 Shuffler We’re also shuffling the data at the start of every epoch.

How does model training benefit? Why do we want to
shuffle the data at the start of every epoch?

Shuffling decorrelates samples within and across batches

Per-epoch shuffling reduces variance, improves generalization, and prevents overfitting Batch
gradient descent - batches are more representative of entire dataset, improving estimate of “true” full-dataset gradient Batch/stochastic gradient descent - gradient updates on individual samples are independent of sample ordering

End result: better model accuracy after a set number of
epochs You can think of it as per-epoch shuffling improving the statistical gain of each step in the training process.

Large data scale and supporting distributed data-parallel training makes things
more difficult (and fun) What are the challenges in creating a shuffling data loader?

You want: • High-quality, global shuffle • No significant increase
in training time • Convenient integration with ML frameworks

You need to balance... obtaining a high-quality, global shuffle... without
significantly increasing training time for each epoch due to trainers waiting for data to be shuffled.... without hogging significant cluster resources to run shuffle… without devoting significant engineering resources to develop complex data loading and distributed shuffle strategy... while making it easy to integrate with model training frameworks and infrastructure.

And where do they fall short? What solutions already exist?

PyTorch data = datasets.DatasetMNIST( root="data", download=True, transform=ToTensor()) sampler = BatchSampler(
DistributedSampler(data), batch_size) dataloader = DataLoader( data, sampler=sampler) Pitfalls: • PyTorch-specific, locked in to framework • Iterable dataset shuffle is per-shard, not global • For map-style datasets + distributed sampler, main process has to coordinate shuffling and data loading • Reports of performance issues with multiprocessing parallel data loading

TensorFlow ds = tfds.load( 'mnist', split='train', shuffle_files=True) ds = ds.shuffle(
shuffle_buffer_size, reshuffle_each_iteration=True) ds = ds.batch(batch_size) ds = ds.prefetch(tf.data.AUTOTUNE) ds = ds.repeat(num_epochs) Pitfalls: • TensorFlow-specific, locked in to framework • Shuffle is per-shard, not global

Spark Pitfalls: • Requires a separate cluster, adding to infra
ops • Difficult to integrate with training within ML frameworks • Hard to pipeline shuffling with training • Shuffling all epochs in advance can be expensive df = spark.read.parquet( "/some/file.parquet") pyspark.sql.functions.shuffle( df.data).collect()

Petastorm Pitfalls: • Pseudo-global shuffle isn't truly global, shuffles row
groups/partitions instead of individual rows • Pseudo-global shuffle produces spikes in batch wait times for trainers, resulting in longer training times make_batch_reader( "/some/file.parquet", num_epochs=num_epochs)

Existing solutions and pitfalls Pitfalls Solutions ML framework-specific Shuffle is
not global Reports of performance issues Requires separate infrastructure Difficult to integrate with ML frameworks

User Stories Uber’s DL teams

Uber’s DL data loading issues - latency spikes With global
shuffling, experienced data loading latency spikes and throughput instability

Uber’s DL data loading issues - training loss oscillation Without
global shuffling, statistical sampling of training data was not good and resulted in large training loss oscillation.

Uber’s DL data loading issues - model accuracy Both issues
negatively affected model accuracy within a set time window.

Some users are even rolling their own Ray-based data loaders!

NVIDIA’s data loading issue - hanging PyTorch’s multiprocessing data loader
occasionally hangs, hurting training times Training small models that are IO-bound, so data loading performance is important Simple Ray-based data loader (multiprocessing drop-in replacement) achieves higher throughput than TensorFlow’s data loader and matches PyTorch’s data loader, without the occasional hangs

A Ray-based Shuffling Data Loader

System Design Why is Ray a good fit? Ecosystem

Ray System Design Distributed in-memory object store Fast, smart, decentralized
scheduling Heterogeneous resources

Ray Architecture Overview The Global Control Store (RGCS) contains all
cluster/tasks state, and is the only truly stateful component of a Ray cluster A driver is a Ray client (application code) and is colocated with one or more Ray workers Each node in a Ray cluster has a raylet that (1) makes scheduling decisions for tasks created by local drivers/workers, and (2) facilitates object transfers between nodes The object store (Plasma) is used for persisting and communicating all task/actor inputs and outputs; workers only read/write from/to their local object store, with the raylet facilitating transfers to other workers

Heterogeneous Resources Ray can house disparate node types on a
single cluster: • Differing hardware (CPUs and GPUs) • Differing number of CPU cores • Differing amount of CPU memory These heterogeneous resources are all schedulable at the task and actor level, enabling fine-grained resource provisioning Node 1 8 vCPUs 16 GiB RAM Node 2 16 vCPUs 4 GPUs 64 GiB RAM Node 3 32 vCPUs 1 GPU 128 GiB RAM

Distributed In-memory Object Store Intermediate results/shuffled outputs transparently cached in
memory, transferred over network Zero-copy reads on same-node workers via shared memory Performance: Ray scheduler limits how much total memory can be used by objects on single node When object store is full, objects are spilled to external storage Reliability:

Scheduling Decentralized scheduling Hot node mitigation - will attempt to
schedule tasks onto nodes with low memory utilization Locality-aware scheduling - will try to schedule tasks on node with most task dependency bytes already local

Data preprocessing, model training, and model serving, all on Ray
Ray Ecosystem

Ray Ecosystem Native Libraries 3rd Party Libraries universal framework for
distributed computing

API and Architecture Shuffling Data Loader Design

Shuffling Data Loader - Dataset API Shuffling data loader exposes
an iterable dataset API, which yields globally-shuffled GPU batches. from ray_shuffling_data_loader.dataset \ import ShufflingDataset ds = ShufflingDataset( filenames, num_epochs, num_trainers, batch_size, rank, num_reducers=num_reducers, max_concurrent_epochs=2) for epoch in range(num_epochs): ds.set_epoch(epoch) for batch_idx, batch in enumerate(ds): print(f"Batch: {batch_idx}") Dataset will kick off shuffle in background, pipelining shuffling with model training up to throttle limit. Training data can be read from local disk, S3, HDFS, everything that fsspec supports.

Shuffling Data Loader - Torch Dataset API Torch dataset API,
for distributed training of Torch model. from ray_shuffling_data_loader.torch_dataset \ import TorchShufflingDataset ds = TorchShufflingDataset( filenames, num_epochs, num_trainers, batch_size, rank, num_reducers, max_concurrent_epochs, feature_columns, feature_types, label_column, label_type=label_type) def _train(epoch): ds.set_epoch(epoch) for batch_idx, (data, target) in enumerate(ds): # Training step for model. Dataset transforms GPU batches from Pandas DataFrames to PyTorch Tensors, using provided feature and label spec. Seamlessly integrates with Torch training on Horovod-on-Ray

Horovod A distributed deep learning training framework supporting TensorFlow, Keras,
PyTorch, and Apache MXNet. Uses MPI model for distributed training instead of parameter server model (Distributed TensorFlow uses the latter). Focus on speed, ease-of-use: MPI/Gloo collective communications protocols, such as ring-allreduce, yield fast, scalable model training while requiring very little code change when transitioning to distributed training.

Horovod-on-Ray You can run Horovod on a Ray cluster! from
horovod.ray import RayExecutor # Start the Ray cluster. ray.init() # Start num_workers actors on the cluster # using the standard RayExecutor. executor = RayExecutor( setting, num_workers=num_workers, use_gpu=True) # This will launch the actors. executor.start() # Run the actual training function. executor.run(training_fn) Horovod is a distributed deep learning training framework supporting TensorFlow, Keras, PyTorch, and Apache MXNet.

Shuffling Data Loader - Architecture Mapper Mapper Reducer Reducer Trainer
Trainer Queue is streaming, trainers will pull data directly from reducers

Shuffling Data Loader - Architecture Mapper Mapper Reducer Reducer Trainer
Trainer Store Queue

Shuffling Data Loader - Architecture Mapper Mapper Reducer Reducer Input
Output

Pipelining Optimizations Prefetching Data locality

Pipelining Pipelines shuffling with model training, shuffling data for future
epochs concurrently with training for current epoch. If shuffle throughput >= training throughput, hot shuffled batches will be available for trainers (after first epoch). Backpressure on shuffle is enforced by queue, adhering to throttle limit: maximum number of allowed concurrent epoch shuffles. Shuffling Training 1 1 2 3 3 4 4 epochs of pipelined shuffling and training, at most 2 concurrent epoch shuffles 2 4

Queue Prefetching Trainers prefetch batches of shuffle outputs from queue,
pipelining transfer of future batches to trainer while trainer is processing current batches. Decreases average, max batch wait-time on trainers; prevents stragglers from slowing down training. Guarantees that straggler reducers won’t block other shuffle outputs; trainer immediately gets batch when one becomes available. Trainer Queue Trainer Queue Synchronous Prefetching

Data Locality (Coming Soon!) Currently exploring placing shuffle reducers on
same node as trainers. Should virtually eliminate communication overhead of reducers transferring shuffle outputs to trainers, yielding higher throughput, less batch wait-time spikes on trainers. Trainer Trainer Reducer Reducer Reducer Reducer Shared memory

Ray-based solution addresses pitfalls • ML framework-agnostic • Shuffle is
global • Mitigates performance issues • Runs on shared Ray cluster • Easy to integrate with ML frameworks

Benchmarks

Scaling Cluster Size Average throughput: 4M records/s Average throughput: 5.5M
records/s Data read from S3 AWS EC2 i3.8xlarge (32 vCPU, 244GB RAM) 110GB object store per node Benchmark config

Scaling Dataset Size Average throughput: 4M records/s Average throughput: 5.5M
records/s Average throughput: 5M records/s Data read from S3 AWS EC2 i3.8xlarge (32 vCPU, 244GB RAM) 110GB object store per node Benchmark config

Uber deep learning User Stories

Uber’s DL teams Uber is currently experimenting with integrating the
shuffling data loader into their DL pipelines. At their current data scale, we are able to consistently deliver globally-shuffled GPU batches to their trainers in under 100ms after the first epoch.

Per-epoch shuffling is needed to avoid overfitting for many DL
models Takeaways Ray has a compelling data loader that seamlessly integrates with Torch training on Horovod-on-Ray Ray as a common substrate for your ML pipeline gives you best-in-class UX, performance, and ops

Ray: https://github.com/ray-project/ray Data Loader: https://github.com/ray-project/ray_shuffling_data_loader Horovod: https://github.com/horovod/horovod Join the Ray
Discussion Forum: https://discuss.ray.io/ We’re hiring! https://jobs.lever.co/anyscale Connect with us

Thank You Enjoy the rest of the summit!

Per-epoch Shuffling Data Loader: Mix It Up As Y...

Per-epoch Shuffling Data Loader: Mix It Up As You Train! (Clark Zinzow, Anyscale)

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript