Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Per-epoch Shuffling Data Loader: Mix It Up As Y...

Per-epoch Shuffling Data Loader: Mix It Up As You Train! (Clark Zinzow, Anyscale)

Shuffling training data, both before training and between epochs, helps prevent model overfitting by ensuring that batches are more representative of the entire dataset (in batch gradient descent) and that gradient updates on individual samples are independent of the sample ordering (within batches or in stochastic gradient descent); the end-result of high-quality per-epoch shuffling is better model accuracy after a set number of epochs. When the training dataset is small and the model is only being trained on a single GPU, shuffling the entire dataset before each epoch is easy; when the training dataset is large or the model is being trained in parallel on multiple GPUs, even with each trainer operating on a manageable subset of the training data, data loading and shuffling time can dominate training time, and many data loaders will compromise the quality of their per-epoch shuffle in order to save on data loading time by only shuffling within the trainer's data partition or even not shuffling at all.

In this talk, we will go over our Ray-based per-epoch shuffling data loader, capable of providing high throughput of globally shuffled batches to dozens of trainers via an easy-to-use iterable dataset interface. When paired with Horovod-on-Ray, you get distributed model training with high-throughput shuffled data loading all running on a fast, reliable, resource-efficient Ray cluster!

Anyscale

July 15, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Per-Epoch Shuffling Data Loader Mix it up as you train!

    Clark Zinzow Software Engineer @ Anyscale
  2. A compelling Ray-based solution The need for and challenges of

    large-scale shuffled data loading for distributed model training What’s this talk about?
  3. A bit about me Software engineer at Anyscale Working on

    the core Ray system, focusing on supporting large-scale data processing Before Anyscale, I was a Ray user!
  4. Data Preprocessing + Model Training Data Preprocessing Model Training Workers

    Storage (S3, HDFS) Storing Loading Loading data from storage into the model trainers is simple, right?
  5. With • large data scales • distributed data-parallel training data

    loading can become a significant training bottleneck! Not so simple!
  6. We’re not just loading the data once... Model Training Workers

    Storage (S3, HDFS) Load data for epoch 1 for epoch in range(num_epochs): for data, target in load_data(): train(data, target) We’re loading data at the start of every epoch. Load data for epoch 3 Load data for epoch 2
  7. We’re not just loading data into a single trainer... Trainer

    #1 Storage (S3, HDFS) Trainer #2 Trainer #3 We’re loading partitions of data into multiple trainers, spanning multiple machines.
  8. We’re not just loading data... Trainer #1 Storage (S3, HDFS)

    Trainer #2 Trainer #3 Shuffler We’re also shuffling the data at the start of every epoch.
  9. How does model training benefit? Why do we want to

    shuffle the data at the start of every epoch?
  10. Per-epoch shuffling reduces variance, improves generalization, and prevents overfitting Batch

    gradient descent - batches are more representative of entire dataset, improving estimate of “true” full-dataset gradient Batch/stochastic gradient descent - gradient updates on individual samples are independent of sample ordering
  11. End result: better model accuracy after a set number of

    epochs You can think of it as per-epoch shuffling improving the statistical gain of each step in the training process.
  12. Large data scale and supporting distributed data-parallel training makes things

    more difficult (and fun) What are the challenges in creating a shuffling data loader?
  13. You want: • High-quality, global shuffle • No significant increase

    in training time • Convenient integration with ML frameworks
  14. You need to balance... obtaining a high-quality, global shuffle... without

    significantly increasing training time for each epoch due to trainers waiting for data to be shuffled.... without hogging significant cluster resources to run shuffle… without devoting significant engineering resources to develop complex data loading and distributed shuffle strategy... while making it easy to integrate with model training frameworks and infrastructure.
  15. PyTorch data = datasets.DatasetMNIST( root="data", download=True, transform=ToTensor()) sampler = BatchSampler(

    DistributedSampler(data), batch_size) dataloader = DataLoader( data, sampler=sampler) Pitfalls: • PyTorch-specific, locked in to framework • Iterable dataset shuffle is per-shard, not global • For map-style datasets + distributed sampler, main process has to coordinate shuffling and data loading • Reports of performance issues with multiprocessing parallel data loading
  16. TensorFlow ds = tfds.load( 'mnist', split='train', shuffle_files=True) ds = ds.shuffle(

    shuffle_buffer_size, reshuffle_each_iteration=True) ds = ds.batch(batch_size) ds = ds.prefetch(tf.data.AUTOTUNE) ds = ds.repeat(num_epochs) Pitfalls: • TensorFlow-specific, locked in to framework • Shuffle is per-shard, not global
  17. Spark Pitfalls: • Requires a separate cluster, adding to infra

    ops • Difficult to integrate with training within ML frameworks • Hard to pipeline shuffling with training • Shuffling all epochs in advance can be expensive df = spark.read.parquet( "/some/file.parquet") pyspark.sql.functions.shuffle( df.data).collect()
  18. Petastorm Pitfalls: • Pseudo-global shuffle isn't truly global, shuffles row

    groups/partitions instead of individual rows • Pseudo-global shuffle produces spikes in batch wait times for trainers, resulting in longer training times make_batch_reader( "/some/file.parquet", num_epochs=num_epochs)
  19. Existing solutions and pitfalls Pitfalls Solutions ML framework-specific Shuffle is

    not global Reports of performance issues Requires separate infrastructure Difficult to integrate with ML frameworks
  20. Uber’s DL data loading issues - latency spikes With global

    shuffling, experienced data loading latency spikes and throughput instability
  21. Uber’s DL data loading issues - training loss oscillation Without

    global shuffling, statistical sampling of training data was not good and resulted in large training loss oscillation.
  22. Uber’s DL data loading issues - model accuracy Both issues

    negatively affected model accuracy within a set time window.
  23. NVIDIA’s data loading issue - hanging PyTorch’s multiprocessing data loader

    occasionally hangs, hurting training times Training small models that are IO-bound, so data loading performance is important Simple Ray-based data loader (multiprocessing drop-in replacement) achieves higher throughput than TensorFlow’s data loader and matches PyTorch’s data loader, without the occasional hangs
  24. Ray Architecture Overview The Global Control Store (RGCS) contains all

    cluster/tasks state, and is the only truly stateful component of a Ray cluster A driver is a Ray client (application code) and is colocated with one or more Ray workers Each node in a Ray cluster has a raylet that (1) makes scheduling decisions for tasks created by local drivers/workers, and (2) facilitates object transfers between nodes The object store (Plasma) is used for persisting and communicating all task/actor inputs and outputs; workers only read/write from/to their local object store, with the raylet facilitating transfers to other workers
  25. Heterogeneous Resources Ray can house disparate node types on a

    single cluster: • Differing hardware (CPUs and GPUs) • Differing number of CPU cores • Differing amount of CPU memory These heterogeneous resources are all schedulable at the task and actor level, enabling fine-grained resource provisioning Node 1 8 vCPUs 16 GiB RAM Node 2 16 vCPUs 4 GPUs 64 GiB RAM Node 3 32 vCPUs 1 GPU 128 GiB RAM
  26. Distributed In-memory Object Store Intermediate results/shuffled outputs transparently cached in

    memory, transferred over network Zero-copy reads on same-node workers via shared memory Performance: Ray scheduler limits how much total memory can be used by objects on single node When object store is full, objects are spilled to external storage Reliability:
  27. Scheduling Decentralized scheduling Hot node mitigation - will attempt to

    schedule tasks onto nodes with low memory utilization Locality-aware scheduling - will try to schedule tasks on node with most task dependency bytes already local
  28. Shuffling Data Loader - Dataset API Shuffling data loader exposes

    an iterable dataset API, which yields globally-shuffled GPU batches. from ray_shuffling_data_loader.dataset \ import ShufflingDataset ds = ShufflingDataset( filenames, num_epochs, num_trainers, batch_size, rank, num_reducers=num_reducers, max_concurrent_epochs=2) for epoch in range(num_epochs): ds.set_epoch(epoch) for batch_idx, batch in enumerate(ds): print(f"Batch: {batch_idx}") Dataset will kick off shuffle in background, pipelining shuffling with model training up to throttle limit. Training data can be read from local disk, S3, HDFS, everything that fsspec supports.
  29. Shuffling Data Loader - Torch Dataset API Torch dataset API,

    for distributed training of Torch model. from ray_shuffling_data_loader.torch_dataset \ import TorchShufflingDataset ds = TorchShufflingDataset( filenames, num_epochs, num_trainers, batch_size, rank, num_reducers, max_concurrent_epochs, feature_columns, feature_types, label_column, label_type=label_type) def _train(epoch): ds.set_epoch(epoch) for batch_idx, (data, target) in enumerate(ds): # Training step for model. Dataset transforms GPU batches from Pandas DataFrames to PyTorch Tensors, using provided feature and label spec. Seamlessly integrates with Torch training on Horovod-on-Ray
  30. Horovod A distributed deep learning training framework supporting TensorFlow, Keras,

    PyTorch, and Apache MXNet. Uses MPI model for distributed training instead of parameter server model (Distributed TensorFlow uses the latter). Focus on speed, ease-of-use: MPI/Gloo collective communications protocols, such as ring-allreduce, yield fast, scalable model training while requiring very little code change when transitioning to distributed training.
  31. Horovod-on-Ray You can run Horovod on a Ray cluster! from

    horovod.ray import RayExecutor # Start the Ray cluster. ray.init() # Start num_workers actors on the cluster # using the standard RayExecutor. executor = RayExecutor( setting, num_workers=num_workers, use_gpu=True) # This will launch the actors. executor.start() # Run the actual training function. executor.run(training_fn) Horovod is a distributed deep learning training framework supporting TensorFlow, Keras, PyTorch, and Apache MXNet.
  32. Shuffling Data Loader - Architecture Mapper Mapper Reducer Reducer Trainer

    Trainer Queue is streaming, trainers will pull data directly from reducers
  33. Pipelining Pipelines shuffling with model training, shuffling data for future

    epochs concurrently with training for current epoch. If shuffle throughput >= training throughput, hot shuffled batches will be available for trainers (after first epoch). Backpressure on shuffle is enforced by queue, adhering to throttle limit: maximum number of allowed concurrent epoch shuffles. Shuffling Training 1 1 2 3 3 4 4 epochs of pipelined shuffling and training, at most 2 concurrent epoch shuffles 2 4
  34. Queue Prefetching Trainers prefetch batches of shuffle outputs from queue,

    pipelining transfer of future batches to trainer while trainer is processing current batches. Decreases average, max batch wait-time on trainers; prevents stragglers from slowing down training. Guarantees that straggler reducers won’t block other shuffle outputs; trainer immediately gets batch when one becomes available. Trainer Queue Trainer Queue Synchronous Prefetching
  35. Data Locality (Coming Soon!) Currently exploring placing shuffle reducers on

    same node as trainers. Should virtually eliminate communication overhead of reducers transferring shuffle outputs to trainers, yielding higher throughput, less batch wait-time spikes on trainers. Trainer Trainer Reducer Reducer Reducer Reducer Shared memory
  36. Ray-based solution addresses pitfalls • ML framework-agnostic • Shuffle is

    global • Mitigates performance issues • Runs on shared Ray cluster • Easy to integrate with ML frameworks
  37. Scaling Cluster Size Average throughput: 4M records/s Average throughput: 5.5M

    records/s Data read from S3 AWS EC2 i3.8xlarge (32 vCPU, 244GB RAM) 110GB object store per node Benchmark config
  38. Scaling Dataset Size Average throughput: 4M records/s Average throughput: 5.5M

    records/s Average throughput: 5M records/s Data read from S3 AWS EC2 i3.8xlarge (32 vCPU, 244GB RAM) 110GB object store per node Benchmark config
  39. Uber’s DL teams Uber is currently experimenting with integrating the

    shuffling data loader into their DL pipelines. At their current data scale, we are able to consistently deliver globally-shuffled GPU batches to their trainers in under 100ms after the first epoch.
  40. Per-epoch shuffling is needed to avoid overfitting for many DL

    models Takeaways Ray has a compelling data loader that seamlessly integrates with Torch training on Horovod-on-Ray Ray as a common substrate for your ML pipeline gives you best-in-class UX, performance, and ops