Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How Ray and Anyscale Make it Easy to do Massive...

How Ray and Anyscale Make it Easy to do Massive-scale ML on Aerial Imagery (Richard Decal, Dendra Systems)

This talk outlines how Dendra Systems leverage Ray and Anyscale to parallelize their workloads across a cluster containing dozens of GPUs in a single Python script. Richard discusses how Dendra optimized their inference pipelines to saturate their clusters' network I/O limits using Ray Serve.
Richard then describes how Anyscale makes it easy to run this Ray application in production: running seamlessly in their BitBucket CI and also supporting a microservice to launch jobs programmatically.

Anyscale

July 16, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. How Ray and Anyscale Make it Easy to do Massive-scale

    ML on Aerial Imagery Richard Decal Lead ML Engineer
  2. Traditional methods of ecological restoration do not scale Invasive weeds

    monitoring: Person on truck Ecosystem restoration: Manual, little automation, or optimized for monoculture Ecosystem analytics: Ground surveys Source: Wikipedia 3
  3. Dendra’s mission is to create the tools needed to power

    scalable ecosystem restoration Scale: 11x faster. Cost: 3x cheaper Safe: Steep & unstable ground Targetted Robust Advantages We're a global team of data ecologists, engineers and drone operators who come together to provide a holistic system designed to restore native lands. 4
  4. Dendra Technologies and Restoration Native species & Tree stem density

    Track native species richness and tree stem density to understand land rehabilitation progression. Fauna & Habitat Monitor fauna to validate success through the native animals that return. Biodiversity assessment Assess biodiversity to identify the most suitable species to thrive in the ecosystem and ensure general monitoring. Species quantification Fauna paths and migration routes Goana Snake Native species assessment 5
  5. Dendra Technologies and Restoration Pest monitoring Track known and unknown

    pests over an ecosystem to facilitate intervention. Legacy infrastructure Monitor legacy infrastructure to clean the land. Priority Heatmaps Weed species Feral cat Feral goat Dog tracks Boat Car, wheel & other Whole site weed mapping Weed management Identify invasive weed coverage and monitor progress for transparency and financial forecasting. Ant nest 6
  6. Dendra Technologies and Restoration Tunneling RGB Tunneling DEM Mass movement

    RGB Gully Pooling Mass movement DEM 7 Erosion and soil health monitoring Quantify erosion risk, alert of safety issues , and take early action (prevention is easier and cheaper than cure). Scale: 400 soccer fields of imagery per drone per day
  7. Action Each seeding drone can reseed up to 60 hectares

    per day, carrying 700 kg of daily payload (2 polar bears worth) 10 drones flying in a swarm can plant as many as 300,000 trees per day. 8 Action Each seeding drone can reseed up to 60 hectares per day, carrying 700 kg of daily payload (2 polar bears worth). Tailored seed mixtures (>50 species). 10 drones flying in a swarm could plant as many as 300,000 trees per day.
  8. About me ML engineer on a personal and professional mission

    to fight climate change. Former molecular biologist. Lead ML Engineer at Dendra. Founded ML team. Ray user since version 0.6.4 Experience with MapReduce and PySpark, but not distributed applications. 9
  9. Journey 1 • First attempt was using AWS Sagemaker. ◦

    Pro: Great if following templates and using off-the-shelf solutions ◦ Con: Too inflexible for bringing our own specialized models and our in-house platform. • Discovered Ray Tune Tuning deep neural networks at scale 12
  10. Loss (lower is better) Time (training epochs) Early terminated trials

    Ray Tune saves time and money in hyperparameter tournaments Porting our PyTorch code to Ray SGD was easy. Enables training to scale from my laptop to dozens of GPUs. It costs hundreds of dollars to train a single hyperparameter trial to completion. Ray Tune scheduler algorithm aggressively terminates under-performing trials rather than training to completion. - This saves us money, and enables us to search a larger hyperparameter space since bad samples are cheap ($17) - Cost without terminations: $21,000 (estimated). Cost with termination: $3,740. - With trial resuming, we can do coarse-to-fine hyperparameter searches. 16 of 24 hyperparameter configurations terminated after 1 epoch 13
  11. Journey 2 Requirements • Scale: serving solution that can scale

    to hundreds of millions of images in future ◦ parallelize across multiple workers • Maximize GPU utility: Network I/O bottleneck. How to scale as we add to remove workers from our cluster. Scaling inference workloads to millions of images 14
  12. cC Ray distributed queue actor Model replica S3 Client Actor

    Head node Worker nodes S3 Client Actor S3 Client Actor AWS Kinesis Firehose Post processing, batching, sharding Image S3 URLs Images Streaming results Parquet shards Saturating Network I/O in inference pipeline opportunistic batching 15
  13. Why we became Anyscale customers Open source Ray is great,

    but it lacks: - Administration: Robust and programmatic control of clusters - Ability to start and stop clusters easily is necessary to us scaling to support more customers - Performance: fast start times and scaling - Keeping EC2 instances around is expensive and stopped instance EBS volumes still cost money - No elastic scaling - If setup changes, need to destroy nodes and rebuild or build a docker CI pipeline Anyscale solves these problems with: (1) the Anyscale SDK (a) APIs for organizing, creating, and automating everything around cluster management (2) Managed Application Images (a) Simple automation capabilities for integrating into CI pipelines to move to production 16
  14. Journey 3 • Goal: to test training and inference pipelines.

    Make sure we can overfit on a single image, etc. • Problem: Bitbucket CI Pipelines do not have access to GPUs. Continuously testing pipelines that require GPUs 17
  15. Solution: extend our CI tests using the Anyscale SDK to

    launch sessions. Continuous integration pipelines on Anyscale GPUs Tests pass 18
  16. Journey 4 • Problems: ◦ Running jobs manually is tedious,

    slow, and error prone ◦ setting up a cluster can take >25 minutes due to many dependencies. • Goal: to expose Anyscale apps via simple API such that our systems are decoupled Programmatically standing up clusters and running jobs 21
  17. Tests pass Solution: - Ray Serve microservice to launch inference

    clusters (via Anyscale SDK) - Hot start application using Anyscale AppConfig registry for pre-compiled app images. Future: Programmatic inference runs myDendra Users Results in S3, webhook alert 22
  18. • Ray Tune enables us to conduct hyperparameter tournaments on

    large search spaces, and can scale without changing any code. • Ray Serve allows us to future-proof our inference pipelines to work at any scale, and we can hot-start clusters on demand. • Anyscale makes it much easier to interact with Ray clusters • Anyscale SDK enable regular testing of mission critical ML pipelines with Bitbucket CI integration. • Ray core is much more friendly than Python’s multiprocessing and is easier to debug than PySpark. It makes workloads faster on your laptop by easy vertical scaling, and can scale to any limit by horizontal scaling without changing any code. Conclusions 23
  19. www.dendra.io @DendraSystems THANK YOU! Acknowledgements: Will, Bill, Simon, Ian, Charles,

    Amog, Ed, Richard, Sang from Anyscale Team www.richarddecal.com @AIJedi