How Ray and Anyscale Make it Easy to do Massive-scale ML on Aerial Imagery (Richard Decal, Dendra Systems)

How Ray and Anyscale Make it Easy to do Massive-scale
ML on Aerial Imagery Richard Decal Lead ML Engineer

Current rate of ecological destruction is unparalleled Image sources: NYT,
NZ forestry 2B Hectares of degraded land 2

Traditional methods of ecological restoration do not scale Invasive weeds
monitoring: Person on truck Ecosystem restoration: Manual, little automation, or optimized for monoculture Ecosystem analytics: Ground surveys Source: Wikipedia 3

Dendra’s mission is to create the tools needed to power
scalable ecosystem restoration Scale: 11x faster. Cost: 3x cheaper Safe: Steep & unstable ground Targetted Robust Advantages We're a global team of data ecologists, engineers and drone operators who come together to provide a holistic system designed to restore native lands. 4

Dendra Technologies and Restoration Native species & Tree stem density
Track native species richness and tree stem density to understand land rehabilitation progression. Fauna & Habitat Monitor fauna to validate success through the native animals that return. Biodiversity assessment Assess biodiversity to identify the most suitable species to thrive in the ecosystem and ensure general monitoring. Species quantiﬁcation Fauna paths and migration routes Goana Snake Native species assessment 5

Dendra Technologies and Restoration Pest monitoring Track known and unknown
pests over an ecosystem to facilitate intervention. Legacy infrastructure Monitor legacy infrastructure to clean the land. Priority Heatmaps Weed species Feral cat Feral goat Dog tracks Boat Car, wheel & other Whole site weed mapping Weed management Identify invasive weed coverage and monitor progress for transparency and ﬁnancial forecasting. Ant nest 6

Dendra Technologies and Restoration Tunneling RGB Tunneling DEM Mass movement
RGB Gully Pooling Mass movement DEM 7 Erosion and soil health monitoring Quantify erosion risk, alert of safety issues , and take early action (prevention is easier and cheaper than cure). Scale: 400 soccer ﬁelds of imagery per drone per day

Action Each seeding drone can reseed up to 60 hectares
per day, carrying 700 kg of daily payload (2 polar bears worth) 10 drones ﬂying in a swarm can plant as many as 300,000 trees per day. 8 Action Each seeding drone can reseed up to 60 hectares per day, carrying 700 kg of daily payload (2 polar bears worth). Tailored seed mixtures (>50 species). 10 drones ﬂying in a swarm could plant as many as 300,000 trees per day.

About me ML engineer on a personal and professional mission
to ﬁght climate change. Former molecular biologist. Lead ML Engineer at Dendra. Founded ML team. Ray user since version 0.6.4 Experience with MapReduce and PySpark, but not distributed applications. 9

Feature engineering Hyperparameter tournaments Inference Testing Map of Dendra’s ML
Platform Hypothesis (with Ray scheduler) 10

Feature engineering Hyperparameter tournaments Inference Testing Map of Dendra’s ML
Platform Hypothesis (with Ray scheduler) 11

Journey 1 • First attempt was using AWS Sagemaker. ◦
Pro: Great if following templates and using off-the-shelf solutions ◦ Con: Too inﬂexible for bringing our own specialized models and our in-house platform. • Discovered Ray Tune Tuning deep neural networks at scale 12

Loss (lower is better) Time (training epochs) Early terminated trials
Ray Tune saves time and money in hyperparameter tournaments Porting our PyTorch code to Ray SGD was easy. Enables training to scale from my laptop to dozens of GPUs. It costs hundreds of dollars to train a single hyperparameter trial to completion. Ray Tune scheduler algorithm aggressively terminates under-performing trials rather than training to completion. - This saves us money, and enables us to search a larger hyperparameter space since bad samples are cheap ($17) - Cost without terminations: $21,000 (estimated). Cost with termination: $3,740. - With trial resuming, we can do coarse-to-ﬁne hyperparameter searches. 16 of 24 hyperparameter conﬁgurations terminated after 1 epoch 13

Journey 2 Requirements • Scale: serving solution that can scale
to hundreds of millions of images in future ◦ parallelize across multiple workers • Maximize GPU utility: Network I/O bottleneck. How to scale as we add to remove workers from our cluster. Scaling inference workloads to millions of images 14

cC Ray distributed queue actor Model replica S3 Client Actor
Head node Worker nodes S3 Client Actor S3 Client Actor AWS Kinesis Firehose Post processing, batching, sharding Image S3 URLs Images Streaming results Parquet shards Saturating Network I/O in inference pipeline opportunistic batching 15

Why we became Anyscale customers Open source Ray is great,
but it lacks: - Administration: Robust and programmatic control of clusters - Ability to start and stop clusters easily is necessary to us scaling to support more customers - Performance: fast start times and scaling - Keeping EC2 instances around is expensive and stopped instance EBS volumes still cost money - No elastic scaling - If setup changes, need to destroy nodes and rebuild or build a docker CI pipeline Anyscale solves these problems with: (1) the Anyscale SDK (a) APIs for organizing, creating, and automating everything around cluster management (2) Managed Application Images (a) Simple automation capabilities for integrating into CI pipelines to move to production 16

Journey 3 • Goal: to test training and inference pipelines.
Make sure we can overﬁt on a single image, etc. • Problem: Bitbucket CI Pipelines do not have access to GPUs. Continuously testing pipelines that require GPUs 17

Solution: extend our CI tests using the Anyscale SDK to
launch sessions. Continuous integration pipelines on Anyscale GPUs Tests pass 18

Passing build and tests 19

Journey 4 • Problems: ◦ Running jobs manually is tedious,
slow, and error prone ◦ setting up a cluster can take >25 minutes due to many dependencies. • Goal: to expose Anyscale apps via simple API such that our systems are decoupled Programmatically standing up clusters and running jobs 21

Tests pass Solution: - Ray Serve microservice to launch inference
clusters (via Anyscale SDK) - Hot start application using Anyscale AppConﬁg registry for pre-compiled app images. Future: Programmatic inference runs myDendra Users Results in S3, webhook alert 22

• Ray Tune enables us to conduct hyperparameter tournaments on
large search spaces, and can scale without changing any code. • Ray Serve allows us to future-proof our inference pipelines to work at any scale, and we can hot-start clusters on demand. • Anyscale makes it much easier to interact with Ray clusters • Anyscale SDK enable regular testing of mission critical ML pipelines with Bitbucket CI integration. • Ray core is much more friendly than Python’s multiprocessing and is easier to debug than PySpark. It makes workloads faster on your laptop by easy vertical scaling, and can scale to any limit by horizontal scaling without changing any code. Conclusions 23

www.dendra.io @DendraSystems THANK YOU! Acknowledgements: Will, Bill, Simon, Ian, Charles,
Amog, Ed, Richard, Sang from Anyscale Team www.richarddecal.com @AIJedi

How Ray and Anyscale Make it Easy to do Massive...

How Ray and Anyscale Make it Easy to do Massive-scale ML on Aerial Imagery (Richard Decal, Dendra Systems)

Anyscale

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript

How Ray and Anyscale Make it Easy to do Massive-scale

Current rate of ecological destruction is unparalleled Image sources: NYT,

Traditional methods of ecological restoration do not scale Invasive weeds

Dendra’s mission is to create the tools needed to power

Dendra Technologies and Restoration Native species & Tree stem density

Dendra Technologies and Restoration Pest monitoring Track known and unknown

Dendra Technologies and Restoration Tunneling RGB Tunneling DEM Mass movement

Action Each seeding drone can reseed up to 60 hectares

About me ML engineer on a personal and professional mission

Feature engineering Hyperparameter tournaments Inference Testing Map of Dendra’s ML

Feature engineering Hyperparameter tournaments Inference Testing Map of Dendra’s ML

Journey 1 • First attempt was using AWS Sagemaker. ◦

Loss (lower is better) Time (training epochs) Early terminated trials

Journey 2 Requirements • Scale: serving solution that can scale

cC Ray distributed queue actor Model replica S3 Client Actor

Why we became Anyscale customers Open source Ray is great,

Journey 3 • Goal: to test training and inference pipelines.

Solution: extend our CI tests using the Anyscale SDK to

Passing build and tests 19

Journey 4 • Problems: ◦ Running jobs manually is tedious,

Tests pass Solution: - Ray Serve microservice to launch inference

• Ray Tune enables us to conduct hyperparameter tournaments on

www.dendra.io @DendraSystems THANK YOU! Acknowledgements: Will, Bill, Simon, Ian, Charles,