Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How Ray and Anyscale Make it Easy to do Massive-scale ML on Aerial Imagery (Richard Decal, Dendra Systems)

How Ray and Anyscale Make it Easy to do Massive-scale ML on Aerial Imagery (Richard Decal, Dendra Systems)

This talk outlines how Dendra Systems leverage Ray and Anyscale to parallelize their workloads across a cluster containing dozens of GPUs in a single Python script. Richard discusses how Dendra optimized their inference pipelines to saturate their clusters' network I/O limits using Ray Serve.
Richard then describes how Anyscale makes it easy to run this Ray application in production: running seamlessly in their BitBucket CI and also supporting a microservice to launch jobs programmatically.

Anyscale
PRO

July 16, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. How Ray and Anyscale Make it Easy to
    do Massive-scale ML on Aerial
    Imagery
    Richard Decal
    Lead ML Engineer

    View Slide

  2. Current rate of ecological destruction is unparalleled
    Image sources: NYT, NZ forestry
    2B Hectares
    of degraded
    land
    2

    View Slide

  3. Traditional methods of ecological restoration do not scale
    Invasive weeds monitoring:
    Person on truck
    Ecosystem restoration:
    Manual, little automation, or
    optimized for monoculture
    Ecosystem analytics:
    Ground surveys
    Source: Wikipedia 3

    View Slide

  4. Dendra’s mission is to create the tools needed to power scalable
    ecosystem restoration
    Scale: 11x faster.
    Cost: 3x cheaper
    Safe: Steep & unstable
    ground
    Targetted
    Robust
    Advantages
    We're a global team of data
    ecologists, engineers and
    drone operators who come
    together to provide a holistic
    system designed to restore
    native lands.
    4

    View Slide

  5. Dendra Technologies and Restoration
    Native species & Tree stem density
    Track native species richness and tree stem
    density to understand land rehabilitation
    progression.
    Fauna & Habitat
    Monitor fauna to validate
    success through the native
    animals that return.
    Biodiversity assessment
    Assess biodiversity to identify the most
    suitable species to thrive in the ecosystem
    and ensure general monitoring.
    Species quantification
    Fauna paths and migration routes
    Goana Snake
    Native species assessment
    5

    View Slide

  6. Dendra Technologies and Restoration
    Pest monitoring
    Track known and unknown pests over an
    ecosystem to facilitate intervention.
    Legacy infrastructure
    Monitor legacy infrastructure
    to clean the land.
    Priority Heatmaps
    Weed species Feral cat
    Feral goat
    Dog tracks Boat
    Car, wheel & other
    Whole site weed mapping
    Weed management
    Identify invasive weed coverage and
    monitor progress for transparency and
    financial forecasting.
    Ant nest
    6

    View Slide

  7. Dendra Technologies and Restoration
    Tunneling RGB
    Tunneling DEM
    Mass movement RGB Gully Pooling
    Mass movement DEM
    7
    Erosion and soil health monitoring
    Quantify erosion risk, alert of safety issues , and take early
    action (prevention is easier and cheaper than cure).
    Scale:
    400 soccer fields of imagery per drone
    per day

    View Slide

  8. Action
    Each seeding drone can reseed
    up to 60 hectares per day,
    carrying 700 kg of daily
    payload (2 polar bears worth)
    10 drones flying in a swarm can
    plant as many as 300,000 trees
    per day.
    8
    Action
    Each seeding drone can reseed
    up to 60 hectares per day,
    carrying 700 kg of daily
    payload (2 polar bears worth).
    Tailored seed mixtures (>50
    species).
    10 drones flying in a swarm
    could plant as many as 300,000
    trees per day.

    View Slide

  9. About me
    ML engineer on a personal and professional mission to fight climate change.
    Former molecular biologist.
    Lead ML Engineer at Dendra. Founded ML team.
    Ray user since version 0.6.4
    Experience with MapReduce and PySpark, but not distributed applications.
    9

    View Slide

  10. Feature engineering Hyperparameter
    tournaments
    Inference Testing
    Map of Dendra’s ML Platform
    Hypothesis
    (with Ray scheduler)
    10

    View Slide

  11. Feature engineering Hyperparameter
    tournaments
    Inference Testing
    Map of Dendra’s ML Platform
    Hypothesis
    (with Ray scheduler)
    11

    View Slide

  12. Journey 1
    ● First attempt was using AWS
    Sagemaker.
    ○ Pro: Great if following
    templates and using
    off-the-shelf solutions
    ○ Con: Too inflexible for bringing
    our own specialized models
    and our in-house platform.
    ● Discovered Ray Tune
    Tuning deep neural networks at
    scale
    12

    View Slide

  13. Loss
    (lower is better)
    Time
    (training epochs)
    Early terminated trials
    Ray Tune saves time and money in hyperparameter tournaments
    Porting our PyTorch code to Ray SGD was easy. Enables training to scale from my laptop to dozens of GPUs.
    It costs hundreds of dollars to train a single hyperparameter trial to completion. Ray Tune scheduler algorithm aggressively
    terminates under-performing trials rather than training to completion.
    - This saves us money, and enables us to search a larger hyperparameter space since bad samples are cheap ($17)
    - Cost without terminations: $21,000 (estimated). Cost with termination: $3,740.
    - With trial resuming, we can do coarse-to-fine hyperparameter searches.
    16 of 24 hyperparameter
    configurations terminated
    after 1 epoch
    13

    View Slide

  14. Journey 2
    Requirements
    ● Scale: serving solution that can
    scale to hundreds of millions of
    images in future
    ○ parallelize across multiple
    workers
    ● Maximize GPU utility: Network I/O
    bottleneck. How to scale as we add
    to remove workers from our cluster.
    Scaling inference workloads to
    millions of images
    14

    View Slide

  15. cC
    Ray
    distributed
    queue
    actor Model
    replica
    S3 Client
    Actor
    Head node Worker nodes
    S3 Client
    Actor
    S3 Client
    Actor
    AWS Kinesis
    Firehose
    Post
    processing,
    batching,
    sharding
    Image S3
    URLs
    Images
    Streaming
    results
    Parquet
    shards
    Saturating Network I/O in inference pipeline
    opportunistic
    batching
    15

    View Slide

  16. Why we became Anyscale customers
    Open source Ray is great, but it lacks:
    - Administration: Robust and programmatic control of clusters
    - Ability to start and stop clusters easily is necessary to us scaling to support more customers
    - Performance: fast start times and scaling
    - Keeping EC2 instances around is expensive and stopped instance EBS volumes still cost money
    - No elastic scaling
    - If setup changes, need to destroy nodes and rebuild or build a docker CI pipeline
    Anyscale solves these problems with:
    (1) the Anyscale SDK
    (a) APIs for organizing, creating, and automating everything around cluster management
    (2) Managed Application Images
    (a) Simple automation capabilities for integrating into CI pipelines to move to production 16

    View Slide

  17. Journey 3 ● Goal: to test training and inference
    pipelines. Make sure we can overfit
    on a single image, etc.
    ● Problem: Bitbucket CI Pipelines do
    not have access to GPUs.
    Continuously testing pipelines that
    require GPUs
    17

    View Slide

  18. Solution: extend our CI tests using the Anyscale SDK to launch sessions.
    Continuous integration pipelines on Anyscale GPUs
    Tests pass
    18

    View Slide

  19. Passing build and tests
    19

    View Slide

  20. View Slide

  21. Journey 4
    ● Problems:
    ○ Running jobs manually is
    tedious, slow, and error prone
    ○ setting up a cluster can take
    >25 minutes due to many
    dependencies.
    ● Goal: to expose Anyscale apps via
    simple API such that our systems
    are decoupled
    Programmatically standing up
    clusters and running jobs
    21

    View Slide

  22. Tests pass
    Solution:
    - Ray Serve microservice to launch inference clusters (via Anyscale SDK)
    - Hot start application using Anyscale AppConfig registry for pre-compiled app images.
    Future: Programmatic inference runs
    myDendra
    Users
    Results in S3,
    webhook alert
    22

    View Slide

  23. ● Ray Tune enables us to conduct hyperparameter tournaments on large search spaces, and
    can scale without changing any code.
    ● Ray Serve allows us to future-proof our inference pipelines to work at any scale, and we
    can hot-start clusters on demand.
    ● Anyscale makes it much easier to interact with Ray clusters
    ● Anyscale SDK enable regular testing of mission critical ML pipelines with Bitbucket CI
    integration.
    ● Ray core is much more friendly than Python’s multiprocessing and is easier to debug than
    PySpark. It makes workloads faster on your laptop by easy vertical scaling, and can scale
    to any limit by horizontal scaling without changing any code.
    Conclusions
    23

    View Slide

  24. www.dendra.io
    @DendraSystems
    THANK YOU!
    Acknowledgements: Will, Bill, Simon, Ian, Charles, Amog,
    Ed, Richard, Sang from Anyscale Team
    www.richarddecal.com
    @AIJedi

    View Slide