$30 off During Our Annual Pro Sale. View Details »

Distributed Vector Generation in One Step with Ray

Anyscale
PRO
December 07, 2022
60

Distributed Vector Generation in One Step with Ray

The combination of big data and deep learning has fundamentally changed how we approach data science; through machine learning models and application-specific code, computers can " understand" unstructured data (think images and audio). To leverage these capabilities, most applications require a continuous pipeline of one or more trained models interlinked with data processing functions. The Towhee open-source project aims to provide users with a tool to create these end-to-end pipelines using a simple API. This is done by providing generic data processing scripts, pre-trained ML models, training tools, and a Pythonic toolkit to stitch all the parts into a pipeline on a local machine. Doing this locally was hard enough, but scaling these pipelines is one step above and has remained a key challenge for us from day 1.

This talk will discuss how integrating a Ray engine into Towhee has helped our users confidently scale their unstructured data processing pipelines and other ML applications. Using Ray as a backend, our users can easily schedule compute-heavy Towhee operations using Ray Actors found in Ray Core. First, we will discuss some challenges in building scalable machine learning pipelines. Second, we will elaborate on the pipeline development-to-deployment gap and why Ray is the obvious choice for our users and us. Finally, we will provide a live demo of a real-world data processing application scaled using Ray.

Anyscale
PRO

December 07, 2022
Tweet

Transcript

  1. One Step to Distributed Embedding Vector


    Generation with Ray

    View Slide

  2. Speaker
    Filip Haltmayer
    Software Engineer
    [email protected]


    filiphaltmayer@linkedin

    View Slide

  3. 01 Vector Database Background
    CONTENTS
    02 Pipeline Solution
    03 How We Use Ray

    View Slide

  4. 01
    Vector Database Background

    View Slide

  5. What is Unstructured Data

    View Slide

  6. Unstructured Data Processing

    View Slide

  7. Vectors are Different

    View Slide

  8. Vector Index Types

    View Slide

  9. The Milvus Vector Database
    •Our first open-source project


    •Search, store, and query vectors at the trillion
    scale


    •Takes advantage of SOTA vector indexing
    algorithms and heavy parallelization to achieve
    top speeds


    •Fully cloud native and elastic

    View Slide

  10. ?
    The Problem Our Users Face
    •With many users, storing and searching their embedding vectors
    wasn’t enough


    •Need help with generating their vectors


    •No access to internal ML teams


    •Not enough manpower to create

    the complex models needed


    •How can we help these users?

    View Slide

  11. 02
    Pipeline Solution

    View Slide

  12. Why It’s a Hard
    •Unstructured Data processing is complex


    •Data: Multiple modalities


    •Tools: Lack of standards


    •Solutions: One size doesn’t fit all


    •Projects: Limited resources
    Expected Reality

    View Slide

  13. What is Towhee
    •Our second open-source project


    •Framework for Unstructured Data ETL


    •Simplifies pipeline creation for heavy
    compute and embedding generation
    Vectors
    Deep Learning Models
    Unstructured Data

    View Slide

  14. What Towhee Brings
    •Modular pythonic API for chaining different types of Operators into
    Pipelines


    •Provides hub for over 700
    +
    SOTA models and data processing
    functions


    •Integrates various tools and libraries


    View Slide

  15. Whats Missing
    •A key feature that a project like this needs is the ability to scale


    •Distributing compute and organizing execution is a project of its own


    •We need a simple backend for parallel compute that users can setup
    and run on their own


    •We also hope for a backend that wont require major changes to our
    framework


    •Ultimately need a fast way to getting on the cloud

    View Slide

  16. 03
    How We Use Ray

    View Slide

  17. Why We Went With Ray Core
    •Ability to distribute certain compute-heavy Operators to external
    machines with ease


    •Ability to keep state for certain Operators


    •Ability to keep the current Towhee design pattern


    •And most important, developer friendly for usage and deployment

    View Slide

  18. Distributing to the Cluster
    •Ray allows for easy deployment of the cluster both locally and in the
    cloud


    •Simple to use API for converting functions to remote functions


    •Ray’s ability to remotely connect to the cluster with Ray Client offers
    great benefits for our use case:


    •Allows us and our users to test pipelines remotely, offering
    speedups in development and testing


    •Allows us and our users to approximate computation speed ups
    and allows us optimize before localizing

    View Slide

  19. Keeping Operator State
    •Statefulness in Operators is needed due
    some Operators tracking metrics, calculating
    averages, etc.


    •By wrapping the operator in an Actor we are
    able to keep its state and call its function


    •No additional changes are required by the
    user within the pipeline definition


    View Slide

  20. Dealing with Package Requirements
    •We figured that the best way to deal with
    package requirements would be to treat each
    Actor as its own towhee pipeline.


    •In order to achieve this in Ray we install
    Towhee on the cluster during ray.init()


    •Within each Actor we set unique cache
    directories to avoid race conditions and
    deadlocks between two actors on the same
    machine using the same cache


    •Package requirements are then dealt with
    during runtime

    View Slide

  21. Keeping Current Design Pattern
    •Towhee is built on a coroutine structure
    allowing for more efficient execution for large
    pipelines


    •In order to parallelize certain operations

    we opted to use ThreadPoolExecutor for local

    execution


    •Ray ActorPool can smoothly replace
    ThreadPoolExecutor with little extra work


    View Slide

  22. What the Result Looks Like

    View Slide

  23. Issue We Faced
    •Haven’t upgraded to new version of Ray so
    may have been dealt with


    •Our main issue was trouble with serializing
    lambdas and functions


    •Storing callables within object store was
    faced with some difficulties


    •Most likely route would be custom
    serializers

    View Slide

  24. Future Work
    •Move to full async engine and incorporate
    async Actors for further speedup


    •We want to ultimately work with Ray Serve


    •Convert Pipeline to DAG that can then be
    executed in Serve

    View Slide

  25. QA
    https://github.com/milvus-io/milvus
    https://zilliz.com
    https://github.com/towhee-io/towhee

    View Slide