Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Distributed Vector Generation in One Step with...

Anyscale
December 07, 2022
68

Distributed Vector Generation in One Step with Ray

The combination of big data and deep learning has fundamentally changed how we approach data science; through machine learning models and application-specific code, computers can " understand" unstructured data (think images and audio). To leverage these capabilities, most applications require a continuous pipeline of one or more trained models interlinked with data processing functions. The Towhee open-source project aims to provide users with a tool to create these end-to-end pipelines using a simple API. This is done by providing generic data processing scripts, pre-trained ML models, training tools, and a Pythonic toolkit to stitch all the parts into a pipeline on a local machine. Doing this locally was hard enough, but scaling these pipelines is one step above and has remained a key challenge for us from day 1.

This talk will discuss how integrating a Ray engine into Towhee has helped our users confidently scale their unstructured data processing pipelines and other ML applications. Using Ray as a backend, our users can easily schedule compute-heavy Towhee operations using Ray Actors found in Ray Core. First, we will discuss some challenges in building scalable machine learning pipelines. Second, we will elaborate on the pipeline development-to-deployment gap and why Ray is the obvious choice for our users and us. Finally, we will provide a live demo of a real-world data processing application scaled using Ray.

Anyscale

December 07, 2022
Tweet

Transcript

  1. The Milvus Vector Database •Our first open-source project •Search, store,

    and query vectors at the trillion scale •Takes advantage of SOTA vector indexing algorithms and heavy parallelization to achieve top speeds •Fully cloud native and elastic
  2. ? The Problem Our Users Face •With many users, storing

    and searching their embedding vectors wasn’t enough •Need help with generating their vectors •No access to internal ML teams •Not enough manpower to create 
 the complex models needed •How can we help these users?
  3. Why It’s a Hard •Unstructured Data processing is complex •Data:

    Multiple modalities •Tools: Lack of standards •Solutions: One size doesn’t fit all •Projects: Limited resources Expected Reality
  4. What is Towhee •Our second open-source project •Framework for Unstructured

    Data ETL •Simplifies pipeline creation for heavy compute and embedding generation Vectors Deep Learning Models Unstructured Data
  5. What Towhee Brings •Modular pythonic API for chaining different types

    of Operators into Pipelines •Provides hub for over 700 + SOTA models and data processing functions •Integrates various tools and libraries
  6. Whats Missing •A key feature that a project like this

    needs is the ability to scale •Distributing compute and organizing execution is a project of its own •We need a simple backend for parallel compute that users can setup and run on their own •We also hope for a backend that wont require major changes to our framework •Ultimately need a fast way to getting on the cloud
  7. Why We Went With Ray Core •Ability to distribute certain

    compute-heavy Operators to external machines with ease •Ability to keep state for certain Operators •Ability to keep the current Towhee design pattern •And most important, developer friendly for usage and deployment
  8. Distributing to the Cluster •Ray allows for easy deployment of

    the cluster both locally and in the cloud •Simple to use API for converting functions to remote functions •Ray’s ability to remotely connect to the cluster with Ray Client offers great benefits for our use case: •Allows us and our users to test pipelines remotely, offering speedups in development and testing •Allows us and our users to approximate computation speed ups and allows us optimize before localizing
  9. Keeping Operator State •Statefulness in Operators is needed due some

    Operators tracking metrics, calculating averages, etc. •By wrapping the operator in an Actor we are able to keep its state and call its function •No additional changes are required by the user within the pipeline definition
  10. Dealing with Package Requirements •We figured that the best way

    to deal with package requirements would be to treat each Actor as its own towhee pipeline. •In order to achieve this in Ray we install Towhee on the cluster during ray.init() •Within each Actor we set unique cache directories to avoid race conditions and deadlocks between two actors on the same machine using the same cache •Package requirements are then dealt with during runtime
  11. Keeping Current Design Pattern •Towhee is built on a coroutine

    structure allowing for more efficient execution for large pipelines •In order to parallelize certain operations 
 we opted to use ThreadPoolExecutor for local 
 execution •Ray ActorPool can smoothly replace ThreadPoolExecutor with little extra work
  12. Issue We Faced •Haven’t upgraded to new version of Ray

    so may have been dealt with •Our main issue was trouble with serializing lambdas and functions •Storing callables within object store was faced with some difficulties •Most likely route would be custom serializers
  13. Future Work •Move to full async engine and incorporate async

    Actors for further speedup •We want to ultimately work with Ray Serve •Convert Pipeline to DAG that can then be executed in Serve