Distributed Vector Generation in One Step with Ray

One Step to Distributed Embedding Vector Generation with Ray

Speaker Filip Haltmayer Software Engineer [email protected] filiphaltmayer@linkedin

01 Vector Database Background CONTENTS 02 Pipeline Solution 03 How
We Use Ray

01 Vector Database Background

What is Unstructured Data

Unstructured Data Processing

Vectors are Different

Vector Index Types

The Milvus Vector Database •Our first open-source project •Search, store,
and query vectors at the trillion scale •Takes advantage of SOTA vector indexing algorithms and heavy parallelization to achieve top speeds •Fully cloud native and elastic

? The Problem Our Users Face •With many users, storing
and searching their embedding vectors wasn’t enough •Need help with generating their vectors •No access to internal ML teams •Not enough manpower to create   the complex models needed •How can we help these users?

02 Pipeline Solution

Why It’s a Hard •Unstructured Data processing is complex •Data:
Multiple modalities •Tools: Lack of standards •Solutions: One size doesn’t fit all •Projects: Limited resources Expected Reality

What is Towhee •Our second open-source project •Framework for Unstructured
Data ETL •Simplifies pipeline creation for heavy compute and embedding generation Vectors Deep Learning Models Unstructured Data

What Towhee Brings •Modular pythonic API for chaining different types
of Operators into Pipelines •Provides hub for over 700 + SOTA models and data processing functions •Integrates various tools and libraries

Whats Missing •A key feature that a project like this
needs is the ability to scale •Distributing compute and organizing execution is a project of its own •We need a simple backend for parallel compute that users can setup and run on their own •We also hope for a backend that wont require major changes to our framework •Ultimately need a fast way to getting on the cloud

03 How We Use Ray

Why We Went With Ray Core •Ability to distribute certain
compute-heavy Operators to external machines with ease •Ability to keep state for certain Operators •Ability to keep the current Towhee design pattern •And most important, developer friendly for usage and deployment

Distributing to the Cluster •Ray allows for easy deployment of
the cluster both locally and in the cloud •Simple to use API for converting functions to remote functions •Ray’s ability to remotely connect to the cluster with Ray Client offers great benefits for our use case: •Allows us and our users to test pipelines remotely, offering speedups in development and testing •Allows us and our users to approximate computation speed ups and allows us optimize before localizing

Keeping Operator State •Statefulness in Operators is needed due some
Operators tracking metrics, calculating averages, etc. •By wrapping the operator in an Actor we are able to keep its state and call its function •No additional changes are required by the user within the pipeline definition

Dealing with Package Requirements •We figured that the best way
to deal with package requirements would be to treat each Actor as its own towhee pipeline. •In order to achieve this in Ray we install Towhee on the cluster during ray.init() •Within each Actor we set unique cache directories to avoid race conditions and deadlocks between two actors on the same machine using the same cache •Package requirements are then dealt with during runtime

Keeping Current Design Pattern •Towhee is built on a coroutine
structure allowing for more efficient execution for large pipelines •In order to parallelize certain operations   we opted to use ThreadPoolExecutor for local   execution •Ray ActorPool can smoothly replace ThreadPoolExecutor with little extra work

What the Result Looks Like

Issue We Faced •Haven’t upgraded to new version of Ray
so may have been dealt with •Our main issue was trouble with serializing lambdas and functions •Storing callables within object store was faced with some difficulties •Most likely route would be custom serializers

Future Work •Move to full async engine and incorporate async
Actors for further speedup •We want to ultimately work with Ray Serve •Convert Pipeline to DAG that can then be executed in Serve

QA https://github.com/milvus-io/milvus https://zilliz.com https://github.com/towhee-io/towhee

Distributed Vector Generation in One Step with...

Distributed Vector Generation in One Step with Ray

Anyscale

More Decks by Anyscale

Featured

Transcript

One Step to Distributed Embedding Vector Generation with Ray

Speaker Filip Haltmayer Software Engineer [email protected] filiphaltmayer@linkedin

01 Vector Database Background CONTENTS 02 Pipeline Solution 03 How

01 Vector Database Background

What is Unstructured Data

Unstructured Data Processing

Vectors are Different

Vector Index Types

The Milvus Vector Database •Our first open-source project •Search, store,

? The Problem Our Users Face •With many users, storing

02 Pipeline Solution

Why It’s a Hard •Unstructured Data processing is complex •Data:

What is Towhee •Our second open-source project •Framework for Unstructured

What Towhee Brings •Modular pythonic API for chaining different types

Whats Missing •A key feature that a project like this

03 How We Use Ray

Why We Went With Ray Core •Ability to distribute certain

Distributing to the Cluster •Ray allows for easy deployment of

Keeping Operator State •Statefulness in Operators is needed due some

Dealing with Package Requirements •We figured that the best way

Keeping Current Design Pattern •Towhee is built on a coroutine

What the Result Looks Like

Issue We Faced •Haven’t upgraded to new version of Ray

Future Work •Move to full async engine and incorporate async

QA https://github.com/milvus-io/milvus https://zilliz.com https://github.com/towhee-io/towhee