Slide 1

Slide 1 text

One Step to Distributed Embedding Vector Generation with Ray

Slide 2

Slide 2 text

Speaker Filip Haltmayer Software Engineer [email protected] filiphaltmayer@linkedin

Slide 3

Slide 3 text

01 Vector Database Background CONTENTS 02 Pipeline Solution 03 How We Use Ray

Slide 4

Slide 4 text

01 Vector Database Background

Slide 5

Slide 5 text

What is Unstructured Data

Slide 6

Slide 6 text

Unstructured Data Processing

Slide 7

Slide 7 text

Vectors are Different

Slide 8

Slide 8 text

Vector Index Types

Slide 9

Slide 9 text

The Milvus Vector Database •Our first open-source project •Search, store, and query vectors at the trillion scale •Takes advantage of SOTA vector indexing algorithms and heavy parallelization to achieve top speeds •Fully cloud native and elastic

Slide 10

Slide 10 text

? The Problem Our Users Face •With many users, storing and searching their embedding vectors wasn’t enough •Need help with generating their vectors •No access to internal ML teams •Not enough manpower to create 
 the complex models needed •How can we help these users?

Slide 11

Slide 11 text

02 Pipeline Solution

Slide 12

Slide 12 text

Why It’s a Hard •Unstructured Data processing is complex •Data: Multiple modalities •Tools: Lack of standards •Solutions: One size doesn’t fit all •Projects: Limited resources Expected Reality

Slide 13

Slide 13 text

What is Towhee •Our second open-source project •Framework for Unstructured Data ETL •Simplifies pipeline creation for heavy compute and embedding generation Vectors Deep Learning Models Unstructured Data

Slide 14

Slide 14 text

What Towhee Brings •Modular pythonic API for chaining different types of Operators into Pipelines •Provides hub for over 700 + SOTA models and data processing functions •Integrates various tools and libraries

Slide 15

Slide 15 text

Whats Missing •A key feature that a project like this needs is the ability to scale •Distributing compute and organizing execution is a project of its own •We need a simple backend for parallel compute that users can setup and run on their own •We also hope for a backend that wont require major changes to our framework •Ultimately need a fast way to getting on the cloud

Slide 16

Slide 16 text

03 How We Use Ray

Slide 17

Slide 17 text

Why We Went With Ray Core •Ability to distribute certain compute-heavy Operators to external machines with ease •Ability to keep state for certain Operators •Ability to keep the current Towhee design pattern •And most important, developer friendly for usage and deployment

Slide 18

Slide 18 text

Distributing to the Cluster •Ray allows for easy deployment of the cluster both locally and in the cloud •Simple to use API for converting functions to remote functions •Ray’s ability to remotely connect to the cluster with Ray Client offers great benefits for our use case: •Allows us and our users to test pipelines remotely, offering speedups in development and testing •Allows us and our users to approximate computation speed ups and allows us optimize before localizing

Slide 19

Slide 19 text

Keeping Operator State •Statefulness in Operators is needed due some Operators tracking metrics, calculating averages, etc. •By wrapping the operator in an Actor we are able to keep its state and call its function •No additional changes are required by the user within the pipeline definition

Slide 20

Slide 20 text

Dealing with Package Requirements •We figured that the best way to deal with package requirements would be to treat each Actor as its own towhee pipeline. •In order to achieve this in Ray we install Towhee on the cluster during ray.init() •Within each Actor we set unique cache directories to avoid race conditions and deadlocks between two actors on the same machine using the same cache •Package requirements are then dealt with during runtime

Slide 21

Slide 21 text

Keeping Current Design Pattern •Towhee is built on a coroutine structure allowing for more efficient execution for large pipelines •In order to parallelize certain operations 
 we opted to use ThreadPoolExecutor for local 
 execution •Ray ActorPool can smoothly replace ThreadPoolExecutor with little extra work

Slide 22

Slide 22 text

What the Result Looks Like

Slide 23

Slide 23 text

Issue We Faced •Haven’t upgraded to new version of Ray so may have been dealt with •Our main issue was trouble with serializing lambdas and functions •Storing callables within object store was faced with some difficulties •Most likely route would be custom serializers

Slide 24

Slide 24 text

Future Work •Move to full async engine and incorporate async Actors for further speedup •We want to ultimately work with Ray Serve •Convert Pipeline to DAG that can then be executed in Serve

Slide 25

Slide 25 text

QA https://github.com/milvus-io/milvus https://zilliz.com https://github.com/towhee-io/towhee