Building a Petabyte-Scale Vector Store

Slide 1

Slide 1 text

Slide 2

Slide 2 text

©2023 DataStax. – All rights reserved  Cédrick Lunven clunven clunven clun ❖ Trainer  ❖ Public Speaker    ❖ Developer Tooling (sdk, cli)  ❖ Developer Apps  ❖ Team Lead  ❖ Creator of ﬀ4j (ﬀ4j.org)  ❖ Helping with Langchain4j  Software Engineer

Slide 3

Slide 3 text

©2023 DataStax. – All rights reserved  Agenda 3 Retrieval Augmented Generation (RAG) principles  Apache Cassandra™ as a Vector Database  Introduction to the jVector  Integrating the ecosystem  Road to Real-Time generative AI 

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

©2023 DataStax. – All rights reserved  Are (effective) prompts good enough ? 7  Prompt Context: I am XXX i want to YYY in order to ZZZ (Objectives) Tasks: I want you to create this and that because… Roles and Persona: You are an assistant, We are targeting AI developers Constraints: Format, Style, Must have, Forbidden, if you do not know, say you do not know. LLM Chat completion API Rest API Conversation Agent API Rest API

Slide 8

Slide 8 text

©2023 DataStax. – All rights reserved  Aside: LLM parameters 8  ● Temperature  Controls “creativity” of generations.  Temp = 0, always use highest probability next token  Temp = 1 max creativity    ● Top K  Limit selection of the next token to the top K matches by probability    ● Top P  Select tokens based on the sum of their probabilities 

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

©2023 DataStax. – All rights reserved  Self Consistency 11  How does it work in practice?    ● Task needs to have a correct answer  ● Diﬀerent reasoning paths need to be explored  ● Eﬀectively achieved by increasing temperature, top p, and top k parameters with LLMs  ● Need to sample 5-20 reasoning paths  Sampling multiple reasoning paths can produce better results. March 2022

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

©2023 DataStax. – All rights reserved  Limitations of “LLM only” mode 16  LLM Chat completion API Rest API Conversation Agent API Rest API • LLM …..can be outdated • LLM …..Does not know your data • LLM …..is not tuned = hard steerability • LLM …..Hallucinating if not properly prompted • LLM …..works with limited Input windows (tokens)

Slide 17

Slide 17 text

©2023 DataStax. – All rights reserved  Retrieval-Augmented Generation 17  Prompt Context: I am XXX i want to YYY in order to ZZZ (Objectives) Tasks: I want you to create this and that because… Roles and Persona: You are an assistant, We are targeting AI developers Constraints: Format, Style, Must have, Forbidden, if you do not know, say you do not know. LLM Chat completion API Rest API Conversation Agent API Rest API Unstructured Data, issued from SEMANTIC SEARCH New stuff, piece of text than will give more information to the LLM to specialize the response

Slide 18

Slide 18 text

©2023 DataStax. – All rights reserved  18  LLM Chat completion API Rest API Sentence Transformer Embeddings API Data Ingestion: Vectorization Your DATA Your Website Vector Space PROMPT vector  Conversation Agent API Rest API DOCUMENT SPLITTER

Slide 19

Slide 19 text

©2023 DataStax. – All rights reserved  Vector DATABASE 19  LLM Chat completion API Rest API Sentence Transformer Embeddings API Semantic Search Your DATA Your Website Vector Space PROMPT vector  SIMILARITY SEARCH (ANN) Conversation Agent API Rest API DOCUMENT SPLITTER

Slide 20

Slide 20 text

©2023 DataStax. – All rights reserved  Vector DATABASE 20  LLM Chat completion API Rest API Sentence Transformer Embeddings API Retrieval-Augmented Generation Your DATA Your Website Vector Space PROMPT vector  SIMILARITY SEARCH PROMPT+RAG ANSWER Conversation Agent API Rest API DOCUMENT SPLITTER

Slide 21

Slide 21 text

©2023 DataStax. – All rights reserved  Retrieval and generation are not the same 21  Seems obvious, but… Standard RAG examples generally treat both steps exactly the same. Retrieval: ● Important to find the correct/best matching documents ● Varying size of embedded text will affect how documents cluster in the embedding space Generation: ● Models support large context windows ● Are good at summarizing, extracting facts and data from long documents

Slide 22

Slide 22 text

Slide 23

Slide 23 text

©2023 DataStax. – All rights reserved  FLARE: Forward-Looking Active REtrieval 23  23  Query Knowledge base for question (Vector Search)  1 Extract more questions from knowledge base (LLM)  2 Query knowledge base with all questions   (Vector Search)  3 Summarize and check answer   (LLM)  4 Repeat 1–4 until answer looks good  23  https://arxiv.org/abs/2305.06983

Slide 24

Slide 24 text

©2023 DataStax. – All rights reserved  Vector Database ? 24  Vector DATABASE Vector Space SIMILARITY SEARCH ● Handling A LOT of vector (chunking)  ● Eﬀective Search Algorithms  ● Performant, Resilient  ● Dynamic (vector sizes)    ● Meta Data Filtering  ● Keyword search  ● Semantic Caching  ● Chat History  ● Key Value cache 

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

©2023 DataStax. – All rights reserved  Nosql Distributed database 1 Installation = 1 NODE ✔ Capacity = ~ 2-4TB ✔ Throughput = LOTS Tx/sec/core Communication: ✔ Gossiping ✔ No Master (peer-to-peer) DataCenter (DC) | Ring

Slide 28

Slide 28 text

©2023 DataStax. – All rights reserved  network sensor temperature forest f001 92 forest f002 88 volcano v001 210 sea s001 45 sea s002 50 home h001 72 road r001 105 road r002 110 ice i001 35 car c001 69 dog d001 40 car c002 70 sensors_by_network Partition Key Primary Key Nosql Distributed database

Slide 29

Slide 29 text

Slide 30

Slide 30 text

©2023 DataStax. – All rights reserved  Apache Cassandra High Availability Always On Every second of downtime translates into lost revenue   Linear Scalability Hyper Scalability Millions of operations per day, hour, or second  Global Distribution Data Everywhere On-premises, hybrid, multi-cloud, centralized, or edge   Low Latency Faster Pace Every millisecond of latency has consequence  

Slide 31

Slide 31 text

©2023 DataStax. – All rights reserved  ● Scale-Out Capabilities: No upper limits  ● Garbage Collection: Pruning obsolete index information  ● Eﬀective Use of Disk: Enabling high throughput  ● Composability: Predicates, term-based searches. Aka Hybrid Search  ● Concurrency: Non-blocking, multi-threaded index construction  https://thenewstack.io/5-hard-problems-in-vector-search-and-how-cassandra-solves-them/ 5 Hard Problems We’re Solving 31 

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

©2023 DataStax. – All rights reserved  SELECT * FROM vsearch.products ORDER BY item_vector ANN OF [0.15, 0.1, 0.1, 0.35, 0.55] LIMIT 1; id | description | item_vector | name ----+-------------------------------------------------+-----------------------------+--------------------- 5 | A deep learning display that controls your mood | [0.1, 0.05, 0.08, 0.3, 0.6] | Vision Vector Frame Searching for Neighbors 35 

Slide 36

Slide 36 text

©2023 DataStax. – All rights reserved  ©2023 DataStax. – All rights reserved  Composes with partitioning  SELECT * FROM demo WHERE partition_id = ? ORDER BY embedding ANN OF ? LIMIT 100  Composes with other   SAI indexes  SELECT * FROM demo WHERE (c1 = ? AND c2 = ?) OR c3 = ? ORDER BY embedding ANN OF ? LIMIT 5  Global ANN   everywhere  SELECT * FROM demo ORDER BY embedding ANN OF ? LIMIT 10  Integration 36 

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text

Slide 41

Slide 41 text

Slide 42

Slide 42 text

Slide 43

Slide 43 text

Slide 44

Slide 44 text

Slide 45

Slide 45 text

©2023 DataStax. – All rights reserved  Build around LLM & embeddings 45  LLM Embeddings Your GenAI application  quite a lot of logic to handle all this …  Prompt templating  Memory of the past/context mgmt  Domain-speciﬁc knowledge  Retrieval-Augmented Generation  Caching  Prompt versioning/mgmt  Storage (vector or otherwise)  Reranking (e.g. MMR)  Agents  . . .  Doc ingestion  "Chain of thought"  Any Any

Slide 46

Slide 46 text

©2023 DataStax. – All rights reserved  ©2023 DataStax. – All rights reserved  LangChain  Messy, bloated: but it's everywhere  Broader coverage of everything around LLMs  "must know"  Top Frameworks (Python ecosystem) 46  LlamaIndex  More considerate in its growth  Mainly data ingestion and storage/retrieval  "should know"  Semantic Kernel  Still a lesser player  Auto-planner feature  Possibly the best-structured  "Sem… what?"  …  Probably not worth checking right now  current Github star count ~ 60k TILs mentioning framework 27 GH forks, StackOverflow tell the same story

Slide 47

Slide 47 text

©2023 DataStax. – All rights reserved  Introducing Vector Search with AI Apps Ai Powered Application ChatBot | AI Assistant | Copilot Text Generation Search Engine Multi Model Similarity Search ChatMemory Semantic Cache Meta-Data Filtering KV Cache Prompt Template RAG LLM History 🦜🔗Langchain 🦙 Llama index Semantic Kernel vector vector metadata_s map body_blob text keys Cassandra Use Cases Queries Tables & Indexes LLM Chat completion API Rest API Conversation Agent API Rest API Embeddings API Rest API

Slide 48

Slide 48 text

©2023 DataStax. – All rights reserved  CREATE TABLE $name ( row_id text PRIMARY KEY, attributes_blob text, body_blob text, metadata_s map, vector vector ) CREATE TABLE $name ( partition_id text, row_id text, attributes_blob text, body_blob text, metadata_s map, vector vector, PRIMARY KEY (partition_id, row_id) ) CREATE TABLE BaseTable ( row_id text PRIMARY KEY, body_blob text, metadata_s map, ) LLM Providers Embeddings API Rest API Chat completion API Rest API Prompt Template Map data into prompts PlainCassandraTable VectorStore Store embeddings as knowledge for LLM MetadataVectorCassandraTable VectorStore + Partition aware Store embeddings as knowledge for LLM ClusteredMetadataVectorCassandraTable create put search create put search connection VectorStore ChatMemory Semantic Cache Meta-Data Filtering KV Cache 🦜🔗Langchain adapters adapters adapters CASSIO: Dedicated models for your Queries

Slide 49

Slide 49 text

©2023 DataStax. – All rights reserved  GITPOD My First Application Genai-Demo Text Generation Astra DB VECTOR OpenAi LLM Providers Embeddings API Rest API Chat completion API Rest API OpenAI JAVA CLIENT com.theokanning.openai-gpt3-java CREATE TABLE philosophers ( partition_id text, row_id text, attributes_blob text, body_blob text, metadata_s map, vector vector, PRIMARY KEY (partition_id, row_id) ) astra-sdk-vector MetadataVectorCassandraTable Cassandra Driers Module Vector

Slide 50

Slide 50 text

©2023 DataStax. – All rights reserved  GenAI Application My Second Application Genai-Demo-week2 Text Generation Astra DB VECTOR CREATE TABLE philosophers ( partition_id text, row_id text, attributes_blob text, body_blob text, metadata_s map, vector vector, PRIMARY KEY (partition_id, row_id) ) Cassandra Drivers Module Vector VertexAI LLM Provider Embeddings API Rest API Chat completion API Rest API langchain4j-cassandra Support of SDK Vector Langchain4j astra-sdk-vector MetadataVectorCassandraTable Langchain4j-vertex-ai

Slide 51

Slide 51 text

Slide 52

Slide 52 text

Slide 53

Slide 53 text

©2023 DataStax. – All rights reserved  What is LangStream? 53  What is it ?  ● Framework for developing generative AI applications  ● Runtime environment for generative AI applications   ● Data integration platform to bring relevant and recent data to gen AI  ● Powered by proven technology: Kubernetes, Kafka, Kafka Connect   

Slide 54

Slide 54 text

©2023 DataStax. – All rights reserved  LangStream 54  Kubernetes Kafka Agent chat completion Vector DB AI Services Agent OSS embedding LangStream LangStream Operator  DB Control Pane and API WebSock et Gateway

Slide 55

Slide 55 text

Slide 56

Slide 56 text