Making Sense of Vector Databases

Slide 1

Slide 1 text

Alignment indicator 02 Dec 2024 Balkrishna Rawool Making Sense of Vector Databases

Slide 2

Slide 2 text

Slide 3

Slide 3 text

3 Balkrishna Rawool IT Chapter Lead, ING Bank @BalaRawool

Slide 4

Slide 4 text

4 Vector Ø In mathematics, a vector is a quantity that has magnitude and direction. ⃗ 𝑣 Ø In computer science, a vector is group or array of numbers. Ø Examples: Ø [2.1, 0.34, 4.678] Ø [4] Ø [0.13483997249264842, 0.26967994498529685, −0.40451991747794525, 0.5393598899705937, 0.674199862463242] Ø They are both equivalent. Ø The elements in a vector are also called the dimensions of the vector. 𝑋 𝑌 𝑍 V(x, y, z) 𝜃 𝑚 ⃗ 𝑣 m, 𝜃 = V(x, y, z) ⃗ 𝑣

Slide 5

Slide 5 text

5 Vector Embeddings Ø Typically a vector is calculated from some data by applying an embedding algorithm/ model. Ø Here, the vector is also known as embedding or vector embedding. Embedding model/ algorithm Data/ Tokens Ø An embedding model, puts data with similar meaning close to each other. Ø Measurement of this togetherness is called similarity or distnace between the two vectors/ embeddings. Vector DB

Slide 6

Slide 6 text

6 Vector Databases Ø Vector databases store vectors. Ø Operations: Ø Inserting Ø Updating Ø Deleting Ø Indexing Ø Searching Ø Examples: Ø PostgreSQL with pgvector Ø Elasticsearch Ø Redis Ø Milvus Ø MongoDB Ø Neo4j Ø Pinecone Ø Chroma Ø Weaviate

Slide 7

Slide 7 text

Alignment indicator Simple VectorDB using Java Vector API

Slide 8

Slide 8 text

8 Simple VectorDB using Java Vector API Ø Java Vector API Ø Takes advantage of CPU architectures for faster processing Ø Single Instruction Multiple Data (SIMD) Ø Current status: Ninth Incubator – JEP-489 Ø Simple VectorDB implementation Ø Stores vectors in memory Ø Insertion and update Ø Selection Ø Searching with Cosine Similarity

Slide 9

Slide 9 text

Alignment indicator Example 1: Simple vectors

Slide 10

Slide 10 text

10 Example 1: Simple Vectors Ø A world that only consists of gray colors. Ø All embeddings are one-dimensional. Mapping Data Vector DB Query Store Search Neighbours 1 2 3 4 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]

Slide 11

Slide 11 text

11 Similarity/ Distance Ø Two vectors which have similar meaning are close together. Ø Similarity/ distance is measurement of how close they are. Ø Ways of calculating distance: Ø Manhattan Distance Ø Dot Product Similarity Ø Cosine Similarity Ø Euclidean Distance Ø Manhattan Distance Manhattan distance between V1 and V2 = 𝑥1 − 𝑥2 + 𝑦1 − 𝑦2 + |𝑧1 − 𝑧2| 𝑋 𝑌 𝑍 V1(x1, y1, z1) V2(𝑥2, 𝑦2, 𝑧2)

Slide 12

Slide 12 text

12 Similarity/ Distance 𝑋 𝑌 𝑍 V1(x1, y1, z1) Ø Dot Product Similarity V2(𝑥2, 𝑦2, 𝑧2) Dot product of V1 and V2 = 𝑥1 ∗ 𝑥2 + 𝑦1 ∗ 𝑦2 + (𝑧1 ∗ 𝑧2) Ø Euclidean distance Ø Straight line distance between two vectors. Euclidean diustance between V1 and V2 = (𝑥1 − 𝑥2)!+(𝑦1 − 𝑦2)!+(𝑧1 − 𝑧2)!

Slide 13

Slide 13 text

13 Magnitude and Normalization of Vector Ø A normailzed vector of a vector, is just the same vector but without magnitude. Ø So, a unit vector in the same direction as the original vector. 𝑋 𝑌 𝑍 V(x, y, z) Magnitude of V = | 𝑉 | = 𝑥! + 𝑦! + 𝑧! Ø Magnitude or length of a vector is its length from origin. Normalized V = " | $ | = ( % $ , & $ , ' $ )

Slide 14

Slide 14 text

14 Cosine Similarity Cosine similarity between V1 and V2 = cos 𝜃 = "()"! $( )| $! | Ø Cosine similarity of two vectors is cosine of the angle between them. 𝑋 𝑌 𝑍 V1(x1, y1, z1) V2(𝑥2, 𝑦2, 𝑧2) 𝜃

Slide 15

Slide 15 text

Alignment indicator Example 2: Cosine Similarity

Slide 16

Slide 16 text

16 Example 2: Cosine Similarity Ø World becomes colorful. Ranging from RGB(1,1,1) to RGB(255,255,255) Ø Embedding algorithm: Ø Map color to a vector with 3 RGB values Ø Normalize the vector Embedding Algorithm RGB Colors Vector DB Query Store Search Neighbours 1 2 3 4

Slide 17

Slide 17 text

17 Embedding Models Ø Embedding models convert input data/tokens into vectors. Ø These vectors capture the meaning and relationship between tokens. Ø Examples Ø word2vec – Text/ words Ø GLoVE (Global Vectors) – Text/ words Ø BERT (Bidirectional Encoder Representations from Transformers) – Text Ø CNN (Convolutionsl Neural Network) – Images/ Videos Ø img2vec – Images Ø GPT (Generative Pre-trained Transformer) – Multimodal Ø CLIP (Contrastive Language-Image Pre-training) – Multimodal (Text and Images)

Slide 18

Slide 18 text

Alignment indicator Example 3: word2vec embedding

Slide 19

Slide 19 text

19 Example 3: word2vec Embeddings Ø Convert words to vectors Ø Embedding algorithm: Ø word2vec Ø Deeplearning4j word2vec implementation, applied to 4 wikipedia pages Ø Vectors contain meaning and relationships between tokens. word2vec 4 wiki pages PostgreS + pgvector Word Store Search Neighbours 1 2 3 4

Slide 20

Slide 20 text

20 Indexing Algorithms Ø k-d Tree (k Dimensional Tree) Ø A binary search tree with each layer using a different dimension. Ø R- tree (Rectangle Tree) Ø Vectors are grouped into rectangles. Ø LSH (Locality-Sensitive Hashing) Ø Vectors are placed in buckets of hashes. Ø AANoy (Approximate Nearest Neighbours, Oh Yeah) Ø A forest of multiple binary search trees Ø HNSW (Hierarchical Navigable Small World) Ø ScaNN (Scalable Nearest Neighbours) Ø Hybrid approach with several techniques combined

Slide 21

Slide 21 text

21 HNSW (Hierarchical Navigable Small World) Ø Vectors are plotted onto a graph with each vector being a node and “close” vectors are connected by edges. Ø The graph is divided into layers with varying granularity. Ø To search for the nearest neighbour Ø Start with a fixed entry point on the highest layer. Ø Find a vector with least distance from the query vector. Ø Use this vector as entry point to the next layer. Ø Repeat this until you reach the lowest layer with maximum granularity. Entry-point Query-vector Layer 2 Layer 1 Layer 0

Slide 22

Slide 22 text

Alignment indicator Example 4: HNSW Indexing

Slide 23

Slide 23 text

23 Example 4: HNSW Indexing Ø HNSW indexing of vectors of relatively large wiki dataset Ø Embedding algorithm: Ø llama3 embedding, applied to 10 wikipedia pages Ø Search for 5 nearest neighbours llama3 embedding 10 wiki pages PostgreS + pgvector Prompt Store Search Neighbours 1 2 3 4

Slide 24

Slide 24 text

24 Retrieval Augmented Generation (RAG) Ø Augment prompt to an LLM with input with specific knowledge that is relevant for the prompt. Embedding Model Data Vector DB Prompt Store Search Neighbours 1 2 3 4 LLM

Slide 25

Slide 25 text

Alignment indicator Example 5: Retrieval Augmented Generation

Slide 26

Slide 26 text

26 Example 5: RAG Ø Use RAG to ask an LLM questions about Epic Comic Co book-store Ø Embedding model: llama3, Vector DB: PostgreS with pgvector llama3 Embedding FAQ PostgreS + pgvector Prompt Store Search Neighbours 1 2 3 4 llama3

Slide 27

Slide 27 text

Alignment indicator Example 6: img2vec Embedding

Slide 28

Slide 28 text

28 Example 6: img2vec Embedding Ø Vectorize a lot of images and search for similar images Ø Embedding model: img2vec Ø Vector DB: Weaviate img2vec Embedding Lots of images Weaviate New image Store Search Neighbours 1 2 3 4

Slide 29

Slide 29 text

29 Source code and contact @balarawool.bluesky.social @[email protected] @BalaRawool

Slide 30

Slide 30 text

No content