The Mathemagic of Vector Databases

Alignment indicator 31 Jan 2025 Balkrishna Rawool The Mathemagic of
Vector Databases

3 Balkrishna Rawool IT Chapter Lead, ING Bank @BalaRawool

4 Vector Ø In mathematics, a vector is a quantity
that has magnitude and direction. ⃗ 𝑣 Ø In computer science, a vector is group or array of numbers. Ø Examples: Ø [2.1, 0.34, 4.678] Ø [4] Ø [0.13483997249264842, 0.26967994498529685, −0.40451991747794525, 0.5393598899705937, 0.674199862463242] Ø They are both equivalent. Ø The elements in a vector are also called the dimensions of the vector. 𝑋 𝑌 𝑍 V(x, y, z) 𝜃 𝑚 ⃗ 𝑣 m, 𝜃 = V(x, y, z) ⃗ 𝑣

5 Vector Embeddings Ø Typically a vector is calculated from
some data by applying an embedding algorithm/ model. Ø Here, the vector is also known as embedding or vector embedding. Embedding model/ algorithm Data/ Tokens Ø An embedding model, puts data with similar meaning close to each other. Ø Measurement of this togetherness is called similarity or distnace between the two vectors/ embeddings. Vector DB

6 Vector Databases Ø Vector databases store vectors. Ø Operations:
Ø Inserting Ø Updating Ø Deleting Ø Indexing Ø Searching Ø Examples: Ø PostgreSQL with pgvector Ø Elasticsearch Ø Redis Ø Milvus Ø MongoDB Ø Neo4j Ø Pinecone Ø Chroma Ø Weaviate

Alignment indicator Simple VectorDB using Java Vector API

8 Simple VectorDB using Java Vector API Ø Java Vector
API Ø Takes advantage of CPU architectures for faster processing Ø Single Instruction Multiple Data (SIMD) Ø Current status: Ninth Incubator – JEP-489 Ø Simple VectorDB implementation Ø Stores vectors in memory Ø Insertion and update Ø Selection Ø Searching with Cosine Similarity

Alignment indicator Example 1: Simple vectors

10 Example 1: Simple Vectors Ø A world that only
consists of gray colors. Ø All embeddings are one-dimensional. Mapping Data Vector DB Query Store Search Neighbours 1 2 3 4 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]

11 Similarity/ Distance Ø Two vectors which have similar meaning
are close together. Ø Similarity/ distance is measurement of how close they are. Ø Ways of calculating distance: Ø Manhattan Distance Ø Dot Product Similarity Ø Cosine Similarity Ø Euclidean Distance Ø Manhattan Distance Manhattan distance between V1 and V2 = 𝑥1 − 𝑥2 + 𝑦1 − 𝑦2 + |𝑧1 − 𝑧2| 𝑋 𝑌 𝑍 V1(x1, y1, z1) V2(𝑥2, 𝑦2, 𝑧2)

12 Similarity/ Distance 𝑋 𝑌 𝑍 V1(x1, y1, z1) Ø
Dot Product Similarity V2(𝑥2, 𝑦2, 𝑧2) Dot product of V1 and V2 = 𝑥1 ∗ 𝑥2 + 𝑦1 ∗ 𝑦2 + (𝑧1 ∗ 𝑧2) Ø Euclidean distance Ø Straight line distance between two vectors. Euclidean diustance between V1 and V2 = (𝑥1 − 𝑥2)!+(𝑦1 − 𝑦2)!+(𝑧1 − 𝑧2)!

13 Magnitude and Normalization of Vector Ø A normailzed vector
of a vector, is just the same vector but without magnitude. Ø So, a unit vector in the same direction as the original vector. 𝑋 𝑌 𝑍 V(x, y, z) Magnitude of V = | 𝑉 | = 𝑥! + 𝑦! + 𝑧! Ø Magnitude or length of a vector is its length from origin. Normalized V = " | $ | = ( % $ , & $ , ' $ )

14 Cosine Similarity Cosine similarity between V1 and V2 =
cos 𝜃 = "()"! $( )| $! | Ø Cosine similarity of two vectors is cosine of the angle between them. 𝑋 𝑌 𝑍 V1(x1, y1, z1) V2(𝑥2, 𝑦2, 𝑧2) 𝜃

Alignment indicator Example 2: Cosine Similarity

16 Example 2: Cosine Similarity Ø World becomes colorful. Ranging
from RGB(1,1,1) to RGB(255,255,255) Ø Embedding algorithm: Ø Map color to a vector with 3 RGB values Ø Normalize the vector Embedding Algorithm RGB Colors Vector DB Query Store Search Neighbours 1 2 3 4

17 Embedding Models Ø Embedding models convert input data/tokens into
vectors. Ø These vectors capture the meaning and relationship between tokens. Ø Examples Ø word2vec – Text/ words Ø GLoVE (Global Vectors) – Text/ words Ø BERT (Bidirectional Encoder Representations from Transformers) – Text Ø CNN (Convolutionsl Neural Network) – Images/ Videos Ø img2vec – Images Ø GPT (Generative Pre-trained Transformer) – Multimodal Ø CLIP (Contrastive Language-Image Pre-training) – Multimodal (Text and Images)

Alignment indicator Example 3: word2vec embedding

19 Example 3: word2vec Embeddings Ø Convert words to vectors
Ø Embedding algorithm: Ø word2vec Ø Deeplearning4j word2vec implementation, applied to 4 wikipedia pages Ø Vectors contain meaning and relationships between tokens. word2vec 4 wiki pages PostgreS + pgvector Word Store Search Neighbours 1 2 3 4

20 Indexing Algorithms Ø k-d Tree (k Dimensional Tree) Ø
A binary search tree with each layer using a different dimension. Ø R- tree (Rectangle Tree) Ø Vectors are grouped into rectangles. Ø LSH (Locality-Sensitive Hashing) Ø Vectors are placed in buckets of hashes. Ø AANoy (Approximate Nearest Neighbours, Oh Yeah) Ø A forest of multiple binary search trees Ø HNSW (Hierarchical Navigable Small World) Ø ScaNN (Scalable Nearest Neighbours) Ø Hybrid approach with several techniques combined

21 HNSW (Hierarchical Navigable Small World) Ø Vectors are plotted
onto a graph with each vector being a node and “close” vectors are connected by edges. Ø The graph is divided into layers with varying granularity. Ø To search for the nearest neighbour Ø Start with a fixed entry point on the highest layer. Ø Find a vector with least distance from the query vector. Ø Use this vector as entry point to the next layer. Ø Repeat this until you reach the lowest layer with maximum granularity. Entry-point Query-vector Layer 2 Layer 1 Layer 0

Alignment indicator Example 4: HNSW Indexing

23 Example 4: HNSW Indexing Ø HNSW indexing of vectors
of relatively large wiki dataset Ø Embedding algorithm: Ø llama3 embedding, applied to 10 wikipedia pages Ø Search for 5 nearest neighbours llama3 embedding 10 wiki pages PostgreS + pgvector Prompt Store Search Neighbours 1 2 3 4

24 Retrieval Augmented Generation (RAG) Ø Augment prompt to an
LLM with input with specific knowledge that is relevant for the prompt. Embedding Model Data Vector DB Prompt Store Search Neighbours 1 2 3 4 LLM

Alignment indicator Example 5: Retrieval Augmented Generation

26 Example 5: RAG Ø Use RAG to ask an
LLM questions about Epic Comic Co book-store Ø Embedding model: llama3, Vector DB: PostgreS with pgvector llama3 Embedding FAQ PostgreS + pgvector Prompt Store Search Neighbours 1 2 3 4 llama3

Alignment indicator Example 6: img2vec Embedding

28 Example 6: img2vec Embedding Ø Vectorize a lot of
images and search for similar images Ø Embedding model: img2vec Ø Vector DB: Weaviate img2vec Embedding Lots of images Weaviate New image Store Search Neighbours 1 2 3 4

29 Source code and contact @balarawool.bsky.social @BalaRawool@foojay.social @BalaRawool

The Mathemagic of Vector Databases

The Mathemagic of Vector Databases

Balkrishna Rawool

More Decks by Balkrishna Rawool

Other Decks in Technology

Featured

Transcript

Alignment indicator 31 Jan 2025 Balkrishna Rawool The Mathemagic of

2

3 Balkrishna Rawool IT Chapter Lead, ING Bank @BalaRawool

4 Vector Ø In mathematics, a vector is a quantity

5 Vector Embeddings Ø Typically a vector is calculated from

6 Vector Databases Ø Vector databases store vectors. Ø Operations:

Alignment indicator Simple VectorDB using Java Vector API

8 Simple VectorDB using Java Vector API Ø Java Vector

Alignment indicator Example 1: Simple vectors

10 Example 1: Simple Vectors Ø A world that only

11 Similarity/ Distance Ø Two vectors which have similar meaning

12 Similarity/ Distance 𝑋 𝑌 𝑍 V1(x1, y1, z1) Ø

13 Magnitude and Normalization of Vector Ø A normailzed vector

14 Cosine Similarity Cosine similarity between V1 and V2 =

Alignment indicator Example 2: Cosine Similarity

16 Example 2: Cosine Similarity Ø World becomes colorful. Ranging

17 Embedding Models Ø Embedding models convert input data/tokens into

Alignment indicator Example 3: word2vec embedding

19 Example 3: word2vec Embeddings Ø Convert words to vectors

20 Indexing Algorithms Ø k-d Tree (k Dimensional Tree) Ø

21 HNSW (Hierarchical Navigable Small World) Ø Vectors are plotted

Alignment indicator Example 4: HNSW Indexing

23 Example 4: HNSW Indexing Ø HNSW indexing of vectors

24 Retrieval Augmented Generation (RAG) Ø Augment prompt to an

Alignment indicator Example 5: Retrieval Augmented Generation

26 Example 5: RAG Ø Use RAG to ask an

Alignment indicator Example 6: img2vec Embedding

28 Example 6: img2vec Embedding Ø Vectorize a lot of

29 Source code and contact @balarawool.bsky.social @BalaRawool@foojay.social @BalaRawool