that has magnitude and direction. ⃗ 𝑣 Ø In computer science, a vector is group or array of numbers. Ø Examples: Ø [2.1, 0.34, 4.678] Ø [4] Ø [0.13483997249264842, 0.26967994498529685, −0.40451991747794525, 0.5393598899705937, 0.674199862463242] Ø They are both equivalent. Ø The elements in a vector are also called the dimensions of the vector. 𝑋 𝑌 𝑍 V(x, y, z) 𝜃 𝑚 ⃗ 𝑣 m, 𝜃 = V(x, y, z) ⃗ 𝑣
some data by applying an embedding algorithm/ model. Ø Here, the vector is also known as embedding or vector embedding. Embedding model/ algorithm Data/ Tokens Ø An embedding model, puts data with similar meaning close to each other. Ø Measurement of this togetherness is called similarity or distnace between the two vectors/ embeddings. Vector DB
API Ø Takes advantage of CPU architectures for faster processing Ø Single Instruction Multiple Data (SIMD) Ø Current status: Ninth Incubator – JEP-489 Ø Simple VectorDB implementation Ø Stores vectors in memory Ø Insertion and update Ø Selection Ø Searching with Cosine Similarity
of a vector, is just the same vector but without magnitude. Ø So, a unit vector in the same direction as the original vector. 𝑋 𝑌 𝑍 V(x, y, z) Magnitude of V = | 𝑉 | = 𝑥! + 𝑦! + 𝑧! Ø Magnitude or length of a vector is its length from origin. Normalized V = " | $ | = ( % $ , & $ , ' $ )
from RGB(1,1,1) to RGB(255,255,255) Ø Embedding algorithm: Ø Map color to a vector with 3 RGB values Ø Normalize the vector Embedding Algorithm RGB Colors Vector DB Query Store Search Neighbours 1 2 3 4
Ø Embedding algorithm: Ø word2vec Ø Deeplearning4j word2vec implementation, applied to 4 wikipedia pages Ø Vectors contain meaning and relationships between tokens. word2vec 4 wiki pages PostgreS + pgvector Word Store Search Neighbours 1 2 3 4
A binary search tree with each layer using a different dimension. Ø R- tree (Rectangle Tree) Ø Vectors are grouped into rectangles. Ø LSH (Locality-Sensitive Hashing) Ø Vectors are placed in buckets of hashes. Ø AANoy (Approximate Nearest Neighbours, Oh Yeah) Ø A forest of multiple binary search trees Ø HNSW (Hierarchical Navigable Small World) Ø ScaNN (Scalable Nearest Neighbours) Ø Hybrid approach with several techniques combined
onto a graph with each vector being a node and “close” vectors are connected by edges. Ø The graph is divided into layers with varying granularity. Ø To search for the nearest neighbour Ø Start with a fixed entry point on the highest layer. Ø Find a vector with least distance from the query vector. Ø Use this vector as entry point to the next layer. Ø Repeat this until you reach the lowest layer with maximum granularity. Entry-point Query-vector Layer 2 Layer 1 Layer 0
of relatively large wiki dataset Ø Embedding algorithm: Ø llama3 embedding, applied to 10 wikipedia pages Ø Search for 5 nearest neighbours llama3 embedding 10 wiki pages PostgreS + pgvector Prompt Store Search Neighbours 1 2 3 4
images and search for similar images Ø Embedding model: img2vec Ø Vector DB: Weaviate img2vec Embedding Lots of images Weaviate New image Store Search Neighbours 1 2 3 4