apidays Paris 2024 - Embeddings: Core Concepts for Developers, Jocelyn Matthews, Pinecone

Embeddings Presented by

Featured Speaker Jocelyn Matthews Head of Community, Pinecone Presented by
[email protected] 3

What are embeddings? Embeddings are numerical representations that capture the
essential features and relationships of discrete objects, like words or documents, in a continuous vector space.

Embeddings: • Are dynamic and context-sensitive. • Capture the essence
of the data they represent • Are influenced by the context in which they are used • Adaptability makes them powerful Humans think in sensations, words, ideas. Computers think in numbers

You don’t need to memorize this now Vector: a list
of numbers that tell us about something Vector space: an environment in which vectors exist Semantics: the study of meaning communicated through language

Vectors A vector is a mathematical structure with a size
and a direction. For example, we can think of the vector as a point in space, with the “direction” being an arrow from (0,0,0) to that point in the vector space.

Vectors As developers, it might be easier to think of
a vector as an array containing numerical values. For example: vector = [0,-2,...4]

Vectors When we look at a bunch of vectors in
one space, we can say that some are closer to one another, while others are far apart. Some vectors can seem to cluster together, while others could be sparsely distributed in the space.

An example you can bank on 🏦 Where is the
Bank of England? 🌱 Where is the grassy bank? 🛩️ How does a plane bank? 🐝 “the bees decided to have a mutiny against their queen” 🐝 “flying stinging insects rebelled in opposition to the matriarch”

Polysemy and homonyms

Embeddings visualized

Owning the concepts

Word arithmetic king – man + woman = queen Image,
Peter Sutor, “Metaconcepts: Isolating Context in Word Embeddings”

Word arithmetic king – man + woman = queen “Distributed
Representations of Words and Phrases and their Compositionality”

Word arithmetic king – man + woman = queen “adding
the vectors associated with the words king and woman while subtracting man is equal to the vector associated with queen. This describes a gender relationship.” – MIT Technology Review, 2015

Word arithmetic Paris - France + Poland = Warsaw

Word arithmetic Paris - France + Poland = Warsaw “In
this case, the vector difference between Paris and France captures the concept of capital city.” – MIT Technology Review, 2015

Proximity

Together and apart Coffee Hospital Music Restaurant School

Together and apart Coffee Hospital Music Restaurant School Cup Caffeine
Morning Galaxy Dinosaur Doctor Patient Surgery Volcano Unicorn Song Melody Instrument Asteroid Bacteria Food Menu Waiter Nebula Dragon Teacher Classroom Student Volcano Spaceship Exam

Dimensionality! Coffee Hospital Music Restaurant School Cup Caffeine Morning Galaxy
Dinosaur Doctor Patient Surgery Volcano Unicorn Song Melody Instrument Asteroid Bacteria Food Menu Waiter Nebula Dragon Teacher Classroom Student Volcano Spaceship Exam

Green is to blue green blue

As orange is to… green orange blue

As orange is to…yep! green orange blue red

What’s The Fallacy? Why "Green : Blue :: Orange :
Red" is Imperfect as a Teaching Tool • Simplicity of relationships • Linear vs nuanced • Lack of Context • How are the words used? • Dimensionality • 3D vs 100s of D • Oversimplification

Life is Like a Box of… (Or, ”Check the Vectors”)

Check the vectors The distance between red and orange is
incredibly similar to blue and green… But when we tested things trying to verify, we got interesting results which show the "understanding" of the relationship This actually yields this # Find a term that has the same distance and direction blue has from green, but starting from blue target_distance = distance_green_blue target_direction = direction_green_blue # Define a list of terms to compare terms = ["red", "orange", "yellow", "green", "blue", "purple", "pink", "black", "white", "gray"] # Get the embedding for each term term_embeddings = {term: get_embedding(term) for term in terms} # Find the term with the closest distance and same direction to the target distance and direction closest_term = None closest_distance = float('inf') start_term = "red" start_embedding = get_embedding(start_term) for term, embedding in term_embeddings.items(): if term == start_term: continue distance, direction = cosine_distance_and_direction(start_embedding, embedding) if direction == target_direction and abs(distance - target_distance) < closest_distance: closest_distance = abs(distance - target_distance) closest_term = term closest_term, closest_distance

Check the vectors The distance between red and orange is
incredibly similar to blue and green… But when we played around to verify, we got interesting results revealing the semantic "understanding" of the relationship This actually yields this # Find a term that has the same distance and direction blue has from green, but starting from blue target_distance = distance_green_blue target_direction = direction_green_blue # Define a list of terms to compare terms = ["red", "orange", "yellow", "green", "blue", "purple", "pink", "black", "white", "gray"] # Get the embedding for each term term_embeddings = {term: get_embedding(term) for term in terms} # Find the term with the closest distance and same direction to the target distance and direction closest_term = None closest_distance = float('inf') start_term = "red" start_embedding = get_embedding(start_term) for term, embedding in term_embeddings.items(): if term == start_term: continue distance, direction = cosine_distance_and_direction(start_embedding, embedding) if direction == target_direction and abs(distance - target_distance) < closest_distance: closest_distance = abs(distance - target_distance) closest_term = term closest_term, closest_distance ('purple', np.float64(0.006596347059928065)) Purple

Why not 'orange'? The code's result of ('purple', np.float64(0.006596347059928065)) suggests
that, in the embedding space used by the model, "red" and "purple" have a closer semantic relationship than "red" and "orange". The embedding model used in the code has determined that "red" and "purple" are closer semantically. This is likely due to the specific contexts and relationships captured by the model during training. It yields 'purple' instead of orange because the cosine distance and direction calculations between the embeddings of "red" and other terms result in "purple" being the closest match to the target distance and direction from "green" to "blue". # Find a term that has the same distance and direction blue has from green, but starting from blue target_distance = distance_green_blue target_direction = direction_green_blue # Define a list of terms to compare terms = ["red", "orange", "yellow", "green", "blue", "purple", "pink", "black", "white", "gray"] # Get the embedding for each term term_embeddings = {term: get_embedding(term) for term in terms} # Find the term with the closest distance and same direction to the target distance and direction closest_term = None closest_distance = float('inf') start_term = "red" start_embedding = get_embedding(start_term) for term, embedding in term_embeddings.items(): if term == start_term: continue distance, direction = cosine_distance_and_direction(start_embedding, embedding) if direction == target_direction and abs(distance - target_distance) < closest_distance: closest_distance = abs(distance - target_distance) closest_term = term closest_term, closest_distance

Embeddings TL;DR

What are embeddings? Embeddings are numerical representations that capture the
essential features and relationships of discrete objects, like words or documents, in a continuous vector space.

The most important thing to understand Embeddings are numerical representations
of data that: capture semantic meaning and allow for efficient comparison of similarity.

Key points about embeddings 1. They can represent various data
types, not just text. 2. Dimensionality 3. Context sensitivity affects interpretation and application.

Applications of embeddings include: - Semantic search - Question-answering applications
- Image search - Audio search - Recommender systems - Anomaly detection “Generate your own embeddings” (Inference API)

Sample app Legal Semantic Search

Sample app Shop the Look

Thank you! [email protected]

apidays Paris 2024 - Embeddings: Core Concepts ...

apidays Paris 2024 - Embeddings: Core Concepts for Developers, Jocelyn Matthews, Pinecone

More Decks by apidays

Other Decks in Programming

Featured

Transcript